Skip to content

chore(spanner): optimize lock contention and skipped tablet reporting#12719

Merged
rahul2393 merged 9 commits intomainfrom
optimize-locks-skipped-tablet-uid
Apr 9, 2026
Merged

chore(spanner): optimize lock contention and skipped tablet reporting#12719
rahul2393 merged 9 commits intomainfrom
optimize-locks-skipped-tablet-uid

Conversation

@rahul2393
Copy link
Copy Markdown
Contributor

@rahul2393 rahul2393 commented Apr 8, 2026

Summary

  • Replace synchronized with ReadWriteLock in KeyRangeCache to reduce lock contention on the hot path. Read operations (findServer, getActiveAddresses, size, debugString) now take a read lock, allowing concurrent lookups. Group updates are performed outside the write lock to minimize critical section duration.
  • Move cache updates to an async executor in ChannelFinder via updateAsync(), using a sequential executor backed by a shared daemon thread pool. This prevents CacheUpdate processing from blocking the gRPC response listener thread.
  • Report skipped tablets for recently evicted TRANSIENT_FAILURE endpoints. Previously, when an endpoint was evicted after repeated TRANSIENT_FAILURE probes and not yet recreated, its tablets were silently skipped — the server never learned the client considered them unhealthy. Now EndpointLifecycleManager tracks addresses evicted for TRANSIENT_FAILURE and KeyRangeCache includes their tablet UIDs in skipped_tablets, giving the server better routing visibility.
  • Deduplicate skipped tablet UIDs to avoid reporting the same tablet multiple times when it appears across replicas.
  • Report known TRANSIENT_FAILURE replicas even when an earlier healthy replica was already selected, so the server gets a complete picture of unhealthy tablets for the group.
  • Mark CachedRange.lastAccess as volatile for safe cross-thread reads.

@rahul2393 rahul2393 requested review from a team as code owners April 8, 2026 18:32
@rahul2393 rahul2393 requested a review from olavloite April 8, 2026 18:32
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces asynchronous cache updates in ChannelFinder and KeyAwareChannel to improve performance, and enhances EndpointLifecycleManager to track and report transient failure evictions. Additionally, it refactors KeyRangeCache to use ReentrantReadWriteLock for better concurrency. My review highlights critical issues: the addRanges method in KeyRangeCache is not exception-safe and risks reference count leaks, the readLock release in fillRoutingHint creates a race condition that requires CachedGroup to be fully synchronized, and the use of Executors.newCachedThreadPool in ChannelFinder poses a risk of resource exhaustion.

Comment on lines +154 to +158
readLock.lock();
try {
targetRange = findRangeLocked(key, hintBuilder.getLimitKey(), rangeMode);
} finally {
readLock.unlock();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Releasing the readLock before calling targetRange.group.fillRoutingHint() introduces a race condition because addRanges now performs group updates outside the global lock. Since CachedGroup relies on monitor-based synchronization (as seen in getActiveAddresses at line 180), both CachedGroup.update and CachedGroup.fillRoutingHint must be synchronized to ensure thread safety. Without this, concurrent updates and reads of the group's internal state (like tablets and leaderIndex) could lead to inconsistent behavior or data corruption.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CachedGroup.update() is synchronized and CachedGroup.fillRoutingHint() now synchronizes on the same monitor

for (Set<String> addresses : activeAddressesPerFinder.values()) {
allActive.addAll(addresses);
}
transientFailureEvictedAddresses.retainAll(allActive);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call is synchronized on activeAddressLock. However, the probe method also adds to this set without holding a lock. This means that this call (and also the add call from probe) are not guaranteed to be atomic, and that the outcome is undefined. It won't cause any exceptions, as it is a ConcurrentHashSet, so if we are fine with this being slightly undeterministic, then this is fine. Otherwise, we should take the same lock in probe as well as here.

}

@VisibleForTesting
void awaitPendingUpdates() throws InterruptedException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this method to src/test/java? Or otherwise add a comment that it should only be called from test code? (The @VisibleForTesting annotation is intended to indicate that a method has higher visibility than otherwise needed to be able to test, it does not indicate that it should only be invoked from test code.)

@rahul2393 rahul2393 merged commit be795b5 into main Apr 9, 2026
118 of 119 checks passed
@rahul2393 rahul2393 deleted the optimize-locks-skipped-tablet-uid branch April 9, 2026 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants