apurtell opened a new pull request, #7363: URL: https://github.com/apache/hbase/pull/7363
This patch modifies `AsyncBufferedMutatorImpl` class to improve its performance under concurrent usage. While `AsyncTable#batch()` is largely asynchronous in nature, it can exhibit blocking behavior during its preparation phase, for instance, while looking up region locations. In the original implementation of `AsyncBufferedMutatorImpl`, calls to `AsyncTable#batch()` occur within a synchronized block, potentially causing severe contention and stalling other threads trying to buffer their mutations. The original implementation relied on coarse grained synchronized blocks for multi-threading safety, so when one thread triggered a buffer flush (either because the buffer was full or a periodic timer fired), all other threads attempting to add mutations via the mutate method would be blocked until the table.batch() call completed, which could take a surprisingly long time. The new implementation replaces the broad synchronized blocks with a `ReentrantLock`. This lock is acquired only for the brief period needed to safely copy the current batch of mutations and futures into local variables and swap in a new internal buffer. Immediately after this quick operation, the lock is released. The `batch()` call is then executed outside of the locked section. This allows other threads to continue adding new mutations concurrently while the flushing of the previous batch proceeds independently. The client has already opted in to asynchronous and potentially interleaved commit of the mutations submitted to `AsyncBufferedMutator`, by definition. The minimization of critical section scope minimizes thread contention and significantly boosts throughput under load. Other related profiler driven efficiency changes are also included, such as elimination of stream api and array resizing hotspots identified by the profiler. To validate the performance improvement of these changes, a JMH benchmark, `AsyncBufferedMutatorBenchmark` was created to measure the performance of the mutate method under various conditions. It focuses specifically on the overhead and concurrency management of `AsyncBufferedMutatorImpl` itself, not the underlying network communication. To achieve this, it uses the `Mockito` framework to create a mock `AsyncTable` that instantly returns completed futures, isolating the mutator's buffering logic for measurement. It runs tests with 1, 10, and 100 threads to simulate no, medium, and high levels of concurrency. It uses a low value (100) for `maxMutations` to force frequent flushes based on the number of mutations, and a very high value (100,000) to ensure flushes are rare in that measurement case. The benchmark measures the average time per operation in microseconds, where a lower score indicates better performance and higher throughput. With a single thread and no contention the performance of both implementations is nearly identical. The minor variations are negligible and show that the new locking mechanism does not introduce any performance regression in the non-concurrent case. For example, with a 10MB buffer and high maxMutations, the NEW implementation scored 0.167 us/op while the OLD scored 0.169 us/op, a statistically insignificant difference. When the test is run with 10 threads, a noticeable gap appears. In the scenario designed to cause frequent flushes (maxMutations = 100), the NEW implementation is approximately 12 times faster than the OLD one (14.250 us/op for NEW vs. 172.463 us/op for OLD). This is because the OLD implementation forces threads to wait while flushes occur, and flushes incur a synthetic thread sleep of 1ms to simulate occasional unexpected blocking behavior in `AsyncTable#batch()`, whereas the NEW implementation allows them to proceed without contention. The most significant results come from the 100-thread tests, which simulate high contention. In the frequent flush scenario (maxMutations = 100) the NEW implementation is 114 times faster in the synthetic benchmark scenario (16.123 us/op for NEW vs. 1847.567 us/op for OLD). Note that blocking IO observed in a real client for e.g. region location lookups can produce a much more significant impact. With the OLD code, 100 threads are constantly competing for a lock that is held for a long duration, leading to a contention storm. The NEW code's reduced locking scope almost entirely eliminates this bottleneck. OS: Apple Silicon (aarch64) M1 Max / 64 GB JVM: openjdk version "17.0.11" 2024-04-16 LTS / OpenJDK 64-Bit Server VM Zulu17.50+19-CA (build 17.0.11+9-LTS, mixed mode, sharing) | Threads | Max Mutations | OLD Implementation (µs/op) | NEW Implementation (µs/op) | Performance Gain (OLD / NEW) | | :--- | :--- | :--- | :--- | :--- | | 1 | 100 | 14.091 | 16.313 | 0.86x (comparable) | | 1 | 100,000 | 0.169 | 0.167 | 1.01x (comparable) | | 10 | 100 | 172.463 | 14.250 | 12.10x | | 10 | 100,000 | 2.465 | 1.072 | 2.30x | | 100 | 100 | 1847.567 | 16.123 | **114.59x** | | 100 | 100,000 | 24.125 | 12.796 | 1.89x | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
