apurtell opened a new pull request, #7363:
URL: https://github.com/apache/hbase/pull/7363

   This patch modifies `AsyncBufferedMutatorImpl` class to improve its 
performance under concurrent usage. 
   
   While `AsyncTable#batch()` is largely asynchronous in nature, it can exhibit 
blocking behavior during its preparation phase, for instance, while looking up 
region locations. In the original implementation of `AsyncBufferedMutatorImpl`, 
calls to `AsyncTable#batch()` occur within a synchronized block, potentially 
causing severe contention and stalling other threads trying to buffer their 
mutations. The original implementation relied on coarse grained synchronized 
blocks for multi-threading safety, so when one thread triggered a buffer flush 
(either because the buffer was full or a periodic timer fired), all other 
threads attempting to add mutations via the mutate method would be blocked 
until the table.batch() call completed, which could take a surprisingly long 
time.
   
   The new implementation replaces the broad synchronized blocks with a 
`ReentrantLock`. This lock is acquired only for the brief period needed to 
safely copy the current batch of mutations and futures into local variables and 
swap in a new internal buffer. Immediately after this quick operation, the lock 
is released. The `batch()` call is then executed outside of the locked section. 
This allows other threads to continue adding new mutations concurrently while 
the flushing of the previous batch proceeds independently. The client has 
already opted in to asynchronous and potentially interleaved commit of the 
mutations submitted to `AsyncBufferedMutator`, by definition. The minimization 
of critical section scope minimizes thread contention and significantly boosts 
throughput under load. Other related profiler driven efficiency changes are 
also included, such as elimination of stream api and array resizing hotspots 
identified by the profiler.
   
   To validate the performance improvement of these changes, a JMH benchmark, 
`AsyncBufferedMutatorBenchmark` was created to measure the performance of the 
mutate method under various conditions. It focuses specifically on the overhead 
and concurrency management of `AsyncBufferedMutatorImpl` itself, not the 
underlying network communication. To achieve this, it uses the `Mockito` 
framework to create a mock `AsyncTable` that instantly returns completed 
futures, isolating the mutator's buffering logic for measurement. It runs tests 
with 1, 10, and 100 threads to simulate no, medium, and high levels of 
concurrency. It uses a low value (100) for `maxMutations` to force frequent 
flushes based on the number of mutations, and a very high value (100,000) to 
ensure flushes are rare in that measurement case. The benchmark measures the 
average time per operation in microseconds, where a lower score indicates 
better performance and higher throughput.
   
   With a single thread and no contention the performance of both 
implementations is nearly identical. The minor variations are negligible and 
show that the new locking mechanism does not introduce any performance 
regression in the non-concurrent case. For example, with a 10MB buffer and high 
maxMutations, the NEW implementation scored 0.167 us/op while the OLD scored 
0.169 us/op, a statistically insignificant difference. When the test is run 
with 10 threads, a noticeable gap appears. In the scenario designed to cause 
frequent flushes (maxMutations = 100), the NEW implementation is approximately 
12 times faster than the OLD one (14.250 us/op for NEW vs. 172.463 us/op for 
OLD). This is because the OLD implementation forces threads to wait while 
flushes occur, and flushes incur a synthetic thread sleep of 1ms to simulate 
occasional unexpected blocking behavior in `AsyncTable#batch()`, whereas the 
NEW implementation allows them to proceed without contention. The most 
significant results
  come from the 100-thread tests, which simulate high contention. In the 
frequent flush scenario (maxMutations = 100) the NEW implementation is 114 
times faster in the synthetic benchmark scenario (16.123 us/op for NEW vs. 
1847.567 us/op for OLD). Note that blocking IO observed in a real client for 
e.g. region location lookups can produce a much more significant impact. With 
the OLD code, 100 threads are constantly competing for a lock that is held for 
a long duration, leading to a contention storm. The NEW code's reduced locking 
scope almost entirely eliminates this bottleneck.
   
   OS: Apple Silicon (aarch64) M1 Max / 64 GB
   
   JVM: openjdk version "17.0.11" 2024-04-16 LTS  / OpenJDK 64-Bit Server VM 
Zulu17.50+19-CA (build 17.0.11+9-LTS, mixed mode, sharing)
   
   
   | Threads | Max Mutations | OLD Implementation (µs/op) | NEW Implementation 
(µs/op) | Performance Gain (OLD / NEW) |
   | :--- | :--- | :--- | :--- | :--- |
   | 1 | 100 | 14.091 | 16.313 | 0.86x (comparable) |
   | 1 | 100,000 | 0.169 | 0.167 | 1.01x (comparable) |
   | 10 | 100 | 172.463 | 14.250 | 12.10x |
   | 10 | 100,000 | 2.465 | 1.072 | 2.30x |
   | 100 | 100 | 1847.567 | 16.123 | **114.59x** |
   | 100 | 100,000 | 24.125 | 12.796 | 1.89x |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to