Apache9 commented on PR #7363: URL: https://github.com/apache/hbase/pull/7363#issuecomment-3381440722
> > In general, all rpc request like locating a region should be an asynchronous call, do you have more details on what makes the table.batch call blocking for a long time? > > I assume we're not waiting for any RPC responses on the `internalFlush` thread, but with large enough buffers and large enough numbers of mutations and high enough concurrency on incoming mutations, it seems even `AsyncBatchRpcRetryingCaller#groupAndSend` can take long enough (milliseconds?) to decrease overall throughput. > > If the region location is in the cache, then the future completes synchronously in `AsyncNonMetaRegionLocator#getRegionLocationsInternal`, which allows more work under the `synchronized` block, which prevents further mutations from being accepted. > > https://github.com/apache/hbase/blob/d0b94780f509156c66ea9d297a89e73940257eb1/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncNonMetaRegionLocator.java#L504-L513 > > When the number of mutations could be >100,000 and number of region locations could be >10,000, and most of those locations are in the cache, `groupAndSend` of those Multi RPCs yields a non-trivial amount of work. So maybe we should check depth of the stack trace? In netty there are some tricks around this area, if the future is complete synchronously all the time and makes a very deep call stack trace, it will force schedule an asynchronous task to prevent stack overflow and also reduce the blocking execution time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
