daviscook477 opened a new pull request, #12801: URL: https://github.com/apache/lucene/pull/12801
### Description Within Lucene Monitor, there is a thread contention issue that manifests as multi-second latencies in the `Monitor.match` function when it is called concurrently with `Monitor.register/Monitor.deleteById` and `QueryIndex.purgeCache`. The source of the issue is the usage of `purgeLock`, a ReentrantReadWriteLock. Within the QueryIndex, there are 3 usages of the `purgeLock`. They are `commit` (called by `Monitor.register/Monitor.deleteById`), ‘search’ (called by Monitor.match), and `purgeCache` (called by a background thread at a consistent interval). Within `commit` and `search`, the read lock is acquired, and within `purgeCache`, the write lock is acquired. `search` calls are very fast and only hold the `purgeLock` for short durations. However, `commit` calls are quite long and will hold the `purgeLock` for multi-second durations. Ostensibly because both `commit` and `search` only need the read lock, they should be able to execute concurrently and the long duration of `commit` should not impact `search`. However, when `purgeCache` attempts to acquire the write lock this no longer holds true. This is because once `purgeCache` is waiting to acquire the write lock, attempts to acquire the read lock [probabilistically](https://github.com/openjdk/jdk17u-dev/blob/2486ba2899f1e51418fa73f6216 b5224b4a348c5/src/java.base/share/classes/java/util/concurrent/locks/ReentrantReadWriteLock.java#L686) might wait for the write lock to be acquired + released (otherwise attempts to acquire the write lock would be starved). What this means is that if `commit` is holding the read lock, and then `purgeCache` attempts to acquire the write lock, some amount of calls to `match` that want to acquire the read lock will be queued behind the `purgeCache` attempt to acquire the write lock. Because `purgeCache` cannot acquire the write lock until `commit` releases the read lock, and because those `search` calls cannot acquire the read lock until `purgeCache` acquires + releases the write lock, those `search` calls end up having to wait for `commit` to complete. This makes it possible for `search` calls to take as long as `commit` to complete (multi-second rather than millisecond). ### Diagrams For illustration purposes I have the following 2 diagrams of the problem. The first diagram demonstrates the wait order for the `ReentrantReadWriteLock` in the high latency match scenario. For the purpose of the diagram I am assuming that `commit` takes 3 seconds, `purgeCache` takes 3 milliseconds, and `search` takes 3 milliseconds as well. Most of the calls to `search` will not end up in this scenario, but without the fix included in this PR *some* of the calls to `search` will end up in this scenario and be slow.  The way in which the calls to `search` end up being slow is more clearly shown in the second diagram which shows a timeline that has a multisecond delayed `search` call - making the same assumptions about the runtime of the 3 operations as the first diagram. In the below diagram, `Search Thread A` avoids the problem and has expected latency, but `Search Thread B` runs into the problem and has very high latency.  ### Solution This issue can be resolved by ensuring that `purgeCache` never attempts to acquire the write lock when `commit` is holding the read lock. This can be done by simply using a mutual exclusion lock between the two since: 1. `purgeCache` is not time-sensitive so it can afford to wait until `commit` completes. 2. `commit` is arguably time sensitive but `purgeCache` is a very quick operation so sometimes having to wait on it won’t make a meaningful difference in the time it takes for `commit` to complete by itself. ### Test Case / Demo To demonstrate the problem I have this [gist](https://gist.github.com/daviscook477/a1b1fbebe75b0fdc41c889e542a2ed42) that runs `commit` and `search` concurrently while recording the durations of the calls to `search`. Then it runs `commit`, `search` and `purgeCache` concurrently while recording the durations of `search`. Comparing the p99 - p100 durations between when `purgeCache` wasn’t run concurrently vs when it was run concurrently shows the high latency behavior for combining all 3 operations. Without this change, running the gist on my laptop gives: 1. No `purgeCache` - mean of p99 thru p100 is 20ms 2. Yes `purgeCache` - mean of p99 thru p100 is 651ms With this change, running the gist on my laptop gives: 1. No `purgeCache` - mean of p99 thru p100 is 16ms 2. Yes `purgeCache` - mean of p99 thru p100 is 9ms Even though the latencies in this test are noisy, the order of magnitude difference in the mean p99 thru p100 when using this change with `commit`, `search` and `purgeCache` run concurrently is statistically significant. This issue is important to fix because it prevents Lucene Monitor from having reliable low latency match performance. ### Merge Request If this PR gets merged, can you please use my `dcoo...@bloomberg.net` email address for the squash+merge. Thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org