daviscook477 opened a new pull request, #12801:
URL: https://github.com/apache/lucene/pull/12801

   ### Description
   Within Lucene Monitor, there is a thread contention issue that manifests as 
multi-second latencies in the `Monitor.match` function when it is called 
concurrently with `Monitor.register/Monitor.deleteById` and 
`QueryIndex.purgeCache`. The source of the issue is the usage of `purgeLock`, a 
ReentrantReadWriteLock.
   
   Within the QueryIndex, there are 3 usages of the `purgeLock`. They are 
`commit` (called by `Monitor.register/Monitor.deleteById`), ‘search’ (called by 
Monitor.match), and `purgeCache` (called by a background thread at a consistent 
interval). Within `commit` and `search`, the read lock is acquired, and within 
`purgeCache`, the write lock is acquired. `search` calls are very fast and only 
hold the `purgeLock` for short durations. However, `commit` calls are quite 
long and will hold the `purgeLock` for multi-second durations. Ostensibly 
because both `commit` and `search` only need the read lock, they should be able 
to execute concurrently and the long duration of `commit` should not impact 
`search`. However, when `purgeCache` attempts to acquire the write lock this no 
longer holds true. This is because once `purgeCache` is waiting to acquire the 
write lock, attempts to acquire the read lock 
[probabilistically](https://github.com/openjdk/jdk17u-dev/blob/2486ba2899f1e51418fa73f6216
 
b5224b4a348c5/src/java.base/share/classes/java/util/concurrent/locks/ReentrantReadWriteLock.java#L686)
 might wait for the write lock to be acquired + released (otherwise attempts to 
acquire the write lock would be starved). What this means is that if `commit` 
is holding the read lock, and then `purgeCache` attempts to acquire the write 
lock, some amount of calls to `match` that want to acquire the read lock will 
be queued behind the `purgeCache` attempt to acquire the write lock. Because 
`purgeCache` cannot acquire the write lock until `commit` releases the read 
lock, and because those `search` calls cannot acquire the read lock until 
`purgeCache` acquires + releases the write lock, those `search` calls end up 
having to wait for `commit` to complete. This makes it possible for `search` 
calls to take as long as `commit` to complete (multi-second rather than 
millisecond).
   
   ### Diagrams
   For illustration purposes I have the following 2 diagrams of the problem. 
The first diagram demonstrates the wait order for the `ReentrantReadWriteLock` 
in the high latency match scenario. For the purpose of the diagram I am 
assuming that `commit` takes 3 seconds, `purgeCache` takes 3 milliseconds, and 
`search` takes 3 milliseconds as well. Most of the calls to `search` will not 
end up in this scenario, but without the fix included in this PR *some* of the 
calls to `search` will end up in this scenario and be slow.
   
![image](https://github.com/apache/lucene/assets/10566967/e05fe134-2ea2-4f05-8319-64773ae4feff)
   The way in which the calls to `search` end up being slow is more clearly 
shown in the second diagram which shows a timeline that has a multisecond 
delayed `search` call - making the same assumptions about the runtime of the 3 
operations as the first diagram. In the below diagram, `Search Thread A` avoids 
the problem and has expected latency, but `Search Thread B` runs into the 
problem and has very high latency.
   
![image](https://github.com/apache/lucene/assets/10566967/7c462aa3-18fc-4183-a2d8-d98d8341f489)
   
   ### Solution
   This issue can be resolved by ensuring that `purgeCache` never attempts to 
acquire the write lock when `commit` is holding the read lock. This can be done 
by  simply using a mutual exclusion lock between the two since:
   1. `purgeCache` is not time-sensitive so it can afford to wait until 
`commit` completes.
   2. `commit` is arguably time sensitive but `purgeCache` is a very quick 
operation so sometimes having to wait on it won’t make a meaningful difference 
in the time it takes for `commit` to complete by itself.
   
   ### Test Case / Demo
   To demonstrate the problem I have this 
[gist](https://gist.github.com/daviscook477/a1b1fbebe75b0fdc41c889e542a2ed42) 
that runs `commit` and `search` concurrently while recording the durations of 
the calls to `search`. Then it runs `commit`, `search` and `purgeCache` 
concurrently while recording the durations of `search`. Comparing the p99 - 
p100 durations between when `purgeCache` wasn’t run concurrently vs when it was 
run concurrently shows the high latency behavior for combining all 3 operations.
   
   Without this change, running the gist on my laptop gives:
   1. No `purgeCache` - mean of p99 thru p100 is 20ms
   2. Yes `purgeCache` - mean of p99 thru p100 is 651ms
   
   With this change, running the gist on my laptop gives:
   1. No `purgeCache` - mean of p99 thru p100 is 16ms
   2. Yes `purgeCache` - mean of p99 thru p100 is 9ms
   
   Even though the latencies in this test are noisy, the order of magnitude 
difference in the mean p99 thru p100 when using this change with `commit`, 
`search` and `purgeCache` run concurrently is statistically significant. This 
issue is important to fix because it prevents Lucene Monitor from having 
reliable low latency match performance.
   
   ### Merge Request
   If this PR gets merged, can you please use my `dcoo...@bloomberg.net` email 
address for the squash+merge. Thank you.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to