[PR] Fix large 99 percentile match latency in Monitor during concurrent commits and purges [lucene]

via GitHub Mon, 13 Nov 2023 09:41:15 -0800


daviscook477 opened a new pull request, #12801:
URL: https://github.com/apache/lucene/pull/12801

### Description
Within Lucene Monitor, there is a thread contention issue that manifests as
multi-second latencies in the `Monitor.match` function when it is called
concurrently with `Monitor.register/Monitor.deleteById` and
`QueryIndex.purgeCache`. The source of the issue is the usage of `purgeLock`, a
ReentrantReadWriteLock.

Within the QueryIndex, there are 3 usages of the `purgeLock`. They are
`commit` (called by `Monitor.register/Monitor.deleteById`), ‘search’ (called by
Monitor.match), and `purgeCache` (called by a background thread at a consistent
interval). Within `commit` and `search`, the read lock is acquired, and within
`purgeCache`, the write lock is acquired. `search` calls are very fast and only
hold the `purgeLock` for short durations. However, `commit` calls are quite
long and will hold the `purgeLock` for multi-second durations. Ostensibly
because both `commit` and `search` only need the read lock, they should be able
to execute concurrently and the long duration of `commit` should not impact
`search`. However, when `purgeCache` attempts to acquire the write lock this no
longer holds true. This is because once `purgeCache` is waiting to acquire the
write lock, attempts to acquire the read lock
[probabilistically](https://github.com/openjdk/jdk17u-dev/blob/2486ba2899f1e51418fa73f6216

b5224b4a348c5/src/java.base/share/classes/java/util/concurrent/locks/ReentrantReadWriteLock.java#L686)
might wait for the write lock to be acquired + released (otherwise attempts to
acquire the write lock would be starved). What this means is that if `commit`
is holding the read lock, and then `purgeCache` attempts to acquire the write
lock, some amount of calls to `match` that want to acquire the read lock will
be queued behind the `purgeCache` attempt to acquire the write lock. Because
`purgeCache` cannot acquire the write lock until `commit` releases the read
lock, and because those `search` calls cannot acquire the read lock until
`purgeCache` acquires + releases the write lock, those `search` calls end up
having to wait for `commit` to complete. This makes it possible for `search`
calls to take as long as `commit` to complete (multi-second rather than
millisecond).

### Diagrams
For illustration purposes I have the following 2 diagrams of the problem.
The first diagram demonstrates the wait order for the `ReentrantReadWriteLock`
in the high latency match scenario. For the purpose of the diagram I am
assuming that `commit` takes 3 seconds, `purgeCache` takes 3 milliseconds, and
`search` takes 3 milliseconds as well. Most of the calls to `search` will not
end up in this scenario, but without the fix included in this PR *some* of the
calls to `search` will end up in this scenario and be slow.

![image](https://github.com/apache/lucene/assets/10566967/e05fe134-2ea2-4f05-8319-64773ae4feff)
The way in which the calls to `search` end up being slow is more clearly
shown in the second diagram which shows a timeline that has a multisecond
delayed `search` call - making the same assumptions about the runtime of the 3
operations as the first diagram. In the below diagram, `Search Thread A` avoids
the problem and has expected latency, but `Search Thread B` runs into the
problem and has very high latency.

![image](https://github.com/apache/lucene/assets/10566967/7c462aa3-18fc-4183-a2d8-d98d8341f489)

### Solution
This issue can be resolved by ensuring that `purgeCache` never attempts to
acquire the write lock when `commit` is holding the read lock. This can be done
by simply using a mutual exclusion lock between the two since:
1. `purgeCache` is not time-sensitive so it can afford to wait until
`commit` completes.
2. `commit` is arguably time sensitive but `purgeCache` is a very quick
operation so sometimes having to wait on it won’t make a meaningful difference
in the time it takes for `commit` to complete by itself.

### Test Case / Demo
To demonstrate the problem I have this
[gist](https://gist.github.com/daviscook477/a1b1fbebe75b0fdc41c889e542a2ed42)
that runs `commit` and `search` concurrently while recording the durations of
the calls to `search`. Then it runs `commit`, `search` and `purgeCache`
concurrently while recording the durations of `search`. Comparing the p99 -
p100 durations between when `purgeCache` wasn’t run concurrently vs when it was
run concurrently shows the high latency behavior for combining all 3 operations.

Without this change, running the gist on my laptop gives:
1. No `purgeCache` - mean of p99 thru p100 is 20ms
2. Yes `purgeCache` - mean of p99 thru p100 is 651ms

With this change, running the gist on my laptop gives:
1. No `purgeCache` - mean of p99 thru p100 is 16ms
2. Yes `purgeCache` - mean of p99 thru p100 is 9ms

Even though the latencies in this test are noisy, the order of magnitude
difference in the mean p99 thru p100 when using this change with `commit`,
`search` and `purgeCache` run concurrently is statistically significant. This
issue is important to fix because it prevents Lucene Monitor from having
reliable low latency match performance.

### Merge Request
If this PR gets merged, can you please use my `dcoo...@bloomberg.net` email
address for the squash+merge. Thank you.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Fix large 99 percentile match latency in Monitor during concurrent commits and purges [lucene]

Reply via email to