Re: [PR] TaskExecutor should not fork unnecessarily [lucene]

via GitHub Tue, 18 Jun 2024 06:52:07 -0700


mikemccand commented on PR #13472:
URL: https://github.com/apache/lucene/pull/13472#issuecomment-2176161076

We need to be careful interpreting the QPS results from `luceneutil`:

These are not actual red-line (capacity) QPS numbers (CPU is not normally
saturated during these runs), but rather "effective QPS". The benchy measures
"`t` = [median elapsed wall-clock time to run the query (after discarding 10%
slowest outliers and warmup), averaged across all JVMs for each task" and
computes and reports the effective QPS (`1.0 /
t`)](https://github.com/mikemccand/luceneutil/blob/04f7c6ced4621fbbe0bf16a25afa7583fae44f8a/src/python/benchUtil.py#L1301-L1467).
E.g. this means that if a run is using more concurrent threads it may finish
(wall clock elapsed time) faster and appear to have more QPS, but that is false
since this is net/net a zero-sum-game (still burning more/same-ish total CPU,
just spread across more threads).

Except, if Lucene is actually more efficient (lower total CPU) by running
more threads at once because e.g. collecting the best hits across more segments
concurrently means we can stop searching the non-competitive docs across
segments sooner, then that is an actual QPS (red-line / capacity) win, not just
a reporting artifact. With time/improvements this should in fact be a big
contributor to more efficient (total CPU) search... I think much innovation
remains on this part of Lucene :) E.g. maybe some segments tend to contribute
strongly to most queries, and if Lucene could instrument/record this, it should
kick off those segments first so they quickly find the competitive docs and
quickly cause the other segments to stop early. Intra-segment concurrency is
another innovation we have yet to successfully tackle ... explicit early
termination (not just BMW) with all this concurrency is more exploration ...

But, in your [first
run](https://github.com/apache/lucene/pull/13472#issuecomment-2170618011),
since you ran `main` with three threads and `branch` with two threads, I think
the concurrency is the same (three threads actually searching slices?) so this
artifact of luceneutil QPS reporting can't explain the gains in that run ...
but it's hard to believe context switching is so costly? Hmm, though your
profiling seems to show far fewer calls to collect, which might mean
cross-segment concurrent efficiency is somehow kicking in? Or, if context
switching really explains it all, and `collect` is much faster, it would be
sampled less ... can't read too much into the sample count from profiler output.

> This is main vs main, no concurrency vs 4 threads:

So for these results, I would expect to see ~4X QPS gain (ish) simply
because wall-clock elapsed time for the query got ~4X faster (assuming perfect
concurrency which won't happen in practice ... it depends of course on relative
size of segments, whether there is long-pole outlier task limiting the actual
concurrency, etc.). So, it's interesting some tasks are > 400% faster -- maybe
the added cross-segment concurrent efficiency is contributing here?

It's also spooky that some tasks got ~2X slower. Can this really be due to
context switching overhead? Nightly benchmarks did indeed switch to
"concurrent" search but with only 1 worker thread, just to exercise that code
path without actually using concurrency. Yet we didn't see similar slowdowns
to the BrowseXXX facet tasks as
[above](https://github.com/apache/lucene/pull/13472#issuecomment-2173609575),
e.g.
[`BrowseDateTaxoFacets`](https://home.apache.org/~mikemccand/lucenebench/BrowseDateTaxoFacets.html).
So the slowdown doesn't happen just because we are exercising the concurrent
path. Actually, I don't think the `BrowseXXX` tasks even use concurrency at
all: if you look at
[`SearchTask.java`](https://github.com/mikemccand/luceneutil/blob/04f7c6ced4621fbbe0bf16a25afa7583fae44f8a/src/main/perf/SearchTask.java#L229)
it's just calling `FastTaxonomyFacetCounts` on the whole index which [runs
sequentially segment by segment I
think](https://github.com/apache/lucene/blob/7e31f5

6ea1130909948bcf9e5beefc161ceef137/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FastTaxonomyFacetCounts.java#L120-L175)?
So now I don't understand why this `main vs main (4 threads)` is showing any
slowdown for these `BrowseXXX` tasks... confused.

[I'll open a spinoff issue here that Lucene's facet counting should also
maybe tap into this executor for concurrent counting. It's likely tricky
though...].

@original-brownbear -- what does your Lucene index look like? Can you run
`CheckIndex` and share the output on your runs? I'm curious about the actual
segment geometry... also make sure you are sharing the same single index across
`main` and `branch`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] TaskExecutor should not fork unnecessarily [lucene]

Reply via email to