mikemccand commented on PR #13472: URL: https://github.com/apache/lucene/pull/13472#issuecomment-2176161076
We need to be careful interpreting the QPS results from `luceneutil`: These are not actual red-line (capacity) QPS numbers (CPU is not normally saturated during these runs), but rather "effective QPS". The benchy measures "`t` = [median elapsed wall-clock time to run the query (after discarding 10% slowest outliers and warmup), averaged across all JVMs for each task" and computes and reports the effective QPS (`1.0 / t`)](https://github.com/mikemccand/luceneutil/blob/04f7c6ced4621fbbe0bf16a25afa7583fae44f8a/src/python/benchUtil.py#L1301-L1467). E.g. this means that if a run is using more concurrent threads it may finish (wall clock elapsed time) faster and appear to have more QPS, but that is false since this is net/net a zero-sum-game (still burning more/same-ish total CPU, just spread across more threads). Except, if Lucene is actually more efficient (lower total CPU) by running more threads at once because e.g. collecting the best hits across more segments concurrently means we can stop searching the non-competitive docs across segments sooner, then that is an actual QPS (red-line / capacity) win, not just a reporting artifact. With time/improvements this should in fact be a big contributor to more efficient (total CPU) search... I think much innovation remains on this part of Lucene :) E.g. maybe some segments tend to contribute strongly to most queries, and if Lucene could instrument/record this, it should kick off those segments first so they quickly find the competitive docs and quickly cause the other segments to stop early. Intra-segment concurrency is another innovation we have yet to successfully tackle ... explicit early termination (not just BMW) with all this concurrency is more exploration ... But, in your [first run](https://github.com/apache/lucene/pull/13472#issuecomment-2170618011), since you ran `main` with three threads and `branch` with two threads, I think the concurrency is the same (three threads actually searching slices?) so this artifact of luceneutil QPS reporting can't explain the gains in that run ... but it's hard to believe context switching is so costly? Hmm, though your profiling seems to show far fewer calls to collect, which might mean cross-segment concurrent efficiency is somehow kicking in? Or, if context switching really explains it all, and `collect` is much faster, it would be sampled less ... can't read too much into the sample count from profiler output. > This is main vs main, no concurrency vs 4 threads: So for these results, I would expect to see ~4X QPS gain (ish) simply because wall-clock elapsed time for the query got ~4X faster (assuming perfect concurrency which won't happen in practice ... it depends of course on relative size of segments, whether there is long-pole outlier task limiting the actual concurrency, etc.). So, it's interesting some tasks are > 400% faster -- maybe the added cross-segment concurrent efficiency is contributing here? It's also spooky that some tasks got ~2X slower. Can this really be due to context switching overhead? Nightly benchmarks did indeed switch to "concurrent" search but with only 1 worker thread, just to exercise that code path without actually using concurrency. Yet we didn't see similar slowdowns to the BrowseXXX facet tasks as [above](https://github.com/apache/lucene/pull/13472#issuecomment-2173609575), e.g. [`BrowseDateTaxoFacets`](https://home.apache.org/~mikemccand/lucenebench/BrowseDateTaxoFacets.html). So the slowdown doesn't happen just because we are exercising the concurrent path. Actually, I don't think the `BrowseXXX` tasks even use concurrency at all: if you look at [`SearchTask.java`](https://github.com/mikemccand/luceneutil/blob/04f7c6ced4621fbbe0bf16a25afa7583fae44f8a/src/main/perf/SearchTask.java#L229) it's just calling `FastTaxonomyFacetCounts` on the whole index which [runs sequentially segment by segment I think](https://github.com/apache/lucene/blob/7e31f5 6ea1130909948bcf9e5beefc161ceef137/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FastTaxonomyFacetCounts.java#L120-L175)? So now I don't understand why this `main vs main (4 threads)` is showing any slowdown for these `BrowseXXX` tasks... confused. [I'll open a spinoff issue here that Lucene's facet counting should also maybe tap into this executor for concurrent counting. It's likely tricky though...]. @original-brownbear -- what does your Lucene index look like? Can you run `CheckIndex` and share the output on your runs? I'm curious about the actual segment geometry... also make sure you are sharing the same single index across `main` and `branch`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org