mikemccand commented on PR #13472:
URL: https://github.com/apache/lucene/pull/13472#issuecomment-2176161076

   We need to be careful interpreting the QPS results from `luceneutil`:
   
   These are not actual red-line (capacity) QPS numbers (CPU is not normally 
saturated during these runs), but rather "effective QPS".  The benchy measures 
"`t` = [median elapsed wall-clock time to run the query (after discarding 10% 
slowest outliers and warmup), averaged across all JVMs for each task" and 
computes and reports the effective QPS (`1.0 / 
t`)](https://github.com/mikemccand/luceneutil/blob/04f7c6ced4621fbbe0bf16a25afa7583fae44f8a/src/python/benchUtil.py#L1301-L1467).
  E.g. this means that if a run is using more concurrent threads it may finish 
(wall clock elapsed time) faster and appear to have more QPS, but that is false 
since this is net/net a zero-sum-game (still burning more/same-ish total CPU, 
just spread across more threads).
   
   Except, if Lucene is actually more efficient (lower total CPU) by running 
more threads at once because e.g. collecting the best hits across more segments 
concurrently means we can stop searching the non-competitive docs across 
segments sooner, then that is an actual QPS (red-line / capacity) win, not just 
a reporting artifact.  With time/improvements this should in fact be a big 
contributor to more efficient (total CPU) search... I think much innovation 
remains on this part of Lucene :)  E.g. maybe some segments tend to contribute 
strongly to most queries, and if Lucene could instrument/record this, it should 
kick off those segments first so they quickly find the competitive docs and 
quickly cause the other segments to stop early.  Intra-segment concurrency is 
another innovation we have yet to successfully tackle ... explicit early 
termination (not just BMW) with all this concurrency is more exploration ...
   
   But, in your [first 
run](https://github.com/apache/lucene/pull/13472#issuecomment-2170618011), 
since you ran `main` with three threads and `branch` with two threads, I think 
the concurrency is the same (three threads actually searching slices?) so this 
artifact of luceneutil QPS reporting can't explain the gains in that run ... 
but it's hard to believe context switching is so costly?  Hmm, though your 
profiling seems to show far fewer calls to collect, which might mean 
cross-segment concurrent efficiency is somehow kicking in?  Or, if context 
switching really explains it all, and `collect` is much faster, it would be 
sampled less ... can't read too much into the sample count from profiler output.
   
   > This is main vs main, no concurrency vs 4 threads:
   
   So for these results, I would expect to see ~4X QPS gain (ish) simply 
because wall-clock elapsed time for the query got ~4X faster (assuming perfect 
concurrency which won't happen in practice ... it depends of course on relative 
size of segments, whether there is long-pole outlier task limiting the actual 
concurrency, etc.).  So, it's interesting some tasks are > 400% faster -- maybe 
the added cross-segment concurrent efficiency is contributing here?
   
   It's also spooky that some tasks got ~2X slower.  Can this really be due to 
context switching overhead?  Nightly benchmarks did indeed switch to 
"concurrent" search but with only 1 worker thread, just to exercise that code 
path without actually using concurrency.  Yet we didn't see similar slowdowns 
to the BrowseXXX facet tasks as 
[above](https://github.com/apache/lucene/pull/13472#issuecomment-2173609575), 
e.g. 
[`BrowseDateTaxoFacets`](https://home.apache.org/~mikemccand/lucenebench/BrowseDateTaxoFacets.html).
  So the slowdown doesn't happen just because we are exercising the concurrent 
path.  Actually, I don't think the `BrowseXXX` tasks even use concurrency at 
all: if you look at 
[`SearchTask.java`](https://github.com/mikemccand/luceneutil/blob/04f7c6ced4621fbbe0bf16a25afa7583fae44f8a/src/main/perf/SearchTask.java#L229)
 it's just calling `FastTaxonomyFacetCounts` on the whole index which [runs 
sequentially segment by segment I 
think](https://github.com/apache/lucene/blob/7e31f5
 
6ea1130909948bcf9e5beefc161ceef137/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FastTaxonomyFacetCounts.java#L120-L175)?
  So now I don't understand why this `main vs main (4 threads)` is showing any 
slowdown for these `BrowseXXX` tasks... confused.
   
   [I'll open a spinoff issue here that Lucene's facet counting should also 
maybe tap into this executor for concurrent counting.  It's likely tricky 
though...].
   
   @original-brownbear -- what does your Lucene index look like?  Can you run 
`CheckIndex` and share the output on your runs?  I'm curious about the actual 
segment geometry... also make sure you are sharing the same single index across 
`main` and `branch`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to