original-brownbear commented on PR #13472: URL: https://github.com/apache/lucene/pull/13472#issuecomment-2173609575
@msokolov they are astounding but in the opposite direction, in fact it's concurrency that's the problem mostly. This is `main` vs `main`, no concurrency vs 4 threads: ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value BrowseDayOfYearTaxoFacets 14.81 (0.4%) 5.97 (0.4%) -59.7% ( -60% - -59%) 0.000 BrowseDateTaxoFacets 14.20 (9.0%) 5.85 (0.2%) -58.8% ( -62% - -54%) 0.000 IntNRQ 70.46 (1.3%) 30.29 (3.2%) -57.0% ( -60% - -53%) 0.000 BrowseRandomLabelTaxoFacets 11.61 (2.7%) 5.08 (0.3%) -56.3% ( -57% - -54%) 0.000 Fuzzy1 72.82 (5.7%) 44.58 (1.1%) -38.8% ( -43% - -33%) 0.000 BrowseDayOfYearSSDVFacets 7.66 (1.0%) 4.78 (0.6%) -37.6% ( -38% - -36%) 0.000 OrHighMed 74.56 (2.4%) 51.86 (3.2%) -30.4% ( -35% - -25%) 0.000 AndHighHigh 47.99 (2.7%) 34.33 (2.6%) -28.5% ( -32% - -23%) 0.000 AndHighMed 67.95 (1.5%) 52.17 (2.4%) -23.2% ( -26% - -19%) 0.000 LowSloppyPhrase 45.51 (1.7%) 37.92 (2.4%) -16.7% ( -20% - -12%) 0.000 MedPhrase 11.68 (5.0%) 9.74 (0.3%) -16.6% ( -20% - -11%) 0.000 BrowseMonthTaxoFacets 12.23 (2.6%) 10.73 (27.7%) -12.2% ( -41% - 18%) 0.378 OrHighHigh 45.32 (2.7%) 39.79 (4.2%) -12.2% ( -18% - -5%) 0.000 BrowseMonthSSDVFacets 5.49 (4.2%) 4.85 (1.1%) -11.7% ( -16% - -6%) 0.000 HighSloppyPhrase 2.01 (2.2%) 1.81 (7.4%) -10.2% ( -19% - 0%) 0.008 Wildcard 123.17 (2.5%) 115.43 (0.9%) -6.3% ( -9% - -2%) 0.000 OrNotHighLow 908.00 (2.2%) 865.22 (1.4%) -4.7% ( -8% - -1%) 0.000 LowIntervalsOrdered 57.32 (3.2%) 54.78 (3.9%) -4.4% ( -11% - 2%) 0.077 MedTermDayTaxoFacets 22.22 (0.6%) 21.57 (2.9%) -2.9% ( -6% - 0%) 0.049 BrowseDateSSDVFacets 1.46 (2.0%) 1.45 (2.1%) -0.5% ( -4% - 3%) 0.743 BrowseRandomLabelSSDVFacets 3.75 (0.6%) 3.74 (0.2%) -0.2% ( -1% - 0%) 0.551 OrHighMedDayTaxoFacets 1.20 (1.1%) 1.21 (4.4%) 0.9% ( -4% - 6%) 0.678 Respell 52.55 (1.4%) 53.25 (2.9%) 1.3% ( -2% - 5%) 0.407 AndHighMedDayTaxoFacets 11.46 (0.8%) 11.76 (2.7%) 2.6% ( 0% - 6%) 0.067 AndHighHighDayTaxoFacets 12.74 (1.3%) 13.23 (2.1%) 3.8% ( 0% - 7%) 0.002 MedSpanNear 8.28 (2.4%) 9.50 (5.0%) 14.7% ( 7% - 22%) 0.000 AndHighLow 624.28 (22.4%) 726.83 (3.6%) 16.4% ( -7% - 54%) 0.147 Fuzzy2 51.95 (23.2%) 60.73 (2.7%) 16.9% ( -7% - 55%) 0.147 MedSloppyPhrase 12.94 (4.1%) 15.57 (10.9%) 20.3% ( 5% - 36%) 0.001 Prefix3 158.65 (23.1%) 213.31 (3.8%) 34.5% ( 6% - 79%) 0.003 PKLookup 175.73 (6.3%) 247.50 (0.6%) 40.8% ( 31% - 50%) 0.000 HighPhrase 24.79 (6.3%) 37.67 (1.4%) 52.0% ( 41% - 63%) 0.000 LowPhrase 153.31 (1.4%) 244.54 (1.6%) 59.5% ( 55% - 63%) 0.000 OrHighLow 232.73 (23.8%) 371.84 (4.1%) 59.8% ( 25% - 115%) 0.000 HighSpanNear 2.93 (3.5%) 4.82 (13.3%) 64.7% ( 46% - 84%) 0.000 LowSpanNear 51.65 (6.0%) 98.03 (9.8%) 89.8% ( 69% - 112%) 0.000 HighTermTitleBDVSort 4.37 (4.4%) 8.65 (1.6%) 98.0% ( 88% - 108%) 0.000 MedIntervalsOrdered 9.46 (7.3%) 19.51 (13.2%) 106.1% ( 79% - 136%) 0.000 HighIntervalsOrdered 4.26 (6.5%) 8.81 (13.6%) 106.9% ( 81% - 135%) 0.000 LowTerm 232.68 (3.8%) 485.59 (7.5%) 108.7% ( 93% - 124%) 0.000 MedTerm 202.48 (26.4%) 535.61 (18.8%) 164.5% ( 94% - 285%) 0.000 OrHighNotLow 172.52 (3.4%) 516.56 (7.3%) 199.4% ( 182% - 217%) 0.000 OrNotHighHigh 69.11 (4.1%) 224.30 (11.7%) 224.6% ( 200% - 250%) 0.000 OrHighNotHigh 77.32 (2.7%) 271.59 (12.7%) 251.3% ( 229% - 274%) 0.000 TermDTSort 62.88 (4.5%) 224.63 (5.5%) 257.2% ( 236% - 279%) 0.000 HighTerm 106.32 (3.1%) 385.12 (25.1%) 262.2% ( 227% - 299%) 0.000 OrNotHighMed 64.14 (10.1%) 247.41 (19.2%) 285.7% ( 232% - 350%) 0.000 OrHighNotMed 78.41 (5.3%) 306.67 (10.6%) 291.1% ( 261% - 324%) 0.000 HighTermMonthSort 395.36 (38.7%) 2712.69 (16.2%) 586.1% ( 382% - 1046%) 0.000 HighTermDayOfYearSort 67.77 (4.8%) 524.03 (18.3%) 673.3% ( 620% - 731%) 0.000 HighTermTitleSort 15.06 (3.6%) 131.50 (5.9%) 773.4% ( 737% - 811%) 0.000 ``` A large number of these items are actually showing extreme regressions from forking. Even this branch is like 50% behind no concurrency on some points. This is in fact how I got to opening this PR. When profiling ES benchmark runs I saw a bunch of sections where the overhead of forking for a given task was higher than the cost of just executing that same task right away. It's a little hard to quantitatively show this in a flame graph but the qualitative problem is here: This is the profiling with vanilla Lucene: <img width="2497" alt="image" src="https://github.com/apache/lucene/assets/6490959/c7415ecc-8625-4ada-ab14-e8122b8a3d6f"> And this is the same situation with my changes in Lucene: <img width="2519" alt="image" src="https://github.com/apache/lucene/assets/6490959/fecc99be-449f-45d4-8c2d-9ed74f4afe00"> For weight creation, the forking overhead is still overwhelming but at least we save the future.get overhead from putting the calling thread to sleep and waking it up again. Only for longer running search tasks is the forking overhead "ok" I think. As I tried to show with the `perf` output, the cache effects of context switching often outweigh any benefits of parallization of IO. I could even see a point where the IO parallization causes harm, not from the IO itself but from the fact that page faulting isn't super scalable in Linux, so even if you make an NVMe drive run faster, the contention on the page fault handling might actually destroy any benefit from pushing the disk (assuming a fast disk that is) harder. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org