jpountz commented on PR #12489: URL: https://github.com/apache/lucene/pull/12489#issuecomment-1685250650
I think it's starting to look better now. I worked on some inefficiencies and applied some of the optimizations suggested by Mackenzie et al. in ["Tradeoff Options for Bipartite Graph Partitioning"](https://assets.amazon.science/a2/af/ae2fd9c14e6dbf75838160c37a34/tradeoff-options-for-bipartite-graph-partitioning.pdf): - Use a simplified estimator that only requires two log computations. - Simulated annealing to stop iterating when the gain would be small by using the iteration number as a threshold. With the suggested defaults of minDocFreq=4,096 and minPartitionSize=32, I'm getting the following performance numbers on wikimedium10m (10M docs): - indexing (24 threads): 6.5 minutes - force-merging (single thread): 4.2 minutes - reordering doc IDs, including building a forward index by uninverting the inverted index (24 threads): 5.6 minutes - serializing the reordered view via addIndexes (single thread): 7.4 minutes Then comparing query performance, I'm getting interesting results. I had to disable verification of scores and counts because of the reordering. A quick manual check suggests that results are valid. I can guess why some queries like conjunctions are faster, but I'm not sure for `OrHighLow` or `HighPhrase`. Regarding sorting tasks, their performance is highly dependent on the index order, so I'm considering them as noise. ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrHighLow 583.26 (5.9%) 400.94 (5.9%) -31.3% ( -40% - -20%) 0.000 HighPhrase 28.64 (7.6%) 20.78 (5.0%) -27.4% ( -37% - -16%) 0.000 TermDTSort 113.57 (2.5%) 90.23 (1.2%) -20.5% ( -23% - -17%) 0.000 HighTermTitleSort 72.11 (1.5%) 63.72 (1.4%) -11.6% ( -14% - -8%) 0.000 PKLookup 290.91 (3.5%) 264.17 (3.0%) -9.2% ( -15% - -2%) 0.000 HighTerm 634.93 (6.4%) 584.08 (5.9%) -8.0% ( -19% - 4%) 0.000 IntNRQ 54.77 (16.8%) 50.44 (13.2%) -7.9% ( -32% - 26%) 0.098 HighTermMonthSort 7652.28 (2.8%) 7294.07 (3.2%) -4.7% ( -10% - 1%) 0.000 OrHighNotLow 600.38 (5.5%) 600.98 (5.3%) 0.1% ( -10% - 11%) 0.953 Respell 272.80 (2.2%) 274.82 (2.3%) 0.7% ( -3% - 5%) 0.301 Wildcard 157.34 (4.4%) 160.16 (3.9%) 1.8% ( -6% - 10%) 0.172 HighSloppyPhrase 19.99 (5.5%) 20.74 (4.3%) 3.8% ( -5% - 14%) 0.016 Prefix3 840.82 (4.7%) 882.90 (5.8%) 5.0% ( -5% - 16%) 0.002 Fuzzy1 361.19 (2.8%) 383.68 (3.7%) 6.2% ( 0% - 13%) 0.000 HighIntervalsOrdered 8.51 (4.9%) 9.10 (4.3%) 6.9% ( -2% - 16%) 0.000 OrHighNotHigh 408.11 (4.8%) 440.27 (5.1%) 7.9% ( -1% - 18%) 0.000 HighSpanNear 23.57 (3.2%) 25.52 (3.7%) 8.3% ( 1% - 15%) 0.000 OrNotHighHigh 367.13 (4.2%) 397.89 (4.2%) 8.4% ( 0% - 17%) 0.000 Fuzzy2 188.81 (2.1%) 204.96 (2.6%) 8.6% ( 3% - 13%) 0.000 MedIntervalsOrdered 28.61 (4.8%) 31.32 (4.3%) 9.5% ( 0% - 19%) 0.000 LowSpanNear 51.15 (3.3%) 56.30 (2.6%) 10.1% ( 4% - 16%) 0.000 MedSpanNear 46.95 (3.1%) 51.69 (2.9%) 10.1% ( 3% - 16%) 0.000 MedSloppyPhrase 59.70 (5.0%) 66.11 (4.1%) 10.7% ( 1% - 20%) 0.000 OrHighNotMed 514.44 (5.3%) 577.94 (5.7%) 12.3% ( 1% - 24%) 0.000 LowIntervalsOrdered 78.83 (3.8%) 88.74 (3.7%) 12.6% ( 4% - 20%) 0.000 LowSloppyPhrase 54.64 (4.4%) 62.23 (3.6%) 13.9% ( 5% - 22%) 0.000 OrHighHigh 60.79 (8.6%) 69.68 (9.4%) 14.6% ( -3% - 35%) 0.000 MedTerm 864.76 (5.3%) 1024.07 (7.0%) 18.4% ( 5% - 32%) 0.000 LowPhrase 80.72 (4.5%) 97.66 (5.1%) 21.0% ( 10% - 32%) 0.000 MedPhrase 49.16 (4.5%) 60.54 (5.1%) 23.1% ( 12% - 34%) 0.000 AndHighMed 183.79 (4.8%) 238.88 (6.2%) 30.0% ( 18% - 42%) 0.000 AndHighHigh 88.11 (5.8%) 114.97 (7.0%) 30.5% ( 16% - 46%) 0.000 OrNotHighMed 462.38 (3.5%) 606.15 (4.7%) 31.1% ( 22% - 40%) 0.000 HighTermTitleBDVSort 27.04 (1.7%) 35.47 (5.8%) 31.2% ( 23% - 39%) 0.000 OrNotHighLow 1666.47 (3.9%) 2265.52 (3.6%) 35.9% ( 27% - 45%) 0.000 OrHighMed 170.06 (4.6%) 242.18 (8.7%) 42.4% ( 27% - 58%) 0.000 LowTerm 1032.01 (4.4%) 1520.01 (7.2%) 47.3% ( 34% - 61%) 0.000 AndHighLow 1577.56 (2.9%) 2337.64 (6.8%) 48.2% ( 37% - 59%) 0.000 HighTermDayOfYearSort 199.31 (2.3%) 357.90 (2.9%) 79.6% ( 72% - 86%) 0.000 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org