jpountz commented on PR #12489: URL: https://github.com/apache/lucene/pull/12489#issuecomment-1712779097
Wikibigall. Less space spent on doc valuse this time since I did not enable indexing of facets. There is a more significant size reduction of postings this time (-10.5%). This is not misaligned with the reproducibility paper which observered size reductions of 18% with partitioned Elias-Fano and 5% with SVByte on the Wikipedia dataset. I would expect PFor to be somewhere in between as it's better able to take advantage of small gaps between docs than SVByte, but less than partioned Elias-Fano. | File | before (MB) | after (MB) | | - | - | - | | terms (tim) | 767 |766 | | postings (doc) | 2779 | 2489 | | positions (pos) | 11356 | 10569 | | points (kdd) | 100 | 99 | | doc values (dvd) | 456 | 461 | | stored fields (fdt) | 249 | 257 | | norms (nvd) | 13 | 13 | | total | 15734 |14669 | Benchmarks still show slowdowns on phrase queries and speedups on conjunctions, though it's less spectacular than on wikimedium10m. ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value MedTerm 652.41 (7.5%) 493.97 (2.6%) -24.3% ( -31% - -15%) 0.000 HighPhrase 30.86 (3.5%) 23.85 (2.6%) -22.7% ( -27% - -17%) 0.000 LowPhrase 51.09 (3.1%) 42.38 (2.2%) -17.1% ( -21% - -12%) 0.000 LowTerm 1057.76 (5.4%) 881.22 (2.5%) -16.7% ( -23% - -9%) 0.000 MedPhrase 82.18 (3.0%) 71.88 (1.7%) -12.5% ( -16% - -8%) 0.000 HighTermMonthSort 6482.52 (4.5%) 5739.50 (3.5%) -11.5% ( -18% - -3%) 0.000 PKLookup 293.95 (3.2%) 276.15 (3.7%) -6.1% ( -12% - 0%) 0.000 MedSloppyPhrase 8.68 (2.7%) 8.20 (2.9%) -5.5% ( -10% - 0%) 0.000 OrHighLow 578.06 (4.4%) 550.49 (4.0%) -4.8% ( -12% - 3%) 0.016 HighSloppyPhrase 7.43 (2.2%) 7.10 (4.0%) -4.4% ( -10% - 1%) 0.003 Fuzzy1 244.70 (2.9%) 238.49 (3.3%) -2.5% ( -8% - 3%) 0.080 OrHighHigh 39.76 (9.5%) 39.21 (6.1%) -1.4% ( -15% - 15%) 0.717 HighTerm 370.57 (8.5%) 367.09 (4.4%) -0.9% ( -12% - 13%) 0.768 LowSloppyPhrase 13.68 (2.3%) 13.71 (3.3%) 0.2% ( -5% - 5%) 0.868 Respell 204.23 (1.8%) 204.98 (2.0%) 0.4% ( -3% - 4%) 0.679 Prefix3 225.23 (5.1%) 226.74 (5.5%) 0.7% ( -9% - 11%) 0.786 Wildcard 170.34 (4.0%) 171.63 (3.4%) 0.8% ( -6% - 8%) 0.665 IntNRQ 92.30 (11.9%) 95.15 (10.2%) 3.1% ( -17% - 28%) 0.555 MedSpanNear 5.79 (6.8%) 5.99 (9.3%) 3.4% ( -11% - 20%) 0.378 OrHighMed 104.41 (7.3%) 107.99 (5.3%) 3.4% ( -8% - 17%) 0.253 HighSpanNear 2.47 (4.2%) 2.56 (4.1%) 3.7% ( -4% - 12%) 0.059 Fuzzy2 139.96 (2.8%) 146.77 (2.6%) 4.9% ( 0% - 10%) 0.000 LowSpanNear 42.96 (3.6%) 45.21 (2.5%) 5.2% ( 0% - 11%) 0.000 AndHighHigh 33.24 (6.2%) 36.20 (4.3%) 8.9% ( -1% - 20%) 0.000 AndHighMed 131.84 (5.2%) 144.31 (3.2%) 9.5% ( 0% - 18%) 0.000 HighTermDayOfYearSort 186.67 (2.9%) 208.78 (3.2%) 11.8% ( 5% - 18%) 0.000 AndHighLow 590.69 (3.2%) 677.22 (2.2%) 14.6% ( 9% - 20%) 0.000 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org