slow-J opened a new issue, #12696: URL: https://github.com/apache/lucene/issues/12696
### Description Background: In https://github.com/Tony-X/search-benchmark-game we were comparing performance of Tantivy and Lucene. "One difference between Lucene and Tantivy is Lucene uses the "patch" FOR, meaning the large values in a block are held out as exceptions so that the remaining values can use a smaller number of bits to encode, a tradeoff of CPU for lower storage space." In https://github.com/Tony-X/search-benchmark-game/issues/46 , I disable the patching in Lucene, to match how Tantivy encodes and run the search-benchmark-game to test the change. Lucene modifications for testing: I cloned the pforUtil and removed all logic related to patching the exceptions. I modified the Lucene90PostingsReader + Writer to use the util with no patching logic, see sample code https://github.com/slow-J/lucene/commit/83ec5a8b9f7ed39b8aa3ee948ffe5288a9d3fb16 Hardware used: EC2 Graviton3 instance, m6g.4xlarge Results from the search-benchmark-game: https://github.com/Tony-X/search-benchmark-game/issues/46#issuecomment-1693714327 We saw Lucene's latency improve: -2% in COUNT, -2% in TOP_10_COUNT, -2.07% in TOP_100. I then ran a Lucene benchmark with [luceneutil](https://github.com/mikemccand/luceneutil) `python3 src/python/localrun.py -source wikimediumall -r` Hardware used: EC2 Graviton3 instance, m6g.4xlarge Posting results below ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value BrowseDateSSDVFacets 0.95 (4.5%) 0.95 (5.9%) -1.0% ( -10% - 9%) 0.566 HighTermMonthSort 2558.85 (2.6%) 2539.66 (4.6%) -0.7% ( -7% - 6%) 0.526 BrowseRandomLabelSSDVFacets 2.48 (4.3%) 2.47 (1.4%) -0.7% ( -6% - 5%) 0.486 Prefix3 139.77 (1.7%) 139.41 (3.1%) -0.3% ( -4% - 4%) 0.735 BrowseRandomLabelTaxoFacets 3.38 (1.5%) 3.37 (1.8%) -0.2% ( -3% - 3%) 0.689 Fuzzy2 50.20 (2.0%) 50.11 (2.1%) -0.2% ( -4% - 3%) 0.776 MedTerm 439.13 (3.9%) 438.94 (4.3%) -0.0% ( -7% - 8%) 0.974 BrowseMonthSSDVFacets 3.48 (1.8%) 3.48 (1.5%) 0.0% ( -3% - 3%) 0.949 Respell 37.97 (2.9%) 37.99 (2.7%) 0.1% ( -5% - 5%) 0.945 LowTerm 323.73 (4.2%) 323.93 (3.9%) 0.1% ( -7% - 8%) 0.960 Fuzzy1 45.01 (2.5%) 45.06 (2.6%) 0.1% ( -4% - 5%) 0.896 PKLookup 154.24 (2.6%) 154.41 (2.1%) 0.1% ( -4% - 4%) 0.886 Wildcard 71.96 (1.4%) 72.04 (1.9%) 0.1% ( -3% - 3%) 0.823 BrowseDayOfYearSSDVFacets 3.32 (1.8%) 3.33 (1.9%) 0.2% ( -3% - 4%) 0.789 IntNRQ 27.77 (8.5%) 27.84 (9.6%) 0.3% ( -16% - 19%) 0.929 OrHighNotLow 183.77 (6.9%) 184.66 (5.4%) 0.5% ( -11% - 13%) 0.805 HighTermTitleBDVSort 4.81 (3.0%) 4.83 (3.0%) 0.5% ( -5% - 6%) 0.566 HighTerm 496.37 (5.5%) 499.12 (5.2%) 0.6% ( -9% - 11%) 0.745 TermDTSort 109.66 (1.4%) 110.31 (1.3%) 0.6% ( -1% - 3%) 0.155 HighSloppyPhrase 23.39 (4.2%) 23.56 (3.8%) 0.7% ( -7% - 9%) 0.580 BrowseMonthTaxoFacets 3.88 (2.5%) 3.91 (1.7%) 0.8% ( -3% - 5%) 0.261 BrowseDayOfYearTaxoFacets 3.89 (2.5%) 3.92 (0.9%) 0.8% ( -2% - 4%) 0.171 BrowseDateTaxoFacets 3.88 (2.4%) 3.91 (0.8%) 0.9% ( -2% - 4%) 0.117 OrHighNotHigh 270.96 (5.3%) 273.88 (4.2%) 1.1% ( -7% - 11%) 0.474 OrHighNotMed 202.84 (6.3%) 205.09 (4.7%) 1.1% ( -9% - 12%) 0.531 HighIntervalsOrdered 4.36 (4.2%) 4.41 (4.3%) 1.1% ( -7% - 10%) 0.399 HighTermTitleSort 20.12 (5.5%) 20.35 (5.6%) 1.2% ( -9% - 12%) 0.511 MedTermDayTaxoFacets 9.42 (3.2%) 9.53 (4.3%) 1.2% ( -6% - 8%) 0.325 OrHighHigh 12.07 (4.9%) 12.22 (4.9%) 1.3% ( -8% - 11%) 0.415 HighSpanNear 9.92 (3.5%) 10.06 (3.3%) 1.3% ( -5% - 8%) 0.212 OrNotHighHigh 356.64 (4.5%) 361.84 (3.3%) 1.5% ( -6% - 9%) 0.242 MedIntervalsOrdered 3.09 (2.9%) 3.14 (3.0%) 1.5% ( -4% - 7%) 0.119 OrHighMedDayTaxoFacets 1.54 (4.6%) 1.56 (5.3%) 1.5% ( -8% - 11%) 0.338 OrNotHighMed 346.18 (3.3%) 351.65 (3.6%) 1.6% ( -5% - 8%) 0.144 LowSloppyPhrase 21.58 (2.6%) 21.94 (2.0%) 1.7% ( -2% - 6%) 0.025 OrHighMed 48.51 (3.4%) 49.50 (3.5%) 2.0% ( -4% - 9%) 0.062 MedSpanNear 26.76 (1.4%) 27.35 (2.2%) 2.2% ( -1% - 5%) 0.000 HighTermDayOfYearSort 223.99 (2.7%) 228.97 (2.1%) 2.2% ( -2% - 7%) 0.004 OrHighLow 208.96 (3.2%) 213.65 (3.0%) 2.2% ( -3% - 8%) 0.022 AndHighHigh 13.93 (3.6%) 14.25 (3.7%) 2.4% ( -4% - 10%) 0.042 MedSloppyPhrase 19.45 (2.5%) 19.92 (2.0%) 2.4% ( -1% - 7%) 0.001 AndHighHighDayTaxoFacets 2.12 (3.9%) 2.17 (5.2%) 2.4% ( -6% - 11%) 0.094 LowSpanNear 8.99 (1.6%) 9.24 (1.4%) 2.7% ( 0% - 5%) 0.000 LowIntervalsOrdered 7.46 (2.6%) 7.67 (3.1%) 2.8% ( -2% - 8%) 0.002 HighPhrase 12.54 (2.8%) 12.91 (3.3%) 2.9% ( -3% - 9%) 0.003 MedPhrase 25.45 (1.8%) 26.22 (2.1%) 3.0% ( 0% - 6%) 0.000 AndHighMedDayTaxoFacets 16.49 (1.7%) 17.08 (2.3%) 3.6% ( 0% - 7%) 0.000 AndHighMed 91.80 (2.2%) 95.26 (2.3%) 3.8% ( 0% - 8%) 0.000 LowPhrase 144.04 (1.5%) 150.03 (2.0%) 4.2% ( 0% - 7%) 0.000 OrNotHighLow 315.83 (1.8%) 330.04 (2.6%) 4.5% ( 0% - 9%) 0.000 AndHighLow 352.57 (2.7%) 379.44 (3.7%) 7.6% ( 1% - 14%) 0.000 ``` The tasks at the bottom of the table had the larger QPS improvement. AndHighLow has a QPS improvement of +7.6%! Size of test candidate index: 17.626 GiB total Size of test baseline index: 18.401 GiB total This change would bring a 4.39691% increase in index size. ### Proposal I propose adding an option to Lucene's codec that allows users to disable patching in the PFOR encoding, providing the option to leverage the performance benefits observed here at a cost of index size. I would appreciate all feedback and further evaluation of this idea by the Lucene community. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org