[ https://issues.apache.org/jira/browse/LUCENE-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404882#comment-17404882 ]
Greg Miller commented on LUCENE-10062: -------------------------------------- The performance improvement, as measured by {{luceneutil}} benchmarks, is borderline unbelievable by moving to numeric doc values (instead of the custom binary encoded values). It feels too good to be true, but all tests pass and I pulled the change into our internal fork and ran all of our tests and correctness suites, which also all pass. *I'm seeing almost 400% QPS improvement on the three taxonomy browsing tasks with this change*. The following results are using {{wikimediumall}}: {noformat} TaskQPS baseline StdDevQPS candidate StdDev Pct diff p-value OrHighNotMed 638.52 (5.3%) 602.16 (8.5%) -5.7% ( -18% - 8%) 0.011 OrHighNotLow 609.27 (4.2%) 588.95 (5.8%) -3.3% ( -12% - 6%) 0.036 PKLookup 136.76 (3.4%) 133.32 (3.3%) -2.5% ( -8% - 4%) 0.018 OrHighNotHigh 535.46 (4.6%) 523.63 (5.7%) -2.2% ( -12% - 8%) 0.181 OrNotHighMed 516.79 (5.6%) 507.71 (6.7%) -1.8% ( -13% - 11%) 0.367 OrNotHighLow 543.98 (4.6%) 535.62 (6.8%) -1.5% ( -12% - 10%) 0.403 OrHighLow 222.57 (2.7%) 219.42 (3.8%) -1.4% ( -7% - 5%) 0.171 Prefix3 52.18 (6.1%) 51.50 (6.0%) -1.3% ( -12% - 11%) 0.499 Fuzzy1 49.69 (3.3%) 49.12 (4.1%) -1.1% ( -8% - 6%) 0.340 Wildcard 23.73 (4.1%) 23.53 (4.0%) -0.8% ( -8% - 7%) 0.512 OrNotHighHigh 471.12 (3.7%) 467.29 (5.6%) -0.8% ( -9% - 8%) 0.589 HighSloppyPhrase 4.67 (4.8%) 4.63 (5.5%) -0.8% ( -10% - 10%) 0.635 MedTerm 1510.23 (5.3%) 1498.61 (7.9%) -0.8% ( -13% - 13%) 0.718 LowIntervalsOrdered 71.09 (3.3%) 70.59 (3.6%) -0.7% ( -7% - 6%) 0.523 HighPhrase 15.80 (3.1%) 15.72 (3.5%) -0.5% ( -6% - 6%) 0.607 MedSloppyPhrase 12.99 (2.2%) 12.94 (2.7%) -0.4% ( -5% - 4%) 0.614 MedPhrase 11.68 (2.7%) 11.63 (2.7%) -0.4% ( -5% - 5%) 0.646 Respell 42.75 (2.3%) 42.59 (2.7%) -0.4% ( -5% - 4%) 0.645 LowSloppyPhrase 6.80 (2.3%) 6.77 (2.5%) -0.3% ( -5% - 4%) 0.682 IntNRQ 32.19 (1.7%) 32.11 (1.8%) -0.3% ( -3% - 3%) 0.633 LowPhrase 16.49 (2.6%) 16.45 (2.3%) -0.3% ( -4% - 4%) 0.738 Fuzzy2 12.52 (3.0%) 12.49 (3.9%) -0.2% ( -6% - 6%) 0.831 LowTerm 1338.97 (5.7%) 1336.19 (7.1%) -0.2% ( -12% - 13%) 0.919 HighIntervalsOrdered 5.48 (2.3%) 5.47 (2.7%) -0.2% ( -5% - 4%) 0.827 AndHighLow 295.57 (2.4%) 295.11 (3.2%) -0.2% ( -5% - 5%) 0.861 LowSpanNear 39.91 (1.4%) 39.86 (1.5%) -0.1% ( -3% - 2%) 0.775 HighTerm 1014.28 (4.6%) 1013.17 (6.4%) -0.1% ( -10% - 11%) 0.951 BrowseMonthSSDVFacets 3.23 (5.0%) 3.23 (4.9%) -0.1% ( -9% - 10%) 0.956 MedSpanNear 10.01 (2.1%) 10.01 (2.2%) -0.1% ( -4% - 4%) 0.931 AndHighHigh 50.17 (2.5%) 50.17 (2.8%) -0.0% ( -5% - 5%) 0.997 HighSpanNear 0.90 (1.3%) 0.90 (1.7%) 0.0% ( -2% - 3%) 0.997 MedIntervalsOrdered 18.18 (1.9%) 18.20 (2.2%) 0.1% ( -3% - 4%) 0.853 OrHighHigh 15.91 (1.7%) 15.93 (2.1%) 0.1% ( -3% - 3%) 0.820 HighTermDayOfYearSort 20.48 (8.0%) 20.54 (6.6%) 0.3% ( -13% - 16%) 0.903 OrHighMed 33.57 (1.9%) 33.68 (2.7%) 0.3% ( -4% - 5%) 0.637 BrowseDayOfYearSSDVFacets 2.99 (5.5%) 3.00 (5.0%) 0.4% ( -9% - 11%) 0.809 AndHighMed 44.35 (3.2%) 44.60 (3.1%) 0.6% ( -5% - 7%) 0.574 HighTermMonthSort 41.42 (14.7%) 41.92 (15.8%) 1.2% ( -25% - 37%) 0.805 HighTermTitleBDVSort 34.18 (12.8%) 34.70 (11.7%) 1.5% ( -20% - 29%) 0.699 TermDTSort 45.24 (9.5%) 45.93 (9.7%) 1.5% ( -16% - 22%) 0.616 BrowseDateTaxoFacets 0.72 (3.5%) 3.51 (62.8%) 388.3% ( 311% - 471%) 0.000 BrowseDayOfYearTaxoFacets 0.72 (3.4%) 3.52 (61.3%) 389.9% ( 314% - 470%) 0.000 BrowseMonthTaxoFacets 0.76 (3.5%) 3.95 (84.1%) 419.2% ( 320% - 525%) 0.000 {noformat} Digging a little deeper, here's what I'm seeing as top CPU time: baseline: {noformat} PERCENT CPU SAMPLES STACK 12.89% 286328 org.apache.lucene.util.packed.DirectMonotonicReader#get() 7.18% 159607 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$15#binaryValue() 6.90% 153297 org.apache.lucene.util.packed.DirectReader$DirectPackedReader12#get() 6.25% 138833 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll() {noformat} candidate: {noformat} PERCENT CPU SAMPLES STACK 4.77% 62575 org.apache.lucene.index.SingletonSortedNumericDocValues#nextDoc() 4.30% 56479 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#nextPosition() 4.20% 55120 org.apache.lucene.util.packed.DirectReader$DirectPackedReader12#get() 3.97% 52068 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$18#nextDoc() 3.77% 49425 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$4#longValue() 3.35% 43952 org.apache.lucene.queries.spans.NearSpansOrdered#nextStartPosition() 3.29% 43142 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#advance() 2.86% 37556 org.apache.lucene.queries.spans.TermSpans#nextStartPosition() 2.83% 37102 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#advance() 2.62% 34434 org.apache.lucene.queries.spans.NearSpansOrdered#stretchToOrder() 2.53% 33236 org.apache.lucene.util.packed.DirectReader$DirectPackedReader4#get() 1.86% 24417 org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegment() 1.85% 24271 org.apache.lucene.queries.spans.SpanScorer#setFreqCurrentDoc() 1.73% 22668 org.apache.lucene.search.similarities.BM25Similarity$BM25Scorer#score() 1.70% 22365 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll() {noformat} So a drop from 6.25% CPU time to 1.7% for {{FastTaxonomyFacetCounts#countAll}} On top of this, the index actually gets smaller (by ~1.4%). {noformat} 11504472 wikimediumall.baseline.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd33.3326M 11334516 wikimediumall.candidate.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd33.3326M {noformat} And... I haven't even optimized the single-value case yet (which will be easy to do and may squeeze out a little more performance based on what we saw with SSDV faceting). Like I said, almost too good to be true. I've uploaded a PR here and would appreciate another set of eyes to see if I have something fundamentally wrong: https://github.com/apache/lucene/pull/264 > Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for > faceting > -------------------------------------------------------------------------------- > > Key: LUCENE-10062 > URL: https://issues.apache.org/jira/browse/LUCENE-10062 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet > Reporter: Greg Miller > Assignee: Greg Miller > Priority: Minor > > We currently encode taxonomy ordinals using varint style packing in a binary > doc values field. I suspect there have been a number of improvements to > SortedNumericDocValues since taxonomy faceting was first introduced, and I > plan to explore replacing the custom binary format we have today with a > SORTED_NUMERIC type dv field instead. > I'll report benchmark results and index size impact here. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org