[jira] [Commented] (LUCENE-10062) Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for faceting

Greg Miller (Jira) Wed, 25 Aug 2021 20:33:06 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404882#comment-17404882
 ]


Greg Miller commented on LUCENE-10062:
--------------------------------------

The performance improvement, as measured by {{luceneutil}} benchmarks, is 
borderline unbelievable by moving to numeric doc values (instead of the custom 
binary encoded values). It feels too good to be true, but all tests pass and I 
pulled the change into our internal fork and ran all of our tests and 
correctness suites, which also all pass.

*I'm seeing almost 400% QPS improvement on the three taxonomy browsing tasks 
with this change*. 

The following results are using {{wikimediumall}}:

 
{noformat}
                    TaskQPS baseline      StdDevQPS candidate      StdDev       
         Pct diff p-value
            OrHighNotMed      638.52      (5.3%)      602.16      (8.5%)   
-5.7% ( -18% -    8%) 0.011
            OrHighNotLow      609.27      (4.2%)      588.95      (5.8%)   
-3.3% ( -12% -    6%) 0.036
                PKLookup      136.76      (3.4%)      133.32      (3.3%)   
-2.5% (  -8% -    4%) 0.018
           OrHighNotHigh      535.46      (4.6%)      523.63      (5.7%)   
-2.2% ( -12% -    8%) 0.181
            OrNotHighMed      516.79      (5.6%)      507.71      (6.7%)   
-1.8% ( -13% -   11%) 0.367
            OrNotHighLow      543.98      (4.6%)      535.62      (6.8%)   
-1.5% ( -12% -   10%) 0.403
               OrHighLow      222.57      (2.7%)      219.42      (3.8%)   
-1.4% (  -7% -    5%) 0.171
                 Prefix3       52.18      (6.1%)       51.50      (6.0%)   
-1.3% ( -12% -   11%) 0.499
                  Fuzzy1       49.69      (3.3%)       49.12      (4.1%)   
-1.1% (  -8% -    6%) 0.340
                Wildcard       23.73      (4.1%)       23.53      (4.0%)   
-0.8% (  -8% -    7%) 0.512
           OrNotHighHigh      471.12      (3.7%)      467.29      (5.6%)   
-0.8% (  -9% -    8%) 0.589
        HighSloppyPhrase        4.67      (4.8%)        4.63      (5.5%)   
-0.8% ( -10% -   10%) 0.635
                 MedTerm     1510.23      (5.3%)     1498.61      (7.9%)   
-0.8% ( -13% -   13%) 0.718
     LowIntervalsOrdered       71.09      (3.3%)       70.59      (3.6%)   
-0.7% (  -7% -    6%) 0.523
              HighPhrase       15.80      (3.1%)       15.72      (3.5%)   
-0.5% (  -6% -    6%) 0.607
         MedSloppyPhrase       12.99      (2.2%)       12.94      (2.7%)   
-0.4% (  -5% -    4%) 0.614
               MedPhrase       11.68      (2.7%)       11.63      (2.7%)   
-0.4% (  -5% -    5%) 0.646
                 Respell       42.75      (2.3%)       42.59      (2.7%)   
-0.4% (  -5% -    4%) 0.645
         LowSloppyPhrase        6.80      (2.3%)        6.77      (2.5%)   
-0.3% (  -5% -    4%) 0.682
                  IntNRQ       32.19      (1.7%)       32.11      (1.8%)   
-0.3% (  -3% -    3%) 0.633
               LowPhrase       16.49      (2.6%)       16.45      (2.3%)   
-0.3% (  -4% -    4%) 0.738
                  Fuzzy2       12.52      (3.0%)       12.49      (3.9%)   
-0.2% (  -6% -    6%) 0.831
                 LowTerm     1338.97      (5.7%)     1336.19      (7.1%)   
-0.2% ( -12% -   13%) 0.919
    HighIntervalsOrdered        5.48      (2.3%)        5.47      (2.7%)   
-0.2% (  -5% -    4%) 0.827
              AndHighLow      295.57      (2.4%)      295.11      (3.2%)   
-0.2% (  -5% -    5%) 0.861
             LowSpanNear       39.91      (1.4%)       39.86      (1.5%)   
-0.1% (  -3% -    2%) 0.775
                HighTerm     1014.28      (4.6%)     1013.17      (6.4%)   
-0.1% ( -10% -   11%) 0.951
   BrowseMonthSSDVFacets        3.23      (5.0%)        3.23      (4.9%)   
-0.1% (  -9% -   10%) 0.956
             MedSpanNear       10.01      (2.1%)       10.01      (2.2%)   
-0.1% (  -4% -    4%) 0.931
             AndHighHigh       50.17      (2.5%)       50.17      (2.8%)   
-0.0% (  -5% -    5%) 0.997
            HighSpanNear        0.90      (1.3%)        0.90      (1.7%)    
0.0% (  -2% -    3%) 0.997
     MedIntervalsOrdered       18.18      (1.9%)       18.20      (2.2%)    
0.1% (  -3% -    4%) 0.853
              OrHighHigh       15.91      (1.7%)       15.93      (2.1%)    
0.1% (  -3% -    3%) 0.820
   HighTermDayOfYearSort       20.48      (8.0%)       20.54      (6.6%)    
0.3% ( -13% -   16%) 0.903
               OrHighMed       33.57      (1.9%)       33.68      (2.7%)    
0.3% (  -4% -    5%) 0.637
BrowseDayOfYearSSDVFacets        2.99      (5.5%)        3.00      (5.0%)    
0.4% (  -9% -   11%) 0.809
              AndHighMed       44.35      (3.2%)       44.60      (3.1%)    
0.6% (  -5% -    7%) 0.574
       HighTermMonthSort       41.42     (14.7%)       41.92     (15.8%)    
1.2% ( -25% -   37%) 0.805
    HighTermTitleBDVSort       34.18     (12.8%)       34.70     (11.7%)    
1.5% ( -20% -   29%) 0.699
              TermDTSort       45.24      (9.5%)       45.93      (9.7%)    
1.5% ( -16% -   22%) 0.616
    BrowseDateTaxoFacets        0.72      (3.5%)        3.51     (62.8%)  
388.3% ( 311% -  471%) 0.000
BrowseDayOfYearTaxoFacets        0.72      (3.4%)        3.52     (61.3%)  
389.9% ( 314% -  470%) 0.000
   BrowseMonthTaxoFacets        0.76      (3.5%)        3.95     (84.1%)  
419.2% ( 320% -  525%) 0.000
{noformat}
Digging a little deeper, here's what I'm seeing as top CPU time:

baseline:
{noformat}
PERCENT       CPU SAMPLES   STACK
12.89%        286328        
org.apache.lucene.util.packed.DirectMonotonicReader#get()
7.18%         159607        
org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$15#binaryValue()
6.90%         153297        
org.apache.lucene.util.packed.DirectReader$DirectPackedReader12#get()
6.25%         138833        
org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll()
{noformat}
 candidate:
{noformat}
PERCENT       CPU SAMPLES   STACK
4.77%         62575         
org.apache.lucene.index.SingletonSortedNumericDocValues#nextDoc()
4.30%         56479         
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#nextPosition()
4.20%         55120         
org.apache.lucene.util.packed.DirectReader$DirectPackedReader12#get()
3.97%         52068         
org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$18#nextDoc()
3.77%         49425         
org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$4#longValue()
3.35%         43952         
org.apache.lucene.queries.spans.NearSpansOrdered#nextStartPosition()
3.29%         43142         
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#advance()
2.86%         37556         
org.apache.lucene.queries.spans.TermSpans#nextStartPosition()
2.83%         37102         
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#advance()
2.62%         34434         
org.apache.lucene.queries.spans.NearSpansOrdered#stretchToOrder()
2.53%         33236         
org.apache.lucene.util.packed.DirectReader$DirectPackedReader4#get()
1.86%         24417         
org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegment()
1.85%         24271         
org.apache.lucene.queries.spans.SpanScorer#setFreqCurrentDoc()
1.73%         22668         
org.apache.lucene.search.similarities.BM25Similarity$BM25Scorer#score()
1.70%         22365         
org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll()
{noformat}
So a drop from 6.25% CPU time to 1.7% for {{FastTaxonomyFacetCounts#countAll}}

On top of this, the index actually gets smaller (by ~1.4%).
{noformat}
11504472        
wikimediumall.baseline.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd33.3326M
11334516        
wikimediumall.candidate.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd33.3326M
{noformat}
And... I haven't even optimized the single-value case yet (which will be easy 
to do and may squeeze out a little more performance based on what we saw with 
SSDV faceting).

Like I said, almost too good to be true. I've uploaded a PR here and would 
appreciate another set of eyes to see if I have something fundamentally wrong: 
https://github.com/apache/lucene/pull/264

> Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for 
> faceting
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-10062
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10062
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Greg Miller
>            Assignee: Greg Miller
>            Priority: Minor
>
> We currently encode taxonomy ordinals using varint style packing in a binary 
> doc values field. I suspect there have been a number of improvements to 
> SortedNumericDocValues since taxonomy faceting was first introduced, and I 
> plan to explore replacing the custom binary format we have today with a 
> SORTED_NUMERIC type dv field instead.
> I'll report benchmark results and index size impact here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10062) Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for faceting

Reply via email to