[GitHub] [lucene] gf2121 edited a comment on pull request #557: LUCENE-10333: Speed up BinaryDocValues with a batch reading on LongValues

GitBox Wed, 22 Dec 2021 03:57:54 -0800


gf2121 edited a comment on pull request #557:
URL: https://github.com/apache/lucene/pull/557#issuecomment-999516790



   Hi @rmuir @jpountz , Thanks a lot for all talking about this! I think i 
**probably** find out a better way there:
   
   > Actually, what I thought at first was to only change the structure of 
addresses, implementing a new LongValues to replace the DirectReader or 
DirectMonotonicReader to read addresses, e.g. a ForUtilLongValues. When users 
try to get long through an index, It will use ForUtil to decompress the 
required block (of course, caching the block there and if the next call in the 
same block we can reuse it). 
   
   I implemented this idea and see a good benchmark result based on the 
**newest** luceneutil (wiki10m) without seeing any obvious slower tasks. I 
raised a new issue about this: 
https://issues.apache.org/jira/browse/LUCENE-10334
   
   **Edit**: In addition, the new optimization is not exactly optimizing the 
same palce as this PR. This PR is trying to optimize the BinaryDocValues while 
the new optimization is trying to use the new reader on 
`NumericDocValues#LongValue`. This is because the optimization on 
`NumericDocValues#LongValue` can be easier seen in the newest luceneutil and i 
think if the the new reader is justified on `NumericDocValues` we can easily 
port this to the DirectMonotonicReader :)
   
   ```
                               TaskQPS baseline      StdDevQPS 
my_modified_version      StdDev                Pct diff p-value
                      OrNotHighHigh      694.17      (8.2%)      685.83      
(7.0%)   -1.2% ( -15% -   15%) 0.618
                            Respell       75.15      (2.7%)       74.32      
(2.0%)   -1.1% (  -5% -    3%) 0.146
                            Prefix3      220.11      (5.1%)      217.78      
(5.8%)   -1.1% ( -11% -   10%) 0.541
                           Wildcard      129.75      (3.7%)      128.63      
(2.5%)   -0.9% (  -6% -    5%) 0.383
                        LowSpanNear       68.54      (2.1%)       68.00      
(2.4%)   -0.8% (  -5% -    3%) 0.269
                       OrNotHighMed      732.90      (6.8%)      727.49      
(5.3%)   -0.7% ( -12% -   12%) 0.703
        BrowseRandomLabelTaxoFacets    11879.03      (8.6%)    11799.33      
(5.5%)   -0.7% ( -13% -   14%) 0.769
                   HighSloppyPhrase        6.87      (2.9%)        6.83      
(2.3%)   -0.6% (  -5% -    4%) 0.496
                       OrHighNotMed      827.54      (9.2%)      822.94      
(8.0%)   -0.6% ( -16% -   18%) 0.838
                        MedSpanNear       18.92      (5.7%)       18.82      
(5.6%)   -0.5% ( -11% -   11%) 0.759
             OrHighMedDayTaxoFacets       10.27      (4.0%)       10.21      
(4.3%)   -0.5% (  -8% -    8%) 0.676
                           PKLookup      207.98      (4.0%)      206.85      
(2.7%)   -0.5% (  -7% -    6%) 0.621
                LowIntervalsOrdered      159.17      (2.3%)      158.32      
(2.2%)   -0.5% (  -4% -    3%) 0.445
                       HighSpanNear        6.32      (4.2%)        6.28      
(4.1%)   -0.5% (  -8% -    8%) 0.691
                MedIntervalsOrdered       85.31      (3.2%)       84.88      
(2.9%)   -0.5% (  -6% -    5%) 0.607
                           HighTerm     1170.55      (5.8%)     1164.79      
(3.9%)   -0.5% (  -9% -    9%) 0.753
                    LowSloppyPhrase       14.54      (3.1%)       14.48      
(2.9%)   -0.4% (  -6% -    5%) 0.651
                         HighPhrase      112.81      (4.4%)      112.39      
(4.1%)   -0.4% (  -8% -    8%) 0.781
                       OrNotHighLow      858.02      (5.9%)      854.99      
(4.8%)   -0.4% ( -10% -   10%) 0.835
               HighIntervalsOrdered       25.08      (2.8%)       25.00      
(2.6%)   -0.3% (  -5% -    5%) 0.701
                          MedPhrase       27.20      (2.1%)       27.11      
(2.9%)   -0.3% (  -5% -    4%) 0.689
               MedTermDayTaxoFacets       81.55      (2.3%)       81.35      
(2.9%)   -0.3% (  -5% -    5%) 0.762
                             IntNRQ       63.36      (2.0%)       63.21      
(2.5%)   -0.2% (  -4% -    4%) 0.740
                             Fuzzy2       73.24      (5.5%)       73.10      
(6.2%)   -0.2% ( -11% -   12%) 0.916
            AndHighMedDayTaxoFacets       76.08      (3.5%)       75.98      
(3.4%)   -0.1% (  -6% -    7%) 0.905
                        AndHighHigh       62.20      (2.0%)       62.18      
(2.4%)   -0.0% (  -4% -    4%) 0.954
              BrowseMonthTaxoFacets    11993.48      (6.7%)    11989.53      
(4.8%)   -0.0% ( -10% -   12%) 0.986
                       OrHighNotLow      732.82      (7.2%)      732.80      
(6.2%)   -0.0% ( -12% -   14%) 0.999
                             Fuzzy1       46.43      (5.3%)       46.45      
(6.0%)    0.0% ( -10% -   11%) 0.989
                            LowTerm     1608.25      (6.0%)     1608.84      
(4.9%)    0.0% ( -10% -   11%) 0.983
                          OrHighMed       75.90      (2.3%)       75.93      
(1.8%)    0.0% (  -3% -    4%) 0.939
                          LowPhrase      273.81      (2.9%)      274.04      
(3.3%)    0.1% (  -5% -    6%) 0.932
                         AndHighLow      717.24      (6.1%)      718.17      
(3.3%)    0.1% (  -8% -   10%) 0.933
           AndHighHighDayTaxoFacets       39.63      (2.5%)       39.69      
(2.6%)    0.1% (  -4% -    5%) 0.862
                         OrHighHigh       34.63      (1.8%)       34.68      
(2.0%)    0.1% (  -3% -    4%) 0.821
                    MedSloppyPhrase      158.80      (2.8%)      159.09      
(2.6%)    0.2% (  -5% -    5%) 0.832
                          OrHighLow      257.77      (2.9%)      258.46      
(4.6%)    0.3% (  -7% -    8%) 0.826
                         AndHighMed      133.43      (2.1%)      133.79      
(2.7%)    0.3% (  -4% -    5%) 0.726
                  HighTermMonthSort      145.28     (10.8%)      145.88     
(11.2%)    0.4% ( -19% -   25%) 0.905
                      OrHighNotHigh      834.99      (6.1%)      839.62      
(5.7%)    0.6% ( -10% -   13%) 0.766
                         TermDTSort       83.66      (9.6%)       84.30     
(11.1%)    0.8% ( -18% -   23%) 0.817
          BrowseDayOfYearTaxoFacets    11639.59      (5.1%)    11777.38      
(6.0%)    1.2% (  -9% -   12%) 0.502
                            MedTerm     1473.62      (7.4%)     1493.79      
(6.4%)    1.4% ( -11% -   16%) 0.530
               HighTermTitleBDVSort      114.98     (16.7%)      117.30     
(18.8%)    2.0% ( -28% -   45%) 0.720
              HighTermDayOfYearSort      128.29     (17.2%)      132.83     
(22.6%)    3.5% ( -30% -   52%) 0.577
               BrowseDateTaxoFacets       19.25     (20.4%)       26.77      
(3.7%)   39.1% (  12% -   79%) 0.000
        BrowseRandomLabelSSDVFacets       10.38      (3.5%)       18.03      
(6.8%)   73.7% (  61% -   87%) 0.000
              BrowseMonthSSDVFacets       15.71      (3.6%)       34.59     
(12.4%)  120.1% ( 100% -  141%) 0.000
          BrowseDayOfYearSSDVFacets       14.31      (3.3%)       33.54     
(12.9%)  134.4% ( 114% -  155%) 0.000
   ```
   
   > Unfortunately I noticed that the sorted queries that didn't become slower 
only didn't become slower because the field was also indexed with points, so 
the short-circuiting logic we have to progressively add a filter that only 
matches competitive documents hid the slowdown. If I hack the benchmark code to 
not use this optimization then sorted queries are all about 40-50% slower.
   
   Hi @jpountz , I saw these words in 
https://issues.apache.org/jira/browse/LUCENE-10033. I wonder if i also need to 
do some hack to see the slower tasks? Could you tell me the concret changes 
since i'm not very familiar with it, Thanks!
   
   And here is my localrun script:
   ```
   #!/usr/bin/env python
   
   # Licensed to the Apache Software Foundation (ASF) under one or more
   # contributor license agreements.  See the NOTICE file distributed with
   # this work for additional information regarding copyright ownership.
   # The ASF licenses this file to You under the Apache License, Version 2.0
   # (the "License"); you may not use this file except in compliance with
   # the License.  You may obtain a copy of the License at
   # 
   #     http://www.apache.org/licenses/LICENSE-2.0
   # 
   # Unless required by applicable law or agreed to in writing, software
   # distributed under the License is distributed on an "AS IS" BASIS,
   # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   # See the License for the specific language governing permissions and
   # limitations under the License.
   
   import competition
   import sys
   
   # simple example that runs benchmark with WIKI_MEDIUM source and taks files 
   # Baseline here is ../lucene_baseline versus ../lucene_candidate
   if __name__ == '__main__':
     sourceData = competition.sourceData()
     comp =  competition.Competition()
   
     facets = (('taxonomy:Date', 'Date'),('sortedset:Month', 
'Month'),('sortedset:DayOfYear', 'DayOfYear'),('sortedset:RandomLabel', 
"RandomLabel"))
     index = comp.newIndex('lucene_baseline', sourceData, facets=facets, 
indexSort='dayOfYearNumericDV:long')
     candidate_index = comp.newIndex('lucene_candidate', sourceData, 
facets=facets, indexSort='dayOfYearNumericDV:long')
   
     #Warning -- Do not break the order of arguments
     #TODO -- Fix the following by using argparser
     if len(sys.argv) > 3 and sys.argv[3] == '-concurrentSearches':
       concurrentSearches = True
     else:
       concurrentSearches = False
   
     # create a competitor named baseline with sources in the ../trunk folder
     comp.competitor('baseline', 'lucene_baseline',
                     index = index, concurrentSearches = concurrentSearches)
   
     # use the same index here
     # create a competitor named my_modified_version with sources in the 
../patch folder
     # note that we haven't specified an index here, luceneutil will 
automatically use the index from the base competitor for searching 
     # while the codec that is used for running this competitor is taken from 
this competitor.
     comp.competitor('my_modified_version', 'lucene_candidate',
                     index = candidate_index, concurrentSearches = 
concurrentSearches)
   
     # start the benchmark - this can take long depending on your index and 
machines
     comp.benchmark("baseline_vs_patch")
     
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [lucene] gf2121 edited a comment on pull request #557: LUCENE-10333: Speed up BinaryDocValues with a batch reading on LongValues

Reply via email to