gf2121 edited a comment on pull request #557: URL: https://github.com/apache/lucene/pull/557#issuecomment-999516790
Hi @rmuir @jpountz , Thanks a lot for all talking about this! I think i **probably** find out a better way there: > Actually, what I thought at first was to only change the structure of addresses, implementing a new LongValues to replace the DirectReader or DirectMonotonicReader to read addresses, e.g. a ForUtilLongValues. When users try to get long through an index, It will use ForUtil to decompress the required block (of course, caching the block there and if the next call in the same block we can reuse it). I implemented this idea and see a good benchmark result based on the **newest** luceneutil (wiki10m) without seeing any obvious slower tasks. I raised a new issue about this: https://issues.apache.org/jira/browse/LUCENE-10334 **Edit**: In addition, the new optimization is not exactly optimizing the same palce as this PR. This PR is trying to optimize the BinaryDocValues while the new optimization is trying to use the new reader on `NumericDocValues#LongValue`. This is because the optimization on `NumericDocValues#LongValue` can be easier seen in the newest luceneutil and i think if the the new reader is justified on `NumericDocValues` we can easily port this to the DirectMonotonicReader :) ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrNotHighHigh 694.17 (8.2%) 685.83 (7.0%) -1.2% ( -15% - 15%) 0.618 Respell 75.15 (2.7%) 74.32 (2.0%) -1.1% ( -5% - 3%) 0.146 Prefix3 220.11 (5.1%) 217.78 (5.8%) -1.1% ( -11% - 10%) 0.541 Wildcard 129.75 (3.7%) 128.63 (2.5%) -0.9% ( -6% - 5%) 0.383 LowSpanNear 68.54 (2.1%) 68.00 (2.4%) -0.8% ( -5% - 3%) 0.269 OrNotHighMed 732.90 (6.8%) 727.49 (5.3%) -0.7% ( -12% - 12%) 0.703 BrowseRandomLabelTaxoFacets 11879.03 (8.6%) 11799.33 (5.5%) -0.7% ( -13% - 14%) 0.769 HighSloppyPhrase 6.87 (2.9%) 6.83 (2.3%) -0.6% ( -5% - 4%) 0.496 OrHighNotMed 827.54 (9.2%) 822.94 (8.0%) -0.6% ( -16% - 18%) 0.838 MedSpanNear 18.92 (5.7%) 18.82 (5.6%) -0.5% ( -11% - 11%) 0.759 OrHighMedDayTaxoFacets 10.27 (4.0%) 10.21 (4.3%) -0.5% ( -8% - 8%) 0.676 PKLookup 207.98 (4.0%) 206.85 (2.7%) -0.5% ( -7% - 6%) 0.621 LowIntervalsOrdered 159.17 (2.3%) 158.32 (2.2%) -0.5% ( -4% - 3%) 0.445 HighSpanNear 6.32 (4.2%) 6.28 (4.1%) -0.5% ( -8% - 8%) 0.691 MedIntervalsOrdered 85.31 (3.2%) 84.88 (2.9%) -0.5% ( -6% - 5%) 0.607 HighTerm 1170.55 (5.8%) 1164.79 (3.9%) -0.5% ( -9% - 9%) 0.753 LowSloppyPhrase 14.54 (3.1%) 14.48 (2.9%) -0.4% ( -6% - 5%) 0.651 HighPhrase 112.81 (4.4%) 112.39 (4.1%) -0.4% ( -8% - 8%) 0.781 OrNotHighLow 858.02 (5.9%) 854.99 (4.8%) -0.4% ( -10% - 10%) 0.835 HighIntervalsOrdered 25.08 (2.8%) 25.00 (2.6%) -0.3% ( -5% - 5%) 0.701 MedPhrase 27.20 (2.1%) 27.11 (2.9%) -0.3% ( -5% - 4%) 0.689 MedTermDayTaxoFacets 81.55 (2.3%) 81.35 (2.9%) -0.3% ( -5% - 5%) 0.762 IntNRQ 63.36 (2.0%) 63.21 (2.5%) -0.2% ( -4% - 4%) 0.740 Fuzzy2 73.24 (5.5%) 73.10 (6.2%) -0.2% ( -11% - 12%) 0.916 AndHighMedDayTaxoFacets 76.08 (3.5%) 75.98 (3.4%) -0.1% ( -6% - 7%) 0.905 AndHighHigh 62.20 (2.0%) 62.18 (2.4%) -0.0% ( -4% - 4%) 0.954 BrowseMonthTaxoFacets 11993.48 (6.7%) 11989.53 (4.8%) -0.0% ( -10% - 12%) 0.986 OrHighNotLow 732.82 (7.2%) 732.80 (6.2%) -0.0% ( -12% - 14%) 0.999 Fuzzy1 46.43 (5.3%) 46.45 (6.0%) 0.0% ( -10% - 11%) 0.989 LowTerm 1608.25 (6.0%) 1608.84 (4.9%) 0.0% ( -10% - 11%) 0.983 OrHighMed 75.90 (2.3%) 75.93 (1.8%) 0.0% ( -3% - 4%) 0.939 LowPhrase 273.81 (2.9%) 274.04 (3.3%) 0.1% ( -5% - 6%) 0.932 AndHighLow 717.24 (6.1%) 718.17 (3.3%) 0.1% ( -8% - 10%) 0.933 AndHighHighDayTaxoFacets 39.63 (2.5%) 39.69 (2.6%) 0.1% ( -4% - 5%) 0.862 OrHighHigh 34.63 (1.8%) 34.68 (2.0%) 0.1% ( -3% - 4%) 0.821 MedSloppyPhrase 158.80 (2.8%) 159.09 (2.6%) 0.2% ( -5% - 5%) 0.832 OrHighLow 257.77 (2.9%) 258.46 (4.6%) 0.3% ( -7% - 8%) 0.826 AndHighMed 133.43 (2.1%) 133.79 (2.7%) 0.3% ( -4% - 5%) 0.726 HighTermMonthSort 145.28 (10.8%) 145.88 (11.2%) 0.4% ( -19% - 25%) 0.905 OrHighNotHigh 834.99 (6.1%) 839.62 (5.7%) 0.6% ( -10% - 13%) 0.766 TermDTSort 83.66 (9.6%) 84.30 (11.1%) 0.8% ( -18% - 23%) 0.817 BrowseDayOfYearTaxoFacets 11639.59 (5.1%) 11777.38 (6.0%) 1.2% ( -9% - 12%) 0.502 MedTerm 1473.62 (7.4%) 1493.79 (6.4%) 1.4% ( -11% - 16%) 0.530 HighTermTitleBDVSort 114.98 (16.7%) 117.30 (18.8%) 2.0% ( -28% - 45%) 0.720 HighTermDayOfYearSort 128.29 (17.2%) 132.83 (22.6%) 3.5% ( -30% - 52%) 0.577 BrowseDateTaxoFacets 19.25 (20.4%) 26.77 (3.7%) 39.1% ( 12% - 79%) 0.000 BrowseRandomLabelSSDVFacets 10.38 (3.5%) 18.03 (6.8%) 73.7% ( 61% - 87%) 0.000 BrowseMonthSSDVFacets 15.71 (3.6%) 34.59 (12.4%) 120.1% ( 100% - 141%) 0.000 BrowseDayOfYearSSDVFacets 14.31 (3.3%) 33.54 (12.9%) 134.4% ( 114% - 155%) 0.000 ``` > Unfortunately I noticed that the sorted queries that didn't become slower only didn't become slower because the field was also indexed with points, so the short-circuiting logic we have to progressively add a filter that only matches competitive documents hid the slowdown. If I hack the benchmark code to not use this optimization then sorted queries are all about 40-50% slower. Hi @jpountz , I saw these words in https://issues.apache.org/jira/browse/LUCENE-10033. I wonder if i also need to do some hack to see the slower tasks? Could you tell me the concret changes since i'm not very familiar with it, Thanks! And here is my localrun script: ``` #!/usr/bin/env python # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import competition import sys # simple example that runs benchmark with WIKI_MEDIUM source and taks files # Baseline here is ../lucene_baseline versus ../lucene_candidate if __name__ == '__main__': sourceData = competition.sourceData() comp = competition.Competition() facets = (('taxonomy:Date', 'Date'),('sortedset:Month', 'Month'),('sortedset:DayOfYear', 'DayOfYear'),('sortedset:RandomLabel', "RandomLabel")) index = comp.newIndex('lucene_baseline', sourceData, facets=facets, indexSort='dayOfYearNumericDV:long') candidate_index = comp.newIndex('lucene_candidate', sourceData, facets=facets, indexSort='dayOfYearNumericDV:long') #Warning -- Do not break the order of arguments #TODO -- Fix the following by using argparser if len(sys.argv) > 3 and sys.argv[3] == '-concurrentSearches': concurrentSearches = True else: concurrentSearches = False # create a competitor named baseline with sources in the ../trunk folder comp.competitor('baseline', 'lucene_baseline', index = index, concurrentSearches = concurrentSearches) # use the same index here # create a competitor named my_modified_version with sources in the ../patch folder # note that we haven't specified an index here, luceneutil will automatically use the index from the base competitor for searching # while the codec that is used for running this competitor is taken from this competitor. comp.competitor('my_modified_version', 'lucene_candidate', index = candidate_index, concurrentSearches = concurrentSearches) # start the benchmark - this can take long depending on your index and machines comp.benchmark("baseline_vs_patch") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
