[GitHub] [lucene] mmatela commented on issue #12080: SynonymGraphFilter: wrong output token position when input positions overlap

2023-02-17 Thread via GitHub


mmatela commented on issue #12080:
URL: https://github.com/apache/lucene/issues/12080#issuecomment-1434301348

   Turns out my initial solution lead to exceptions when a synonym appears at 
the beginning of the query or there are more tokens after the synonym. After 
some trial and error, it seems to work correctly with these changes: 
https://github.com/mmatela/lucene/commit/1d2df64e09cfbc89d42511274530248fe559befb


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang opened a new pull request, #12153: Unrelated code in TestIndexSortSortedNumericDocValuesRangeQuery

2023-02-17 Thread via GitHub


LuXugang opened a new pull request, #12153:
URL: https://github.com/apache/lucene/pull/12153

   `SortedSetDocValuesField.newSlowRangeQuery`  appeared in 
`TestIndexSortSortedNumericDocValuesRangeQuery#toString` seems no reason?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #12132: Implement ScorerSupplier for Sorted(Set)DocValuesField#newSlowRangeQuery

2023-02-17 Thread via GitHub


rmuir merged PR #12132:
URL: https://github.com/apache/lucene/pull/12132


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #12151: Benchmark Current Approaches for TermInSetQuery Evaluation

2023-02-17 Thread via GitHub


rmuir commented on issue #12151:
URL: https://github.com/apache/lucene/issues/12151#issuecomment-1434652878

   > First, imagine searching over a catalog of products, where products have 
been assigned a
   > [UNSPSC](https://en.wikipedia.org/wiki/UNSPSC) categorization identifier.
   
   You can use Points for this and it will be faster than all the benchmark 
here. I think you might be missing the forest for the trees focusing so much on 
Terms.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase opened a new pull request, #12154: Implement ScorerSupplier for LatLonDocValuesQuery

2023-02-17 Thread via GitHub


iverase opened a new pull request, #12154:
URL: https://github.com/apache/lucene/pull/12154

   Similar to https://github.com/apache/lucene/pull/12132, implement Score 
supplier for LatLonDocValuesQuery and move the creation of the Component2D in 
there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

2023-02-17 Thread via GitHub


gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1434796647

   @jpountz I've found that applying this same idea to `TermInSetQuery` is 
really helpful for performance in our use-cases at Amazon product search. It's 
nice because the behavior of `TermInSetQuery` gradually changes from a standard 
boolean query to an up-front rewrite approach. We've had some trouble in the 
past with the hard flip-over in the behavior (e.g., one more term in a 
disjunction leads to completely different behavior).
   
   Do you mind if I extend this PR to include a similar change to 
`TermInSetQuery`? Not sure where you are with this work, or if you were 
planning to pick it back up?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

2023-02-17 Thread via GitHub


jpountz commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1434808901

   My attention has moved to a few other things, feel free to do whatever you 
want with this PR, I'll be happy to review.
   
   +1 on the nice property of gradually moving from a lazy disjunction to eager 
evaluation, this is what I was after with this change!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

2023-02-17 Thread via GitHub


gsmiller commented on code in PR #12055:
URL: https://github.com/apache/lucene/pull/12055#discussion_r1109995482


##
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##
@@ -183,23 +182,31 @@ private WeightOrDocIdSet rewrite(LeafReaderContext 
context) throws IOException {
   }
   Query q = new ConstantScoreQuery(bq.build());
   final Weight weight = searcher.rewrite(q).createWeight(searcher, 
scoreMode, score());
-  return new WeightOrDocIdSet(weight);
+  return new WeightOrDocIdSetIterator(weight);
 }
 
 // Too many terms: go back to the terms we already collected and start 
building the bit set
-DocIdSetBuilder builder = new 
DocIdSetBuilder(context.reader().maxDoc(), terms);
+PriorityQueue highFrequencyTerms =
+new PriorityQueue(collectedTerms.size()) {
+  @Override
+  protected boolean lessThan(PostingsEnum a, PostingsEnum b) {
+return a.cost() < b.cost();

Review Comment:
   Looking at the internals of `insertWithOverflow`, I don't think we'll churn 
the smallest term here. If a later-visited term has the same cost as the top of 
the PQ, it will be rejected / immediately returned by `insertWithOverflow` and 
subsequently built into the bitset, which I think is what we want (to avoid 
unnecessary seeking later). Does that sound right? It's possible I'm missing 
something or misunderstanding the concern.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

2023-02-17 Thread via GitHub


gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1434825645

   Thanks @jpountz. I pushed my `TermInSetQuery` changes, but still need to 
address a couple of Robert's comments on the original implementation. I'll 
update here when I think it's ready for a look. Thanks for the offer to review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase closed pull request #12154: Implement ScorerSupplier for LatLonDocValuesQuery

2023-02-17 Thread via GitHub


iverase closed pull request #12154: Implement ScorerSupplier for 
LatLonDocValuesQuery
URL: https://github.com/apache/lucene/pull/12154


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on pull request #12154: Implement ScorerSupplier for LatLonDocValuesQuery

2023-02-17 Thread via GitHub


iverase commented on PR #12154:
URL: https://github.com/apache/lucene/pull/12154#issuecomment-1434848174

   Yikes, Adrien just make me realise that now I might be creating one of those 
Component2D objects per segment. I am going to close it for now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

2023-02-17 Thread via GitHub


gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435332248

   OK, I think I've addressed the previous feedback and also brought in the 
same changes to `TermInSetQuery`. This should be ready for feedback @jpountz 
(whenever you have a free moment).
   
   On some internal benchmarks (Amazon product search), we see throughput 
increases ranging from ~7 - 63% (depending on a number of factors). We have 
some situations where pre-processing of long postings (in the existing TiS 
implementation) takes up a large share of CPU time (these tend to be cases 
where the actual matches are sparse but the TiS terms match a large number of 
docs). Being able to "hold back" these long postings into a 
`DisjunctionDisiAppox` while still pre-processing the shorter postings is a big 
win in these cases. Here are some flame charts showing the impact (heavily 
redacted of course):
   
   "Normal" TiS:
   https://user-images.githubusercontent.com/16479560/219803198-c28ce8f5-1807-4c86-b3ec-7d6e33096533.png";>
   
   TiS with this PR:
   https://user-images.githubusercontent.com/16479560/219803253-b785d652-cf37-42c6-bf7c-be6d18e69caf.png";>
   
   I also re-ran `luceneutil` benchmarks (`wikimedium10m`) and see consistent 
results with the initial PR:
   ```
   TaskQPS baseline  StdDevQPS candidate  
StdDevPct diff p-value
  BrowseMonthTaxoFacets   32.05 (14.3%)   30.06 
(23.2%)   -6.2% ( -38% -   36%) 0.309
  BrowseMonthSSDVFacets   14.85 (17.6%)   13.95  
(2.3%)   -6.0% ( -22% -   16%) 0.127
BrowseRandomLabelTaxoFacets   23.18 (12.4%)   21.93 
(19.8%)   -5.4% ( -33% -   30%) 0.303
   BrowseDateTaxoFacets   31.04 (14.2%)   29.42 
(22.7%)   -5.2% ( -36% -   37%) 0.383
  BrowseDayOfYearTaxoFacets   31.29 (14.4%)   29.66 
(22.9%)   -5.2% ( -37% -   37%) 0.389
  BrowseDayOfYearSSDVFacets   14.17 (17.7%)   13.85 
(14.0%)   -2.2% ( -28% -   35%) 0.659
 IntNRQ  156.59  (5.3%)  154.50  
(7.1%)   -1.3% ( -12% -   11%) 0.498
  HighTermTitleSort  100.27  (2.6%)   99.43  
(2.8%)   -0.8% (  -6% -4%) 0.329
  HighTermMonthSort 2633.69  (3.3%) 2614.59  
(3.1%)   -0.7% (  -6% -5%) 0.477
 AndHighLow 1085.57  (2.9%) 1080.12  
(2.5%)   -0.5% (  -5% -5%) 0.559
  OrNotHighHigh  795.28  (3.1%)  791.61  
(3.4%)   -0.5% (  -6% -6%) 0.656
 HighPhrase  145.14  (2.7%)  144.70  
(2.8%)   -0.3% (  -5% -5%) 0.725
   BrowseDateSSDVFacets3.81  (7.7%)3.80  
(7.9%)   -0.3% ( -14% -   16%) 0.905
 Fuzzy2   57.08  (1.3%)   56.95  
(1.2%)   -0.2% (  -2% -2%) 0.555
  LowPhrase  379.61  (2.4%)  378.74  
(2.8%)   -0.2% (  -5% -5%) 0.783
 Fuzzy1   76.62  (1.5%)   76.45  
(1.2%)   -0.2% (  -2% -2%) 0.621
  MedPhrase   19.20  (2.3%)   19.19  
(2.2%)   -0.1% (  -4% -4%) 0.898
 OrHighMedDayTaxoFacets   15.40  (3.4%)   15.40  
(3.3%)   -0.0% (  -6% -6%) 0.978
   OrNotHighLow 1186.45  (3.0%) 1186.35  
(2.2%)   -0.0% (  -5% -5%) 0.992
MedSpanNear8.28  (2.7%)8.28  
(2.6%)0.0% (  -5% -5%) 0.996
 TermDTSort  104.78  (1.2%)  104.79  
(1.3%)0.0% (  -2% -2%) 0.984
 AndHighMed  204.19  (3.2%)  204.21  
(3.7%)0.0% (  -6% -7%) 0.992
   MedTermDayTaxoFacets   51.30  (2.9%)   51.32  
(3.0%)0.0% (  -5% -6%) 0.970
   HighTermTitleBDVSort   21.28  (4.7%)   21.28  
(4.2%)0.0% (  -8% -9%) 0.975
LowSpanNear  179.68  (1.3%)  179.77  
(1.6%)0.0% (  -2% -3%) 0.919
AndHighMedDayTaxoFacets   25.84  (1.6%)   25.85  
(1.7%)0.0% (  -3% -3%) 0.923
  OrHighMed  106.73  (3.0%)  106.83  
(3.4%)0.1% (  -6% -6%) 0.932
   OrNotHighMed  329.39  (3.1%)  329.71  
(3.4%)0.1% (  -6% -6%) 0.925
Respell   49.02  (0.9%)   49.09  
(0.6%)0.1% (  -1% -1%) 0.546
MedIntervalsOrdered4.40  (5.8%)4.40  
(5.5%)0.1% ( -10% -   12%) 0.933
   AndHighHighDayTaxoFacets   13.00  (1.7%)   13.02  
(1.5%)0.2% (  -2% -3%) 0.760
  

[GitHub] [lucene] rmuir commented on a diff in pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

2023-02-17 Thread via GitHub


rmuir commented on code in PR #12055:
URL: https://github.com/apache/lucene/pull/12055#discussion_r1110449800


##
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##
@@ -183,23 +182,31 @@ private WeightOrDocIdSet rewrite(LeafReaderContext 
context) throws IOException {
   }
   Query q = new ConstantScoreQuery(bq.build());
   final Weight weight = searcher.rewrite(q).createWeight(searcher, 
scoreMode, score());
-  return new WeightOrDocIdSet(weight);
+  return new WeightOrDocIdSetIterator(weight);
 }
 
 // Too many terms: go back to the terms we already collected and start 
building the bit set
-DocIdSetBuilder builder = new 
DocIdSetBuilder(context.reader().maxDoc(), terms);
+PriorityQueue highFrequencyTerms =
+new PriorityQueue(collectedTerms.size()) {
+  @Override
+  protected boolean lessThan(PostingsEnum a, PostingsEnum b) {
+return a.cost() < b.cost();

Review Comment:
   dawid fixed the issue: `!lessThan()` is no longer used in the logic, and the 
"churn" should be avoided I think. 
https://github.com/apache/lucene/commit/ba9fee502b0c18fda312da55c6304ac8a463f509
   
   I was looking at outdated version of this file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

2023-02-17 Thread via GitHub


rmuir commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435401885

   I dont see any of my feedback addressed.  I'll repeat what i said before:
   * We shouldn't be forming booleanqueries from a `FILTER` rewrite, this is 
wrong to do and it causes some slowdowns in some cases. We need more rewrites 
instead of one 'wonder-do-it-all'.
   * We are still reusing postingsenum (now only some of the time, grr) and 
tossing them into priority queues. this is unsafe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

2023-02-17 Thread via GitHub


rmuir commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435403774

   I think it would help, a lot, to look thru history and see how "constant 
score auto rewrite" was implemented years ago, and then its removal, before 
adding it back again.
   
   but I'm firmly against doing this crap inside filter rewrite. If we really 
have good reason to "add  back" "auto rewrite", it would be good to add it as a 
separate thing like before. Also probably good to look at how it was done 
before.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

2023-02-17 Thread via GitHub


rmuir commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435411850

   also dropping the postings reuse is going to cause big performance 
degradation for many situations. For example with NIOFSDirectory, new postings 
reader means indexinput.clone() calls, buffer refills, etc etc. It translates 
into real I/O
   
   But the reuse isn't dropped in all cases, and we're still putting "reused 
enums" into things like PQs. Sorry, none of this looks even half-way baked at 
all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

2023-02-17 Thread via GitHub


rmuir commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435417952

   I suggest an easy path to success here:
   * Keep the simple filter rewrite without crazy boolean auto-optimizations, 
that does what the javadocs in MultiTermQuery says it does and nothing more. It 
will be useful as a bailout, e.g. for NIOFS users, in case the crazy heuristics 
dont work well. It will also make testing easy: there have been iterations here 
that are clearly incorrect and yet no tests fail. 
   * add a new auto rewrite, but first look at why it failed and was removed 
before and try to learn from those lessons. It is ok, for it to be the new 
"default" rewrite method, once it it is baked.
   * fix TermInSetQuery to extend MultiTermQuery so we don't duplicate the 
horror-show twice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

2023-02-17 Thread via GitHub


gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435469343

   @rmuir thanks for the feedback. Let me see if I can respond to all of it 
here:
   
   > postings reuse problems
   
   Can you help me with where you see this as a problem? I went through the 
code and didn't see any obvious mistakes. I'm sure I'm missing it? The only 
tricky one is in the do/while loop in MTQCSW starting on line 214. But I think 
this is safe since we reassign `postings` to `dropped` at the end of the loop, 
so `postings` should be safe to reuse at that point. Again, not trying to 
challenge this feedback, just trying to understand what I'm overlooking.
   
   > objection to modifying MTQCSW rewrite in general / learn from "constant 
score auto rewrite" history
   
   Thanks for raising this. I wasn't aware of the history here so I'll have to 
go do some digging. I'll take that on as a next step.
   
   > have TermInSetQuery extend MultiTermQuery
   
   I'd looked at this in the past and had found performance regressions with 
this approach, which I believe were caused by the different term intersection 
approach (seekCeil vs. seekExact). I don't recall the amount of regression 
though. I'll see if I can dig that up. Hopefully I documented it somewhere. But 
if not, I'll try to re-benchmark. Maybe the regressions don't justify the 
separate implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

2023-02-17 Thread via GitHub


gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435477300

   Found the issue where the "constant score auto rewrite" implementation was 
removed: [LUCENE-5938](https://issues.apache.org/jira/browse/LUCENE-5938). If 
I'm understanding the history, it seems like the auto rewrite logic was trying 
to balance two things:
   1. Number of terms. If there are "few" terms (< 350) it would favor using a 
boolean query. When it passed the term threshold, it would use a filter rewrite.
   2. Sparsity of docs (total docFreq over visited terms relative to total docs 
in segment). If the "density" of the docs passed a certain point (0.1%), it 
would favor a filter approach instead of a boolean query. This point seems to 
have been in place to account for the fact that rewriting required using a 
fixed bitset, which wasn't efficient when very sparse.
   
   It looks like this rewrite method was removed in LUCENE-5938 since it 
introduced the idea of a sparse bitset, which removed the issue with #2 above.
   
   In my opinion, it seems like #1 is still a very valid trade-off (many terms 
are inefficient to manage in a boolean query due to the associated PQ). This, 
of course, is what the current rewrite method takes into consideration (with a 
threshold of 16 terms). What I still _don't_ like about the existing 
implementation is how it completely changes behavior to a full bitset rewrite 
after passing 16 terms. I _do_ think it's a nice win overall to rewrite in the 
way proposed by this PR. As far as I can tell, the former implementation never 
"incrementally" pre-processed postings into a filter bitset. It was an "all or 
nothing" approach. I think the key benefit of this PR is to allow for 
"incremental" processing.
   
   But, I also recognize it might not be applicable in all cases, for all 
users, and/or for all file systems. I think it's really good feedback to 
introduce this idea as a new rewrite option instead of modifying the existing 
one in-place. I'll look into that as a next step.
   
   @rmuir / @jpountz / @mikemccand - since you all were involved in LUCENE-5938 
and the earlier implementation of the "auto rewrite," please let me know if I'm 
missing anything. I the best "digital archeology" I could, but it's very 
possible I'm missing something. Thanks again for the feedback!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org