[GitHub] [lucene] jpountz commented on pull request #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts

2022-09-09 Thread GitBox


jpountz commented on PR #1068:
URL: https://github.com/apache/lucene/pull/1068#issuecomment-1241631941

   Thank you for your comments, I think I understand the bug now. I think a 
better description of the bug is that `BitSetConjunctionDISI#docID()` doesn't 
honor its contract that it must return `NO_MORE_DOCS` when the iterator is 
exhausted. And this issue may only occur if any of the bitsets involved in the 
conjunction have a length that is less than `maxDoc`, which is not typical. 
This would explain why we haven't seen this bug earlier.
   
   > Is it valid for an iterator to advance outside of the ConjunctionDISI? I 
cant find anywhere that prevents this, but I was under the assumption it should 
be invalid to do this.
   
   It is invalid indeed. The goal of the assertion that all iterators are on 
the same doc is to catch such issues.
   
   > I was trying to keep all sub-iterators on the same doc all the time.
   
   I have a preference for not advancing other iterators, because it should not 
be necessary, and if it is then this means we have another bug somewhere else. 
The `ConjunctionDISI` that doesn't optimize for bitsets only advances other 
iterators when reaching `NO_MORE_DOCS` because avoiding to do this would 
require introducing more conditionals, which would have overhead. But it isn't 
necessary.
   
   I would suggest to no longer advance other iterators, and change the test to 
make sure that `docID()` returns `NO_MORE_DOCS` when the iterator is exhausted, 
instead of checking if all sub iterators are on the same doc ID?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #11753: Added interface to relate a LatLonShape with another shape represented as Component2D

2022-09-09 Thread GitBox


jpountz commented on PR #11753:
URL: https://github.com/apache/lucene/pull/11753#issuecomment-1241643263

   I had missed that the constructor was not publicly available. Am I guessing 
correctly that your goal is to create custom collectors that work with shapes? 
Maybe we should make the ctor that takes a `BytesRef` public for such 
use-cases? @nknize I wonder if you have better ideas as to how to unloch such 
use-cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost

2022-09-09 Thread GitBox


jpountz commented on PR #11741:
URL: https://github.com/apache/lucene/pull/11741#issuecomment-1241681411

   > If we assume a scenario where we have a TermInSetQuery over very selective 
terms (low docFreqs for each), we'd want to use the index query unless there's 
another clause that can lead that query that's significantly more restrictive. 
So there's no benefit of using IndexOrDocValuesQuery in that scenario, but 
moving the term lookup to the scorerSupplier also shouldn't hurt this case 
since we have to do it anyway.
   
   I'm not sure if this is true. I've seen users run `TermInSetQuery`s with 10k 
terms or more, a typical use-case being implementing some form of join where a 
first query collects IDs of interest and a second query uses them as a filter. 
Terms dictionary lookups tend to be expensive, so looking up these 10k terms is 
not cheap, and if `IndexOrDocValuesQuery` decides that the doc-values approach 
is better because of costs, then Lucene will perform these 10k lookups again in 
the terms dictionary of the doc values field, which would be wasteful?
   
   Not directly related to this discussion, but `TermInSetQuery` feels more 
complicated than ranges when it comes to figuring out the best data structure 
to run the query, and maybe we should fold the logic about whether or not to 
use doc-values directly into `TermInSetQuery` instead of expecting users to 
build an `IndexOrDocValuesQuery` themselves? This would allow more 
sophisticated logic, such as making different decisions depending on whether we 
expect terms to be selective or not?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] romseygeek opened a new issue, #11759: IntervalBuilder.NO_INTERVALS returns wrong docId when unpositioned

2022-09-09 Thread GitBox


romseygeek opened a new issue, #11759:
URL: https://github.com/apache/lucene/issues/11759

   ### Description
   
   DocIdSetIterators should return -1 when they are unpositioned, but 
   IntervalBuilder.NO_INTERVALS always returns NO_MORE_DOCS.  This
   can lead to exceptions when an empty interval query (for example, a
   text string made entirely of stopwords) is combined in a conjunction.
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] nknize commented on pull request #11753: Added interface to relate a LatLonShape with another shape represented as Component2D

2022-09-09 Thread GitBox


nknize commented on PR #11753:
URL: https://github.com/apache/lucene/pull/11753#issuecomment-1242048944

   > Maybe we should make the ctor that takes a `BytesRef` public for such 
use-cases?
   
   We can either make the `LatLonShapeDocValues` and `XYShapeDocValues` ctor 
public, or add new factory methods to `LatLonShape` and `XYShape` for creating 
the DocValues instances like we do the fields? I like the latter for 
consistency. 
   
   I also think it might be a good idea to add a new `public void 
resetBinaryValue(BytesRef binaryValue)` method to `ShapeDocValues`? This would 
enable us to reuse the same ShapeDocValue instance inside of a query by just 
resetting the backing data from the iterator values. 
   
   I'm happy to do this but I think these would be a great contribution to help 
build your merit. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] romseygeek opened a new pull request, #11760: Fix IntervalBuilder.NO_INTERVALS docId when unpositioned

2022-09-09 Thread GitBox


romseygeek opened a new pull request, #11760:
URL: https://github.com/apache/lucene/pull/11760

   IntervalBuilder.NO_INTERVALS should return -1 when unpositioned,
   not NO_MORE_DOCS.  This can trigger exceptions when an empty
   intervalquery is combined in a conjunction.
   
   Fixes #11759 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #11729: LUCENE-11728: Improve code clarity for OrdinalMap

2022-09-09 Thread GitBox


jpountz commented on PR #11729:
URL: https://github.com/apache/lucene/pull/11729#issuecomment-1242116139

   This class is indeed a bit subtle, but most of the comments that you are 
suggesting seem to be paraphrasing what the code is doing and I'm not sure that 
they would actually help someone who would try to get more familiar with this 
class.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] romseygeek merged pull request #11760: Fix IntervalBuilder.NO_INTERVALS docId when unpositioned

2022-09-09 Thread GitBox


romseygeek merged PR #11760:
URL: https://github.com/apache/lucene/pull/11760


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] romseygeek closed issue #11759: IntervalBuilder.NO_INTERVALS returns wrong docId when unpositioned

2022-09-09 Thread GitBox


romseygeek closed issue #11759: IntervalBuilder.NO_INTERVALS returns wrong 
docId when unpositioned
URL: https://github.com/apache/lucene/issues/11759


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on pull request #11729: LUCENE-11728: Improve code clarity for OrdinalMap

2022-09-09 Thread GitBox


mdmarshmallow commented on PR #11729:
URL: https://github.com/apache/lucene/pull/11729#issuecomment-1242374535

   That's fair, I think I also was getting a little too specific with the 
comments in some places. I'll go through and see if I can make the comments 
more general than they currently are.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] navneet1v commented on pull request #11753: Added interface to relate a LatLonShape with another shape represented as Component2D

2022-09-09 Thread GitBox


navneet1v commented on PR #11753:
URL: https://github.com/apache/lucene/pull/11753#issuecomment-1242385721

   > > Maybe we should make the ctor that takes a `BytesRef` public for such 
use-cases?
   > 
   > We can either make the `LatLonShapeDocValues` and `XYShapeDocValues` ctor 
public, or add new factory methods to `LatLonShape` and `XYShape` for creating 
the DocValues instances like we do the fields? I like the latter for 
consistency.
   > 
   > I also think it might be a good idea to add a new `public void 
resetBinaryValue(BytesRef binaryValue)` method to `ShapeDocValues`? This would 
enable us to reuse the same ShapeDocValue instance inside of a query by just 
resetting the backing data from the iterator values.
   > 
   > I'm happy to do this but I think these would be a great contribution to 
help build your merit.
   
   I will take the approach of creating a new factory method in LatLonShape and 
XYShape. On the function: resetBinaryValue I will check what I can do here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost

2022-09-09 Thread GitBox


gsmiller commented on PR #11741:
URL: https://github.com/apache/lucene/pull/11741#issuecomment-1242553881

   > I'm not sure if this is true. I've seen users run TermInSetQuerys with 10k 
terms or more, a typical use-case being implementing some form of join where a 
first query collects IDs of interest and a second query uses them as a filter.
   
   Right, that's fair. I'm curious about the cases you've seen though. The cost 
estimate for `TermInSetQuery` is pretty perfectly suited for this use-case 
right now (assuming the key uniquely identifies a document). Putting cost 
estimation aside though, I'd be curious when a doc-values approach would be 
more efficient here. Looking up the 10k terms has to be done for both query 
approaches, so I'd only expect the doc-values approach to be more efficient if 
there's a lead iterator with relatively few documents (relative to the number 
of unique terms in the join). Is that the sort of case you have in mind?
   
   > Terms dictionary lookups tend to be expensive, so looking up these 10k 
terms is not cheap, and if IndexOrDocValuesQuery decides that the doc-values 
approach is better because of costs, then Lucene will perform these 10k lookups 
again in the terms dictionary of the doc values field, which would be wasteful?
   
   +1. I think we're saying roughly the same thing. The problematic case with 
doing term lookup as part of cost estimation is duplicating that work when 
deciding to use doc-values. It's too bad we have to effectively do the same 
work twice in this case. It would be super cool if we could find a way to reuse 
the term lookup work, but that's obviously very non-trivial since the term 
dictionaries are completely different implementations, types of fields, etc.
   
   > Not directly related to this discussion, but TermInSetQuery feels more 
complicated than ranges when it comes to figuring out the best data structure 
to run the query, and maybe we should fold the logic about whether or not to 
use doc-values directly into TermInSetQuery instead of expecting users to build 
an IndexOrDocValuesQuery themselves? This would allow more sophisticated logic, 
such as making different decisions depending on whether we expect terms to be 
selective or not?
   
   I like this idea a lot actually. I can imagine a fairly different type of 
heuristic we might want to use as compared to numeric ranges. For instance, the 
number of terms matters. Whether-or-not we can identify that the field is a 
unique key matters. Hmmm... seems worth experimenting with a bit. Thanks for 
the idea!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow opened a new issue, #11761: Expand TieredMergePolicy deletePctAllowed limits

2022-09-09 Thread GitBox


mdmarshmallow opened a new issue, #11761:
URL: https://github.com/apache/lucene/issues/11761

   ### Description
   
   I'm an engineer at Amazon Search and we have been experimenting with more 
aggressively getting rid of deleted documents. We use TieredMergePolicy and we 
would like to set `TieredMergePolicy#deletesPctAllowed` to be [lower than the 
current limit of 
20%](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java#L157).
 
   
   I was wondering why this limit was set in place. I'm sure I could be missing 
some context here. Maybe we could keep the limits in place but allow users to 
explicitly remove the checks? Any information would be much appreciated, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org