[GitHub] [lucene] jpountz commented on pull request #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts
jpountz commented on PR #1068: URL: https://github.com/apache/lucene/pull/1068#issuecomment-1241631941 Thank you for your comments, I think I understand the bug now. I think a better description of the bug is that `BitSetConjunctionDISI#docID()` doesn't honor its contract that it must return `NO_MORE_DOCS` when the iterator is exhausted. And this issue may only occur if any of the bitsets involved in the conjunction have a length that is less than `maxDoc`, which is not typical. This would explain why we haven't seen this bug earlier. > Is it valid for an iterator to advance outside of the ConjunctionDISI? I cant find anywhere that prevents this, but I was under the assumption it should be invalid to do this. It is invalid indeed. The goal of the assertion that all iterators are on the same doc is to catch such issues. > I was trying to keep all sub-iterators on the same doc all the time. I have a preference for not advancing other iterators, because it should not be necessary, and if it is then this means we have another bug somewhere else. The `ConjunctionDISI` that doesn't optimize for bitsets only advances other iterators when reaching `NO_MORE_DOCS` because avoiding to do this would require introducing more conditionals, which would have overhead. But it isn't necessary. I would suggest to no longer advance other iterators, and change the test to make sure that `docID()` returns `NO_MORE_DOCS` when the iterator is exhausted, instead of checking if all sub iterators are on the same doc ID? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #11753: Added interface to relate a LatLonShape with another shape represented as Component2D
jpountz commented on PR #11753: URL: https://github.com/apache/lucene/pull/11753#issuecomment-1241643263 I had missed that the constructor was not publicly available. Am I guessing correctly that your goal is to create custom collectors that work with shapes? Maybe we should make the ctor that takes a `BytesRef` public for such use-cases? @nknize I wonder if you have better ideas as to how to unloch such use-cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost
jpountz commented on PR #11741: URL: https://github.com/apache/lucene/pull/11741#issuecomment-1241681411 > If we assume a scenario where we have a TermInSetQuery over very selective terms (low docFreqs for each), we'd want to use the index query unless there's another clause that can lead that query that's significantly more restrictive. So there's no benefit of using IndexOrDocValuesQuery in that scenario, but moving the term lookup to the scorerSupplier also shouldn't hurt this case since we have to do it anyway. I'm not sure if this is true. I've seen users run `TermInSetQuery`s with 10k terms or more, a typical use-case being implementing some form of join where a first query collects IDs of interest and a second query uses them as a filter. Terms dictionary lookups tend to be expensive, so looking up these 10k terms is not cheap, and if `IndexOrDocValuesQuery` decides that the doc-values approach is better because of costs, then Lucene will perform these 10k lookups again in the terms dictionary of the doc values field, which would be wasteful? Not directly related to this discussion, but `TermInSetQuery` feels more complicated than ranges when it comes to figuring out the best data structure to run the query, and maybe we should fold the logic about whether or not to use doc-values directly into `TermInSetQuery` instead of expecting users to build an `IndexOrDocValuesQuery` themselves? This would allow more sophisticated logic, such as making different decisions depending on whether we expect terms to be selective or not? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] romseygeek opened a new issue, #11759: IntervalBuilder.NO_INTERVALS returns wrong docId when unpositioned
romseygeek opened a new issue, #11759: URL: https://github.com/apache/lucene/issues/11759 ### Description DocIdSetIterators should return -1 when they are unpositioned, but IntervalBuilder.NO_INTERVALS always returns NO_MORE_DOCS. This can lead to exceptions when an empty interval query (for example, a text string made entirely of stopwords) is combined in a conjunction. ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] nknize commented on pull request #11753: Added interface to relate a LatLonShape with another shape represented as Component2D
nknize commented on PR #11753: URL: https://github.com/apache/lucene/pull/11753#issuecomment-1242048944 > Maybe we should make the ctor that takes a `BytesRef` public for such use-cases? We can either make the `LatLonShapeDocValues` and `XYShapeDocValues` ctor public, or add new factory methods to `LatLonShape` and `XYShape` for creating the DocValues instances like we do the fields? I like the latter for consistency. I also think it might be a good idea to add a new `public void resetBinaryValue(BytesRef binaryValue)` method to `ShapeDocValues`? This would enable us to reuse the same ShapeDocValue instance inside of a query by just resetting the backing data from the iterator values. I'm happy to do this but I think these would be a great contribution to help build your merit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] romseygeek opened a new pull request, #11760: Fix IntervalBuilder.NO_INTERVALS docId when unpositioned
romseygeek opened a new pull request, #11760: URL: https://github.com/apache/lucene/pull/11760 IntervalBuilder.NO_INTERVALS should return -1 when unpositioned, not NO_MORE_DOCS. This can trigger exceptions when an empty intervalquery is combined in a conjunction. Fixes #11759 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #11729: LUCENE-11728: Improve code clarity for OrdinalMap
jpountz commented on PR #11729: URL: https://github.com/apache/lucene/pull/11729#issuecomment-1242116139 This class is indeed a bit subtle, but most of the comments that you are suggesting seem to be paraphrasing what the code is doing and I'm not sure that they would actually help someone who would try to get more familiar with this class. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] romseygeek merged pull request #11760: Fix IntervalBuilder.NO_INTERVALS docId when unpositioned
romseygeek merged PR #11760: URL: https://github.com/apache/lucene/pull/11760 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] romseygeek closed issue #11759: IntervalBuilder.NO_INTERVALS returns wrong docId when unpositioned
romseygeek closed issue #11759: IntervalBuilder.NO_INTERVALS returns wrong docId when unpositioned URL: https://github.com/apache/lucene/issues/11759 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on pull request #11729: LUCENE-11728: Improve code clarity for OrdinalMap
mdmarshmallow commented on PR #11729: URL: https://github.com/apache/lucene/pull/11729#issuecomment-1242374535 That's fair, I think I also was getting a little too specific with the comments in some places. I'll go through and see if I can make the comments more general than they currently are. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] navneet1v commented on pull request #11753: Added interface to relate a LatLonShape with another shape represented as Component2D
navneet1v commented on PR #11753: URL: https://github.com/apache/lucene/pull/11753#issuecomment-1242385721 > > Maybe we should make the ctor that takes a `BytesRef` public for such use-cases? > > We can either make the `LatLonShapeDocValues` and `XYShapeDocValues` ctor public, or add new factory methods to `LatLonShape` and `XYShape` for creating the DocValues instances like we do the fields? I like the latter for consistency. > > I also think it might be a good idea to add a new `public void resetBinaryValue(BytesRef binaryValue)` method to `ShapeDocValues`? This would enable us to reuse the same ShapeDocValue instance inside of a query by just resetting the backing data from the iterator values. > > I'm happy to do this but I think these would be a great contribution to help build your merit. I will take the approach of creating a new factory method in LatLonShape and XYShape. On the function: resetBinaryValue I will check what I can do here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost
gsmiller commented on PR #11741: URL: https://github.com/apache/lucene/pull/11741#issuecomment-1242553881 > I'm not sure if this is true. I've seen users run TermInSetQuerys with 10k terms or more, a typical use-case being implementing some form of join where a first query collects IDs of interest and a second query uses them as a filter. Right, that's fair. I'm curious about the cases you've seen though. The cost estimate for `TermInSetQuery` is pretty perfectly suited for this use-case right now (assuming the key uniquely identifies a document). Putting cost estimation aside though, I'd be curious when a doc-values approach would be more efficient here. Looking up the 10k terms has to be done for both query approaches, so I'd only expect the doc-values approach to be more efficient if there's a lead iterator with relatively few documents (relative to the number of unique terms in the join). Is that the sort of case you have in mind? > Terms dictionary lookups tend to be expensive, so looking up these 10k terms is not cheap, and if IndexOrDocValuesQuery decides that the doc-values approach is better because of costs, then Lucene will perform these 10k lookups again in the terms dictionary of the doc values field, which would be wasteful? +1. I think we're saying roughly the same thing. The problematic case with doing term lookup as part of cost estimation is duplicating that work when deciding to use doc-values. It's too bad we have to effectively do the same work twice in this case. It would be super cool if we could find a way to reuse the term lookup work, but that's obviously very non-trivial since the term dictionaries are completely different implementations, types of fields, etc. > Not directly related to this discussion, but TermInSetQuery feels more complicated than ranges when it comes to figuring out the best data structure to run the query, and maybe we should fold the logic about whether or not to use doc-values directly into TermInSetQuery instead of expecting users to build an IndexOrDocValuesQuery themselves? This would allow more sophisticated logic, such as making different decisions depending on whether we expect terms to be selective or not? I like this idea a lot actually. I can imagine a fairly different type of heuristic we might want to use as compared to numeric ranges. For instance, the number of terms matters. Whether-or-not we can identify that the field is a unique key matters. Hmmm... seems worth experimenting with a bit. Thanks for the idea! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow opened a new issue, #11761: Expand TieredMergePolicy deletePctAllowed limits
mdmarshmallow opened a new issue, #11761: URL: https://github.com/apache/lucene/issues/11761 ### Description I'm an engineer at Amazon Search and we have been experimenting with more aggressively getting rid of deleted documents. We use TieredMergePolicy and we would like to set `TieredMergePolicy#deletesPctAllowed` to be [lower than the current limit of 20%](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java#L157). I was wondering why this limit was set in place. I'm sure I could be missing some context here. Maybe we could keep the limits in place but allow users to explicitly remove the checks? Any information would be much appreciated, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org