[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available
rmuir commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1420793280 > 100%. The issue here is that `TermInSetQuery` gets rewritten to a `BooleanQuery` because there are fewer than 16 terms, so it doesn't have a chance to "self-optimize" to use doc values. We can fix this by not eagerly rewriting to a `BooleanQuery`, but I held off doing that for now. So this is "easily" fixable I think. Well right now, I'm not seeing any justification to mix up concerns and pile all into one query. Currently overall, IndexOrDocValues has the best performance: it is slightly slower in one case but massively faster in that case. Also there are some problems with the benchmark, i tried to wrestle with it but there is some serious noise/gc/something going on. if i just rearrange order of tests numbers change dramatically. I am still willing to optimize the one case where IndexOrDocValues might do better, but I honestly don't feel I need to "play defense" any more. I think we should use IndexOrDocValues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available
rmuir commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1420797458 I spent mine time her just to prevent that code is duplicated and prevent a mess, nothing more. I got no skin in the game and could care less about these stupid abusive "joins" that people do. But let's please look at the current situation and say "we are gonna keep it clean" so I don't have to stay up so late anymore to prevent code mess. There is no advantage to piling all this shit in one query. breathe in, breathe out, let it go. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12054: Introduce a new `KeywordField`.
rmuir commented on PR #12054: URL: https://github.com/apache/lucene/pull/12054#issuecomment-1420805369 > * added a `newSetQuery` that creates a `TermInSetQuery` and hopefully soon benefits from @gsmiller 's optimization You can make it `new IndexOrDocValuesQuery(new TermInSetQuery, SortedSetDocValuesField.newSlowSetQuery())` right now and it performs better than what is on that PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available
gsmiller commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1421121653 @rmuir thanks for the feedback and spending time having a look. I'm going to try summarizing where we've landed to make sure we're on the same page. I think we both agree on the following, but please let me know if I'm missing anything? 1. Ideally we want to leverage `IndexOrDocValues` instead of putting custom solutions into queries themselves (e.g., we'd prefer not to have query implementations decide between postings / doc values). This decoupling allows more reuse and separate innovations in the separate queries (e.g., we can independently optimize `TermInSetQuery` and `DocValuesTermsQuery`). This also provides more reuse across various use-cases, as opposed to lots of different query implementations having to re-invent this postings vs. doc values logic. 2. The only inputs `IndexOrDocValues` has for for making a decision are, 1) "lead cost" and 2) the stated "cost" of the wrapper query `ScoreSupplier`s. `TermInSetQuery` may significantly over-estimate cost in its currently implementation as it guarantees a cost ceiling, only looking at field-level statistics. 3. "Primary key" type cases work well with `IndexOrDocValues` since `TermInSetQuery` cost is able to recognize this situation and provide a better cost estimate. 4. The case where `IndexOrDocValues` doesn't do a great job is when the indexed terms cover many documents in general, but the terms used in the specific query cover few. Given this, I tend to agree that moving away from `IndexOrDocValues` to only solve one type of scenario (mentioned in `#4` above), probably isn't the right decision. To move forward, I'm going to explore other ways to solve this case while relying on the `IndexOrDocValues` abstraction. I can think of a couple ways to approach the problem: 1. Alter the way we compute `TermInSetQuery#cost`. Instead of providing a ceiling, we could look at some sort of "average" cost (e.g., determine average number of docs per term and multiply that out by the number of terms in the query). We could also term-seek up to some fixed number of terms up-front to get more information. This obviously ads cost but may be worth it 2. While significantly more complex, if it's not sufficient to use field-level statistics and a single cost value, I may further explore the idea of a "cost iterator" in `ScoreSupplier`. This still feels too complex to me and I don't really want to go in this direction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12054: Introduce a new `KeywordField`.
gsmiller commented on PR #12054: URL: https://github.com/apache/lucene/pull/12054#issuecomment-1421143189 > You can make it new IndexOrDocValuesQuery(new TermInSetQuery, SortedSetDocValuesField.newSlowSetQuery()) right now and it performs better than what is on that PR. +1 to using `IndexOrDocValues` for now. Given the feedback in #12089, I'm going to see if I can come up with a way to help `IndexOrDocValuesQuery` make a better decision between postings/doc values for the case it currently doesn't handle well, as opposed to changing the guts of `TermInSetQuery`. I'll benchmark a couple different ideas for that and post a separate PR with what I'm able to come up with. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #12054: Introduce a new `KeywordField`.
jpountz merged PR #12054: URL: https://github.com/apache/lucene/pull/12054 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller closed issue #11736: Promote DocValuesTermsQuery functionality from sandbox module
gsmiller closed issue #11736: Promote DocValuesTermsQuery functionality from sandbox module URL: https://github.com/apache/lucene/issues/11736 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on issue #11736: Promote DocValuesTermsQuery functionality from sandbox module
gsmiller commented on issue #11736: URL: https://github.com/apache/lucene/issues/11736#issuecomment-1421239346 This was done as part of #12129. Resolving. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on issue #11740: Can we improve cost estimation in TermInSetQuery's ScoreSupplier?
gsmiller commented on issue #11740: URL: https://github.com/apache/lucene/issues/11740#issuecomment-1421245215 As a different approach, the idea of a "self-optimizing" `TermInSetQuery` as explored in #12089, working around the problem of trying to provide an up-front cost estimation to be used by `IndexOrDocValues`. There's some history/context there that's relevant to this idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent merged pull request #12050: Reuse HNSW graph for intialization during merge
benwtrent merged PR #12050: URL: https://github.com/apache/lucene/pull/12050 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent closed issue #11354: Reuse HNSW graphs when merging segments? [LUCENE-10318]
benwtrent closed issue #11354: Reuse HNSW graphs when merging segments? [LUCENE-10318] URL: https://github.com/apache/lucene/issues/11354 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #12050: Reuse HNSW graph for intialization during merge
benwtrent commented on PR #12050: URL: https://github.com/apache/lucene/pull/12050#issuecomment-1421381273 @jmazanec15 merged and I backported to branch_9x (some minor changes for java version stuff around switch statements). Good stuff! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jmazanec15 commented on pull request #12050: Reuse HNSW graph for intialization during merge
jmazanec15 commented on PR #12050: URL: https://github.com/apache/lucene/pull/12050#issuecomment-1421527701 Thanks @benwtrent! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a diff in pull request #12126: Refactor part of IndexFileDeleter and ReplicaFileDeleter into a common utility class
zhaih commented on code in PR #12126: URL: https://github.com/apache/lucene/pull/12126#discussion_r1099470288 ## lucene/replicator/src/java/org/apache/lucene/replicator/nrt/CopyJob.java: ## @@ -206,7 +206,7 @@ private synchronized void _transferAndCancel(CopyJob prevJob) throws IOException if (Node.VERBOSE_FILES) { dest.message("remove partial file " + prevJob.current.tmpName); } - dest.deleter.deleteNewFile(prevJob.current.tmpName); + dest.deleter.deleteIfNoRef(prevJob.current.tmpName); Review Comment: Ah I actually want to do it the opposite way, I'm not 100% sure why we need a `deleteNewFile` (force delete) here rather than `deleteIfNoRef` but I don't want to introduce a (possibly) different behavior in this change so I kept it. Altho the `deleteNewFile` is a bit misleading so I changed the name to `forceDeleteFile` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] AKafakA commented on issue #11862: Concurrent rewrite for KnnVectorQuery
AKafakA commented on issue #11862: URL: https://github.com/apache/lucene/issues/11862#issuecomment-1421715239 Hey, here is Wei from Linkedin. I am interesting on this issue and will try to work on it. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org