[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-02-07 Thread via GitHub


rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1420793280

   > 100%. The issue here is that `TermInSetQuery` gets rewritten to a 
`BooleanQuery` because there are fewer than 16 terms, so it doesn't have a 
chance to "self-optimize" to use doc values. We can fix this by not eagerly 
rewriting to a `BooleanQuery`, but I held off doing that for now. So this is 
"easily" fixable I think.
   
   Well right now, I'm not seeing any justification to mix up concerns and pile 
all into one query. Currently overall, IndexOrDocValues has the best 
performance: it is slightly slower in one case but massively faster in that 
case. 
   
   Also there are some problems with the benchmark, i tried to wrestle with it 
but there is some serious noise/gc/something going on. if i just rearrange 
order of tests numbers  change dramatically. 
   
   I am still willing to optimize the one case where IndexOrDocValues might do 
better, but I honestly don't feel I need to "play defense" any more. I think we 
should use IndexOrDocValues.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-02-07 Thread via GitHub


rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1420797458

   I spent mine time her just to prevent that code is duplicated and prevent a 
mess, nothing more. I got no skin in the game and could care less about these 
stupid abusive "joins" that people do.
   
   But let's please look at the current situation and say "we are gonna keep it 
clean" so I don't have to stay up so late anymore to prevent code mess. There 
is no advantage to piling all this shit in one query. breathe in, breathe out, 
let it go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12054: Introduce a new `KeywordField`.

2023-02-07 Thread via GitHub


rmuir commented on PR #12054:
URL: https://github.com/apache/lucene/pull/12054#issuecomment-1420805369

   > * added a `newSetQuery` that creates a `TermInSetQuery` and hopefully 
soon benefits from @gsmiller 's optimization
   
   You can make it `new IndexOrDocValuesQuery(new TermInSetQuery, 
SortedSetDocValuesField.newSlowSetQuery())` right now and it performs better 
than what is on that PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-02-07 Thread via GitHub


gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1421121653

   @rmuir thanks for the feedback and spending time having a look. I'm going to 
try summarizing where we've landed to make sure we're on the same page. I think 
we both agree on the following, but please let me know if I'm missing anything?
   1. Ideally we want to leverage `IndexOrDocValues` instead of putting custom 
solutions into queries themselves (e.g., we'd prefer not to have query 
implementations decide between postings / doc values). This decoupling allows 
more reuse and separate innovations in the separate queries (e.g., we can 
independently optimize `TermInSetQuery` and `DocValuesTermsQuery`). This also 
provides more reuse across various use-cases, as opposed to lots of different 
query implementations having to re-invent this postings vs. doc values logic.
   2. The only inputs `IndexOrDocValues` has for for making a decision are, 1) 
"lead cost" and 2) the stated "cost" of the wrapper query `ScoreSupplier`s. 
`TermInSetQuery` may significantly over-estimate cost in its currently 
implementation as it guarantees a cost ceiling, only looking at field-level 
statistics.
   3. "Primary key" type cases work well with `IndexOrDocValues` since 
`TermInSetQuery` cost is able to recognize this situation and provide a better 
cost estimate.
   4. The case where `IndexOrDocValues` doesn't do a great job is when the 
indexed terms cover many documents in general, but the terms used in the 
specific query cover few.
   
   Given this, I tend to agree that moving away from `IndexOrDocValues` to only 
solve one type of scenario (mentioned in `#4` above), probably isn't the right 
decision. To move forward, I'm going to explore other ways to solve this case 
while relying on the `IndexOrDocValues` abstraction. I can think of a couple 
ways to approach the problem:
   1. Alter the way we compute `TermInSetQuery#cost`. Instead of providing a 
ceiling, we could look at some sort of "average" cost (e.g., determine average 
number of docs per term and multiply that out by the number of terms in the 
query). We could also term-seek up to some fixed number of terms up-front to 
get more information. This obviously ads cost but may be worth it
   2. While significantly more complex, if it's not sufficient to use 
field-level statistics and a single cost value, I may further explore the idea 
of a "cost iterator" in `ScoreSupplier`. This still feels too complex to me and 
I don't really want to go in this direction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12054: Introduce a new `KeywordField`.

2023-02-07 Thread via GitHub


gsmiller commented on PR #12054:
URL: https://github.com/apache/lucene/pull/12054#issuecomment-1421143189

   > You can make it new IndexOrDocValuesQuery(new TermInSetQuery, 
SortedSetDocValuesField.newSlowSetQuery()) right now and it performs better 
than what is on that PR.
   
   +1 to using `IndexOrDocValues` for now. Given the feedback in #12089, I'm 
going to see if I can come up with a way to help `IndexOrDocValuesQuery` make a 
better decision between postings/doc values for the case it currently doesn't 
handle well, as opposed to changing the guts of `TermInSetQuery`. I'll 
benchmark a couple different ideas for that and post a separate PR with what 
I'm able to come up with. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #12054: Introduce a new `KeywordField`.

2023-02-07 Thread via GitHub


jpountz merged PR #12054:
URL: https://github.com/apache/lucene/pull/12054


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller closed issue #11736: Promote DocValuesTermsQuery functionality from sandbox module

2023-02-07 Thread via GitHub


gsmiller closed issue #11736: Promote DocValuesTermsQuery functionality from 
sandbox module
URL: https://github.com/apache/lucene/issues/11736


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on issue #11736: Promote DocValuesTermsQuery functionality from sandbox module

2023-02-07 Thread via GitHub


gsmiller commented on issue #11736:
URL: https://github.com/apache/lucene/issues/11736#issuecomment-1421239346

   This was done as part of #12129. Resolving.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on issue #11740: Can we improve cost estimation in TermInSetQuery's ScoreSupplier?

2023-02-07 Thread via GitHub


gsmiller commented on issue #11740:
URL: https://github.com/apache/lucene/issues/11740#issuecomment-1421245215

   As a different approach, the idea of a "self-optimizing" `TermInSetQuery` as 
explored in #12089, working around the problem of trying to provide an up-front 
cost estimation to be used by `IndexOrDocValues`. There's some history/context 
there that's relevant to this idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent merged pull request #12050: Reuse HNSW graph for intialization during merge

2023-02-07 Thread via GitHub


benwtrent merged PR #12050:
URL: https://github.com/apache/lucene/pull/12050


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent closed issue #11354: Reuse HNSW graphs when merging segments? [LUCENE-10318]

2023-02-07 Thread via GitHub


benwtrent closed issue #11354: Reuse HNSW graphs when merging segments? 
[LUCENE-10318]
URL: https://github.com/apache/lucene/issues/11354


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12050: Reuse HNSW graph for intialization during merge

2023-02-07 Thread via GitHub


benwtrent commented on PR #12050:
URL: https://github.com/apache/lucene/pull/12050#issuecomment-1421381273

   @jmazanec15 merged and I backported to branch_9x (some minor changes for 
java version stuff around switch statements). 
   
   Good stuff!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on pull request #12050: Reuse HNSW graph for intialization during merge

2023-02-07 Thread via GitHub


jmazanec15 commented on PR #12050:
URL: https://github.com/apache/lucene/pull/12050#issuecomment-1421527701

   Thanks @benwtrent!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #12126: Refactor part of IndexFileDeleter and ReplicaFileDeleter into a common utility class

2023-02-07 Thread via GitHub


zhaih commented on code in PR #12126:
URL: https://github.com/apache/lucene/pull/12126#discussion_r1099470288


##
lucene/replicator/src/java/org/apache/lucene/replicator/nrt/CopyJob.java:
##
@@ -206,7 +206,7 @@ private synchronized void _transferAndCancel(CopyJob 
prevJob) throws IOException
   if (Node.VERBOSE_FILES) {
 dest.message("remove partial file " + prevJob.current.tmpName);
   }
-  dest.deleter.deleteNewFile(prevJob.current.tmpName);
+  dest.deleter.deleteIfNoRef(prevJob.current.tmpName);

Review Comment:
   Ah I actually want to do it the opposite way, I'm not 100% sure why we need 
a `deleteNewFile` (force delete) here rather than `deleteIfNoRef` but I don't 
want to introduce a (possibly) different behavior in this change so I kept it. 
   Altho the `deleteNewFile` is a bit misleading so I changed the name to 
`forceDeleteFile`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] AKafakA commented on issue #11862: Concurrent rewrite for KnnVectorQuery

2023-02-07 Thread via GitHub


AKafakA commented on issue #11862:
URL: https://github.com/apache/lucene/issues/11862#issuecomment-1421715239

   Hey, here is Wei from Linkedin. I am interesting on this issue and will try 
to work on it.
   
   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org