Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2025-01-23 Thread via GitHub
sam-herman commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-2611333530 I am actually in the process of extending Lucene Codec for JVector DiskANN integration. Note this work is part of https://github.com/opensearch-project/k-NN/issues/2386 I can

Re: [I] UnsupportedOperationException instead of IllegalArgumentException from PointInSetQuery when values are out of order [lucene]

2025-01-23 Thread via GitHub
jhinch-at-atlassian-com commented on issue #14161: URL: https://github.com/apache/lucene/issues/14161#issuecomment-2611152321 Having `BytesRefBuilder#toString` delegate to its underlying buffer or calling `BytesRefBuilder#get` within `PointInSetQuery` both seem like reasonable options. --

Re: [PR] Add knn result consistency test [lucene]

2025-01-23 Thread via GitHub
mayya-sharipova commented on PR #14167: URL: https://github.com/apache/lucene/pull/14167#issuecomment-2610983018 @benwtrent Thanks for raising this, this indeed happens because of MultiLeafKnnCollector and search threads exchanging info of the globally collected results. Because it is not d

[I] Query parser support for wildcards in phrase queries [lucene]

2025-01-23 Thread via GitHub
aliciavargas opened a new issue, #14168: URL: https://github.com/apache/lucene/issues/14168 ### Description The Lucene docs specify that wildcard search is only supported for single terms but not phrases ([link](https://lucene.apache.org/core/8_10_0/queryparser/org/apache/lucene/quer

Re: [PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]

2025-01-23 Thread via GitHub
benwtrent commented on code in PR #14160: URL: https://github.com/apache/lucene/pull/14160#discussion_r1927632268 ## lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java: ## @@ -0,0 +1,332 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]

2025-01-23 Thread via GitHub
benchaplin commented on code in PR #14160: URL: https://github.com/apache/lucene/pull/14160#discussion_r1927620854 ## lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java: ## @@ -0,0 +1,332 @@ +/* + * Licensed to the Apache Software Foundation (ASF) un

Re: [PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]

2025-01-23 Thread via GitHub
benchaplin commented on code in PR #14160: URL: https://github.com/apache/lucene/pull/14160#discussion_r1927609309 ## lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java: ## @@ -0,0 +1,332 @@ +/* + * Licensed to the Apache Software Foundation (ASF) un

Re: [PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]

2025-01-23 Thread via GitHub
benchaplin commented on code in PR #14160: URL: https://github.com/apache/lucene/pull/14160#discussion_r1927609309 ## lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java: ## @@ -0,0 +1,332 @@ +/* + * Licensed to the Apache Software Foundation (ASF) un

Re: [PR] Add knn result consistency test [lucene]

2025-01-23 Thread via GitHub
benwtrent commented on PR #14167: URL: https://github.com/apache/lucene/pull/14167#issuecomment-2610476022 OK, if I change to never use `MultiLeafKnnCollector`, the multi-threaded consistency test passes. But with using that collector, it will fail a couple times over 10k+ repeats. -- Th

Re: [PR] Add UnwrappingReuseStrategy for AnalyzerWrapper [lucene]

2025-01-23 Thread via GitHub
mayya-sharipova commented on PR #14154: URL: https://github.com/apache/lucene/pull/14154#issuecomment-2610406769 @jpountz @benwtrent I've addressed your comments in the last commit, please continue to review -- This is an automated message from the Apache Git Service. To respond to the me

Re: [PR] Add knn result consistency test [lucene]

2025-01-23 Thread via GitHub
benwtrent commented on PR #14167: URL: https://github.com/apache/lucene/pull/14167#issuecomment-2610393806 OK, I cleaned it all up, and have two separate tests, one for multi-threaded one for single threaded. The multi-threaded one is the only one that fails periodically, which expla

Re: [I] Remove the @Deprecated methods from TopScoreDocCollector and TopFieldCollector [lucene]

2025-01-23 Thread via GitHub
javanna commented on issue #13499: URL: https://github.com/apache/lucene/issues/13499#issuecomment-2610294122 @parastooGit you need to create collector managers instead of collectors. There is not static create method any longer, you need to create the collector managers using their constru

Re: [I] UnsupportedOperationException instead of IllegalArgumentException from PointInSetQuery when values are out of order [lucene]

2025-01-23 Thread via GitHub
gsmiller commented on issue #14161: URL: https://github.com/apache/lucene/issues/14161#issuecomment-2610240868 Oh gross. Good catch! It seems like the desire in this exception message is to print out the `previous` bytes ref in the same was as `current`. I wonder if we should implement `Byt

Re: [PR] Add knn result consistency test [lucene]

2025-01-23 Thread via GitHub
benwtrent commented on PR #14167: URL: https://github.com/apache/lucene/pull/14167#issuecomment-2610150959 > I think our comments relate to the observation that the test does not reproducibly fail with the same seed 🤦 for sure. Let me see if I can shore it up. -- This is an automat

Re: [PR] Add knn result consistency test [lucene]

2025-01-23 Thread via GitHub
msokolov commented on PR #14167: URL: https://github.com/apache/lucene/pull/14167#issuecomment-2610145295 I think our comments relate to the observation that the test does not reproducibly fail with the same seed -- This is an automated message from the Apache Git Service. To respond to t

Re: [PR] Add knn result consistency test [lucene]

2025-01-23 Thread via GitHub
benwtrent commented on PR #14167: URL: https://github.com/apache/lucene/pull/14167#issuecomment-2610139152 @msokolov @mikemccand maybe the consistency I am testing isn't clear. First: Index a bunch of vectors Second: do a single query on a static index to get the top-k Repeat-N:

Re: [PR] Add knn result consistency test [lucene]

2025-01-23 Thread via GitHub
msokolov commented on PR #14167: URL: https://github.com/apache/lucene/pull/14167#issuecomment-2610126329 As for the reproducibility problem, that may be caused by concurrent HNSW merging, which is nondeterministic. -- This is an automated message from the Apache Git Service. To respond

Re: [PR] Add knn result consistency test [lucene]

2025-01-23 Thread via GitHub
mikemccand commented on PR #14167: URL: https://github.com/apache/lucene/pull/14167#issuecomment-2610089955 > Frustratingly, the seeded failures do not seem to be repeatable. Hmm that is bad ... it means there is a test bug or test infra bug (separate from the scary bug this test is

[PR] Add knn result consistency test [lucene]

2025-01-23 Thread via GitHub
benwtrent opened a new pull request, #14167: URL: https://github.com/apache/lucene/pull/14167 Inspired by some weird behavior I have seen, adding a consistency test. I found that indeed, this fails over some seeds. Frustratingly, the seeded failures do not seem to be repeatable

Re: [I] Remove the @Deprecated methods from TopScoreDocCollector and TopFieldCollector [lucene]

2025-01-23 Thread via GitHub
parastooGit commented on issue #13499: URL: https://github.com/apache/lucene/issues/13499#issuecomment-2610054733 what is the replacement for create method? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] Add WrappingReuseStrategy for AnalyzerWrapper [lucene]

2025-01-23 Thread via GitHub
mayya-sharipova commented on PR #14154: URL: https://github.com/apache/lucene/pull/14154#issuecomment-2610017660 @benwtrent Thanks for the review, I am not happy with the design either, will see how I can incorporate your feedback. > I don't like that CompletionAnalyzer needs to trac

[PR] Remove `maxMergeAtOnce` option from `TieredMergePolicy`. [lucene]

2025-01-23 Thread via GitHub
jpountz opened a new pull request, #14165: URL: https://github.com/apache/lucene/pull/14165 `maxMergeAtOnce` increases merge amplification by running multiple merges when it could run a single merge, without giving significant benefits in exchange. We removed this parameter for forced merge

Re: [PR] Not maintain docBufferUpTo when only docs needed [lucene]

2025-01-23 Thread via GitHub
jpountz commented on code in PR #14164: URL: https://github.com/apache/lucene/pull/14164#discussion_r1926977463 ## lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java: ## @@ -388,6 +388,7 @@ private enum DeltaEncoding { final boolean needsOf

Re: [PR] Add small bias towards bit set encoding. [lucene]

2025-01-23 Thread via GitHub
jpountz merged PR #14155: URL: https://github.com/apache/lucene/pull/14155 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] supports force merge based on specified segments. [lucene]

2025-01-23 Thread via GitHub
jpountz commented on PR #14163: URL: https://github.com/apache/lucene/pull/14163#issuecomment-2609834797 I don't think we should merge this change, but it's good that you were able to use it to confirm that merging would reclaim these deleted docs. Can you add your data about this iss

Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2025-01-23 Thread via GitHub
RS146BIJAY commented on issue #13387: URL: https://github.com/apache/lucene/issues/13387#issuecomment-2609475069 Make sense. I think we can extend MultiReader functionality to use it as a combined view if we can support couple of read side features of IndexWriter like opening a reader from

[PR] Not maintain docBufferUpTo when only docs needed [lucene]

2025-01-23 Thread via GitHub
gf2121 opened a new pull request, #14164: URL: https://github.com/apache/lucene/pull/14164 The `docBufferUpTo` variable is mainly maintained to obtain the corresponding value of freq/pos buffer. We can avoid the maintaining when only docs needed. Result on `wikimediumall`: ```