Re: [I] Figure out why hunspell tests occasionally fail and make them more consistent [lucene]

2025-02-13 Thread via GitHub
rmuir commented on issue #14235: URL: https://github.com/apache/lucene/issues/14235#issuecomment-2657956964 This one looks to me like another dictionary bug. Unfortunately the current options we have to "tolerate" such bugs don't work in this case, but perhaps they can be improved. T

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-13 Thread via GitHub
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1955595482 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,457 @@ +/* + * Licensed to the Apache Software Foundation (ASF) unde

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-13 Thread via GitHub
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1955603323 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsReader.java: ## @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundatio

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-13 Thread via GitHub
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1955600494 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,457 @@ +/* + * Licensed to the Apache Software Foundation (ASF) unde

Re: [I] Figure out why hunspell tests occasionally fail and make them more consistent [lucene]

2025-02-13 Thread via GitHub
rmuir commented on issue #14235: URL: https://github.com/apache/lucene/issues/14235#issuecomment-2658111400 The last one was like this, too: https://github.com/apache/lucene/pull/14079 I think people just often have trouble counting and that's why we see errors around the counts, even wit

[PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]

2025-02-13 Thread via GitHub
houserjohn opened a new pull request, #14238: URL: https://github.com/apache/lucene/pull/14238 ### Description Adds additional unit tests to increase coverage of Dynamic Range Faceting. - Adds tests for varying TopN values - Adds test for inputs with the same weights - Adds te

Re: [PR] Use Vector API to decode BKD docIds [lucene]

2025-02-13 Thread via GitHub
gf2121 commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2658082076 Comparison of VectorAPI(Baseline) and InnerLoop(Candidate) ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff

Re: [PR] hunspell: improve tolerateAffixRuleCountMismatches() for common problems [lucene]

2025-02-13 Thread via GitHub
rmuir commented on PR #14239: URL: https://github.com/apache/lucene/pull/14239#issuecomment-2658093518 using the `mark()/reset()` like this can be invitation for trouble, but the situation is contained: the parser will always make forward progress so it can't go crazy or infinite. Also, the

[I] TestByteVectorSimilaryQuery failure on windows [lucene]

2025-02-13 Thread via GitHub
rmuir opened a new issue, #14230: URL: https://github.com/apache/lucene/issues/14230 ### Description From CI when pushing: ``` TestByteVectorSimilarityQuery > testFallbackToExact FAILED junit.framework.AssertionFailedError: Expected exception UnsupportedOperationException

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

2025-02-13 Thread via GitHub
tteofili commented on code in PR #14094: URL: https://github.com/apache/lucene/pull/14094#discussion_r1954405953 ## lucene/core/src/java/org/apache/lucene/search/HnswKnnCollector.java: ## @@ -0,0 +1,24 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

2025-02-13 Thread via GitHub
tteofili commented on PR #14094: URL: https://github.com/apache/lucene/pull/14094#issuecomment-2656443339 updated results (Cohere 768 200k docs) baseline ``` recall latency (ms)nDoc topK fanout maxConn beamWidth quantized visited index s index docs/s num segments

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

2025-02-13 Thread via GitHub
tteofili commented on PR #14094: URL: https://github.com/apache/lucene/pull/14094#issuecomment-2656446320 reference [paper](https://cs.uwaterloo.ca/~jimmylin/publications/Teofili_Lin_ECIR2025.pdf) -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [PR] Fix failure found by TestOperations.testGetRandomAcceptedString [lucene]

2025-02-13 Thread via GitHub
rmuir merged PR #14227: URL: https://github.com/apache/lucene/pull/14227 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [I] TestOperations.testGetRandomAcceptedString failing [lucene]

2025-02-13 Thread via GitHub
rmuir closed issue #14224: TestOperations.testGetRandomAcceptedString failing URL: https://github.com/apache/lucene/issues/14224 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] Support DataInput as source for StoredField [lucene]

2025-02-13 Thread via GitHub
iverase commented on PR #14213: URL: https://github.com/apache/lucene/pull/14213#issuecomment-2656318509 @Tim-Brooks Could you add an entry in CHANGES.txt? It should be under the 10.2 version, thanks! -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [I] TestByteVectorSimilaryQuery failure on windows [lucene]

2025-02-13 Thread via GitHub
benwtrent commented on issue #14230: URL: https://github.com/apache/lucene/issues/14230#issuecomment-2656795520 Running with thousands of seeds, it will fail eventually on linux/macbook as well. -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-13 Thread via GitHub
stefanvodita commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1954612647 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[PR] Remove some randomness from flaky BaseVectorSimilarityQueryTestCase#testFallbackToExact [lucene]

2025-02-13 Thread via GitHub
benwtrent opened a new pull request, #14231: URL: https://github.com/apache/lucene/pull/14231 Periodically, the similarity requested according to the desired matched docs actually doesn't explore enough docs to fall back to exact. Since the purpose of this test is to verify that falli

Re: [PR] Fixed a flaky test TestKnnFloatVectorQuery.testFindFewer [lucene]

2025-02-13 Thread via GitHub
benwtrent merged PR #14223: URL: https://github.com/apache/lucene/pull/14223 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

[I] Evaluate bumping the minimum compile version. [lucene]

2025-02-13 Thread via GitHub
ChrisHegarty opened a new issue, #14229: URL: https://github.com/apache/lucene/issues/14229 This issue has been filed to help facilitate and track a discussion relating to bumping the minimum compile Java version. Currently the minimum compile version is Java 21, for both active deve

Re: [PR] Integrating GPU based Vector Search using cuVS [lucene]

2025-02-13 Thread via GitHub
ChrisHegarty commented on PR #14131: URL: https://github.com/apache/lucene/pull/14131#issuecomment-2656061680 > I think bumping main only for each non LTS release would be cool. Then we keep it at the next LTS (Java 25)? I filed the following issue to help facilitate the discussion re

Re: [PR] Clean up more dead states creation by Automata/Operations [lucene]

2025-02-13 Thread via GitHub
rmuir merged PR #14218: URL: https://github.com/apache/lucene/pull/14218 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [PR] Enable error-prone checks for NonFinalStaticField [lucene]

2025-02-13 Thread via GitHub
rmuir commented on PR #14228: URL: https://github.com/apache/lucene/pull/14228#issuecomment-2656373080 The hunspell test failure is likely an upstream issue with libreoffice dictionaries. It happened recently to me with mongolian dictionaries. Problems: * the test doesn't execute e

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-02-13 Thread via GitHub
msokolov commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2657365313 You can tune it by changing the magic number 3 to a bigger number. I found that with 15 I get slight better recall and slightly lower latencies than the baseline for my test case --

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-02-13 Thread via GitHub
benwtrent commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2657364706 Ah, on other thought is that we are definitely scoring every seeded entry point twice. Once when they are gathered during initial query phase, then later through the seeded provisionin

Re: [I] Estimate memory usage for merges [lucene]

2025-02-13 Thread via GitHub
jpountz commented on issue #14225: URL: https://github.com/apache/lucene/issues/14225#issuecomment-2657354802 For what it's worth, I'm not a fan of throttling merges based on memory usage. Merge throttling is already complicated the way it is, so I'm not too excited about adding more constr

[PR] Fix TestForTooMuchCloning flakiness [lucene]

2025-02-13 Thread via GitHub
benwtrent opened a new pull request, #14232: URL: https://github.com/apache/lucene/pull/14232 Ever since: https://github.com/apache/lucene/pull/14165 This test has been flaky. It fails as the number of clone calls during indexing exceeds 500. I tried only updating the merge sc

Re: [I] org.apache.lucene.search.TestKnnFloatVectorQuery.testFindFewer ComparisonFailure: expected: but was: [lucene]

2025-02-13 Thread via GitHub
benwtrent closed issue #14175: org.apache.lucene.search.TestKnnFloatVectorQuery.testFindFewer ComparisonFailure: expected: but was: URL: https://github.com/apache/lucene/issues/14175 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitH

Re: [PR] Fix TestForTooMuchCloning flakiness [lucene]

2025-02-13 Thread via GitHub
benwtrent commented on PR #14232: URL: https://github.com/apache/lucene/pull/14232#issuecomment-2657320964 @jpountz I tried that while reverting my other changes and it was still over 500 over many runs. I will just merge what I got. Thanks! -- This is an automated message from the Apache

Re: [PR] Fix TestForTooMuchCloning flakiness [lucene]

2025-02-13 Thread via GitHub
benwtrent merged PR #14232: URL: https://github.com/apache/lucene/pull/14232 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [I] TestForTooMuchCloning.test fails [lucene]

2025-02-13 Thread via GitHub
benwtrent closed issue #14220: TestForTooMuchCloning.test fails URL: https://github.com/apache/lucene/issues/14220 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscrib

Re: [I] Evaluate bumping the minimum compile Java version [lucene]

2025-02-13 Thread via GitHub
jpountz commented on issue #14229: URL: https://github.com/apache/lucene/issues/14229#issuecomment-2657325084 This sounds like a safe bet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-02-13 Thread via GitHub
Vikasht34 commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2657428175 Interesting , let me quickly run those tests my self also to see what would be impact!! Thanks for logs .. -- This is an automated message from the Apache Git Service. To respon

Re: [I] Figure out why hunspell tests occasionally fail and make them more consistent [lucene]

2025-02-13 Thread via GitHub
rmuir commented on issue #14235: URL: https://github.com/apache/lucene/issues/14235#issuecomment-2657783923 @dweiss didn't mean for it to come as a complaint, i honestly have no ideas how to improve it. personally i would LIKE to see the failures on upstream dictionary updates and ge

[PR] Move `CombinedFieldQuery` to the code module. [lucene]

2025-02-13 Thread via GitHub
jpountz opened a new pull request, #14236: URL: https://github.com/apache/lucene/pull/14236 `CombinedFieldQuery` is Lucene's most robust way of scoring across multiple fields, let's move it to core and recommend using it to query multiple fields. While moving the class, I modified the

Re: [PR] Enable error-prone checks for NonFinalStaticField [lucene]

2025-02-13 Thread via GitHub
dweiss commented on PR #14228: URL: https://github.com/apache/lucene/pull/14228#issuecomment-2657770674 I'll create a separate issue for hunspell tests and take care of that tomorrow, no worries. -- This is an automated message from the Apache Git Service. To respond to the message, pleas

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-02-13 Thread via GitHub
benwtrent commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2657612038 I am not 100% sure whats up with the behavior. However, I switched to `16` (also happens to be the graph conn) instead of `3`. Its interesting how visited is lower, but recall is

Re: [PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]

2025-02-13 Thread via GitHub
benwtrent merged PR #14160: URL: https://github.com/apache/lucene/pull/14160 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [I] Look into ACORN-1, or another algorithm to aid in filtered HNSW search [lucene]

2025-02-13 Thread via GitHub
benwtrent closed issue #13940: Look into ACORN-1, or another algorithm to aid in filtered HNSW search URL: https://github.com/apache/lucene/issues/13940 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] Use Vector API to decode BKD docIds [lucene]

2025-02-13 Thread via GitHub
jpountz commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2657293043 These results look even better than the results that you had previously reported for the vector API, is my understanding correct that it performs even better? -- This is an automated

Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-13 Thread via GitHub
stefanvodita commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1954612647 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-13 Thread via GitHub
jpountz commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1954925602 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-02-13 Thread via GitHub
msokolov commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2657373529 I think maybe what happens here is that the K controls not only how many hits are returned from each segment, but also what the "beam width" is during search, so we could have gotten be

Re: [PR] Upgrade errorprone to 2.36.0 [lucene]

2025-02-13 Thread via GitHub
risdenk commented on PR #14216: URL: https://github.com/apache/lucene/pull/14216#issuecomment-2657288304 thanks for merging @dweiss -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

[PR] Utility classes to make it easier to use sandbox facet API for most common cases [lucene]

2025-02-13 Thread via GitHub
epotyom opened a new pull request, #14237: URL: https://github.com/apache/lucene/pull/14237 In the initial sandbox facet module PR @gsmiller [suggested](https://github.com/apache/lucene/pull/13568#issuecomment-2249005915) adding helpers to make common tasks easier. This implementatio

Re: [I] Figure out why hunspell tests occasionally fail and make them more consistent [lucene]

2025-02-13 Thread via GitHub
rmuir commented on issue #14235: URL: https://github.com/apache/lucene/issues/14235#issuecomment-2657795678 I think to fix it, we have to look at `checkoutHunspellRegressionRepos()`. it clones the default branch currently, I think we'd just want to pin to a hash for now. We could jus

Re: [PR] Fix TestForTooMuchCloning flakiness [lucene]

2025-02-13 Thread via GitHub
jpountz commented on PR #14232: URL: https://github.com/apache/lucene/pull/14232#issuecomment-2657728449 Thanks @benwtrent -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [I] TestForTooMuchCloning.test fails [lucene]

2025-02-13 Thread via GitHub
benwtrent commented on issue #14220: URL: https://github.com/apache/lucene/issues/14220#issuecomment-2657135532 This is failing pretty often. I tried upping the clone limit to 600, but it still fails periodically with more than 600 merges (604 in a local run). @jpountz what do you thi

Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-13 Thread via GitHub
jpountz commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1954905919 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [PR] Remove some randomness from flaky BaseVectorSimilarityQueryTestCase#testFallbackToExact [lucene]

2025-02-13 Thread via GitHub
benwtrent merged PR #14231: URL: https://github.com/apache/lucene/pull/14231 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [I] TestByteVectorSimilaryQuery failure on windows [lucene]

2025-02-13 Thread via GitHub
benwtrent closed issue #14230: TestByteVectorSimilaryQuery failure on windows URL: https://github.com/apache/lucene/issues/14230 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] Fixed a flaky test TestKnnFloatVectorQuery.testFindFewer [lucene]

2025-02-13 Thread via GitHub
navneet1v commented on PR #14223: URL: https://github.com/apache/lucene/pull/14223#issuecomment-2657079232 Thanks @benwtrent for approval and merging the code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-02-13 Thread via GitHub
msokolov commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2657368730 also - we are not really tracking "visited" properly I think -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[I] Hnsw format testRecall failing [lucene]

2025-02-13 Thread via GitHub
benwtrent opened a new issue, #14233: URL: https://github.com/apache/lucene/issues/14233 ### Description After a6a96cde1c6 Bugfix/fix hnsw search termination check (#14215) HNSW format recall tests started failing. Need to investigate. ``` TestLucene94HnswVectorsFormat > tes

[PR] Remove duplicates from the hnsw recall testing [lucene]

2025-02-13 Thread via GitHub
benwtrent opened a new pull request, #14234: URL: https://github.com/apache/lucene/pull/14234 We had many duplicates within the hnsw recall test index. This tripped over our duplicate score change where we don't explore further unless scores are strictly better: https://github.com/apache/l

Re: [PR] Fix TestForTooMuchCloning flakiness [lucene]

2025-02-13 Thread via GitHub
jpountz commented on PR #14232: URL: https://github.com/apache/lucene/pull/14232#issuecomment-2657233927 Thanks for looking into this and sorry for missing the build failures. The fact that this test has failures makes sense to me since merging is a bit more aggressive now, though I don't e

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-02-13 Thread via GitHub
msokolov commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2657317490 Well I ran some tests, and surprisingly, I saw a significant different in both recall and latency (decreases in both). This surprised me: I expected to see more-or-less similar results,

Re: [PR] Use Vector API to decode BKD docIds [lucene]

2025-02-13 Thread via GitHub
gf2121 commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2657334178 > is my understanding correct that it performs even better? Yeah! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-02-13 Thread via GitHub
benwtrent commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2657348683 OK, I ran it on 8M data set with 128 segments. Indeed it visits way fewer vectors (seemingly), and is consistent across multiple threads. ``` recall latency(ms)

Re: [I] Figure out why hunspell tests occasionally fail and make them more consistent [lucene]

2025-02-13 Thread via GitHub
rmuir commented on issue #14235: URL: https://github.com/apache/lucene/issues/14235#issuecomment-2657937921 I think this is the one triggering current failure: will dig into it https://github.com/LibreOffice/dictionaries/commit/762abe74008b94b2ff06db6f4024b59a8254c467 -- This is an automa

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-13 Thread via GitHub
navneet1v commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1955514351 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsFormat.java: ## @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundatio

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-13 Thread via GitHub
navneet1v commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1955514351 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsFormat.java: ## @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundatio

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-13 Thread via GitHub
navneet1v commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1955523940 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,457 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]

2025-02-13 Thread via GitHub
houserjohn commented on PR #13914: URL: https://github.com/apache/lucene/pull/13914#issuecomment-2658101119 Hey @HoustonPutman, I just published [GH#14238](https://github.com/apache/lucene/pull/14238) which contains all of the unit tests that I've created so far. Note that there was a sligh