Re: [PR] Use Vector API to decode BKD docIds [lucene]

2025-02-11 Thread via GitHub
gf2121 commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2652803442 > applied the 0xFF mask to scratch in the shift loop This helps generate `vpand` in assembly, but not help performance too much. > Sorry for pushing Not at all, it's in

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-11 Thread via GitHub
navneet1v commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1952063404 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,457 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-11 Thread via GitHub
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1952061362 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,457 @@ +/* + * Licensed to the Apache Software Foundation (ASF) unde

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-11 Thread via GitHub
navneet1v commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1952041592 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,457 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-11 Thread via GitHub
navneet1v commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1952038874 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,457 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-02-11 Thread via GitHub
tveasey commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2652753715 This pull request relates only to OSQ, and thus the proper scope of discussion is regarding the concerns raised around its attribution. We have pursued multiple conversations and d

[I] Refactor QueryCache to improve concurrency and performance [lucene]

2025-02-11 Thread via GitHub
sgup432 opened a new issue, #14222: URL: https://github.com/apache/lucene/issues/14222 ### Description Given the significant role of LRUQueryCache in Lucene, I see opportunities to enhance its performance. Although there have been many discussions on this topic like [here](https://gi

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2025-02-11 Thread via GitHub
github-actions[bot] commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2652354185 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] Consistent KNN query results with multiple leafs [lucene]

2025-02-11 Thread via GitHub
benwtrent commented on PR #14191: URL: https://github.com/apache/lucene/pull/14191#issuecomment-2652285114 OK, I ran a slightly modified version of this: https://github.com/apache/lucene/compare/main...benwtrent:lucene:feature/consistent-sharing-knn?expand=1 I indexed 8M docs with Ela

Re: [PR] Use Vector API to decode BKD docIds [lucene]

2025-02-11 Thread via GitHub
jpountz commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2652267868 > #current bpv=24 gets vectorized on the shift loop, but not for the remainder loop. This is an interesting observation. I wonder if a small refactoring could help it get auto-vec

Re: [PR] Clean up public non-final statics [lucene]

2025-02-11 Thread via GitHub
msfroh commented on code in PR #14221: URL: https://github.com/apache/lucene/pull/14221#discussion_r1951677461 ## lucene/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/smart/AnalyzerProfile.java: ## @@ -34,45 +34,44 @@ public class AnalyzerProfile { /** Global ind

Re: [PR] Consistent KNN query results with multiple leafs [lucene]

2025-02-11 Thread via GitHub
benwtrent commented on code in PR #14191: URL: https://github.com/apache/lucene/pull/14191#discussion_r1939664144 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -19,11 +19,7 @@ import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_

Re: [PR] Clean up public non-final statics [lucene]

2025-02-11 Thread via GitHub
risdenk commented on PR #14221: URL: https://github.com/apache/lucene/pull/14221#issuecomment-2652103797 https://github.com/apache/lucene/pull/14216 may help with this specifically `NonFinalStaticField` which was off since it was super noisy. https://errorprone.info/bugpattern/NonFinalStati

Re: [PR] Clean up public non-final statics [lucene]

2025-02-11 Thread via GitHub
risdenk commented on code in PR #14221: URL: https://github.com/apache/lucene/pull/14221#discussion_r1951627505 ## lucene/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/smart/AnalyzerProfile.java: ## @@ -34,45 +34,44 @@ public class AnalyzerProfile { /** Global in

Re: [PR] Clean up public non-final statics [lucene]

2025-02-11 Thread via GitHub
msokolov commented on PR #14221: URL: https://github.com/apache/lucene/pull/14221#issuecomment-2652083206 LGTM, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [PR] Add UnwrappingReuseStrategy for AnalyzerWrapper [lucene]

2025-02-11 Thread via GitHub
jpountz commented on PR #14154: URL: https://github.com/apache/lucene/pull/14154#issuecomment-2652082591 I suspect that `ReuseStrategy` hasn't been built with the idea that it could be use to invalidate components, but only as a way to describe an efficient caching strategy for analysis com

Re: [I] TestForTooMuchCloning.test fails [lucene]

2025-02-11 Thread via GitHub
benwtrent commented on issue #14220: URL: https://github.com/apache/lucene/issues/14220#issuecomment-2651975052 Took a bit to find the first commit where this fails. It sometimes succeeds, so gitbisect was tricky. The first commit I could find is this one: It fails pretty rel

[PR] Clean up public non-final statics [lucene]

2025-02-11 Thread via GitHub
msfroh opened a new pull request, #14221: URL: https://github.com/apache/lucene/pull/14221 ### Description Following up on https://github.com/apache/lucene/issues/14151 and https://github.com/apache/lucene/issues/14152, I decided to grep for other `public static` non-`final` variable

Re: [PR] Correct hashCode of SynonymQuery [lucene]

2025-02-11 Thread via GitHub
mayya-sharipova merged PR #14217: URL: https://github.com/apache/lucene/pull/14217 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lu

[I] TestForTooMuchCloning.test fails [lucene]

2025-02-11 Thread via GitHub
benwtrent opened a new issue, #14220: URL: https://github.com/apache/lucene/issues/14220 ### Description I have seen this fail in periodic CI builds and on some PR builds. ``` TestForTooMuchCloning > test FAILED --   | java.lang.AssertionError: too many calls to IndexIn

Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-11 Thread via GitHub
gsmiller commented on PR #14204: URL: https://github.com/apache/lucene/pull/14204#issuecomment-2651798767 This looks good to me @jpountz. I think it makes sense to put this in sandbox, but I'd personally also be fine with leaving it where you initially had it. (I think this also highlights

Re: [PR] Bugfix/fix hnsw search termination check [lucene]

2025-02-11 Thread via GitHub
benwtrent merged PR #14215: URL: https://github.com/apache/lucene/pull/14215 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

2025-02-11 Thread via GitHub
Vikasht34 commented on issue #14208: URL: https://github.com/apache/lucene/issues/14208#issuecomment-2651710710 @benwtrent Thanks ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

2025-02-11 Thread via GitHub
benwtrent commented on issue #14208: URL: https://github.com/apache/lucene/issues/14208#issuecomment-2651707316 @Vikasht34 there is already a general "make hnsw merges faster" issue, so I think you can just refer to it. I look forward to the results! -- This is an automated message

Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

2025-02-11 Thread via GitHub
Vikasht34 commented on issue #14208: URL: https://github.com/apache/lucene/issues/14208#issuecomment-2651696435 @benwtrent Should we create separate issue and I can take a stab , implementing it with this paper https://www.arxiv.org/pdf/1908.00814v5 -- This is an automated message from th

Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

2025-02-11 Thread via GitHub
benwtrent commented on issue #14208: URL: https://github.com/apache/lucene/issues/14208#issuecomment-2651681430 Ah, ok :) no offense meant. It just struck me, while they are ideas that might be worth exploring, they were a little too general (what I have seen from LLMs in the past).

Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

2025-02-11 Thread via GitHub
Vikasht34 commented on issue #14208: URL: https://github.com/apache/lucene/issues/14208#issuecomment-2651658228 @benwtrent haha , It looks like from LLM , Originally it came from this paper https://www.arxiv.org/pdf/1908.00814v5 ``` The upper layer graph is merged into the app

Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-11 Thread via GitHub
gsmiller commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1951323035 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) unde

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-11 Thread via GitHub
john-wagster commented on PR #14192: URL: https://github.com/apache/lucene/pull/14192#issuecomment-2651490994 @rmuir I made another pass based on your feedback and I'm good with and agree to keep this simple for a first pass. To that end I've done the following: * CaseFolding is no

Re: [PR] Add UnwrappingReuseStrategy for AnalyzerWrapper [lucene]

2025-02-11 Thread via GitHub
mayya-sharipova commented on PR #14154: URL: https://github.com/apache/lucene/pull/14154#issuecomment-2651354310 @jpountz Thank you for your feedback. I am happy to discuss alternatives. Isn't the whole idea of ReuseStrategy to decide if an analyzer's components can be reused or shou

Re: [PR] Use Vector API to decode BKD docIds [lucene]

2025-02-11 Thread via GitHub
gf2121 commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2651208564 Thanks for feedback! I implement the fixed-size inner loop and print out assembly for all. [perf_asm.log](https://github.com/user-attachments/files/18752147/perf_asm.log) * When pr

Re: [PR] Bugfix/fix hnsw search termination check [lucene]

2025-02-11 Thread via GitHub
tteofili commented on PR #14215: URL: https://github.com/apache/lucene/pull/14215#issuecomment-2651170577 only perhaps it'd be nice if we could add an @Monster / nightly test to make sure we don't run in this again in the future. -- This is an automated message from the Apache Git Service

Re: [PR] Bugfix/fix hnsw search termination check [lucene]

2025-02-11 Thread via GitHub
tteofili commented on code in PR #14215: URL: https://github.com/apache/lucene/pull/14215#discussion_r1951068471 ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -52,7 +52,7 @@ public boolean collect(int docId, float similarity) { @Override pu

Re: [PR] Bugfix/fix hnsw search termination check [lucene]

2025-02-11 Thread via GitHub
jimczi commented on code in PR #14215: URL: https://github.com/apache/lucene/pull/14215#discussion_r1950877363 ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -52,7 +52,7 @@ public boolean collect(int docId, float similarity) { @Override publ

Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-11 Thread via GitHub
epotyom commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1950866681 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [PR] Bugfix/fix hnsw search termination check [lucene]

2025-02-11 Thread via GitHub
benwtrent commented on code in PR #14215: URL: https://github.com/apache/lucene/pull/14215#discussion_r1950821830 ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -52,7 +52,7 @@ public boolean collect(int docId, float similarity) { @Override p

Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

2025-02-11 Thread via GitHub
benwtrent commented on issue #14208: URL: https://github.com/apache/lucene/issues/14208#issuecomment-2650772079 @Vikasht34 I thank you for helpful suggestions, but these just seem like rehashes of already made suggestions or things that are completely unrelated. Were these generated via som

[I] NRT replication should make it possible/easy to use bite-sized commits [lucene]

2025-02-11 Thread via GitHub
mikemccand opened a new issue, #14219: URL: https://github.com/apache/lucene/issues/14219 ### Description At Amazon (product search) we use Lucene's [awesome near-real-time segment replication](https://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html) to effi

Re: [PR] Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves [lucene]

2025-02-11 Thread via GitHub
iverase commented on PR #14176: URL: https://github.com/apache/lucene/pull/14176#issuecomment-2650518843 >The [nightly geo benchy](https://benchmarks.mikemccandless.com/geobench.html) didn't seem impacted either way; maybe the tasks it runs are not exercising the optimized path here. --

Re: [PR] Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves [lucene]

2025-02-11 Thread via GitHub
mikemccand commented on PR #14176: URL: https://github.com/apache/lucene/pull/14176#issuecomment-2650500796 > Incredible speeds ups here https://benchmarks.mikemccandless.com/FilteredIntNRQ.html and here https://benchmarks.mikemccandless.com/IntNRQ.html Yeah, wow! > These numb

Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-11 Thread via GitHub
jpountz commented on PR #14204: URL: https://github.com/apache/lucene/pull/14204#issuecomment-2650473311 I moved the code to the sandbox facet framework and applied suggestions, I hope I didn't miss any. -- This is an automated message from the Apache Git Service. To respond to the messag

Re: [I] org.apache.lucene.search.TestKnnFloatVectorQuery.testFindFewer ComparisonFailure: expected: but was: [lucene]

2025-02-11 Thread via GitHub
navneet1v commented on issue #14175: URL: https://github.com/apache/lucene/issues/14175#issuecomment-2650199394 @ChrisHegarty I did some debugging on this failed test and what I found is for the particular see the Lucene99ScalarQuantizedVectorsFormat is getting picked from a list of formats