Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
kaivalnp commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2621481909 ### Description 1. Separate Faiss indexes are maintained per-segment per-field, in line with Lucene's architecture (and the current vector format) 2. Vectors are buffered in memory

[PR] Do not enable security manager on JDK 24+ [lucene]

2025-01-29 Thread via GitHub
ChrisHegarty opened a new pull request, #14179: URL: https://github.com/apache/lucene/pull/14179 This commit avoids setting the security manager on JDK 24+ - since it is not longer possible to enable it in JDK 24+ This is the minimum required to start testing with JDK 24 EA. -- Thi

Re: [I] SpanWithinQuery - A SpanNotQuery that allows a specified number of intersections [LUCENE-777] [lucene]

2025-01-29 Thread via GitHub
stefanvodita closed issue #1852: SpanWithinQuery - A SpanNotQuery that allows a specified number of intersections [LUCENE-777] URL: https://github.com/apache/lucene/issues/1852 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [I] SpanWithinQuery - A SpanNotQuery that allows a specified number of intersections [LUCENE-777] [lucene]

2025-01-29 Thread via GitHub
stefanvodita commented on issue #1852: URL: https://github.com/apache/lucene/issues/1852#issuecomment-2621404654 `SpanWithinQuery` got added as part of #7145. Resolving. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] Adjusts Seeded knn searches to clean up user and internal interfaces [lucene]

2025-01-29 Thread via GitHub
cpoerschke commented on code in PR #14170: URL: https://github.com/apache/lucene/pull/14170#discussion_r1933824084 ## lucene/core/src/java/org/apache/lucene/util/hnsw/SeededHnswGraphSearcher.java: ## @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [PR] Adjusts Seeded knn searches to clean up user and internal interfaces [lucene]

2025-01-29 Thread via GitHub
cpoerschke commented on code in PR #14170: URL: https://github.com/apache/lucene/pull/14170#discussion_r1933828372 ## lucene/join/src/java/org/apache/lucene/search/join/DiversifyingNearestChildrenKnnCollector.java: ## @@ -42,7 +43,20 @@ class DiversifyingNearestChildrenKnnCollec

[PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
kaivalnp opened a new pull request, #14178: URL: https://github.com/apache/lucene/pull/14178 ### Description Faiss (https://github.com/facebookresearch/faiss) is _"a library for efficient similarity search and clustering of dense vectors"_ It supports various features like vect

Re: [I] Index corruption can cause infinite spin loop when Lucene attempts to incorrectly uncompress fields [LUCENE-772] [lucene]

2025-01-29 Thread via GitHub
stefanvodita closed issue #1847: Index corruption can cause infinite spin loop when Lucene attempts to incorrectly uncompress fields [LUCENE-772] URL: https://github.com/apache/lucene/issues/1847 -- This is an automated message from the Apache Git Service. To respond to the message, please lo

Re: [I] Index corruption can cause infinite spin loop when Lucene attempts to incorrectly uncompress fields [LUCENE-772] [lucene]

2025-01-29 Thread via GitHub
stefanvodita commented on issue #1847: URL: https://github.com/apache/lucene/issues/1847#issuecomment-2621354936 Closing based on Uwe's assessment. Doesn't seem like this is getting a fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[I] Multi-threaded vector search over multiple segments can lead to inconsistent results [lucene]

2025-01-29 Thread via GitHub
benwtrent opened a new issue, #14180: URL: https://github.com/apache/lucene/issues/14180 ### Description Related to: https://github.com/apache/lucene/pull/14167 But multi-threaded search over multiple segments in addition to multi-leaf collection (e.g. information sharing) can

Re: [PR] Add knn result consistency test [lucene]

2025-01-29 Thread via GitHub
benwtrent commented on PR #14167: URL: https://github.com/apache/lucene/pull/14167#issuecomment-2621748238 To aid in the conversation, I opened an issue: https://github.com/apache/lucene/issues/14180 I plan on merging this new test, but with the multi-threaded case muted until we can

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-01-29 Thread via GitHub
benwtrent commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2621652581 > For parentJoin benchmark run on main, there is a visible drop in recall when I disable merges (as compared to a main branch run with merges enabled). Is this expected? I wonde

Re: [PR] Adjust knn merge stability testing [lucene]

2025-01-29 Thread via GitHub
benwtrent merged PR #14172: URL: https://github.com/apache/lucene/pull/14172 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [I] testMergeStability failing for Knn formats [lucene]

2025-01-29 Thread via GitHub
benwtrent closed issue #13640: testMergeStability failing for Knn formats URL: https://github.com/apache/lucene/issues/13640 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
mikemccand commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2623225740 > > should report total CPU cycles consumed during indexing and searching (summed across all threads)... > > @mikemccand that would help these higher level multithreaded perform

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
mikemccand commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2623221564 > If I have learned one thing over the years, it's that benchmarking accurately is very difficult! Amen to that!! -- This is an automated message from the Apache Git Service.

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
mikemccand commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2622774476 Really, `luceneutil` should report total CPU cycles consumed during indexing and searching (summed across all threads)... I'll open an issue for this. -- This is an automated messag

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
kaivalnp commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2622861110 @benwtrent Thanks for the input! I tried what you mentioned above: > I would reduce the number of indexing threads to 1, faiss threads to 1, and merge workers to 1 Lucene: `

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
benwtrent commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2622749459 @kaivalnp the force-merge time indicates that during merge to a single segment, the index is being rebuilt from various segments. I would think that the `force-merge` time itself is mo

[I] Add easier segment tracing / verbosity / transparency to `IndexWriter` [lucene]

2025-01-29 Thread via GitHub
mikemccand opened a new issue, #14182: URL: https://github.com/apache/lucene/issues/14182 ### Description When trying to understand why a shard seems to not do a good job merging, it's surprisingly difficult to gain visibility / understanding. E.g. cases like https://github.com/apac

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
navneet1v commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1934791780 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsWriter.java: ## @@ -0,0 +1,204 @@ +/* + * Licensed to the Apache Software Foundati

Re: [PR] Use github wf to add module labels for PR based on file changes [lucene]

2025-01-29 Thread via GitHub
pseudo-nymous commented on code in PR #14101: URL: https://github.com/apache/lucene/pull/14101#discussion_r1935129051 ## .github/labeler.yml: ## @@ -0,0 +1,134 @@ +# This file defines module label mappings for the Lucene project. +# Each module is associated with a set of file g

Re: [PR] Use github wf to add module labels for PR based on file changes [lucene]

2025-01-29 Thread via GitHub
pseudo-nymous commented on code in PR #14101: URL: https://github.com/apache/lucene/pull/14101#discussion_r1935130586 ## .github/labeler.yml: ## @@ -0,0 +1,134 @@ +# This file defines module label mappings for the Lucene project. +# Each module is associated with a set of file g

Re: [PR] Use github wf to add module labels for PR based on file changes [lucene]

2025-01-29 Thread via GitHub
pseudo-nymous commented on code in PR #14101: URL: https://github.com/apache/lucene/pull/14101#discussion_r1935132629 ## .github/workflows/label-pull-request.yml: ## @@ -0,0 +1,21 @@ +# This file defines the workflow for labeling pull requests with module tags based on the chan

Re: [PR] Do not enable security manager on JDK 24+ [lucene]

2025-01-29 Thread via GitHub
ChrisHegarty merged PR #14179: URL: https://github.com/apache/lucene/pull/14179 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucen

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
benwtrent commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2622880879 @kaivalnp 😌 I was worried that we had some serious outstanding performance bug that has been missed in Lucene! Conceptually, it makes sense that the performance of buildi

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
benwtrent commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2622885089 > number of vector operations that FAISS does during search. By this, I mean the number of vectors it must visit when searching the graph. -- This is an automated message from

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
navneet1v commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1934885034 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,268 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
navneet1v commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1934885034 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,268 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
navneet1v commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1934885034 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,268 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1935086492 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsFormat.java: ## @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundation

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1935099798 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsWriter.java: ## @@ -0,0 +1,204 @@ +/* + * Licensed to the Apache Software Foundatio

[PR] Add updateable random scorer interface for vector index building [lucene]

2025-01-29 Thread via GitHub
benwtrent opened a new pull request, #14181: URL: https://github.com/apache/lucene/pull/14181 As stated by @ChrisHegarty and @msokolov the amount of garbage we create during vector index creation is pretty astounding. This adjusts the interface to allow an "Updateable" random vector

Re: [PR] Add updateable random scorer interface for vector index building [lucene]

2025-01-29 Thread via GitHub
benwtrent commented on code in PR #14181: URL: https://github.com/apache/lucene/pull/14181#discussion_r1934697182 ## lucene/codecs/src/java/org/apache/lucene/codecs/bitvectors/FlatBitVectorsScorer.java: ## @@ -58,7 +59,7 @@ public RandomVectorScorer getRandomVectorScorer( t

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
kaivalnp commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2622946390 > FAISS with this vector dimension does seem about 20% faster at search I should add here that Lucene was using vectorized instructions via Panama, but the C_API of Faiss was not.

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
navneet1v commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1934818584 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,268 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

[I] Allow skip_factor to be set dynamically within QueryCache [lucene]

2025-01-29 Thread via GitHub
sgup432 opened a new issue, #14183: URL: https://github.com/apache/lucene/issues/14183 ### Description I see there have been many discussions around finding the right value for skip_factor ([here](https://issues.apache.org/jira/browse/LUCENE-9002) and https://github.com/apache/lucene

Re: [PR] Use github wf to add module labels for PR based on file changes [lucene]

2025-01-29 Thread via GitHub
stefanvodita commented on code in PR #14101: URL: https://github.com/apache/lucene/pull/14101#discussion_r1934819611 ## .github/workflows/label-pull-request.yml: ## @@ -0,0 +1,21 @@ +# This file defines the workflow for labeling pull requests with module tags based on the chang

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
benwtrent commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2622798641 > should report total CPU cycles consumed during indexing and searching (summed across all threads)... @mikemccand that would help these higher level multithreaded performance

Re: [I] Document package javadocs needs improving [LUCENE-1386] [lucene]

2025-01-29 Thread via GitHub
stefanvodita commented on issue #2460: URL: https://github.com/apache/lucene/issues/2460#issuecomment-2621442718 A lot of the documentation (and code!) has changed since 2008. The assessment here is great, but no longer holds, e.g. Package.html, FieldSelect, DateTools.Resolution no longer e

Re: [I] Document package javadocs needs improving [LUCENE-1386] [lucene]

2025-01-29 Thread via GitHub
stefanvodita closed issue #2460: Document package javadocs needs improving [LUCENE-1386] URL: https://github.com/apache/lucene/issues/2460 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
benwtrent commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2621638248 Some very interesting numbers @kaivalnp Almost 10x indexing throughput improvement tells me we are doing something silly in Lucene. Especially since the search time is only about

Re: [PR] Add knn result consistency test [lucene]

2025-01-29 Thread via GitHub
benwtrent merged PR #14167: URL: https://github.com/apache/lucene/pull/14167 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
kaivalnp commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2621943260 > Maybe it can be just as fast by not reading the floating point vectors on to heap and doing memory segment stuff Interesting, do we have a Lucene PR that explores it? > D

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
jimczi commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2622455427 > Almost 10x indexing throughput improvement tells me we are doing something silly in Lucene. I did not test this specific integration but Faiss is multithreaded on bulk training,

Re: [PR] Adjusts Seeded knn searches to clean up user and internal interfaces [lucene]

2025-01-29 Thread via GitHub
cpoerschke commented on code in PR #14170: URL: https://github.com/apache/lucene/pull/14170#discussion_r1933798309 ## lucene/core/src/java/org/apache/lucene/search/SeededKnnVectorQuery.java: ## @@ -0,0 +1,321 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
kaivalnp commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2621529365 ### Usage The new format can be used by: - "Describing" the index you want, see https://github.com/facebookresearch/faiss/wiki/The-index-factory - Setting index parameters,

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2025-01-29 Thread via GitHub
kaivalnp commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-2621532621 > I'm also tinkering with a Faiss (https://github.com/facebookresearch/faiss) wrapper Opened #14178, would appreciate feedback :) -- This is an automated message from the

Re: [PR] Adjusts Seeded knn searches to clean up user and internal interfaces [lucene]

2025-01-29 Thread via GitHub
cpoerschke commented on code in PR #14170: URL: https://github.com/apache/lucene/pull/14170#discussion_r1933811338 ## lucene/core/src/java/org/apache/lucene/search/knn/KnnSearchStrategy.java: ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] Adjusts Seeded knn searches to clean up user and internal interfaces [lucene]

2025-01-29 Thread via GitHub
cpoerschke commented on code in PR #14170: URL: https://github.com/apache/lucene/pull/14170#discussion_r1933812258 ## lucene/core/src/java/org/apache/lucene/search/knn/KnnSearchStrategy.java: ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
jimczi commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2622578687 > Not as high as 10x anymore, but it is still ~3x faster Not so easy ;) See the force merge time for Faiss (41.44 s). The force merge is the time it took to merge the created segmen

Re: [PR] Use github wf to add module labels for PR based on file changes [lucene]

2025-01-29 Thread via GitHub
stefanvodita commented on PR #14101: URL: https://github.com/apache/lucene/pull/14101#issuecomment-2622114022 @pseudo-nymous, I'm only seeing this now, sorry! At first glance, it matches what I had in mind - thank you for addressing that issue! I'll do an in-depth review soon, but I'd appre

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
kaivalnp commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2622613947 Ah I see :) > The force merge is the time it took to merge the created segments into 1 Does it mean that the Faiss benchmark created a larger number of segments initially,

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2025-01-29 Thread via GitHub
navneet1v commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-2622419830 > > I'm also tinkering with a Faiss (https://github.com/facebookresearch/faiss) wrapper > > Opened [#14178](https://github.com/apache/lucene/pull/14178), would appreciate f

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-01-29 Thread via GitHub
kaivalnp commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2622538569 > Since Faiss uses multithreading by default, we cannot compare with Lucene Ah nice catch, the number of threads used by both may be different.. I'm not sure how many thread