Re: [PR] Improve checksum calculations [lucene]

2024-11-12 Thread via GitHub
jpountz commented on code in PR #13989: URL: https://github.com/apache/lucene/pull/13989#discussion_r1839685981 ## lucene/core/src/java/org/apache/lucene/store/BufferedChecksum.java: ## @@ -60,6 +64,37 @@ public void update(byte[] b, int off, int len) { } } + void upd

Re: [PR] Prevent DefaultPassageFormatter from taking shorter overlapping passages [lucene]

2024-11-12 Thread via GitHub
dsmiley commented on code in PR #13384: URL: https://github.com/apache/lucene/pull/13384#discussion_r1839282216 ## lucene/CHANGES.txt: ## @@ -280,6 +280,8 @@ Optimizations Bug Fixes - +* GITHUB#13384: Fix highlighter to use longer passages instead of shor

Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-11-12 Thread via GitHub
vsop-479 commented on PR #13782: URL: https://github.com/apache/lucene/pull/13782#issuecomment-2472163944 Hello @jpountz , Please take a look when you get a chance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the U

Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]

2024-11-12 Thread via GitHub
msfroh commented on PR #13987: URL: https://github.com/apache/lucene/pull/13987#issuecomment-2472044035 I think the caching worked fine, albeit in a funny way. When you call `collect()` with any valid doc ID, it invalidates the cache, causing scores to be computed. After the score wa

Re: [PR] Add a Multi-Vector Similarity Function [lucene]

2024-11-12 Thread via GitHub
vigyasharma commented on PR #13991: URL: https://github.com/apache/lucene/pull/13991#issuecomment-2472019578 I am thinking we can leverage the `NONE` aggregation (in #13525) for non-ColBERT passage vector use-cases, by making each graph node correspond to a single value in the multi-vector

[PR] Add a Multi-Vector Similarity Function [lucene]

2024-11-12 Thread via GitHub
vigyasharma opened a new pull request, #13991: URL: https://github.com/apache/lucene/pull/13991 This is a small first change towards adding support for multi-vectors. We start with adding a `MultiVectorSimilarityFunction` that can handle (late) interaction across multiple vector values.

Re: [PR] Allow easier verification of the Panama Vectorization provider with newer Java versions [lucene]

2024-11-12 Thread via GitHub
ChrisHegarty commented on PR #13986: URL: https://github.com/apache/lucene/pull/13986#issuecomment-2471543719 > Personally I would prefer a less if/else/default handling using Optional like done in the previous sysprops. I'll make that change before merging. > Greetings from Co

Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-12 Thread via GitHub
benchaplin commented on code in PR #13984: URL: https://github.com/apache/lucene/pull/13984#discussion_r1838595160 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -2746,6 +2769,84 @@ public static Status.VectorValuesStatus testVectors( return status;

Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]

2024-11-12 Thread via GitHub
msfroh commented on PR #13987: URL: https://github.com/apache/lucene/pull/13987#issuecomment-2471331326 Luckily, this is a Lucene 10-only bug (from when `docId()` was removed from `Scorable`). I came across it when updating OpenSearch to support Lucene 10 and needed to refactor some

Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-12 Thread via GitHub
benchaplin commented on code in PR #13984: URL: https://github.com/apache/lucene/pull/13984#discussion_r1838595160 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -2746,6 +2769,84 @@ public static Status.VectorValuesStatus testVectors( return status;

Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-12 Thread via GitHub
benchaplin commented on code in PR #13984: URL: https://github.com/apache/lucene/pull/13984#discussion_r1838568093 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -2746,6 +2769,84 @@ public static Status.VectorValuesStatus testVectors( return status;

Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub
viswanathk commented on code in PR #13990: URL: https://github.com/apache/lucene/pull/13990#discussion_r1838356398 ## lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java: ## @@ -154,7 +154,14 @@ protected TopDocs approximateSearch(

Re: [PR] DocValuesSkipper implementation in IndexSortSorted [lucene]

2024-11-12 Thread via GitHub
gsmiller commented on code in PR #13886: URL: https://github.com/apache/lucene/pull/13886#discussion_r1838528845 ## lucene/core/src/java/org/apache/lucene/search/IndexSortSortedNumericDocValuesRangeQuery.java: ## @@ -397,106 +413,80 @@ private boolean matchAll(PointValues points

Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-12 Thread via GitHub
benchaplin commented on code in PR #13984: URL: https://github.com/apache/lucene/pull/13984#discussion_r1838509594 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -406,6 +411,21 @@ public static final class VectorValuesStatus { public Throwable erro

Re: [PR] Improve checksum calculations [lucene]

2024-11-12 Thread via GitHub
jfboeuf commented on PR #13989: URL: https://github.com/apache/lucene/pull/13989#issuecomment-2471157022 @jpountz [I modified the benchmark to make it more realistic by adding a header to the `IndexOutput` ](https://github.com/apache/lucene/commit/8dc6eac23b3a1158ef4c82860d8574c779bad04

Re: [PR] Improve checksum calculations [lucene]

2024-11-12 Thread via GitHub
rmuir commented on PR #13989: URL: https://github.com/apache/lucene/pull/13989#issuecomment-2471135059 OK, I see @jfboeuf, thank you for the explanation. My only concern with with the optimization is testing. If there is a bug here, the user will get CorruptIndexException. Could we

Re: [PR] Introduces IndexInput#updateReadAdvice to change the readadvice while [lucene]

2024-11-12 Thread via GitHub
uschindler commented on PR #13985: URL: https://github.com/apache/lucene/pull/13985#issuecomment-2471068437 Hi, I am currently on travel, so I can't review this. Will look into it posisbly later this week. Greetings from Costa Rica! -- This is an automated message from the Apache Git S

Re: [PR] Add a Better Binary Quantizer format for dense vectors [lucene]

2024-11-12 Thread via GitHub
ShashwatShivam commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2470937060 I conducted a benchmark using Cohere's 768-dimensional data. Here are the steps I followed for reproducibility: 1. **Set up** the [luceneutil repository](https://github.com

Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub
viswanathk commented on code in PR #13990: URL: https://github.com/apache/lucene/pull/13990#discussion_r1838356398 ## lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java: ## @@ -154,7 +154,14 @@ protected TopDocs approximateSearch(

Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub
viswanathk commented on code in PR #13990: URL: https://github.com/apache/lucene/pull/13990#discussion_r1838322211 ## lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java: ## @@ -154,7 +154,14 @@ protected TopDocs approximateSearch(

Re: [PR] Improve checksum calculations [lucene]

2024-11-12 Thread via GitHub
jpountz commented on PR #13989: URL: https://github.com/apache/lucene/pull/13989#issuecomment-2470814754 The change makes sense to me and looks like it could speed up loading live docs. > The benchmark shows the single-long approach performs better on small arrays. [...] It can be im

Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub
benwtrent commented on code in PR #13990: URL: https://github.com/apache/lucene/pull/13990#discussion_r1838315863 ## lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java: ## @@ -154,7 +154,14 @@ protected TopDocs approximateSearch(

Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub
viswanathk commented on PR #13990: URL: https://github.com/apache/lucene/pull/13990#issuecomment-2470836662 I hope I got all of them now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub
jpountz commented on PR #13990: URL: https://github.com/apache/lucene/pull/13990#issuecomment-2470768672 I believe that @benwtrent meant `KnnByteVectorQuery`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub
viswanathk commented on PR #13990: URL: https://github.com/apache/lucene/pull/13990#issuecomment-2470717378 > Could you update the byte knn query & DiversifyingChildern* knn queries as well? I made the changes fo DiversifyingChildren*, but by byteknn do you mean `ByteVectorSimilarity

Re: [PR] Improve checksum calculations [lucene]

2024-11-12 Thread via GitHub
jfboeuf commented on PR #13989: URL: https://github.com/apache/lucene/pull/13989#issuecomment-2470657083 Thank you for your feedback. Perhaps I misunderstood your point, but the implementation I propose only calls `Checksum.update(byte[])`. The change resides in how the buffer is fed to avo

Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub
benwtrent commented on code in PR #13990: URL: https://github.com/apache/lucene/pull/13990#discussion_r1838141983 ## lucene/core/src/test/org/apache/lucene/search/TestKnnFloatVectorQuery.java: ## @@ -29,13 +29,7 @@ import org.apache.lucene.document.Field; import org.apache.luc

[PR] Addresses the issue #13983 by adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub
viswanathk opened a new pull request, #13990: URL: https://github.com/apache/lucene/pull/13990 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe,

Re: [PR] Add a Better Binary Quantizer format for dense vectors [lucene]

2024-11-12 Thread via GitHub
benwtrent commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2470503074 Quick update, we have been bothered with some of the numbers (for example, models like "gist" perform poorly) and we have some improvements to get done first before flipping back to "r

Re: [PR] Add a Better Binary Quantizer format for dense vectors [lucene]

2024-11-12 Thread via GitHub
mikemccand commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2470394946 > @ShashwatShivam I don't think there is a "memory column" provided anywhere. I simply looked at the individual file sizes (veb, vex) and summed their sizes together. Once this

Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

2024-11-12 Thread via GitHub
ChrisHegarty commented on issue #13551: URL: https://github.com/apache/lucene/issues/13551#issuecomment-2470361869 Late to the party!!! I want to dig a little further on the distinction between the implementation of preload and prefetch, at least with the mmap implementation. The for

Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]

2024-11-12 Thread via GitHub
mikemccand commented on code in PR #13987: URL: https://github.com/apache/lucene/pull/13987#discussion_r1837971840 ## lucene/core/src/test/org/apache/lucene/search/TestTopFieldCollector.java: ## @@ -359,7 +359,7 @@ public void testTotalHitsWithScore() throws Exception { l

Re: [PR] Introduces IndexInput#updateReadAdvice to change the readadvice while [lucene]

2024-11-12 Thread via GitHub
ChrisHegarty commented on PR #13985: URL: https://github.com/apache/lucene/pull/13985#issuecomment-2470311965 @shatejas I'm curious how much this actually helps, and I know that you said that benchmark results would be posted. I do like that we can update the ReadAdvice on an index in

Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]

2024-11-12 Thread via GitHub
mikemccand commented on PR #13987: URL: https://github.com/apache/lucene/pull/13987#issuecomment-2470331708 Wow, good catch @msfroh. Could we maybe add a new test case that explicitly confirms that the wrapped `Scorable`'s `score` method is indeed only called once even if the outer user ca

Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-12 Thread via GitHub
mikemccand commented on PR #13984: URL: https://github.com/apache/lucene/pull/13984#issuecomment-2470310487 Actually, `CheckIndex` does have some coverage for vectors and KNN graph (it confirms it can enumerate all vectors, and also runs some searches on it if it's not just flat vectors (`c

Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-12 Thread via GitHub
mikemccand commented on PR #13984: URL: https://github.com/apache/lucene/pull/13984#issuecomment-2470306515 > this might also check for graph connectivity, see #12627 +1 -- this is a tricky thing about these HNSW graphs (that they are not necessarily born connected but rather must be

Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-12 Thread via GitHub
mikemccand commented on code in PR #13984: URL: https://github.com/apache/lucene/pull/13984#discussion_r1837944575 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -406,6 +411,21 @@ public static final class VectorValuesStatus { public Throwable erro

Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-11-12 Thread via GitHub
RS146BIJAY commented on issue #13387: URL: https://github.com/apache/lucene/issues/13387#issuecomment-2470112408 On some more analysis figured out an approach which addresses all the above comments and obtain same improvement with different IndexWriter for different group as we got with usi

Re: [PR] Improve checksum calculations [lucene]

2024-11-12 Thread via GitHub
rmuir commented on PR #13989: URL: https://github.com/apache/lucene/pull/13989#issuecomment-2470068010 This is actually slower, we only want to call `updateBytes(byte[])` or the checksum calculation is very slow (not vectorized). -- This is an automated message from the Apache Git Service

Re: [PR] Simplify codec setup in vector-related tests. [lucene]

2024-11-12 Thread via GitHub
jpountz merged PR #13970: URL: https://github.com/apache/lucene/pull/13970 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

[PR] Improve checksum calculations [lucene]

2024-11-12 Thread via GitHub
jfboeuf opened a new pull request, #13989: URL: https://github.com/apache/lucene/pull/13989 Take advantage of the existing buffer in `BufferedChecksum` to speed up reads for Longs, Ints, Shorts, and Long arrays by avoiding byte-by-byte reads. Use the faster `readLongs()` method to decode

Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-12 Thread via GitHub
tteofili commented on PR #13984: URL: https://github.com/apache/lucene/pull/13984#issuecomment-2469922060 this might also check for graph connectivity, see #12627 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR