Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]
tteofili commented on PR #13984: URL: https://github.com/apache/lucene/pull/13984#issuecomment-2469922060 this might also check for graph connectivity, see #12627 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Improve checksum calculations [lucene]
jfboeuf opened a new pull request, #13989: URL: https://github.com/apache/lucene/pull/13989 Take advantage of the existing buffer in `BufferedChecksum` to speed up reads for Longs, Ints, Shorts, and Long arrays by avoiding byte-by-byte reads. Use the faster `readLongs()` method to decode live docs and bloom filters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]
benchaplin commented on code in PR #13984: URL: https://github.com/apache/lucene/pull/13984#discussion_r1838595160 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -2746,6 +2769,84 @@ public static Status.VectorValuesStatus testVectors( return status; } + private static HnswGraph getHnswGraph(CodecReader reader) throws IOException { +KnnVectorsReader vectorsReader = reader.getVectorReader(); +if (vectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader) { + vectorsReader = ((PerFieldKnnVectorsFormat.FieldsReader) vectorsReader).getFieldReader("knn"); Review Comment: Thanks, I didn't quite understand fields when I wrote this - I think I get it now. Alright, I've done what you suggested (as is also done in `testVectors`) and iterated over `FieldInfos`, performing the check only when it applies. Because we might now parse several HNSW graphs, I've restructured the status object to support per-graph data. Successful output will now look like: ``` test: open reader.OK [took 0.010 sec] test: check integrity.OK [took 2.216 sec] test: check live docs.OK [took 0.000 sec] test: field infos.OK [2 fields] [took 0.000 sec] test: field norms.OK [0 fields] [took 0.000 sec] test: terms, freq, prox...test: stored fields...OK [150 total field count; avg 1.0 fields per doc] [took 0.390 sec] test: term vectorsOK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET; 0 SKIPPING INDEX] [took 0.000 sec] test: points..OK [0 fields, 0 points] [took 0.000 sec] test: vectors.OK [1 fields, 150 vectors] [took 0.496 sec] test: hnsw graphs.OK [2 fields: (field name: knn1, levels: 4, total nodes: 1547684), (field name: knn2, levels: 4, total nodes: 1547684)] [took 0.979 sec] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]
msfroh commented on PR #13987: URL: https://github.com/apache/lucene/pull/13987#issuecomment-2471331326 Luckily, this is a Lucene 10-only bug (from when `docId()` was removed from `Scorable`). I came across it when updating OpenSearch to support Lucene 10 and needed to refactor some code that used `ScoreCachingWrapperScorer`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add a Multi-Vector Similarity Function [lucene]
vigyasharma opened a new pull request, #13991: URL: https://github.com/apache/lucene/pull/13991 This is a small first change towards adding support for multi-vectors. We start with adding a `MultiVectorSimilarityFunction` that can handle (late) interaction across multiple vector values. This is the first of a series of splits for the larger prototype PR #13525 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a Multi-Vector Similarity Function [lucene]
vigyasharma commented on PR #13991: URL: https://github.com/apache/lucene/pull/13991#issuecomment-2472019578 I am thinking we can leverage the `NONE` aggregation (in #13525) for non-ColBERT passage vector use-cases, by making each graph node correspond to a single value in the multi-vector i.e. index time aggregation becomes "none". The resultant graph would be similar to what we construct with parent-child docs today, while flat storage with multi-vectors could allow for aggregated similarity checks at *query* time. This could help with recall while making mutli-vector usage easier to use (no overquery or index time joins). It'll need some work: a mechanism to address each vector value directly, and corresponding changes in VectorValues. I'm thinking maybe an "ordinal" for the multi-vector, and a "sub-ordinal" for values within the multi-vector. Both ints can be packed into a long for node value? Since I haven't chalked out all the details yet, I decided to **remove** the `NONE` aggregation for now, and focus first on the ColBERT use case. Will go with the `isMultiVector` flag in FieldInfos to identify multi-vector storage requirements. cc: @cpoerschke , @benwtrent -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]
msfroh commented on PR #13987: URL: https://github.com/apache/lucene/pull/13987#issuecomment-2472044035 I think the caching worked fine, albeit in a funny way. When you call `collect()` with any valid doc ID, it invalidates the cache, causing scores to be computed. After the score was computed, `currDoc` was set to `-1`. Repeated calls to `score()` (until the next `collect()`) will reuse the cached value because `-1 == -1`. As long as you don't call `collect()` on the same document multiple times (which you probably shouldn't be doing anyway -- those existing unit tests were a little weird), it should be okay. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]
benchaplin commented on code in PR #13984: URL: https://github.com/apache/lucene/pull/13984#discussion_r1838568093 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -2746,6 +2769,84 @@ public static Status.VectorValuesStatus testVectors( return status; } + private static HnswGraph getHnswGraph(CodecReader reader) throws IOException { +KnnVectorsReader vectorsReader = reader.getVectorReader(); +if (vectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader) { + vectorsReader = ((PerFieldKnnVectorsFormat.FieldsReader) vectorsReader).getFieldReader("knn"); + if (vectorsReader instanceof HnswGraphProvider) { +return ((HnswGraphProvider) vectorsReader).getGraph("knn"); + } +} +return null; + } + + /** Test the HNSW graph. */ + public static Status.HnswGraphStatus testHnswGraph( + CodecReader reader, PrintStream infoStream, boolean failFast) throws IOException { +if (infoStream != null) { + infoStream.print("test: hnsw graph.."); +} +long startNS = System.nanoTime(); +Status.HnswGraphStatus status = new Status.HnswGraphStatus(); + +try { + HnswGraph hnswGraph = getHnswGraph(reader); + if (hnswGraph != null) { +final int numLevels = hnswGraph.numLevels(); +// Perform tests on each level of the HNSW graph +for (int level = numLevels - 1; level >= 0; level--) { + HnswGraph.NodesIterator nodesIterator = hnswGraph.getNodesOnLevel(level); + while (nodesIterator.hasNext()) { +int node = nodesIterator.nextInt(); +hnswGraph.seek(level, node); +int nbr, lastNeighbor = -1, firstNeighbor = -1; +while ((nbr = hnswGraph.nextNeighbor()) != NO_MORE_DOCS) { + if (firstNeighbor == -1) { +firstNeighbor = nbr; + } + if (nbr < lastNeighbor) { +throw new CheckIndexException( +"Neighbors out of order for node " ++ node ++ ": " ++ nbr ++ "<" ++ lastNeighbor ++ " 1st=" ++ firstNeighbor); + } else if (nbr == lastNeighbor) { +throw new CheckIndexException( +"There are repeated neighbors of node " + node + " with value " + nbr); + } + lastNeighbor = nbr; +} +status.hnswGraphSize++; + } + status.hsnwGraphNumLevels++; Review Comment: I could, but wanted to respect the object as a "status" - so if a check fails, `numLevels` will indicate on which level it failed. However we don't print the status if an exception is thrown. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduces IndexInput#updateReadAdvice to change the readadvice while [lucene]
uschindler commented on PR #13985: URL: https://github.com/apache/lucene/pull/13985#issuecomment-2471068437 Hi, I am currently on travel, so I can't review this. Will look into it posisbly later this week. Greetings from Costa Rica! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]
viswanathk commented on code in PR #13990: URL: https://github.com/apache/lucene/pull/13990#discussion_r1838356398 ## lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java: ## @@ -154,7 +154,14 @@ protected TopDocs approximateSearch( @Override public String toString(String field) { -return getClass().getSimpleName() + ":" + this.field + "[" + query[0] + ",...][" + k + "]"; Review Comment: @benwtrent for the Diversifying* does it make sense to capture the `parentsFilter` in toString() as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
vsop-479 commented on PR #13782: URL: https://github.com/apache/lucene/pull/13782#issuecomment-2472163944 Hello @jpountz , Please take a look when you get a chance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Prevent DefaultPassageFormatter from taking shorter overlapping passages [lucene]
dsmiley commented on code in PR #13384: URL: https://github.com/apache/lucene/pull/13384#discussion_r1839282216 ## lucene/CHANGES.txt: ## @@ -280,6 +280,8 @@ Optimizations Bug Fixes - +* GITHUB#13384: Fix highlighter to use longer passages instead of shorter individual terms. (Zack Kendall) Review Comment: Next time please indicate *which* highlighter; we have 3-4! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow easier verification of the Panama Vectorization provider with newer Java versions [lucene]
ChrisHegarty commented on PR #13986: URL: https://github.com/apache/lucene/pull/13986#issuecomment-2471543719 > Personally I would prefer a less if/else/default handling using Optional like done in the previous sysprops. I'll make that change before merging. > Greetings from Costa Rica! Enjoy!!! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve checksum calculations [lucene]
jpountz commented on code in PR #13989: URL: https://github.com/apache/lucene/pull/13989#discussion_r1839685981 ## lucene/core/src/java/org/apache/lucene/store/BufferedChecksum.java: ## @@ -60,6 +64,37 @@ public void update(byte[] b, int off, int len) { } } + void updateShort(short val) { +if (upto + Short.BYTES > buffer.length) flush(); +BitUtil.VH_LE_SHORT.set(buffer, upto, val); +upto += Short.BYTES; + } + + void updateInt(int val) { +if (upto + Integer.BYTES > buffer.length) flush(); +BitUtil.VH_LE_INT.set(buffer, upto, val); +upto += Integer.BYTES; + } + + void updateLong(long val) { +if (upto + Long.BYTES > buffer.length) flush(); +BitUtil.VH_LE_LONG.set(buffer, upto, val); +upto += Long.BYTES; + } + + void updateLongs(long[] vals, int offset, int len) { +flush(); Review Comment: All other `updateXXX()` methods only flush if there is no room left or if the data to update is bigger than the buffer size. I'd like to retain this property here (even though your benchmarks suggest that it doesn't hurt performance much). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] DocValuesSkipper implementation in IndexSortSorted [lucene]
gsmiller commented on code in PR #13886: URL: https://github.com/apache/lucene/pull/13886#discussion_r1838528845 ## lucene/core/src/java/org/apache/lucene/search/IndexSortSortedNumericDocValuesRangeQuery.java: ## @@ -397,106 +413,80 @@ private boolean matchAll(PointValues points, byte[] queryLowerPoint, byte[] quer } private IteratorAndCount getDocIdSetIteratorOrNullFromBkd( - LeafReaderContext context, DocIdSetIterator delegate) throws IOException { -Sort indexSort = context.reader().getMetaData().sort(); + LeafReader reader, NumericDocValues numericDocValues, DocValuesSkipper skipper) + throws IOException { +if (skipper.docCount() != reader.maxDoc()) { + return null; +} +final Sort indexSort = reader.getMetaData().sort(); if (indexSort == null || indexSort.getSort().length == 0 || indexSort.getSort()[0].getField().equals(field) == false) { return null; } +final int minDocID; +final int maxDocID; final boolean reverse = indexSort.getSort()[0].getReverse(); - -PointValues points = context.reader().getPointValues(field); Review Comment: Yeah, I'm not really sure to be honest. Ideally we would measure/benchmark this. I'm OK with either option as a starting point though, as long as we're still utilizing these non-skipper options when a skipper doesn't exist. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]
benchaplin commented on code in PR #13984: URL: https://github.com/apache/lucene/pull/13984#discussion_r1838509594 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -406,6 +411,21 @@ public static final class VectorValuesStatus { public Throwable error; } +/** Status from testing HNSW graph */ +public static final class HnswGraphStatus { + + HnswGraphStatus() {} + + /** Number of nodes in the graph */ + public int hnswGraphSize; Review Comment: Good point - renamed it `totalNumNodes` to be extra clear. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve checksum calculations [lucene]
jfboeuf commented on PR #13989: URL: https://github.com/apache/lucene/pull/13989#issuecomment-2471157022 @jpountz [I modified the benchmark to make it more realistic by adding a header to the `IndexOutput` ](https://github.com/apache/lucene/commit/8dc6eac23b3a1158ef4c82860d8574c779bad047) so: * `updateLongs(long[], int, int)` triggers a real flush before processing (the benchmark was biased because the buffer was empty, which is unlikely in real operations where there is at least the header and the table size before). * write accesses to the buffer by `updateLong(long)` are not necessarily aligned (they are always with the new `updateLongs(long[], int, int)`). After testing with different sizes of `long[]` it seems the loop of `updateLong(long)` is only worth it for **really small** arrays and this specific case probably doesn't deserve the additional complexity/if_branch in `updateLongs(long[], int, int)`. ``` Benchmark (size) Mode Cnt Score Error Units BufferedChecksumIndexInputBenchmark.decodeLongArray 8 thrpt 15 5992.000 ± 123.429 ops/ms BufferedChecksumIndexInputBenchmark.decodeLongArray16 thrpt 15 5729.068 ± 123.423 ops/ms BufferedChecksumIndexInputBenchmark.decodeLongArray32 thrpt 15 5307.703 ± 149.885 ops/ms BufferedChecksumIndexInputBenchmark.decodeLongArray64 thrpt 15 4960.178 ± 124.810 ops/ms BufferedChecksumIndexInputBenchmark.decodeSingleLongs 8 thrpt 15 6182.986 ± 253.181 ops/ms BufferedChecksumIndexInputBenchmark.decodeSingleLongs 16 thrpt 15 5472.158 ± 190.560 ops/ms BufferedChecksumIndexInputBenchmark.decodeSingleLongs 32 thrpt 15 4739.634 ± 404.963 ops/ms BufferedChecksumIndexInputBenchmark.decodeSingleLongs 64 thrpt 15 3858.685 ± 81.880 ops/ms ``` @rmuir I'll add unit tests to thoroughly check the method behaves properly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]
benchaplin commented on code in PR #13984: URL: https://github.com/apache/lucene/pull/13984#discussion_r1838595160 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -2746,6 +2769,84 @@ public static Status.VectorValuesStatus testVectors( return status; } + private static HnswGraph getHnswGraph(CodecReader reader) throws IOException { +KnnVectorsReader vectorsReader = reader.getVectorReader(); +if (vectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader) { + vectorsReader = ((PerFieldKnnVectorsFormat.FieldsReader) vectorsReader).getFieldReader("knn"); Review Comment: Thanks, I didn't quite understand fields when I wrote this - I think I get it now. Alright, I've done what you suggested (as is also done in `testVectors`) and iterated over `FieldInfos`, performing the check only when it applies. Because we might now parse several HNSW graphs, I've restructured the status object to support per-graph data. Successful output will now look like: ``` test: open reader.OK [took 0.010 sec] test: check integrity.OK [took 2.216 sec] test: check live docs.OK [took 0.000 sec] test: field infos.OK [2 fields] [took 0.000 sec] test: field norms.OK [0 fields] [took 0.000 sec] test: terms, freq, prox...test: stored fields...OK [150 total field count; avg 1.0 fields per doc] [took 0.390 sec] test: term vectorsOK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET; 0 SKIPPING INDEX] [took 0.000 sec] test: points..OK [0 fields, 0 points] [took 0.000 sec] test: vectors.OK [1 fields, 150 vectors] [took 0.496 sec] test: hnsw graphs.OK [2 fields: (field name: knn1, levels: 4, total nodes: 1547684), (field name: knn2, levels: 4, total nodes: 1547684)] [took 0.979 sec] ``` `testVectors` doesn't do this, it just sums vectors over all fields. I could do that too, but this felt most complete. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve checksum calculations [lucene]
rmuir commented on PR #13989: URL: https://github.com/apache/lucene/pull/13989#issuecomment-2471135059 OK, I see @jfboeuf, thank you for the explanation. My only concern with with the optimization is testing. If there is a bug here, the user will get CorruptIndexException. Could we add some low-level tests somehow? Especially the readLongs() seems like potential trouble, as I am not sure anything tests that it works correctly with a non-zero `offset`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]
viswanathk commented on PR #13990: URL: https://github.com/apache/lucene/pull/13990#issuecomment-2470836662 I hope I got all of them now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]
benwtrent commented on code in PR #13990: URL: https://github.com/apache/lucene/pull/13990#discussion_r1838315863 ## lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java: ## @@ -154,7 +154,14 @@ protected TopDocs approximateSearch( @Override public String toString(String field) { -return getClass().getSimpleName() + ":" + this.field + "[" + query[0] + ",...][" + k + "]"; Review Comment: test updates for this and the float version? The tests are in: TestParentBlockJoinByteKnnVectorQuery and of course the float version as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a Better Binary Quantizer format for dense vectors [lucene]
mikemccand commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2470394946 > @ShashwatShivam I don't think there is a "memory column" provided anywhere. I simply looked at the individual file sizes (veb, vex) and summed their sizes together. Once this cool change is merged let's fix `luceneutil`'s KNN benchy tooling (`knnPerfTest.py`, `KnnGraphTester.java`) to compute/report the "memory column" ("hot RAM", "searchable RAM", something)? Basically everything except the original (`float32` or `byte`) vectors. I'll open an upstream `luceneutil` issue... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a Better Binary Quantizer format for dense vectors [lucene]
benwtrent commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2470503074 Quick update, we have been bothered with some of the numbers (for example, models like "gist" perform poorly) and we have some improvements to get done first before flipping back to "ready for review". @mikemccand YES! That would be great! "Memory required" would be the quantized file size + hnsw graph file size (if the graph exists). @ShashwatShivam Sorry for the late reply. There are no "out of the box" rescoring actions directly in Lucene. Mainly because the individual tools are (mostly) already available to you. You can ask for more overall vectors with one query, and then rescore the individual documents according to the raw vector comparisons. I admit, this requires some Lucene API know how. It would be good for a "vector scorer" to indicate if its an estimation or not to allow for smarter actions in the knn doc collector... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]
jpountz commented on PR #13990: URL: https://github.com/apache/lucene/pull/13990#issuecomment-2470768672 I believe that @benwtrent meant `KnnByteVectorQuery`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]
mikemccand commented on code in PR #13984: URL: https://github.com/apache/lucene/pull/13984#discussion_r1837944575 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -406,6 +411,21 @@ public static final class VectorValuesStatus { public Throwable error; } +/** Status from testing HNSW graph */ +public static final class HnswGraphStatus { + + HnswGraphStatus() {} + + /** Number of nodes in the graph */ + public int hnswGraphSize; + + /** Number of levels of the graph */ + public int hsnwGraphNumLevels; Review Comment: `graphNumLevels`? Naming is the hardest part! ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -406,6 +411,21 @@ public static final class VectorValuesStatus { public Throwable error; } +/** Status from testing HNSW graph */ +public static final class HnswGraphStatus { + + HnswGraphStatus() {} + + /** Number of nodes in the graph */ + public int hnswGraphSize; Review Comment: Maybe rename to `graphNumNodes` or so? `size` is a bit ambiguous/tricky, and since `hnsw` is in the class name we can prolly leave it off the member name? ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -2746,6 +2769,84 @@ public static Status.VectorValuesStatus testVectors( return status; } + private static HnswGraph getHnswGraph(CodecReader reader) throws IOException { +KnnVectorsReader vectorsReader = reader.getVectorReader(); +if (vectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader) { + vectorsReader = ((PerFieldKnnVectorsFormat.FieldsReader) vectorsReader).getFieldReader("knn"); Review Comment: Hmm the field can be anything (not necessarily `knn` like `luceneutil` uses)? Can we change this so we iterate `FieldInfos` and for any field that has KNN vectors enabled (`FieldInfo.hasVectorValues`, hmm but also has the HNSW graph (since I think a field can be only "flat" vectors?)) we do this check? ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -2746,6 +2769,84 @@ public static Status.VectorValuesStatus testVectors( return status; } + private static HnswGraph getHnswGraph(CodecReader reader) throws IOException { +KnnVectorsReader vectorsReader = reader.getVectorReader(); +if (vectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader) { + vectorsReader = ((PerFieldKnnVectorsFormat.FieldsReader) vectorsReader).getFieldReader("knn"); + if (vectorsReader instanceof HnswGraphProvider) { +return ((HnswGraphProvider) vectorsReader).getGraph("knn"); + } +} +return null; + } + + /** Test the HNSW graph. */ + public static Status.HnswGraphStatus testHnswGraph( + CodecReader reader, PrintStream infoStream, boolean failFast) throws IOException { +if (infoStream != null) { + infoStream.print("test: hnsw graph.."); +} +long startNS = System.nanoTime(); +Status.HnswGraphStatus status = new Status.HnswGraphStatus(); + +try { + HnswGraph hnswGraph = getHnswGraph(reader); + if (hnswGraph != null) { +final int numLevels = hnswGraph.numLevels(); +// Perform tests on each level of the HNSW graph +for (int level = numLevels - 1; level >= 0; level--) { + HnswGraph.NodesIterator nodesIterator = hnswGraph.getNodesOnLevel(level); + while (nodesIterator.hasNext()) { +int node = nodesIterator.nextInt(); +hnswGraph.seek(level, node); +int nbr, lastNeighbor = -1, firstNeighbor = -1; +while ((nbr = hnswGraph.nextNeighbor()) != NO_MORE_DOCS) { + if (firstNeighbor == -1) { +firstNeighbor = nbr; + } + if (nbr < lastNeighbor) { +throw new CheckIndexException( +"Neighbors out of order for node " ++ node ++ ": " ++ nbr ++ "<" ++ lastNeighbor ++ " 1st=" ++ firstNeighbor); + } else if (nbr == lastNeighbor) { +throw new CheckIndexException( +"There are repeated neighbors of node " + node + " with value " + nbr); + } + lastNeighbor = nbr; +} +status.hnswGraphSize++; + } + status.hsnwGraphNumLevels++; Review Comment: Instead of incrementing here, you could just set it to `numLevels` from above? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructur
Re: [PR] Simplify codec setup in vector-related tests. [lucene]
jpountz merged PR #13970: URL: https://github.com/apache/lucene/pull/13970 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve checksum calculations [lucene]
rmuir commented on PR #13989: URL: https://github.com/apache/lucene/pull/13989#issuecomment-2470068010 This is actually slower, we only want to call `updateBytes(byte[])` or the checksum calculation is very slow (not vectorized). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve checksum calculations [lucene]
jpountz commented on PR #13989: URL: https://github.com/apache/lucene/pull/13989#issuecomment-2470814754 The change makes sense to me and looks like it could speed up loading live docs. > The benchmark shows the single-long approach performs better on small arrays. [...] It can be improved by making updateLongs(long[], int, int) switch to a loop over updateLong(long) when the length to checksum fits in the buffer. Something like this would make sense to me, let's make it look as similar as possible to `update(byte[], int, int)`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]
viswanathk commented on code in PR #13990: URL: https://github.com/apache/lucene/pull/13990#discussion_r1838322211 ## lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java: ## @@ -154,7 +154,14 @@ protected TopDocs approximateSearch( @Override public String toString(String field) { -return getClass().getSimpleName() + ":" + this.field + "[" + query[0] + ",...][" + k + "]"; Review Comment: yeah, i did check them. they didn't have existing tests that cover `toString` so skipped them. will add that coverage though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]
viswanathk commented on code in PR #13990: URL: https://github.com/apache/lucene/pull/13990#discussion_r1838356398 ## lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java: ## @@ -154,7 +154,14 @@ protected TopDocs approximateSearch( @Override public String toString(String field) { -return getClass().getSimpleName() + ":" + this.field + "[" + query[0] + ",...][" + k + "]"; Review Comment: @benwtrent for the Diversifying* does it make sense to capture the `childFilter` and `parentsFilter` in toString() as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]
mikemccand commented on code in PR #13987: URL: https://github.com/apache/lucene/pull/13987#discussion_r1837971840 ## lucene/core/src/test/org/apache/lucene/search/TestTopFieldCollector.java: ## @@ -359,7 +359,7 @@ public void testTotalHitsWithScore() throws Exception { leafCollector.collect(1); scorer.score = 4; - leafCollector.collect(1); + leafCollector.collect(2); Review Comment: Another pre-existing test bug fixed? ## lucene/core/src/test/org/apache/lucene/search/TestPositiveScoresOnlyCollector.java: ## @@ -114,8 +114,9 @@ public void testNegativeScores() throws Exception { Collector c = new PositiveScoresOnlyCollector(tdc); LeafCollector ac = c.getLeafCollector(ir.leaves().get(0)); ac.setScorer(s); -while (s.iterator().nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { - ac.collect(0); +int docId; +while ((docId = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { + ac.collect(docId); Review Comment: Wow, good catch! So this was a latent pre-existing test bug, and with your above bug fix this test is now (correctly!) failing, and with your fix to this separate test bug, the test now passes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]
RS146BIJAY commented on issue #13387: URL: https://github.com/apache/lucene/issues/13387#issuecomment-2470112408 On some more analysis figured out an approach which addresses all the above comments and obtain same improvement with different IndexWriter for different group as we got with using different DWPTs for different group. ## Using separate IndexWriter for maintaining different tenants with a combined view ### Current Issue Maintaining separate IndexWriter for different groups (tenant) presents a significant problem as they do not function as a single unified entity. Although distinct IndexWriters and directories for each group ensures that data belonging to different groups are kept in separate segments and segments within the same group are merged, a unified read-only view for Client (OpenSearch) to interact with these multiple group-level IndexWriters is still needed. Lucene’s addIndexes api offers a way to combine group-level IndexWriters into single parent-level IndexWriter, but this approach has multiple drawbacks: 1. Since writes may continue on group-level IndexWriters, periodic synchronisation with the parent-level IndexWriter is necessary. 2. During synchronisation, an external lock needs to be placed on the group level IndexWriter directory, causing downtime. 3. Synchronisation will also involve copying files from the group level IndexWriter directory to the parent IndexWriter directory, which is resource-intensive, consuming disk IO and CPU cycles. ### Proposal To address this issue, we propose introducing a mechanism that combines group-level IndexWriters as a soft reference to a parent IndexWriter. This will be achieved by creating a new variant of the addIndexes API within IndexWriter, which will only combine the SegmentInfos of group-level IndexWriter without requiring an external lock or copying files across directories. Group-level segments will be maintained in separate directories associated with their respective group-level IndexWriters. The client will periodically call (for OpenSearch side this corresponds to index refresh interval of 1 sec) this addIndexes API on the parent IndexWriter, passing the segmentInfos of child-level IndexWriter as parameters to sync the latest SegmentInfos with the parent IndexWriter. While combining the SegmentInfos of child-level IndexWriters, the addIndexes API will attach a prefix to the segment names to identify the group each Segments belongs to, avoiding name conflicts between segments of different group-level IndexWriters.  The parent IndexWriter will be associated with a filter directory that will distinguishes the tenant using the file name prefix, redirecting any read/write operations on a file to the correct group level directory using segment file prefix name. Reason for choosing common view as an IndexWriter Most interactions of Lucene with the client (OpenSearch) such as opening a reader, getting the latest commit info, reopening a Lucene index, etc occurs via IndexWriter itself. Thus selecting IndexWriter as a common view made more sense. ### Improvements with multiple IndexWriter with a combined view We were able to observe around 50% - 60% improvements with multiple IndexWriter with a combined view approach similar to what we observed by having different DWPTs for different tenant (initial proposal). ### Considerations 1. The referencing IndexWriter will be a combined read only view for group level IndexWriters. Since this IndexWriter does not itself has any segments and is only referencing segment Infos of other IndexWriters, write operation like segment merge, flush etc should not be performed on this parent IndexWriter instance. 2. We need to consider prefix name attached before segment names when [parsing segment names](https://github.com/RS146BIJAY/lucene/blob/84811e974f38181b0c1f1e1b5655f674a1584385/lucene/core/src/java/org/apache/lucene/index/IndexFileNames.java#L119). 3. It will be difficult to support update queries with multi IndexWriter approach. For eg: If we are grouping logs on status code and user update the status code field of the logs, for lucene, insert and update operations needs to be performed on the separate delete queue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@luce
Re: [PR] Add a Better Binary Quantizer format for dense vectors [lucene]
ShashwatShivam commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2470937060 I conducted a benchmark using Cohere's 768-dimensional data. Here are the steps I followed for reproducibility: 1. **Set up** the [luceneutil repository](https://github.com/mikemccand/luceneutil/) following the installation instructions provided. 2. **Switch branches** to [this specific branch](https://github.com/mikemccand/luceneutil/compare/main...benwtrent:luceneutil:bbq) since the latest mainline branch is not compatible with the feature needed for this experiment. 3. **Change the branch** of `lucene_candidate` to [benwtrent:feature/adv-binarization-format](https://github.com/benwtrent/lucene/tree/feature/adv-binarization-format) to incorporate advanced binarization formats. 4. **Run** `knnPerfTest.py` after specifying the document and query file paths to the stored Cohere data files. The runtime parameters were set as follows: - `nDoc = 500,000` - `topk = 10` - `fanout = 100` - `maxConn = 32` - `beamWidth = 100` - `oversample` values tested: `{1, 1.5, 2, 3, 4, 5}` I used `quantizeBits = 1` for RaBitQ+HNSW and `quantizeBits = 32` for regular HNSW. A comparison was performed between HNSW and RaBitQ, and I observed the recall-latency tradeoff, which is shown in the attached image: . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]
ChrisHegarty commented on issue #13551: URL: https://github.com/apache/lucene/issues/13551#issuecomment-2470361869 Late to the party!!! I want to dig a little further on the distinction between the implementation of preload and prefetch, at least with the mmap implementation. The former touches every page (effectively forcing into the cache), while the latter simply advises of the access pattern. For sequential access, a combination of sequential advice and prefetch works really well, but not so well for random access. Random access, at least for the HNSW graph and the related vector data (be it quantised or not), really needs to be resident in memory. Preload gives us that, at least at the time of opening. Prefect does not. For optimized vector search use cases, assuming enough RAM, I was thinking that a separate `IndexInput::load` could be useful. This would be similar to preload (touches every page), but can be called after opening. It may or may not be super helpful, it mostly depends on how things get setup, etc, and whether or not preload is good enough for your use case. For optimized vector search use cases, assuming enough RAM, I was thinking that a separate `IndexInput::lock` could be useful. Basically a wrapper around `mlock`. The idea being that Vector search perf sucks when there is not enough RAM to keep the graph and vector data resident in memory, so allow to load and lock it, then leave the tradeoff to the user to decide if they want this "fail fast" rather then "search slowly" behaviour (at least for pure vector search). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve checksum calculations [lucene]
jfboeuf commented on PR #13989: URL: https://github.com/apache/lucene/pull/13989#issuecomment-2470657083 Thank you for your feedback. Perhaps I misunderstood your point, but the implementation I propose only calls `Checksum.update(byte[])`. The change resides in how the buffer is fed to avoid byte-by-byte reading from the underlying `IndexInput` and byte-by-byte copy to the buffer. I created and pushed [into a different branch a JMH benchmark](https://github.com/apache/lucene/commit/91482728ea1633213bc064ed5362be041101d1d5) comparing the current implementation to the one I propose. The results show a noticeable improvement: ``` Benchmark (size) Mode Cnt Score Error Units BufferedChecksumIndexInputBenchmark.decodeLongArray10 thrpt 15 7521.843 ± 167.756 ops/ms BufferedChecksumIndexInputBenchmark.decodeLongArray 1000 thrpt 15 1004.213 ± 3.587 ops/ms BufferedChecksumIndexInputBenchmark.decodeLongArray10 thrpt 1510.993 ± 0.102 ops/ms BufferedChecksumIndexInputBenchmark.decodeLongArrayOrig10 thrpt 15 3865.018 ± 38.850 ops/ms BufferedChecksumIndexInputBenchmark.decodeLongArrayOrig 1000 thrpt 1546.381 ± 0.293 ops/ms BufferedChecksumIndexInputBenchmark.decodeLongArrayOrig10 thrpt 15 0.475 ± 0.022 ops/ms BufferedChecksumIndexInputBenchmark.decodeSingleLongs 10 thrpt 15 8355.638 ± 52.745 ops/ms BufferedChecksumIndexInputBenchmark.decodeSingleLongs1000 thrpt 15 212.133 ± 4.296 ops/ms BufferedChecksumIndexInputBenchmark.decodeSingleLongs 10 thrpt 15 2.744 ± 0.021 ops/ms BufferedChecksumIndexInputBenchmark.decodeSingleLongsOrig 10 thrpt 15 3938.857 ± 42.751 ops/ms BufferedChecksumIndexInputBenchmark.decodeSingleLongsOrig1000 thrpt 1547.246 ± 0.444 ops/ms BufferedChecksumIndexInputBenchmark.decodeSingleLongsOrig 10 thrpt 15 0.460 ± 0.020 ops/ms ``` For large arrays, there can be an improvement of up to 23 times when reading long arrays and 6 times when reading single long values. Transitioning from reading single long values to long arrays for live documents and Bloom Filters— bitsets being commonly large in both scenarios—results in even greater performance enhancements. The benchmark shows the single-long approach performs better on small arrays. This is likely due to the cost of wrapping the `byte[]` to a `ByteBuffer` and `LongBuffer` that is not paid off to copy a few bytes. It can be improved by making `updateLongs(long[], int, int)` switch to a loop over `updateLong(long)` when the length to checksum fits in the buffer. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]
viswanathk commented on PR #13990: URL: https://github.com/apache/lucene/pull/13990#issuecomment-2470717378 > Could you update the byte knn query & DiversifyingChildern* knn queries as well? I made the changes fo DiversifyingChildren*, but by byteknn do you mean `ByteVectorSimilarityQuery`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]
benwtrent commented on code in PR #13990: URL: https://github.com/apache/lucene/pull/13990#discussion_r1838141983 ## lucene/core/src/test/org/apache/lucene/search/TestKnnFloatVectorQuery.java: ## @@ -29,13 +29,7 @@ import org.apache.lucene.document.Field; import org.apache.lucene.document.KnnFloatVectorField; import org.apache.lucene.document.StringField; -import org.apache.lucene.index.DirectoryReader; -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.index.IndexWriter; -import org.apache.lucene.index.IndexWriterConfig; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.QueryTimeout; -import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.index.*; Review Comment: Let's not collapse to `*` ## lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java: ## @@ -112,7 +112,14 @@ VectorScorer createVectorScorer(LeafReaderContext context, FieldInfo fi) throws @Override public String toString(String field) { -return getClass().getSimpleName() + ":" + this.field + "[" + target[0] + ",...][" + k + "]"; +StringBuilder buffer = new StringBuilder(); +buffer.append(getClass().getSimpleName() + ":"); +buffer.append(this.field + "[" + target[0] + ",...]"); +buffer.append("[" + k + "]"); +if (this.filter != null) { + buffer.append(",filter:[" + this.filter + "]"); +} Review Comment: ```suggestion if (this.filter != null) { buffer.append("[" + this.filter + "]"); } ``` I think that would be better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]
mikemccand commented on PR #13984: URL: https://github.com/apache/lucene/pull/13984#issuecomment-2470306515 > this might also check for graph connectivity, see #12627 +1 -- this is a tricky thing about these HNSW graphs (that they are not necessarily born connected but rather must be stitched up somehow afterwards). But maybe do this separately? I'd love to just get this initial coverage in first... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]
mikemccand commented on PR #13984: URL: https://github.com/apache/lucene/pull/13984#issuecomment-2470310487 Actually, `CheckIndex` does have some coverage for vectors and KNN graph (it confirms it can enumerate all vectors, and also runs some searches on it if it's not just flat vectors (`checkByte/FloatVectorValues`) , so it's not completely blind :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]
mikemccand commented on PR #13987: URL: https://github.com/apache/lucene/pull/13987#issuecomment-2470331708 Wow, good catch @msfroh. Could we maybe add a new test case that explicitly confirms that the wrapped `Scorable`'s `score` method is indeed only called once even if the outer user calls `.score()` multiple times on the same `docID`? This seems to be the primary purpose of this class, and it was failing this, for a very long time! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduces IndexInput#updateReadAdvice to change the readadvice while [lucene]
ChrisHegarty commented on PR #13985: URL: https://github.com/apache/lucene/pull/13985#issuecomment-2470311965 @shatejas I'm curious how much this actually helps, and I know that you said that benchmark results would be posted. I do like that we can update the ReadAdvice on an index input 👍 What is unclear is how much the sequential advise here helps over something like a `load` (similar to `MemorySegment::load`), that touches every page. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Addresses the issue #13983 by adding filter to the toString() method of KnnFloatVectorQuery [lucene]
viswanathk opened a new pull request, #13990: URL: https://github.com/apache/lucene/pull/13990 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org