Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1384518777 ## lucene/core/src/java/org/apache/lucene/util/fst/FSTWriter.java: ## @@ -0,0 +1,54 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.fst; + +import java.io.IOException; + +/** Abstract class which provides low-level functionality to write to a FST */ +public interface FSTWriter { + + /** Review Comment: These 2 methods are same as DataOutput, but DataOutput is a class which we want to avoid extending from. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Take advantage of bloom filter when delete terms [lucene]
gf2121 commented on issue #12725: URL: https://github.com/apache/lucene/issues/12725#issuecomment-1798007152 > do we have any numbers if it actually helps applying deletes? I made a naive benchmark comparing `seekCeil` and `seekExact` when deleting terms on a field with bloom filter, result shows `seekExact` style ~9% faster in average. (All terms are 16 bytes UUID) https://bytedance.feishu.cn/sheets/Jwjzs3FdXhFbVwtfADJc6L8rnCb"; data-importRangeRawData-range="'Sheet1'!A1:D12"> round | seekCeil(ms) | candidate(ms) | diff -- | -- | -- | -- 1 | 46211 | 42783 | -7% 2 | 44641 | 40698 | -9% 3 | 44462 | 43471 | -2% 4 | 47059 | 43842 | -7% 5 | 45049 | 43609 | -3% 6 | 49529 | 40953 | -17% 7 | 44707 | 41498 | -7% 8 | 45732 | 40785 | -11% 9 | 45396 | 41321 | -9% 10 | 48759 | 40760 | -16% avg | 46154.5 | 41972 | -9% > wild idea but would it make sense to build an automaton off these terms and intersect it? We could reuse it for multiple segments? I am not sure how big the costs are for that but it would potentially in a codec agnostic way? This sounds great. I think intersecting with automation can not take advantage of bloom filter too. So it should be a competitive approach comparing to `seekExact` or `seekCeil` ? I'd like to give it a try. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Early terminate visit BKD leaf when current value greater than upper point in sorted dim. [lucene]
vsop-479 commented on PR #12528: URL: https://github.com/apache/lucene/pull/12528#issuecomment-1798153626 @iverase I implemented visitWithSortedDim in LatLonPointDistanceQuery. Please take a look when you get a chance! BTW, The contents of getIntersectVisitor and getInverseIntersectVisitor in LatLonPointDistanceQuery and PointRangeQuery are identical. Should we move them to a DefaultIntersectVisitor and make match/relate abstract in it, i mean in a new PR without this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove BytesReader#reversed() [lucene]
mikemccand merged PR #12777: URL: https://github.com/apache/lucene/pull/12777 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Remove `FST.BytesReader#reversed` method? [lucene]
mikemccand closed issue #12759: Remove `FST.BytesReader#reversed` method? URL: https://github.com/apache/lucene/issues/12759 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we ban Random#nextInt(int, int)? [lucene]
mikemccand closed issue #12771: Should we ban Random#nextInt(int, int)? URL: https://github.com/apache/lucene/issues/12771 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we ban Random#nextInt(int, int)? [lucene]
mikemccand commented on issue #12771: URL: https://github.com/apache/lucene/issues/12771#issuecomment-1798212632 Indeed, the javadocs have the same terrifying explanation -- thanks @dungba88!: `If bound is a power of two then limiting is a simple masking operation. Otherwise, the result is re-calculated by invoking nextInt() until the result is greater than or equal zero and less than bound.` Fortunately, thanks to @rmuir digging into OpenJDK sources, it looks like this is a false alarm: these methods first pre-mask to the "next largest power of 2" fully containing the requested range, and iteratively pull random numbers in that range, until finding one within the requested (not necessarily power of 2) range. So the number of iterations of that loop is "typically" very low (1 or 2). I cancelled the PR (#12772) and forgot to close this issue -- closing now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]
mikemccand commented on code in PR #12778: URL: https://github.com/apache/lucene/pull/12778#discussion_r1384692182 ## lucene/core/src/test/org/apache/lucene/util/TestByteBlockPool.java: ## @@ -25,7 +24,34 @@ public class TestByteBlockPool extends LuceneTestCase { - public void testReadAndWrite() throws IOException { + public void testAppendFromOtherPool() { +ByteBlockPool pool = new ByteBlockPool(new ByteBlockPool.DirectAllocator()); +final int numBytes = atLeast(2 << 16); +byte[] bytes = new byte[numBytes]; +for (int i = 0; i < numBytes; i++) { + bytes[i] = (byte) random().nextInt(256); Review Comment: > It seems we are also banning the `random().nextInt()` call in #12771 Sorry, no need to avoid this API -- it is unbanned now -- it was a false alarm! Sorry for the scare :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]
mikemccand commented on code in PR #12778: URL: https://github.com/apache/lucene/pull/12778#discussion_r1384697358 ## lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java: ## @@ -234,6 +234,44 @@ public void append(final BytesRef bytes) { append(bytes.bytes, bytes.offset, bytes.length); } + /** + * Append the bytes from a source {@link ByteBlockPool} at a given offset and length + * + * @param srcPool the source pool to copy from + * @param offset the source pool offset + * @param length the number of bytes to copy + */ + public void append(ByteBlockPool srcPool, long offset, int length) { +int bytesLeft = length; +while (bytesLeft > 0) { + int bufferLeft = BYTE_BLOCK_SIZE - byteUpto; + if (bytesLeft < bufferLeft) { // fits within current buffer +copyBytes(srcPool, offset, bytesLeft); +break; + } else { // fill up this buffer and move to next one +if (bufferLeft > 0) { + copyBytes(srcPool, offset, bufferLeft); + bytesLeft -= bufferLeft; + offset += bufferLeft; +} +nextBuffer(); + } +} + } + + // copy from source pool until no bytes left. length must be fit within the current head buffer + private void copyBytes(ByteBlockPool srcPool, long offset, int length) { +while (length > 0) { Review Comment: Maybe add comment that we need `while` loop here in case the byte slice to be copied spans multiple `byte[]` buffers in the `srcPool`? ## lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java: ## @@ -234,6 +234,44 @@ public void append(final BytesRef bytes) { append(bytes.bytes, bytes.offset, bytes.length); } + /** + * Append the bytes from a source {@link ByteBlockPool} at a given offset and length + * + * @param srcPool the source pool to copy from + * @param offset the source pool offset + * @param length the number of bytes to copy + */ + public void append(ByteBlockPool srcPool, long offset, int length) { +int bytesLeft = length; +while (bytesLeft > 0) { + int bufferLeft = BYTE_BLOCK_SIZE - byteUpto; + if (bytesLeft < bufferLeft) { // fits within current buffer +copyBytes(srcPool, offset, bytesLeft); +break; + } else { // fill up this buffer and move to next one +if (bufferLeft > 0) { + copyBytes(srcPool, offset, bufferLeft); + bytesLeft -= bufferLeft; + offset += bufferLeft; +} +nextBuffer(); + } +} + } + + // copy from source pool until no bytes left. length must be fit within the current head buffer + private void copyBytes(ByteBlockPool srcPool, long offset, int length) { +while (length > 0) { Review Comment: Can we add an `assert` confirming `length` does indeed fit within the head buffer? ## lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java: ## @@ -234,6 +234,44 @@ public void append(final BytesRef bytes) { append(bytes.bytes, bytes.offset, bytes.length); } + /** + * Append the bytes from a source {@link ByteBlockPool} at a given offset and length + * + * @param srcPool the source pool to copy from + * @param offset the source pool offset + * @param length the number of bytes to copy + */ + public void append(ByteBlockPool srcPool, long offset, int length) { +int bytesLeft = length; +while (bytesLeft > 0) { + int bufferLeft = BYTE_BLOCK_SIZE - byteUpto; + if (bytesLeft < bufferLeft) { // fits within current buffer +copyBytes(srcPool, offset, bytesLeft); +break; + } else { // fill up this buffer and move to next one +if (bufferLeft > 0) { + copyBytes(srcPool, offset, bufferLeft); + bytesLeft -= bufferLeft; + offset += bufferLeft; +} +nextBuffer(); + } +} + } + + // copy from source pool until no bytes left. length must be fit within the current head buffer + private void copyBytes(ByteBlockPool srcPool, long offset, int length) { Review Comment: Can we rename to make it clear the length must fit within single buffer? Also, it's not a generic `copyBytes` but also an `append` operation? Maybe `appendBytesSingleBuffer` or something? ## lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java: ## @@ -234,6 +234,44 @@ public void append(final BytesRef bytes) { append(bytes.bytes, bytes.offset, bytes.length); } + /** + * Append the bytes from a source {@link ByteBlockPool} at a given offset and length + * + * @param srcPool the source pool to copy from + * @param offset the source pool offset + * @param length the number of bytes to copy + */ + public void append(ByteBlockPool srcPool, long offset, int length) { Review Comment: Rename to `srcOffset`? ## lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java: ## @@ -234,6 +234,44 @@ pu
Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]
dungba88 commented on code in PR #12778: URL: https://github.com/apache/lucene/pull/12778#discussion_r1384749437 ## lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java: ## @@ -234,6 +234,44 @@ public void append(final BytesRef bytes) { append(bytes.bytes, bytes.offset, bytes.length); } + /** + * Append the bytes from a source {@link ByteBlockPool} at a given offset and length + * + * @param srcPool the source pool to copy from + * @param offset the source pool offset + * @param length the number of bytes to copy + */ + public void append(ByteBlockPool srcPool, long offset, int length) { +int bytesLeft = length; +while (bytesLeft > 0) { + int bufferLeft = BYTE_BLOCK_SIZE - byteUpto; + if (bytesLeft < bufferLeft) { // fits within current buffer +copyBytes(srcPool, offset, bytesLeft); +break; + } else { // fill up this buffer and move to next one +if (bufferLeft > 0) { + copyBytes(srcPool, offset, bufferLeft); + bytesLeft -= bufferLeft; + offset += bufferLeft; +} +nextBuffer(); + } +} + } + + // copy from source pool until no bytes left. length must be fit within the current head buffer + private void copyBytes(ByteBlockPool srcPool, long offset, int length) { +while (length > 0) { Review Comment: Thank you, I have incorporated the comment into the new rev -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Enable executing using NFA in RegexpQuery [lucene]
rmuir commented on code in PR #12767: URL: https://github.com/apache/lucene/pull/12767#discussion_r1384795478 ## lucene/core/src/test/org/apache/lucene/search/TestRegexpQuery.java: ## @@ -80,7 +80,10 @@ private long caseInsensitiveRegexQueryNrHits(String regex) throws IOException { newTerm(regex), RegExp.ALL, RegExp.ASCII_CASE_INSENSITIVE, -Operations.DEFAULT_DETERMINIZE_WORK_LIMIT); +RegexpQuery.DEFAULT_PROVIDER, +Operations.DEFAULT_DETERMINIZE_WORK_LIMIT, +MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE, +random().nextBoolean()); Review Comment: yeah unfortunately this test is incredibly weak... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1798357836 @kevindrosendahl this looks really interesting! Thank you for digging in and starting the experimentation! I haven't had a chance to read your branch yet, but hope to soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Stop exploring HNSW graph if scores are not getting better. [lucene]
benwtrent merged PR #12770: URL: https://github.com/apache/lucene/pull/12770 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up BytesRefHash#sort [lucene]
gf2121 commented on PR #12775: URL: https://github.com/apache/lucene/pull/12775#issuecomment-1798365624 For realistic data I tried to index `wikimedium10m` with ramBuffer = 1024m and accumulating the took of all invoking of `BytesRefHash#sort`. Result shows the took sum decreased from 27161ms -> 12203ms, which is much better than i expected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up BytesRefHash#sort [lucene]
dweiss commented on PR #12775: URL: https://github.com/apache/lucene/pull/12775#issuecomment-1798550955 Wow. Nice improvement! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up BytesRefHash#sort [lucene]
mikemccand commented on PR #12775: URL: https://github.com/apache/lucene/pull/12775#issuecomment-1798603970 This is incredible speedup :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix nested link warning from renderSiteJavadoc [lucene]
slow-J opened a new pull request, #12779: URL: https://github.com/apache/lucene/pull/12779 Very minor change. ### Description Without this change, we are getting this warning in `> Task :lucene:core:renderSiteJavadoc` ``` /local/home/jslowins/upstream_bench/lucene_bench_home_2/lucene_candidate/lucene/core/src/java/org/apache/lucene/analysis/WordlistLoader.java:32: warning: Tag {@link}: nested link * @see IOUtils to obtain {@link Reader} instances ^ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix nested link warning from renderSiteJavadoc [lucene]
mikemccand commented on code in PR #12779: URL: https://github.com/apache/lucene/pull/12779#discussion_r1385198200 ## lucene/core/src/java/org/apache/lucene/analysis/WordlistLoader.java: ## @@ -29,7 +29,7 @@ /** * Loader for text files that represent a list of stopwords. * - * @see IOUtils to obtain {@link Reader} instances + * @see IOUtils to obtain Reader instances Review Comment: Ahh this is a problem because the `to obtain Reader instances` is already a link pointing to `IOUtils`? And so we cannot stick another link for `Reader` inside there. Makes sense. I'll merge shortly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix nested link warning from renderSiteJavadoc [lucene]
slow-J commented on code in PR #12779: URL: https://github.com/apache/lucene/pull/12779#discussion_r1385202002 ## lucene/core/src/java/org/apache/lucene/analysis/WordlistLoader.java: ## @@ -29,7 +29,7 @@ /** * Loader for text files that represent a list of stopwords. * - * @see IOUtils to obtain {@link Reader} instances + * @see IOUtils to obtain Reader instances Review Comment: Yes exactly, see the `see also` section of https://lucene.apache.org/core/9_8_0/core/org/apache/lucene/analysis/WordlistLoader.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix nested link warning from renderSiteJavadoc [lucene]
mikemccand merged PR #12779: URL: https://github.com/apache/lucene/pull/12779 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Explore partially decoding blocks (within-block skipping) [lucene]
Tony-X commented on issue #12749: URL: https://github.com/apache/lucene/issues/12749#issuecomment-1799460960 > How would it work? Since blocks are delta-coded, you can't know the value at a given index without decoding all previous values and computing their sum? Or you need to store some checkpoints separately, but then it may be easier/better to simply go with smaller blocks (e.g. 64 doc IDs instead of 128)? +1. Delta-encoding here is the blocker, unless we change the encoding scheme. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]
kevindrosendahl opened a new pull request, #12780: URL: https://github.com/apache/lucene/pull/12780 ### Description When using cosine similarity, the `ScalarQuantizer` normalizes vectors when calculating quantiles and `ScalarQuantizedRandomVectorScorer` normalizes query vectors before scoring them, but `Lucene99ScalarQuantizedVectorsWriter` does not normalize the vectors prior to quantizing them when producing the quantized vectors to write to disk. This PR normalizes vectors prior to quantizing them when writing them to disk. Recall results on my M1 with the `glove-100-angular` data set (all using `maxConn`: 16, `beamWidth` 100, `numCandidates`: 100, `k`: 10, single segment): | Configuration | Recall | Average Query Duration | |---|---|-| | Pre-patch no quantization | 0.78762 | 0.68 ms | | Pre-patch with quantization | 8.717E-5 | 0.45 ms | | Post-patch no quantization | 0.78762 | 0.70 ms | | Post-patch with quantization | 0.66742 | 0.66 ms | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]
kevindrosendahl commented on code in PR #12780: URL: https://github.com/apache/lucene/pull/12780#discussion_r1385606896 ## lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestLucene99HnswQuantizedVectorsFormat.java: ## @@ -38,8 +38,10 @@ import org.apache.lucene.store.Directory; import org.apache.lucene.tests.index.BaseKnnVectorsFormatTestCase; import org.apache.lucene.util.ScalarQuantizer; +import org.apache.lucene.util.VectorUtil; public class TestLucene99HnswQuantizedVectorsFormat extends BaseKnnVectorsFormatTestCase { Review Comment: This test was always using normalized vectors as input. I changed it to use vectors that aren't guaranteed to be normalized, but kept all existing tests using normalized vectors so as not to change other tests incidentally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]
kevindrosendahl commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1800243525 > I haven't had a chance to read your branch yet, but hope to soon. Great, thanks! To save you a bit of time, the tl;dr of going from HNSW to vamana is that it's actually fairly simple in the end. For the most part, I just removed all references to levels and changed the [diversity check](https://github.com/kevindrosendahl/lucene/blob/075116396e8af35b0a19a277491be9acab150beb/lucene/core/src/java/org/apache/lucene/util/vamana/VamanaGraphBuilder.java#L376-L404) to incorporate vamana's `alpha` parameter (essentially just adding a second, outer loop). There were a few small other implementation detail changes as well like [pruning down all the way to `M` when passing the overflow threshold instead of just removing one neighbor](https://github.com/kevindrosendahl/lucene/blob/075116396e8af35b0a19a277491be9acab150beb/lucene/core/src/java/org/apache/lucene/util/vamana/VamanaGraphBuilder.java#L360-L369). Then on the storage size, you just [write vectors into the graph instead of into the `.vec` file](https://github.com/kevindrosendahl/lucene/blob/075116396e8af35b0a19a277491be9acab150beb/lucene/core/src/java/org/apache/lucene/codecs/vectorsandbox/VectorSandboxVamanaVectorsWriter.java#L703-L734), and [read from the right offsets to get the vector values as appropriate](https://github.com/kevindrosendahl/lucene/blob/075116396e8af35b0a19a277491be9acab150beb/lucene/core/src/java/org/apache/lucene/codecs/vectorsandbox/VectorSandboxVamanaVectorsReader.java#L598-L684). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] HnwsGraph creates disconnected components [lucene]
nitirajrathore commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1800972146 Hi @msokolov ! Thanks for clarifying. But I think it can help to remove the 'less important' edge from both sides, since it frees up a degree of "other" node to accept a new connection without indulging into diversity check. Most of the time the problem that I saw with diversity check is that, the most recently added connection to the node itself turns out to be non-diverse and gets removed immediately. Because of this the new incoming node does not even get enough connections to begin with in the graph. Even in the case where I removed the diversity check while connecting incoming node (a) to neighbours (say b,c) and allow full max-conn number of connections, most of the connections from neighbours (b->a) to the nodes are removed because of diversity check, leaving mostly half connections from (a->b). Now when a new node say `x` tries to connect to `a` even that is kicked off by `a` because of diversity check. So keeping half connections sometimes acts against the connectedness n ature of hnsw. I have come up with a new huristic which seems to work well against disconnectedness, but I am yet to optimize it and conduct performance test on it. Will still update the algorithm here and post numbers later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up BytesRefHash#sort [lucene]
gf2121 commented on PR #12775: URL: https://github.com/apache/lucene/pull/12775#issuecomment-1801128536 Something odd: I accidentally run the benchmark on another mac with `intel chip` and the result is disappointing (no improvements) ``` use stable sort: false, sort 5169965 terms, took: 5322ms use stable sort: false, sort 5169965 terms, took: 5605ms use stable sort: false, sort 5169965 terms, took: 5457ms use stable sort: false, sort 5169965 terms, took: 5392ms use stable sort: false, sort 5169965 terms, took: 5379ms use stable sort: false, sort 5169965 terms, took: 5442ms use stable sort: false, sort 5169965 terms, took: 5262ms use stable sort: false, sort 5169965 terms, took: 5100ms use stable sort: false, sort 5169965 terms, took: 5057ms use stable sort: false, sort 5169965 terms, took: 5160ms use stable sort: false, sort 5169965 terms, took: 5207ms use stable sort: false, sort 3130385 terms, took: 2868ms use stable sort: true, sort 5169965 terms, took: 5083ms use stable sort: true, sort 5169965 terms, took: 5332ms use stable sort: true, sort 5169965 terms, took: 5322ms use stable sort: true, sort 5169965 terms, took: 5333ms use stable sort: true, sort 5169965 terms, took: 5295ms use stable sort: true, sort 5169965 terms, took: 5306ms use stable sort: true, sort 5169965 terms, took: 5317ms use stable sort: true, sort 5169965 terms, took: 5336ms use stable sort: true, sort 5169965 terms, took: 5284ms use stable sort: true, sort 5169965 terms, took: 5344ms use stable sort: true, sort 5169965 terms, took: 5289ms use stable sort: true, sort 3130385 terms, took: 2837ms ``` But I can easily reproduce the speedup on the `M2 chip` mac that produced all benchmark results above: ``` use stable sort: false, sort 5169965 terms, took: 2340ms use stable sort: false, sort 5169965 terms, took: 2197ms use stable sort: false, sort 5169965 terms, took: 2288ms use stable sort: false, sort 5169965 terms, took: 2320ms use stable sort: false, sort 5169965 terms, took: 2227ms use stable sort: false, sort 5169965 terms, took: 2238ms use stable sort: false, sort 5169965 terms, took: 2256ms use stable sort: false, sort 5169965 terms, took: 2287ms use stable sort: false, sort 5169965 terms, took: 2244ms use stable sort: false, sort 5169965 terms, took: 2240ms use stable sort: false, sort 5169965 terms, took: 2320ms use stable sort: false, sort 3130385 terms, took: 1082ms use stable sort: true, sort 5169965 terms, took: 1418ms use stable sort: true, sort 5169965 terms, took: 1303ms use stable sort: true, sort 5169965 terms, took: 1290ms use stable sort: true, sort 5169965 terms, took: 1349ms use stable sort: true, sort 5169965 terms, took: 1293ms use stable sort: true, sort 5169965 terms, took: 1329ms use stable sort: true, sort 5169965 terms, took: 1332ms use stable sort: true, sort 5169965 terms, took: 1307ms use stable sort: true, sort 5169965 terms, took: 1280ms use stable sort: true, sort 5169965 terms, took: 1336ms use stable sort: true, sort 5169965 terms, took: 1302ms use stable sort: true, sort 3130385 terms, took: 623ms ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up BytesRefHash#sort [lucene]
gf2121 commented on PR #12775: URL: https://github.com/apache/lucene/pull/12775#issuecomment-1801136851 Something odd: I accidentally run the benchmark on another mac with `intel chip` and the result is disappointing (no obvious improvements) ``` use stable sort: false, sort 5169965 terms, took: 5377ms use stable sort: false, sort 5169965 terms, took: 5293ms use stable sort: false, sort 5169965 terms, took: 5347ms use stable sort: false, sort 5169965 terms, took: 5285ms use stable sort: false, sort 5169965 terms, took: 5307ms use stable sort: false, sort 5169965 terms, took: 5373ms use stable sort: false, sort 5169965 terms, took: 5416ms use stable sort: false, sort 5169965 terms, took: 5321ms use stable sort: false, sort 5169965 terms, took: 5343ms use stable sort: false, sort 5169965 terms, took: 5278ms use stable sort: false, sort 5169965 terms, took: 5323ms use stable sort: true, sort 5169965 terms, took: 5261ms use stable sort: true, sort 5169965 terms, took: 5329ms use stable sort: true, sort 5169965 terms, took: 5304ms use stable sort: true, sort 5169965 terms, took: 5268ms use stable sort: true, sort 5169965 terms, took: 5331ms use stable sort: true, sort 5169965 terms, took: 5306ms use stable sort: true, sort 5169965 terms, took: 5232ms use stable sort: true, sort 5169965 terms, took: 5245ms use stable sort: true, sort 5169965 terms, took: 5248ms use stable sort: true, sort 5169965 terms, took: 5218ms use stable sort: true, sort 5169965 terms, took: 5193ms ``` But I can easily reproduce the speedup on the `M2 chip` mac that produced all benchmark results above: ``` use stable sort: false, sort 5169965 terms, took: 2340ms use stable sort: false, sort 5169965 terms, took: 2197ms use stable sort: false, sort 5169965 terms, took: 2288ms use stable sort: false, sort 5169965 terms, took: 2320ms use stable sort: false, sort 5169965 terms, took: 2227ms use stable sort: false, sort 5169965 terms, took: 2238ms use stable sort: false, sort 5169965 terms, took: 2256ms use stable sort: false, sort 5169965 terms, took: 2287ms use stable sort: false, sort 5169965 terms, took: 2244ms use stable sort: false, sort 5169965 terms, took: 2240ms use stable sort: false, sort 5169965 terms, took: 2320ms use stable sort: true, sort 5169965 terms, took: 1418ms use stable sort: true, sort 5169965 terms, took: 1303ms use stable sort: true, sort 5169965 terms, took: 1290ms use stable sort: true, sort 5169965 terms, took: 1349ms use stable sort: true, sort 5169965 terms, took: 1293ms use stable sort: true, sort 5169965 terms, took: 1329ms use stable sort: true, sort 5169965 terms, took: 1332ms use stable sort: true, sort 5169965 terms, took: 1307ms use stable sort: true, sort 5169965 terms, took: 1280ms use stable sort: true, sort 5169965 terms, took: 1336ms use stable sort: true, sort 5169965 terms, took: 1302ms ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up BytesRefHash#sort [lucene]
gf2121 commented on PR #12775: URL: https://github.com/apache/lucene/pull/12775#issuecomment-1801199102 I checked linux x86, no obvious speedup too. Maybe the stable style somewhat helped the arm CPU cache ``` use stable sort: false, sort 5169965 terms, took: 4900ms use stable sort: false, sort 5169965 terms, took: 5373ms use stable sort: false, sort 5169965 terms, took: 4998ms use stable sort: false, sort 5169965 terms, took: 5345ms use stable sort: false, sort 5169965 terms, took: 5051ms use stable sort: false, sort 5169965 terms, took: 4906ms use stable sort: false, sort 5169965 terms, took: 4869ms use stable sort: false, sort 5169965 terms, took: 4840ms use stable sort: false, sort 5169965 terms, took: 4816ms use stable sort: false, sort 5169965 terms, took: 4760ms use stable sort: false, sort 5169965 terms, took: 4903ms use stable sort: true, sort 5169965 terms, took: 4883ms use stable sort: true, sort 5169965 terms, took: 4852ms use stable sort: true, sort 5169965 terms, took: 4920ms use stable sort: true, sort 5169965 terms, took: 4863ms use stable sort: true, sort 5169965 terms, took: 4836ms use stable sort: true, sort 5169965 terms, took: 4917ms use stable sort: true, sort 5169965 terms, took: 4836ms use stable sort: true, sort 5169965 terms, took: 4900ms use stable sort: true, sort 5169965 terms, took: 4888ms use stable sort: true, sort 5169965 terms, took: 4921ms use stable sort: true, sort 5169965 terms, took: 4835ms ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up BytesRefHash#sort [lucene]
dweiss commented on PR #12775: URL: https://github.com/apache/lucene/pull/12775#issuecomment-1801236300 Maybe Apple chips are tuned for sorting (they need to sort out what to do with all these income bills, after all)? :) And seriously - thank you for checking on different hardware. There may be no single solution to fit them all - as the PR with vectorization benchmarks clearly shows... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]
zacharymorn commented on PR #240: URL: https://github.com/apache/lucene/pull/240#issuecomment-1801269361 Hi @mikemccand @jpountz @javanna @gsmiller , I have updated this PR to pick up the latest from `main`, as well as revert some changes to save them for follow-up PRs that address other collectors. This PR is now focused on the following: * Refactor out TopScoreDocCollectorManager, TopFieldCollectorManager * Refactor some tests to use the above collector manager Could you please take a look, and let me know if you have any feedback? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org