Re: [I] `FSTCompiler.Builder` should have an option to stream the FST bytes directly to Directory [lucene]
dungba88 commented on issue #12543: URL: https://github.com/apache/lucene/issues/12543#issuecomment-1746403380 Thanks @mikemccand ! Let's continue the discuss in this issue instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add readBytes method to RandomAccessInput [lucene]
iverase commented on code in PR #12600: URL: https://github.com/apache/lucene/pull/12600#discussion_r1345475483 ## lucene/core/src/java19/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -168,6 +168,28 @@ private void readBytesBoundary(byte[] b, int offset, int len) throws IOException } } + private void readBytesBoundary(long pos, byte[] b, int offset, int len) throws IOException { Review Comment: I have moved the method behind the random access readBytes method -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add readBytes method to RandomAccessInput [lucene]
uschindler commented on code in PR #12600: URL: https://github.com/apache/lucene/pull/12600#discussion_r1345489213 ## lucene/core/src/java19/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -168,6 +168,28 @@ private void readBytesBoundary(byte[] b, int offset, int len) throws IOException } } + private void readBytesBoundary(long pos, byte[] b, int offset, int len) throws IOException { Review Comment: You removed the absolute/randomaccess readBytesBoundary method, correct? I think for the random access case it is fine. The sequential read case is more optimized and it uses "catch exception program flow", which is there for performance. The separate method is also there for Hotspot to not fall into method-bytecode-size limitations while inlining (because readBytesBoundary is called exceptionally and should not increate methodSize of readBytes). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add readBytes method to RandomAccessInput [lucene]
iverase commented on code in PR #12600: URL: https://github.com/apache/lucene/pull/12600#discussion_r1345506266 ## lucene/core/src/java19/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -168,6 +168,28 @@ private void readBytesBoundary(byte[] b, int offset, int len) throws IOException } } + private void readBytesBoundary(long pos, byte[] b, int offset, int len) throws IOException { Review Comment: > You removed the absolute/randomaccess readBytesBoundary method, correct? Yes, I left the sequential case unchanged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add readBytes method to RandomAccessInput [lucene]
iverase commented on code in PR #12600: URL: https://github.com/apache/lucene/pull/12600#discussion_r1345506266 ## lucene/core/src/java19/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -168,6 +168,28 @@ private void readBytesBoundary(byte[] b, int offset, int len) throws IOException } } + private void readBytesBoundary(long pos, byte[] b, int offset, int len) throws IOException { Review Comment: > You removed the absolute/randomaccess readBytesBoundary method, correct? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Make `byte[]` vector comparisons faster! (if possible) [lucene]
benwtrent opened a new issue, #12621: URL: https://github.com/apache/lucene/issues/12621 ### Description While testing and digging around, I noticed that our float comparisons are way faster than byte on my Macbook (M1) and pretty much the same as our byte comparisons on a GCP Intel Sapphire Rapids CPU. This seems counter-intuitive to me. I would expect Panama to be able to do more `byte` operations per cycle than `float`. My guess is the intrinsics are weird? Panama Vector just doesn't support or detect the required operations? Here are two benchmark results using @rmuir's helpful vectorbench project: MacBook (Apple Silicon [128bits], JDK21): ``` FloatDotProductBenchmark.dotProductNew 768 thrpt5 21.781 ± 0.254 ops/us FloatDotProductBenchmark.dotProductNew1024 thrpt5 15.091 ± 0.217 ops/us BinaryDotProductBenchmark.dotProductNew 768 thrpt5 8.041 ± 0.108 ops/us BinaryDotProductBenchmark.dotProductNew1024 thrpt5 6.085 ± 0.133 ops/us ``` GCP (Intel Sapphire Rapids [avx512], JDK21): ``` FloatDotProductBenchmark.dotProductNew 768 thrpt5 20.169 ± 0.385 ops/us FloatDotProductBenchmark.dotProductNew1024 thrpt5 18.334 ± 0.180 ops/us BinaryDotProductBenchmark.dotProductNew 768 thrpt5 19.686 ± 0.050 ops/us BinaryDotProductBenchmark.dotProductNew1024 thrpt5 14.934 ± 0.014 ops/us ``` cpu-flags ``` Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflus h mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good n opl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pc id sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf _lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsba se tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xs avec xgetbv1 xsaves avx512_bf16 arat avx512vbmi umip avx512_vbmi2 gfni vaes vp clmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri mov dir64b fsrm md_clear serialize arch_capabilities ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Write VLong in opposite order for better outputs sharing in the FST [lucene]
mikemccand commented on issue #12620: URL: https://github.com/apache/lucene/issues/12620#issuecomment-1746831073 This might be needle moving on the size of the FSTs created by block tree for the terms index, since it encodes long as `vLong` in its output. We should only try this "reverse vLong" in that context (private to FST encoding) ... e.g. it would not likely help for FST `Outputs` since the encoded output is a true `long` value not a `byte[]` leading with an encoded `vLong` and most of the written `long` values in that case will already be smallish. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]
jpountz opened a new pull request, #12622: URL: https://github.com/apache/lucene/pull/12622 This adds `BPReorderingMergePolicy`, a merge policy wrapper that reorders doc IDs on merge using a `BPIndexReorderer`. - Reordering always run on forced merges. - A `minNaturalMergeNumDocs` parameter helps only enable reordering on the larger merged segments. This way, small merges retain all merging optimizations like bulk copying of stored fields, and only the larger segments - which are the most important for search performance - get reordered. - If not enough RAM is available to perform reordering, reordering is skipped. To make this work, I had to add the ability for any merge to reorder doc IDs of the merged segment via `OneMerge#reorder`. `MockRandomMergePolicy` from the test framework randomly reverts the order of documents in a merged segment to make sure this logic is properly exercised. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Reduce FST block size for BlockTreeTermsWriter (#12604) [lucene-solr]
risdenk opened a new pull request, #2677: URL: https://github.com/apache/lucene-solr/pull/2677 Backport of https://github.com/apache/lucene/pull/12604 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] FST#Compiler allocates too much memory [lucene]
risdenk commented on issue #12598: URL: https://github.com/apache/lucene/issues/12598#issuecomment-1746863472 FWIW I was looking into this a bit when I saw this issue come in. Specifically on Solr 8.11, but as far as I can tell the changes in #12604 apply to 8.x as well. In a 30s async profiler memory allocation profile, can see that ~4% of total is this same FST byte array. https://github.com/apache/lucene/assets/3384157/7eefeeed-04f6-442b-b9f3-0ac394581b60";> I put up a backport PR for lucene-solr branch_8_11 - https://github.com/apache/lucene-solr/pull/2677 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add readBytes method to RandomAccessInput [lucene]
iverase merged PR #12600: URL: https://github.com/apache/lucene/pull/12600 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add readBytes method to RandomAccessInput [lucene]
iverase closed issue #12599: Add readBytes method to RandomAccessInput URL: https://github.com/apache/lucene/issues/12599 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]
rmuir commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747002969 the type conversions are what makes it slow. for float case it is the equiv of: ``` float x = something; float y = something; float z = something; // no conversions float result = x + y * z; ``` for the binary case it is the equivalent of: ``` byte a = something; byte b = something; byte c = something; // multiply b + c avoiding overflow short z1 = (short)b * (short)c; // add a + z1 avoiding overflow int z = (int)a + (int)z1; ``` You can understand the limitations at the hardware level better by reading https://en.wikichip.org/wiki/x86/avx512_vnni `VPDPBUSD` is currently not supported by openjdk either: https://bugs.openjdk.org/browse/JDK-8215891 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]
rmuir commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747026111 Also their suggested replacement of 3 instructions for the `VPDPBUSD` is: > Likewise, for 8-bit values, three instructions are needed - VPMADDUBSW which is used to multiply two 8-bit pairs and add them together, followed by a VPMADDWD with the value 1 in order to simply up-convert the 16-bit values to 32-bit values, followed by the VPADDD instruction which adds the result to an accumulator. I can tell you this is also not what is happening. We have no ability to write AVX-512-specific code and currently have to support ARM, machines with only AVX-256, etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]
jpountz commented on PR #12622: URL: https://github.com/apache/lucene/pull/12622#issuecomment-1747029247 The diff is large because I had to introduce a new `SlowCompositeCodecReaderWrapper`, which effectively does the merge (lazily) and can be fed to the reordering logic prior to actually running the merge. The rest of the change is reasonable (at least for now). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add readBytes method to RandomAccessInput [lucene]
iverase commented on PR #12600: URL: https://github.com/apache/lucene/pull/12600#issuecomment-1747031597 @uschindler I merged the change. I tried to backported but it is not possible ByteBuffer#get(int, byte[], int, int) is not available in the java version on line 9.x. I think it is ok to leave this change until 10. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add readBytes method to RandomAccessInput [lucene]
uschindler commented on PR #12600: URL: https://github.com/apache/lucene/pull/12600#issuecomment-1747037797 Hi @iverase, oh yeah. The absolute ByteBuffer gets are not available in older Java versions. If you want to backport, you could create a temporary ByteBuffer slice, but if you're fine to delay this change, I am fine. An alternative is to not implement the method and fallback to the default?! I changed the Policeman Jenkins MMAP job back to Lucene Main branch. The next run would have failed as you deleted the branch! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add readBytes method to RandomAccessInput [lucene]
uschindler commented on PR #12600: URL: https://github.com/apache/lucene/pull/12600#issuecomment-1747042486 P.S.: See [docs](https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/nio/ByteBuffer.html#get(int,byte%5B%5D,int,int)) here. The method came with Java 13. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]
rmuir commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747044386 As far as the ARM goes, the fact it has only 128-bit SIMD is the limiting factor. For e.g. AVX-256, we use 64-bit vector of 8 byte values -> 128 bit vector of 8 short values -> 256 bit vector of 8 int values. For ARM/NEON with only 128-bit, we can't do this as we don't have 256-bit vectors. So instead we use use 64-bit vector of 8 byte values -> 128 bit vector of 8 short values -> 2 128-bit vectors of 4 short values each. It requires splitting the vector in half, it is just all we can do. If you want it to be faster get an ARM with SVE SIMD which has bigger vectors than NEON. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add readBytes method to RandomAccessInput [lucene]
uschindler commented on PR #12600: URL: https://github.com/apache/lucene/pull/12600#issuecomment-1747053947 @iverase, I think you have to move the changes entry to Lucene 10. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]
rmuir commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747066837 My recommendation: stop messing around with `byte` and start thinking about the new 16-bit half-float support that is present in Java 21. Unfortunately the half-float *vectorization* support is not even in openjdk master branch for experimentation yet. But even my 3-year old crappy $200 mobile phone can do 16-bit dot product in hardware, support is more widespread and it would avoid these issues while still saving memory/space. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add readBytes method to RandomAccessInput [lucene]
iverase commented on PR #12600: URL: https://github.com/apache/lucene/pull/12600#issuecomment-1747072284 >@iverase, I think you have to move the changes entry to Lucene 10. I did it already in ba74da1 >I changed the Policeman Jenkins MMAP job back to Lucene Main branch. The next run would have failed as you deleted the branch! Oops, I delete them mechanically, glad you changed quickly. > An alternative is to not implement the method and fallback to the default?! I think that is the best approach, We can do that if there is need to have it in the 9.x branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]
uschindler commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747204954 Actually it is worse: Java 20 introduced conversion between short/float, but we got neither a native `float16` datatype nor vector support. In short: completely unuseable. 🤮 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]
uschindler commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747206287 See https://github.com/openjdk/jdk/pull/9422 (Java 20) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]
robertvanwinkle1138 commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747228177 @benwtrent For merges there is "FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search" https://arxiv.org/pdf/2105.09613.pdf DiskANN is known to be slower at indexing than HNSW and the blog post does not compare single threaded index times with Lucene. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747298348 > DiskANN is known to be slower at indexing than HNSW and the blog post does not compare single threaded index times with Lucene. @robertvanwinkle1138 this is just one of my concerns. Additionally, I think we would still have the "segment merging problem". I do think we can get HNSW merges down by keeping connections between graphs, I just haven't had cycles myself to dig into that. There is a Lucene issue around making HNSW merges faster somewhere... https://github.com/apache/lucene/issues/12440 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve fallback sorter for BKD [lucene]
gf2121 commented on code in PR #12610: URL: https://github.com/apache/lucene/pull/12610#discussion_r1346202698 ## lucene/core/src/java/org/apache/lucene/util/bkd/MutablePointTreeReaderUtils.java: ## @@ -81,6 +86,40 @@ protected int byteAt(int i, int k) { return (reader.getDocID(i) >>> Math.max(0, shift)) & 0xff; } } + + @Override + protected Sorter getFallbackSorter(int k) { +return new InPlaceMergeSorter() { + @Override + protected int compare(int i, int j) { +if (k < config.packedBytesLength) { + reader.getValue(i, scratch1); + reader.getValue(j, scratch2); + int v = + Arrays.compareUnsigned( + scratch1.bytes, + scratch1.offset + k, + scratch1.offset + scratch1.length, + scratch2.bytes, + scratch2.offset + k, + scratch2.offset + scratch2.length); + if (v != 0) { +return v; + } +} +if (bitsPerDocId == 0) { Review Comment: Yes, see this [line](https://github.com/apache/lucene/blob/eb14a1a0a69757dcbda4829c4a160ecd1a5530ad/lucene/core/src/java/org/apache/lucene/util/bkd/MutablePointTreeReaderUtils.java#L59). We are avoiding doc id comparing with a stable sort if index sorting in not enabled :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve fallback sorter for BKD [lucene]
gf2121 commented on code in PR #12610: URL: https://github.com/apache/lucene/pull/12610#discussion_r1346210779 ## lucene/core/src/java/org/apache/lucene/util/bkd/MutablePointTreeReaderUtils.java: ## @@ -81,6 +86,40 @@ protected int byteAt(int i, int k) { return (reader.getDocID(i) >>> Math.max(0, shift)) & 0xff; } } + + @Override + protected Sorter getFallbackSorter(int k) { Review Comment: Thanks for the review and the great advice! As `Arrays#compareUnsigned` need to compare bytes one by one when length < 8, This way should speed things up. However, the newest patch (as well as origin patch that using `Arrays#compareUnsigned`) shows a slower speed than `main` in my benchmark. I'm still digging to understand what happened. I'll change this PR to draft until figuring it out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]
jmazanec15 commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747329967 A hybrid disk-memory algorithm would have very strong benefits. I did run a few tests recently that confirmed HNSW does not function very well when memory gets constrained (which I think everyone already knew). I wonder though, instead of DiskANN, what about a partitioning based approach such as [SPANN](https://arxiv.org/pdf/2111.08566.pdf)? I think a partitioning based approach for Lucene would make merging, updating, filtering and indexing a lot easier. Also, it seems it would have better disk-access patterns. In the paper, they do show that in a memory constrained environment, they were able to outperfrom DiskANN. I guess the tradeoff might be that partitioning based approaches would struggle to achieve really low latency search when in memory compared to graph-based approaches. Additionally, partitioning approaches would require a potentially expensive "training" or "preprocessing" step such as k-Means and performance could degrade if data distribution drifts and the models are not updated. But, if PQ is implemented, the same considerations would need to be taken as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747350135 @jmazanec15 I agree that SPANN seems more attractive. I would argue though we don't need to do clustering (in the paper they do clustering, but with minimal effectiveness), but could maybe get away with randomly selecting a subset of vectors per segment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Use a MergeSorter taking advantage of extra storage for StableMSBRadixSorter [lucene]
gf2121 opened a new pull request, #12623: URL: https://github.com/apache/lucene/pull/12623 ### Description As `StableMSBRadixSorter` always requires a `O(n)` extra memory. We can use a `MergeSorter` taking advantage of the extra memory instead of `InPlaceMergeSorter`. ### Benchmark This reduces around 15% took of sorting 10,000,000 timestamp values (long) with `StableMSBRadisSorter`. https://bytedance.feishu.cn/sheets/HSetsPqDrhicnet5lWrcOXMtnRc"; data-importRangeRawData-range="'Sheet1'!A1:D2"> | baseline | candidate | took diff -- | -- | -- | -- sort took | 5285 | 4464 | -15.53% -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 opened a new pull request, #12624: URL: https://github.com/apache/lucene/pull/12624 ### Description Refactor the method in `BytesStore` needed for FST construction to an abstract class and allow it to be passed from `FSTCompiler.Builder`. The Builder will still maintain `bytesPageBits(int)` for backward-compatibility, in which it will pass the BytesStore. Issue: https://github.com/apache/lucene/issues/12543 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce FST block size for BlockTreeTermsWriter (#12604) [lucene-solr]
risdenk merged PR #2677: URL: https://github.com/apache/lucene-solr/pull/2677 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]
rmuir commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747775200 > Actually it is worse: Java 20 introduced conversion between short/float, but we got neither a native `float16` datatype nor vector support. In short: completely unuseable. We should at least fix the field we have in the sandbox to use it and start playing with the possibilities/performance? https://github.com/apache/lucene/blob/main/lucene/sandbox/src/java/org/apache/lucene/sandbox/document/HalfFloatPoint.java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] SOLR-17004: ZkStateReader waitForState should check clusterState before using watchers [lucene-solr]
risdenk opened a new pull request, #2678: URL: https://github.com/apache/lucene-solr/pull/2678 Backport SOLR-17004 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] SOLR-17004: ZkStateReader waitForState should check clusterState before using watchers [lucene-solr]
risdenk merged PR #2678: URL: https://github.com/apache/lucene-solr/pull/2678 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] `FSTCompiler.Builder` should have an option to stream the FST bytes directly to Directory [lucene]
dungba88 commented on issue #12543: URL: https://github.com/apache/lucene/issues/12543#issuecomment-1748001631 I put together a PR at https://github.com/apache/lucene/pull/12624. I also verified with a custom dictionary (~1MB in size) that position does not go backward to previously written input, so I think it would not be a problem to create a new `FSTWriter` that flush to disk on every add and only use the in-mem buffer for processing the current input. That is assuming we have `shouldShareSuffix` set to false. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org