Re: [I] `FSTCompiler.Builder` should have an option to stream the FST bytes directly to Directory [lucene]

2023-10-04 Thread via GitHub
dungba88 commented on issue #12543: URL: https://github.com/apache/lucene/issues/12543#issuecomment-1746403380 Thanks @mikemccand ! Let's continue the discuss in this issue instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Git

Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub
iverase commented on code in PR #12600: URL: https://github.com/apache/lucene/pull/12600#discussion_r1345475483 ## lucene/core/src/java19/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -168,6 +168,28 @@ private void readBytesBoundary(byte[] b, int offset, int len)

Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub
uschindler commented on code in PR #12600: URL: https://github.com/apache/lucene/pull/12600#discussion_r1345489213 ## lucene/core/src/java19/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -168,6 +168,28 @@ private void readBytesBoundary(byte[] b, int offset, int le

Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub
iverase commented on code in PR #12600: URL: https://github.com/apache/lucene/pull/12600#discussion_r1345506266 ## lucene/core/src/java19/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -168,6 +168,28 @@ private void readBytesBoundary(byte[] b, int offset, int len)

Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub
iverase commented on code in PR #12600: URL: https://github.com/apache/lucene/pull/12600#discussion_r1345506266 ## lucene/core/src/java19/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -168,6 +168,28 @@ private void readBytesBoundary(byte[] b, int offset, int len)

[I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-04 Thread via GitHub
benwtrent opened a new issue, #12621: URL: https://github.com/apache/lucene/issues/12621 ### Description While testing and digging around, I noticed that our float comparisons are way faster than byte on my Macbook (M1) and pretty much the same as our byte comparisons on a GCP Intel

Re: [I] Write VLong in opposite order for better outputs sharing in the FST [lucene]

2023-10-04 Thread via GitHub
mikemccand commented on issue #12620: URL: https://github.com/apache/lucene/issues/12620#issuecomment-1746831073 This might be needle moving on the size of the FSTs created by block tree for the terms index, since it encodes long as `vLong` in its output. We should only try this "reverse v

[PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]

2023-10-04 Thread via GitHub
jpountz opened a new pull request, #12622: URL: https://github.com/apache/lucene/pull/12622 This adds `BPReorderingMergePolicy`, a merge policy wrapper that reorders doc IDs on merge using a `BPIndexReorderer`. - Reordering always run on forced merges. - A `minNaturalMergeNumDocs` pa

[PR] Reduce FST block size for BlockTreeTermsWriter (#12604) [lucene-solr]

2023-10-04 Thread via GitHub
risdenk opened a new pull request, #2677: URL: https://github.com/apache/lucene-solr/pull/2677 Backport of https://github.com/apache/lucene/pull/12604 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [I] FST#Compiler allocates too much memory [lucene]

2023-10-04 Thread via GitHub
risdenk commented on issue #12598: URL: https://github.com/apache/lucene/issues/12598#issuecomment-1746863472 FWIW I was looking into this a bit when I saw this issue come in. Specifically on Solr 8.11, but as far as I can tell the changes in #12604 apply to 8.x as well. In a 30s asy

Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub
iverase merged PR #12600: URL: https://github.com/apache/lucene/pull/12600 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [I] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub
iverase closed issue #12599: Add readBytes method to RandomAccessInput URL: https://github.com/apache/lucene/issues/12599 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To uns

Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-04 Thread via GitHub
rmuir commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747002969 the type conversions are what makes it slow. for float case it is the equiv of: ``` float x = something; float y = something; float z = something; // no conversions f

Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-04 Thread via GitHub
rmuir commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747026111 Also their suggested replacement of 3 instructions for the `VPDPBUSD` is: > Likewise, for 8-bit values, three instructions are needed - VPMADDUBSW which is used to multiply two

Re: [PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]

2023-10-04 Thread via GitHub
jpountz commented on PR #12622: URL: https://github.com/apache/lucene/pull/12622#issuecomment-1747029247 The diff is large because I had to introduce a new `SlowCompositeCodecReaderWrapper`, which effectively does the merge (lazily) and can be fed to the reordering logic prior to actually r

Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub
iverase commented on PR #12600: URL: https://github.com/apache/lucene/pull/12600#issuecomment-1747031597 @uschindler I merged the change. I tried to backported but it is not possible ByteBuffer#get(int, byte[], int, int) is not available in the java version on line 9.x. I think it is

Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub
uschindler commented on PR #12600: URL: https://github.com/apache/lucene/pull/12600#issuecomment-1747037797 Hi @iverase, oh yeah. The absolute ByteBuffer gets are not available in older Java versions. If you want to backport, you could create a temporary ByteBuffer slice, but if y

Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub
uschindler commented on PR #12600: URL: https://github.com/apache/lucene/pull/12600#issuecomment-1747042486 P.S.: See [docs](https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/nio/ByteBuffer.html#get(int,byte%5B%5D,int,int)) here. The method came with Java 13. -- This is a

Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-04 Thread via GitHub
rmuir commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747044386 As far as the ARM goes, the fact it has only 128-bit SIMD is the limiting factor. For e.g. AVX-256, we use 64-bit vector of 8 byte values -> 128 bit vector of 8 short values ->

Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub
uschindler commented on PR #12600: URL: https://github.com/apache/lucene/pull/12600#issuecomment-1747053947 @iverase, I think you have to move the changes entry to Lucene 10. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-04 Thread via GitHub
rmuir commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747066837 My recommendation: stop messing around with `byte` and start thinking about the new 16-bit half-float support that is present in Java 21. Unfortunately the half-float *vectorization*

Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub
iverase commented on PR #12600: URL: https://github.com/apache/lucene/pull/12600#issuecomment-1747072284 >@iverase, I think you have to move the changes entry to Lucene 10. I did it already in ba74da1 >I changed the Policeman Jenkins MMAP job back to Lucene Main branch. The nex

Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-04 Thread via GitHub
uschindler commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747204954 Actually it is worse: Java 20 introduced conversion between short/float, but we got neither a native `float16` datatype nor vector support. In short: completely unuseable. 🤮 --

Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-04 Thread via GitHub
uschindler commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747206287 See https://github.com/openjdk/jdk/pull/9422 (Java 20) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use th

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-04 Thread via GitHub
robertvanwinkle1138 commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747228177 @benwtrent For merges there is "FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search" https://arxiv.org/pdf/2105.09613.pdf

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-04 Thread via GitHub
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747298348 > DiskANN is known to be slower at indexing than HNSW and the blog post does not compare single threaded index times with Lucene. @robertvanwinkle1138 this is just one of my

Re: [PR] Improve fallback sorter for BKD [lucene]

2023-10-04 Thread via GitHub
gf2121 commented on code in PR #12610: URL: https://github.com/apache/lucene/pull/12610#discussion_r1346202698 ## lucene/core/src/java/org/apache/lucene/util/bkd/MutablePointTreeReaderUtils.java: ## @@ -81,6 +86,40 @@ protected int byteAt(int i, int k) { return (reade

Re: [PR] Improve fallback sorter for BKD [lucene]

2023-10-04 Thread via GitHub
gf2121 commented on code in PR #12610: URL: https://github.com/apache/lucene/pull/12610#discussion_r1346210779 ## lucene/core/src/java/org/apache/lucene/util/bkd/MutablePointTreeReaderUtils.java: ## @@ -81,6 +86,40 @@ protected int byteAt(int i, int k) { return (reade

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-04 Thread via GitHub
jmazanec15 commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747329967 A hybrid disk-memory algorithm would have very strong benefits. I did run a few tests recently that confirmed HNSW does not function very well when memory gets constrained (which

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-04 Thread via GitHub
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747350135 @jmazanec15 I agree that SPANN seems more attractive. I would argue though we don't need to do clustering (in the paper they do clustering, but with minimal effectiveness), but co

[PR] Use a MergeSorter taking advantage of extra storage for StableMSBRadixSorter [lucene]

2023-10-04 Thread via GitHub
gf2121 opened a new pull request, #12623: URL: https://github.com/apache/lucene/pull/12623 ### Description As `StableMSBRadixSorter` always requires a `O(n)` extra memory. We can use a `MergeSorter` taking advantage of the extra memory instead of `InPlaceMergeSorter`. ### Benc

[PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-04 Thread via GitHub
dungba88 opened a new pull request, #12624: URL: https://github.com/apache/lucene/pull/12624 ### Description Refactor the method in `BytesStore` needed for FST construction to an abstract class and allow it to be passed from `FSTCompiler.Builder`. The Builder will still maintain `byt

Re: [PR] Reduce FST block size for BlockTreeTermsWriter (#12604) [lucene-solr]

2023-10-04 Thread via GitHub
risdenk merged PR #2677: URL: https://github.com/apache/lucene-solr/pull/2677 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-04 Thread via GitHub
rmuir commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747775200 > Actually it is worse: Java 20 introduced conversion between short/float, but we got neither a native `float16` datatype nor vector support. In short: completely unuseable. We

[PR] SOLR-17004: ZkStateReader waitForState should check clusterState before using watchers [lucene-solr]

2023-10-04 Thread via GitHub
risdenk opened a new pull request, #2678: URL: https://github.com/apache/lucene-solr/pull/2678 Backport SOLR-17004 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubs

Re: [PR] SOLR-17004: ZkStateReader waitForState should check clusterState before using watchers [lucene-solr]

2023-10-04 Thread via GitHub
risdenk merged PR #2678: URL: https://github.com/apache/lucene-solr/pull/2678 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [I] `FSTCompiler.Builder` should have an option to stream the FST bytes directly to Directory [lucene]

2023-10-04 Thread via GitHub
dungba88 commented on issue #12543: URL: https://github.com/apache/lucene/issues/12543#issuecomment-1748001631 I put together a PR at https://github.com/apache/lucene/pull/12624. I also verified with a custom dictionary (~1MB in size) that position does not go backward to previously w