Re: [PR] Fix NPE in TestReqOptSumScorer.testFilterRandomRareOpt [lucene]

2024-02-06 Thread via GitHub
easyice commented on PR #13069: URL: https://github.com/apache/lucene/pull/13069#issuecomment-1931077803 This will also fix test failure for TestReqOptSumScorer.testFilterRandomFrequentOpt ``` ./gradlew test --tests TestReqOptSumScorer.testFilterRandomFrequentOpt -Dtests.seed=70A6

Re: [I] Modify getEnWikiRandomLines to fetch and decompress the zstd resource [lucene]

2024-02-06 Thread via GitHub
dweiss closed issue #13083: Modify getEnWikiRandomLines to fetch and decompress the zstd resource URL: https://github.com/apache/lucene/issues/13083 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

Re: [I] Modify getEnWikiRandomLines to fetch and decompress the zstd resource [lucene]

2024-02-06 Thread via GitHub
dweiss commented on issue #13083: URL: https://github.com/apache/lucene/issues/13083#issuecomment-1930762464 I used zstd-jni for decompression within the buildscript as command-line zstd may not be installed locally. zstd-jni is still way, way faster than decompressing bz2... -- This is

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
uschindler commented on code in PR #13076: URL: https://github.com/apache/lucene/pull/13076#discussion_r1480537766 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -214,4 +214,18 @@ public static float[] checkFinite(float[] v) { } return v; } +

Re: [I] Contributing a deep-learning, BERT-based analyzer [lucene]

2024-02-06 Thread via GitHub
dweiss commented on issue #13065: URL: https://github.com/apache/lucene/issues/13065#issuecomment-1930667697 It will be a major headache to maintain native bindings for all major platforms. I think such an analyzer should be a downstream project (then you can restrict the platforms on which

[I] Should we use a SparseFixedBitSet when deletes are sparse? [lucene]

2024-02-06 Thread via GitHub
jpountz opened a new issue, #13084: URL: https://github.com/apache/lucene/issues/13084 ### Description @uschindler asked this question in https://lists.apache.org/thread/6o3hn3x8syfm8lj93kk5rrxb0kx701gp. In this discussion, we were looking for introducing the ability to iterate

[I] Modify getEnWikiRandomLines to fetch and decompress the zstd resource [lucene]

2024-02-06 Thread via GitHub
dweiss opened a new issue, #13083: URL: https://github.com/apache/lucene/issues/13083 ### Description The decompression speed should be significant. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [I] Expose the linedocsfile (enwiki) as a zstd compressed archive [lucene]

2024-02-06 Thread via GitHub
dweiss closed issue #13074: Expose the linedocsfile (enwiki) as a zstd compressed archive URL: https://github.com/apache/lucene/issues/13074 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

Re: [I] Expose the linedocsfile (enwiki) as a zstd compressed archive [lucene]

2024-02-06 Thread via GitHub
dweiss commented on issue #13074: URL: https://github.com/apache/lucene/issues/13074#issuecomment-1930660076 Thank you, Mike! I'll create a follow-up issue to change the gradle task to download and unpack the zstd-compressed file. -- This is an automated message from the Apache Git Servic

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
rmuir commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1930648432 which of the current functions really need to be in core? I guess the problem I see is that there are 6 functions today, 3 float, 3 byte. The byte functions don't perform well and n

Re: [I] Expose the linedocsfile (enwiki) as a zstd compressed archive [lucene]

2024-02-06 Thread via GitHub
mikemccand commented on issue #13074: URL: https://github.com/apache/lucene/issues/13074#issuecomment-1930545328 OK, done! https://home.apache.org/~mikemccand/enwiki.random.lines.txt.zst I downloaded and confirmed the `wc -c` gives the same count as above. Thanks @dweiss

Re: [I] Expose the linedocsfile (enwiki) as a zstd compressed archive [lucene]

2024-02-06 Thread via GitHub
mikemccand commented on issue #13074: URL: https://github.com/apache/lucene/issues/13074#issuecomment-1930535125 Wow, that is an amazingly fast decompression! And also an awesome improvement in compression ratio. Yup, I'll do this shortly. -- This is an automated message from the Apache

Re: [PR] Index arbitrary fields in taxonomy docs [lucene]

2024-02-06 Thread via GitHub
stefanvodita commented on PR #12337: URL: https://github.com/apache/lucene/pull/12337#issuecomment-1930494299 Thank you for reviving the PR, Mike; it had been sitting around for a good while. I’ll leave it up for a few more days to see if there are other comments and merge if there aren’t.

Re: [PR] Fix knn vector visit limit fence post error [lucene]

2024-02-06 Thread via GitHub
benwtrent merged PR #13058: URL: https://github.com/apache/lucene/pull/13058 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
benwtrent commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1930440518 > IMHO, the VectorSimilarity class should NOT be an ENUM and instead be an SPI with a symbolic name (using NamedSPILoader for the lookup) and the name should be stored in FieldInfo.

Re: [PR] Enable MemorySegment in MMapDirectory for Java 22+ and Vectorization (incubation) for exact Java 22 [lucene]

2024-02-06 Thread via GitHub
ChrisHegarty commented on PR #12706: URL: https://github.com/apache/lucene/pull/12706#issuecomment-1930403065 Thanks @uschindler. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
rmuir commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1930397239 Thanks uwe, thats exactly what is needed. The problem i see is a very immature field (vector search) that has no way to add new features (distance functions) without permanently impacting

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1930363479 In general, I'd like to rethink the plugabble VectorSimilarities (per field). IMHO, the VectorSimilarity class should NOT be an ENUM and instead be an SPI with a symbolic name (using

Re: [PR] Fix knn vector visit limit fence post error [lucene]

2024-02-06 Thread via GitHub
benwtrent commented on code in PR #13058: URL: https://github.com/apache/lucene/pull/13058#discussion_r1480229819 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -135,7 +135,7 @@ private TopDocs getLeafResults( } // Perform the app

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
benwtrent commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1930225256 My question about supporting Lucene 9 indices is out of legit ignorance. I think we would still need to support reading and searching segments stored with Cosine in Lucene 10. But we c

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
benwtrent commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1930202577 > Can we do that for Lucene 10.0 ? Deprecate it and warning of its imminent demise or remove it? Either should be possible. For users, they would have to add code to norma

Re: [PR] Enable MemorySegment in MMapDirectory for Java 22+ and Vectorization (incubation) for exact Java 22 [lucene]

2024-02-06 Thread via GitHub
uschindler commented on PR #12706: URL: https://github.com/apache/lucene/pull/12706#issuecomment-1930132446 I targeted it to milestone 9.10.0. I will add the CHANGES.txt entry shortly before merging. -- This is an automated message from the Apache Git Service. To respond to the message, p

Re: [PR] Enable MemorySegment in MMapDirectory for Java 22+ and Vectorization (incubation) for exact Java 22 [lucene]

2024-02-06 Thread via GitHub
uschindler commented on PR #12706: URL: https://github.com/apache/lucene/pull/12706#issuecomment-1930120998 > > Do you mean another 9.9.3 with bugfix, or do you mean next minor version 9.10? > > Apologies, I mean the next minor - 9.10 (not 9.9.3). Sorry for the confusion. Yes.

Re: [PR] Enable MemorySegment in MMapDirectory for Java 22+ and Vectorization (incubation) for exact Java 22 [lucene]

2024-02-06 Thread via GitHub
ChrisHegarty commented on PR #12706: URL: https://github.com/apache/lucene/pull/12706#issuecomment-1930112461 > Do you mean another 9.9.3 with bugfix, or do you mean next minor version 9.10? Apologies, I mean the next minor - 9.10 (not 9.9.3). Sorry for the confusion. -- This is an

Re: [PR] Fix knn vector visit limit fence post error [lucene]

2024-02-06 Thread via GitHub
benwtrent commented on PR #13058: URL: https://github.com/apache/lucene/pull/13058#issuecomment-1930102716 @jpountz there you go :). Only for `approximateSearch` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] Enable MemorySegment in MMapDirectory for Java 22+ and Vectorization (incubation) for exact Java 22 [lucene]

2024-02-06 Thread via GitHub
uschindler commented on PR #12706: URL: https://github.com/apache/lucene/pull/12706#issuecomment-1930098353 Yes. To be sure I wanted to wait till Friday this week. But yes in general I am happy to have this in. Do you mean another 9.9.3 with bugfix, or do you mean next minor version

[PR] Fix test failure TestParentBlockJoinFloatKnnVectorQuery.testSkewedIndex [lucene]

2024-02-06 Thread via GitHub
benwtrent opened a new pull request, #13082: URL: https://github.com/apache/lucene/pull/13082 This particular test relies on doc-ids for potential tie breaks. For consistency, removing the random flushing by reverting change from commit: f7cab164501 closes: https://github.com/apache/

Re: [PR] Enable MemorySegment in MMapDirectory for Java 22+ and Vectorization (incubation) for exact Java 22 [lucene]

2024-02-06 Thread via GitHub
ChrisHegarty commented on PR #12706: URL: https://github.com/apache/lucene/pull/12706#issuecomment-1930033556 @uschindler As per our in-person conversation, are you ok to merge this PR so that it can be incorporated into the next Lucene bugfix version. -- This is an automated message from

Re: [I] Contributing a deep-learning, BERT-based analyzer [lucene]

2024-02-06 Thread via GitHub
lmessinger commented on issue #13065: URL: https://github.com/apache/lucene/issues/13065#issuecomment-1929981311 hi, in Hebrew and other Semitic languages, lemmas are context-dependent. eg שמן could be interpreted as fat, oil, their name, from all dependent on the context s

Re: [I] Speed up requests for many rows [LUCENE-6828] [lucene]

2024-02-06 Thread via GitHub
mark4z commented on issue #7886: URL: https://github.com/apache/lucene/issues/7886#issuecomment-1929959691 Yeah, I think u are right. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [I] Contributing a deep-learning, BERT-based analyzer [lucene]

2024-02-06 Thread via GitHub
benwtrent commented on issue #13065: URL: https://github.com/apache/lucene/issues/13065#issuecomment-1929933564 @lmessinger I don't see why text tokenization would need any native code. Word piece is pretty simple and just a dictionary look up. Do y'all not have a Java one? O

[PR] Fix TestTopFieldCollector.testTotalHits #13080 [lucene]

2024-02-06 Thread via GitHub
benwtrent opened a new pull request, #13081: URL: https://github.com/apache/lucene/pull/13081 The failure is due to the randomized flushing and the later assertion there are two leaves only. When switching from `w.addDocuments` to `w.addDocument` the test infra now has an opportunity to ran

Re: [I] TestTopFieldCollector.testTotalHits test failure [lucene]

2024-02-06 Thread via GitHub
benwtrent commented on issue #13080: URL: https://github.com/apache/lucene/issues/13080#issuecomment-1929888740 Ah, I see the issue, will fix momentarily. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[I] TestTopFieldCollector.testTotalHits test failure [lucene]

2024-02-06 Thread via GitHub
benwtrent opened a new issue, #13080: URL: https://github.com/apache/lucene/issues/13080 ### Description TestTopFieldCollector.testTotalHits fails on branch_9x, git-bisect indicates https://github.com/apache/lucene/commits/0aa88910ca9a1032d288996d14203eac4953f2de I tried reprod

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
pmpailis commented on code in PR #13076: URL: https://github.com/apache/lucene/pull/13076#discussion_r1479909430 ## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ## @@ -94,6 +95,29 @@ public float compare(float[] v1, float[] v2) { public float

Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]

2024-02-06 Thread via GitHub
mayya-sharipova closed pull request #12794: Speedup concurrent multi-segment HNWS graph search URL: https://github.com/apache/lucene/pull/12794 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]

2024-02-06 Thread via GitHub
mayya-sharipova commented on PR #12794: URL: https://github.com/apache/lucene/pull/12794#issuecomment-1929779853 Closed in favour of https://github.com/apache/lucene/pull/12962 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub a

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-06 Thread via GitHub
mayya-sharipova merged PR #12962: URL: https://github.com/apache/lucene/pull/12962 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lu

Re: [I] port gradle improvements to Lucene [lucene]

2024-02-06 Thread via GitHub
risdenk closed issue #12145: port gradle improvements to Lucene URL: https://github.com/apache/lucene/issues/12145 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscrib

Re: [I] port gradle improvements to Lucene [lucene]

2024-02-06 Thread via GitHub
risdenk commented on issue #12145: URL: https://github.com/apache/lucene/issues/12145#issuecomment-1929767397 Handled by https://github.com/apache/lucene/pull/12150 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the U

Re: [PR] Gradle optimizations [lucene]

2024-02-06 Thread via GitHub
risdenk commented on PR #12150: URL: https://github.com/apache/lucene/pull/12150#issuecomment-1929766994 Closes https://github.com/apache/lucene/issues/12145 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2024-02-06 Thread via GitHub
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1929761190 So, I did some of my own experiments. I tested Vamana (vectors in-graph) & HNSW, both with `int8` quantization (here is my Lucene branch: https://github.com/apache/lucene/compare/

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
pmpailis commented on code in PR #13076: URL: https://github.com/apache/lucene/pull/13076#discussion_r1479853108 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -214,4 +214,19 @@ public static float[] checkFinite(float[] v) { } return v; } + +

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
uschindler commented on code in PR #13076: URL: https://github.com/apache/lucene/pull/13076#discussion_r1479835140 ## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ## @@ -94,6 +95,29 @@ public float compare(float[] v1, float[] v2) { public floa

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1929552474 > I do agree cosine should probably be removed (not because of hamming distance), but because dot_product exists. Can we do that for Lucene 10.0 ? -- This is an automated mes

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
benwtrent commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1929537735 > My question is why add this function when it's not that much faster than integer dot product? Because it provides different scores. Integer dot-product doesn't provide the sam

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1929503472 > My question is why add this function when it's not that much faster than integer dot product? I see less than 20 percent improvement, which won't even translate to 20 percent indexi

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
rmuir commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1929405495 A good way to get in a new function would be to actually improve our support o&m by removing a horribly performing one such as cosine first. That way we are actually improving rather than

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
rmuir commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1929397814 My question is why add this function when it's not that much faster than integer dot product? I see less than 20 percent improvement, which won't even translate to 20 percent indexing/sear

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
pmpailis commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1929201238 Thanks for the suggestion @uschindler - will add the suggested variant to benchmarks! To be honest, the reason I re-run on x86 was mainly of the vector performance differences (hence w

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1929197473 P.S. the long support for bit count was added recently on x86. We may also compare with the integer one using the integer var handle (that's easy to check). Maybe that performs better

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1929179243 About NEON: Robert checked yesterday. There is a lot going on in Hotspot and optimizations are added all the time. If neon is slower on your machine, it might be that there's st

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1929165660 Please also add a test like the panama vs scalar one where you compare the results of the varhandle variant with the simple byte-by-byte one from the tail loop. Make sure to use inter

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
pmpailis commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1929150939 Thank you so much @rmuir & @uschindler for taking such a close look and also running benchmarks. 🙇 The reason I went with the look up table was because there seemed to be some improvem

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-06 Thread via GitHub
pmpailis commented on code in PR #13076: URL: https://github.com/apache/lucene/pull/13076#discussion_r1479470613 ## lucene/core/src/test/org/apache/lucene/index/KnnGraphTestCase.java: ## @@ -54,35 +55,65 @@ import org.apache.lucene.util.Bits; import org.apache.lucene.util.Byte