Re: [PR] Use SPI instead of Enum for VectorSimilarityFunctions [lucene]

2024-05-21 Thread via GitHub
Pulkitg64 commented on PR #13401: URL: https://github.com/apache/lucene/pull/13401#issuecomment-2123880947 @benwtrent @uschindler @ChrisHegarty Could you please take a look, if you get a chance? -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [I] What does the Lucene community think about dimensionality reduction for vectors, and should it be something the library does internally (at merge time perhaps)? [lucene]

2024-05-21 Thread via GitHub
gautamworah96 commented on issue #13403: URL: https://github.com/apache/lucene/issues/13403#issuecomment-2123614671 > I would expect the first stab at dimension reduction would be PQ not PCA. Hmm. I would've expected the opposite? If the number of dimensions are reduced, you don't eve

Re: [I] What does the Lucene community think about dimensionality reduction for vectors, and should it be something the library does internally (at merge time perhaps)? [lucene]

2024-05-21 Thread via GitHub
benwtrent commented on issue #13403: URL: https://github.com/apache/lucene/issues/13403#issuecomment-2123458172 Is PCA ever preferred for vector information retrieval over Product Quantization? I would expect the first stab at dimension reduction would be PQ not PCA. Maybe a first st

Re: [PR] [9.x] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
ChrisHegarty merged PR #13402: URL: https://github.com/apache/lucene/pull/13402 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucen

Re: [PR] [9.x] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
uschindler commented on code in PR #13402: URL: https://github.com/apache/lucene/pull/13402#discussion_r1608878571 ## lucene/core/src/java21/org/apache/lucene/util/VectorUtilPanamaProvider.txt: ## @@ -1,2 +0,0 @@ -The version of VectorUtilPanamaProvider for Java 21 is identical

[I] What does the Lucene community think about dimensionality reduction for vectors, and should it be something the library does internally (at merge time perhaps)? [lucene]

2024-05-21 Thread via GitHub
gautamworah96 opened a new issue, #13403: URL: https://github.com/apache/lucene/issues/13403 ### Description I opened this issue as a discussion topic. With the advancement in int8, int4 type vector storage, I believe Lucene takes the unquantized vectors as inputs, intelligently calc

Re: [I] Significant drop in recall for int8 scalar quantization using maximum_inner_product [lucene]

2024-05-21 Thread via GitHub
jmazanec15 commented on issue #13350: URL: https://github.com/apache/lucene/issues/13350#issuecomment-2123110105 I am trying to understand one thing: Does the corrective offset for dot product rectify issues with sign shift that is caused by going from signed domain: [-x, +y] to unsigned do

Re: [PR] [9.x] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
uschindler commented on PR #13402: URL: https://github.com/apache/lucene/pull/13402#issuecomment-2123092672 But I tend to not put this into Lucene 9.x. IMHO, it's too risky. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] [9.x] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
uschindler commented on PR #13402: URL: https://github.com/apache/lucene/pull/13402#issuecomment-2123086712 This is not so easy to do. I think we have to clone the whole vector code to Java 21, but without the memorysegment shortcuts. I'd suggest: - keep the Java 20 code as is -

Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
uschindler commented on PR #13339: URL: https://github.com/apache/lucene/pull/13339#issuecomment-2123079691 Looks like first Java 22 build also worked fine, so no API incompatibilities in JDK (foreign preview vs final): https://jenkins.thetaphi.de/job/Lucene-main-Linux/48322/consoleText -

[PR] Use SPI instead of Enum for VectorSimilarityFunctions [lucene]

2024-05-21 Thread via GitHub
Pulkitg64 opened a new pull request, #13401: URL: https://github.com/apache/lucene/pull/13401 ### Description This PR is to get feedback on the idea and any major changes required in the commit. In this commit we are using Java SPI instead of ENUM to define VectorSimilarityFun

Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
ChrisHegarty merged PR #13339: URL: https://github.com/apache/lucene/pull/13339 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucen

[PR] Replace Set by IntHashSet and Set by LongHashSet [lucene]

2024-05-21 Thread via GitHub
bruno-roustant opened a new pull request, #13400: URL: https://github.com/apache/lucene/pull/13400 Add IntHashSet and LongHashSet to the HPPC fork. Use them to replace usages of Set and Set. Refactor a bit the forked HPPC classes, add tests. On the way I discovered a small bug in HPP

Re: [PR] Replace Map by primitive LongObjectHashMap. [lucene]

2024-05-21 Thread via GitHub
bruno-roustant merged PR #13392: URL: https://github.com/apache/lucene/pull/13392 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@luc

Re: [I] Significant drop in recall for int8 scalar quantization using maximum_inner_product [lucene]

2024-05-21 Thread via GitHub
benwtrent commented on issue #13350: URL: https://github.com/apache/lucene/issues/13350#issuecomment-2122846701 I used int7 for my experiments. While losing one bit of precision isn't the best, it works well. I explored adding an unsigned byte dot product, but that got rejected as too

Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
uschindler commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1608442177 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentFlatVectorsScorer.java: ## @@ -0,0 +1,93 @@ +/* + * Licensed to the Apache Software

Re: [PR] Use `IndexInput#prefetch` for terms dictionary lookups. [lucene]

2024-05-21 Thread via GitHub
jpountz commented on PR #13359: URL: https://github.com/apache/lucene/pull/13359#issuecomment-2122746505 It creates a 50GB terms dictionary while my machine only has ~28GB of RAM for the page cache, so many terms dictionary lookups result in page faults. -- This is an automated message fr

Re: [PR] Use `IndexInput#prefetch` for terms dictionary lookups. [lucene]

2024-05-21 Thread via GitHub
mikemccand commented on PR #13359: URL: https://github.com/apache/lucene/pull/13359#issuecomment-2122733760 > But I created a benchmark that starts looking like running a Lucene query that is encouraging Was this with a forced-cold index? -- This is an automated message from the Ap

Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
ChrisHegarty commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1608349500 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentFlatVectorsScorer.java: ## @@ -0,0 +1,93 @@ +/* + * Licensed to the Apache Softwar

Re: [I] Significant drop in recall for int8 scalar quantization using maximum_inner_product [lucene]

2024-05-21 Thread via GitHub
mikemccand commented on issue #13350: URL: https://github.com/apache/lucene/issues/13350#issuecomment-2122686859 Thank you @naveentatikonda for the deep dive here and a nice unit test ... I couldn't follow all of the logic you described, but if we are indeed first normalizing a dimension's

Re: [PR] Fix IndexOutOfBoundsException thrown in DefaultPassageFormatter by unordered matches [lucene]

2024-05-21 Thread via GitHub
romseygeek commented on code in PR #13315: URL: https://github.com/apache/lucene/pull/13315#discussion_r1608372668 ## lucene/core/src/java/org/apache/lucene/search/DisjunctionMatchesIterator.java: ## @@ -194,6 +194,15 @@ private DisjunctionMatchesIterator(List matches) throws I

Re: [PR] Fix IndexOutOfBoundsException thrown in DefaultPassageFormatter by unordered matches [lucene]

2024-05-21 Thread via GitHub
mikemccand commented on PR #13315: URL: https://github.com/apache/lucene/pull/13315#issuecomment-2122672757 What a fun and tricky corner case -- thank you @scampi for uncovering this, showing the bug with the added unit tests, and the tentative fix. I think it is actually technically

Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
ChrisHegarty commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1608349500 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentFlatVectorsScorer.java: ## @@ -0,0 +1,93 @@ +/* + * Licensed to the Apache Softwar

Re: [PR] Fix IndexOutOfBoundsException thrown in DefaultPassageFormatter by unordered matches [lucene]

2024-05-21 Thread via GitHub
mikemccand commented on code in PR #13315: URL: https://github.com/apache/lucene/pull/13315#discussion_r1608345538 ## lucene/core/src/java/org/apache/lucene/search/DisjunctionMatchesIterator.java: ## @@ -194,6 +194,15 @@ private DisjunctionMatchesIterator(List matches) throws I

Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-05-21 Thread via GitHub
mikemccand commented on issue #13387: URL: https://github.com/apache/lucene/issues/13387#issuecomment-2122642094 I like this idea! I hope we can find a simple enough API exposed through IWC to enable the optional grouping. This also has nice mechanical sympathy / symmetry with the di

Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
ChrisHegarty commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1608169944 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentFlatVectorsScorer.java: ## @@ -0,0 +1,93 @@ +/* + * Licensed to the Apache Softwar

Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
ChrisHegarty commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1608169944 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentFlatVectorsScorer.java: ## @@ -0,0 +1,93 @@ +/* + * Licensed to the Apache Softwar

Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
uschindler commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1608136276 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentFlatVectorsScorer.java: ## @@ -0,0 +1,93 @@ +/* + * Licensed to the Apache Software

Re: [PR] Vectors Format Refactor to improve readability [lucene]

2024-05-21 Thread via GitHub
alessandrobenedetti commented on PR #13399: URL: https://github.com/apache/lucene/pull/13399#issuecomment-2122366365 Obviously no hard opinion in naming sub-packages or how to group classes, but my feeling is that the general audience would benefit -- This is an automated message from the

Re: [PR] Vectors Format Refactor to improve readability [lucene]

2024-05-21 Thread via GitHub
alessandrobenedetti commented on PR #13399: URL: https://github.com/apache/lucene/pull/13399#issuecomment-2122364895 > Does it actually improve readability? I know some Java projects like to be very granular in how they organize packages, but I've come to like Lucene's relatively flat packa

Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
uschindler commented on PR #13339: URL: https://github.com/apache/lucene/pull/13339#issuecomment-2122358609 > > We may add a method like getByteBufferSlice(). > > I experimented locally with similar before, and the performance impact when converting to/from MemorySegment was horri

Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
ChrisHegarty commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1608103780 ## lucene/core/src/java/org/apache/lucene/codecs/hnsw/FlatVectorScorerUtil.java: ## @@ -35,6 +35,6 @@ private FlatVectorScorerUtil() {} * on certain platforms

Re: [PR] Vectors Format Refactor to improve readability [lucene]

2024-05-21 Thread via GitHub
jpountz commented on PR #13399: URL: https://github.com/apache/lucene/pull/13399#issuecomment-2122338445 Does it actually improve readability? I know some Java projects like to be very granular in how they organize packages, but I've come to like Lucene's relatively flat package structure,

Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
uschindler commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1608084373 ## lucene/core/src/java/org/apache/lucene/codecs/hnsw/FlatVectorScorerUtil.java: ## @@ -35,6 +35,6 @@ private FlatVectorScorerUtil() {} * on certain platforms.

Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
ChrisHegarty commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1608072908 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java: ## @@ -73,4 +75,9 @@ private static T doPrivileged(Privilege

Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
ChrisHegarty commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1608070916 ## lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java: ## @@ -91,6 +92,8 @@ public static VectorizationProvider getInstance(

[PR] Vectors Format Refactor first draft to improve readability [lucene]

2024-05-21 Thread via GitHub
alessandrobenedetti opened a new pull request, #13399: URL: https://github.com/apache/lucene/pull/13399 ### Description The code for vector formats in the core codec package grew up quite consistently, impacting readability and maintainability. My main concerns are around duplicate

Re: [I] Test failure in TestBlockMaxConjunction.testRandom. [lucene]

2024-05-21 Thread via GitHub
jpountz closed issue #13396: Test failure in TestBlockMaxConjunction.testRandom. URL: https://github.com/apache/lucene/issues/13396 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

Re: [PR] Fix max score computation in BlockMaxConjunctionBulkScorer. [lucene]

2024-05-21 Thread via GitHub
jpountz merged PR #13397: URL: https://github.com/apache/lucene/pull/13397 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [I] Reproducible failure org.apache.lucene.search.TestBlockMaxConjunction [lucene]

2024-05-21 Thread via GitHub
jpountz closed issue #13371: Reproducible failure org.apache.lucene.search.TestBlockMaxConjunction URL: https://github.com/apache/lucene/issues/13371 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [I] Test failure in TestBlockMaxConjunction.testRandom. [lucene]

2024-05-21 Thread via GitHub
jpountz closed issue #13396: Test failure in TestBlockMaxConjunction.testRandom. URL: https://github.com/apache/lucene/issues/13396 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

Re: [PR] Disjunction as CompetitiveIterator for numeric dynamic pruning [lucene]

2024-05-21 Thread via GitHub
jpountz commented on PR #13221: URL: https://github.com/apache/lucene/pull/13221#issuecomment-219842 > I am so excited to see if this (nightly benchmarks auto-regolding) finally works Looks like it worked! https://people.apache.org/~mikemccand/lucenebench/TermDayOfYearSort.html

Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-21 Thread via GitHub
uschindler commented on code in PR #13339: URL: https://github.com/apache/lucene/pull/13339#discussion_r1607992706 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java: ## @@ -73,4 +75,9 @@ private static T doPrivileged(PrivilegedA

Re: [PR] Delete all live docs when query matched a whole segment. [lucene]

2024-05-21 Thread via GitHub
vsop-479 commented on PR #13395: URL: https://github.com/apache/lucene/pull/13395#issuecomment-2122131720 @mikemccand Please take a look when you get a chance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR

Re: [PR] Break the loop when segment is fully deleted by prior delTerms or delQueries [lucene]

2024-05-21 Thread via GitHub
vsop-479 commented on PR #13398: URL: https://github.com/apache/lucene/pull/13398#issuecomment-2122130138 @mikemccand Please take a look when you get a chance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR

[PR] Break the loop when segment is fully deleted by prior delTerms or delQueries [lucene]

2024-05-21 Thread via GitHub
vsop-479 opened a new pull request, #13398: URL: https://github.com/apache/lucene/pull/13398 ### Description When a segment is already fully deleted by prior `delTerms` or `delQueries`, in `FrozenBufferedUpdates.applyQueryDeletes` and `FrozenBufferedUpdates.applyTermDeletes`. We c

Re: [PR] Add Intervals.noIntervals() method [lucene]

2024-05-21 Thread via GitHub
uschindler commented on PR #13389: URL: https://github.com/apache/lucene/pull/13389#issuecomment-2122073560 Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsub

Re: [I] Add method to `Intervals#noIntervals(String reason)` to `Intervals` class [lucene]

2024-05-21 Thread via GitHub
romseygeek closed issue #13388: Add method to `Intervals#noIntervals(String reason)` to `Intervals` class URL: https://github.com/apache/lucene/issues/13388 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] Add Intervals.noIntervals() method [lucene]

2024-05-21 Thread via GitHub
romseygeek merged PR #13389: URL: https://github.com/apache/lucene/pull/13389 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

[PR] Fix max score computation in BlockMaxConjunctionBulkScorer. [lucene]

2024-05-21 Thread via GitHub
jpountz opened a new pull request, #13397: URL: https://github.com/apache/lucene/pull/13397 It sums up max scores in a float when it should sum them up in a double like we do for `Scorer#score()`. Otherwise, max scores may be returned that are less than actual scores. This bug was in

Re: [I] Test failure in TestBlockMaxConjunction.testRandom. [lucene]

2024-05-21 Thread via GitHub
jpountz commented on issue #13396: URL: https://github.com/apache/lucene/issues/13396#issuecomment-2121985101 Thanks, I had started looking into #13371 but this one was easier to debug and I could figure out the problem. I'll open a PR shortly. -- This is an automated message from the Apa

Re: [PR] Reduce the overhead of `IndexInput#prefetch` when data is cached in RAM. [lucene]

2024-05-21 Thread via GitHub
jpountz merged PR #13381: URL: https://github.com/apache/lucene/pull/13381 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa