[PR] Make DenseConjunctionBulkScorer align scoring windows with #docIDRunEnd(). [lucene]
jpountz opened a new pull request, #14400: URL: https://github.com/apache/lucene/pull/14400 This improves the way how `DenseConjunctionBulkScorer` computes scoring windows by aligning the end of the window with the `#docIDRunEnd()` of its clauses, as long as it would result in a window that is at least half the expected size. This helps reduce the number of clauses to evaluate per window in some cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]
alessandrobenedetti commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2751006045 > > do you confirm that, according to your knowledge, any relevant and active work toward multi-valued vectors in Lucene is effectively aggregated here? > > @alessandrobenedetti I think so. This is the latest stab at it. > > > Main concern is still related to ordinals to become long as far as I can see :) > > Indeed, I just don't see how Lucene can actually support multi-value vectors without switching to long ordinals for the vectors. Otherwise, we enforce some limitation on the number of vectors per segment, or some limitation on the number of vectors per doc (e.g. every doc can only have 256/65535 vectors). > > Making HNSW indexing & merging ~2x (given other constants, it might not be exactly 2x, maybe a little less) more expensive for heap usage is a pretty steep cost. Especially for something I am not sure how many folks will actually use. I agree, I don't think it makes sense to deteriorate single-valued performance at all (didn't investigate that, but I trust your judgement in terms of the int->long ordinal impact, in case you want me to double check let me know). Another option I was pondering is adding a new field type dedicated to multi-valued vectors. Sure, there will be tons of classes to "duplicate" and make multi-valued compliant, but I believe we'll be able to re-use most of the code, so a huge number of classes but not a massive new code quantity (hopefully). Before even exploring this, I want to better check the current parent join approach i.e. native multi-valued, needs to bring advantages (mostly being faster in retrieving top-K 'parent' documents), if not, it won't make much sense to do this huge amount of work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
dweiss commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2751059970 Nice! So... google java format has this option, at least in the cmd line version:  If nothing else works, we could just make a multipass and format javadocs using a different tool than the rest of the code... Or fork gjf and implement proper javadoc formatting, which should be a fun project to work on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for two-phase iterators to DenseConjunctionBulkScorer. [lucene]
jpountz merged PR #14359: URL: https://github.com/apache/lucene/pull/14359 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]
vigyasharma commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2751872315 > Another option I was pondering is adding a new field type dedicated to multi-valued vectors. I tried this in my first stab at this issue (https://github.com/apache/lucene/pull/13525). IIRC, one concern with a separate field, was that it limits users from converting their previously single-valued fields to multi-valued vectors later if they need to. And since single-valued is a base case of multi-valued, why would anyone even use the single valued fields. The idea in this PR was to treat single-valued as an optimization over multi-valued vectors, that can be turned on/off by a flag in stored metadata. FWIW, the PR (#13525) has pieces to use the separate field, and shows the extent of duplication across classes (it's not very much). I had only added support for ColBERT style dependent multi-vectors, but that can be extended with the independent vector pieces in this PR. .. > Before even exploring this, I want to better check the current parent join approach i.e. native multi-valued, needs to bring advantages (mostly being faster in retrieving top-K 'parent' documents), Agreed. The next step for this PR is to benchmark parent-join runs and see an improvement, esp. in cases where we need query time scoring on top of all the vector values. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Create vectorized versions of ScalarQuantizer.quantize and recalculateCorrectiveOffset [lucene]
thecoop commented on code in PR #14304: URL: https://github.com/apache/lucene/pull/14304#discussion_r2012314179 ## lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultVectorUtilSupport.java: ## @@ -234,4 +234,79 @@ public static long int4BitDotProductImpl(byte[] q, byte[] d) { } return ret; } + + @Override + public float minMaxScalarQuantize( + float[] vector, byte[] dest, float scale, float alpha, float minQuantile, float maxQuantile) { +return new ScalarQuantizer(alpha, scale, minQuantile, maxQuantile).quantize(vector, dest, 0); + } + + @Override + public float recalculateScalarQuantizationOffset( + byte[] vector, + float oldAlpha, + float oldMinQuantile, + float scale, + float alpha, + float minQuantile, + float maxQuantile) { +return new ScalarQuantizer(alpha, scale, minQuantile, maxQuantile) +.recalculateOffset(vector, 0, oldAlpha, oldMinQuantile); + } + + static class ScalarQuantizer { Review Comment: It's referenced by `PanamaVectorUtilSupport` to do the tail -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
gsmiller commented on PR #14273: URL: https://github.com/apache/lucene/pull/14273#issuecomment-2751652120 +1 to this optimization. Love the idea! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]
jpountz commented on PR #14401: URL: https://github.com/apache/lucene/pull/14401#issuecomment-2751630823 @epotyom You may be interested in this, this allows computing aggregates in sub-linear time respective to the number of matching docs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make DenseConjunctionBulkScorer align scoring windows with #docIDRunEnd(). [lucene]
gf2121 commented on code in PR #14400: URL: https://github.com/apache/lucene/pull/14400#discussion_r2012455948 ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -171,37 +171,36 @@ private int scoreWindow( } } -if (acceptDocs == null) { - int minDocIDRunEnd = max; - for (DisiWrapper w : iterators) { -if (w.docID() > min) { - minDocIDRunEnd = min; - break; -} else { - minDocIDRunEnd = Math.min(minDocIDRunEnd, w.docIDRunEnd()); -} - } - - if (minDocIDRunEnd - min >= WINDOW_SIZE / 2) { -// We have a large range of doc IDs that all match. -rangeDocIdStream.from = min; -rangeDocIdStream.to = minDocIDRunEnd; -collector.collect(rangeDocIdStream); -return minDocIDRunEnd; - } -} - -int bitsetWindowMax = (int) Math.min(max, (long) min + WINDOW_SIZE); - +// Partition clauses of the conjunction into: +// - clauses that don't fully match the first half of the window and get evaluated via +// #loadIntoBitSet or leaf-frog, +// - other clauses that are used to compute the greatest possible window size that they fully +// match. +// This logic helps align scoring windows with the natural #docIDRunEnd() boundaries of the +// data, which helps evaluate fewer clauses per window - without allowing windows to become too +// small thanks to the WINDOW_SIZE/2 threshold. +int minDocIDRunEnd = max; for (DisiWrapper w : iterators) { - if (w.docID() > min || w.docIDRunEnd() < bitsetWindowMax) { + int docIdRunEnd = w.docIDRunEnd(); + if (w.docID() > min || (docIdRunEnd - min) < WINDOW_SIZE / 2) { windowApproximations.add(w.approximation()); if (w.twoPhase() != null) { windowTwoPhases.add(w.twoPhase()); } + } else { +minDocIDRunEnd = Math.min(minDocIDRunEnd, docIdRunEnd); } } +if (acceptDocs == null && windowApproximations.isEmpty()) { Review Comment: Out of curiosity and not related to this PR: Would it be worth dealing `acceptDocs != null` here as well so that we won't need to call `intoBitset`? ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -171,37 +171,36 @@ private int scoreWindow( } } -if (acceptDocs == null) { - int minDocIDRunEnd = max; - for (DisiWrapper w : iterators) { -if (w.docID() > min) { - minDocIDRunEnd = min; - break; -} else { - minDocIDRunEnd = Math.min(minDocIDRunEnd, w.docIDRunEnd()); -} - } - - if (minDocIDRunEnd - min >= WINDOW_SIZE / 2) { -// We have a large range of doc IDs that all match. -rangeDocIdStream.from = min; -rangeDocIdStream.to = minDocIDRunEnd; -collector.collect(rangeDocIdStream); -return minDocIDRunEnd; - } -} - -int bitsetWindowMax = (int) Math.min(max, (long) min + WINDOW_SIZE); - +// Partition clauses of the conjunction into: +// - clauses that don't fully match the first half of the window and get evaluated via +// #loadIntoBitSet or leaf-frog, +// - other clauses that are used to compute the greatest possible window size that they fully +// match. +// This logic helps align scoring windows with the natural #docIDRunEnd() boundaries of the +// data, which helps evaluate fewer clauses per window - without allowing windows to become too +// small thanks to the WINDOW_SIZE/2 threshold. +int minDocIDRunEnd = max; for (DisiWrapper w : iterators) { - if (w.docID() > min || w.docIDRunEnd() < bitsetWindowMax) { + int docIdRunEnd = w.docIDRunEnd(); + if (w.docID() > min || (docIdRunEnd - min) < WINDOW_SIZE / 2) { Review Comment: > I believe that it only makes a difference when max-min < WINDOW_SIZE, where more clauses would now get evaluated Was this line making the difference and could be addressed by something like following code? I'm OK either way :) ``` int minRunEnd = max; final int minRunEndThreshold = Math.min(min + WINDOW_SIZE / 2, max); for (DisiWrapper w : iterators) { int docIdRunEnd = w.docIDRunEnd(); if (w.docID() > min || docIdRunEnd < minRunEndThreshold) { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Disable sort optimization when tracking all docs [lucene]
bugmakerr commented on PR #14395: URL: https://github.com/apache/lucene/pull/14395#issuecomment-2750983066 > The change looks correct to me. With recent changes to allow clauses that match all docs to remove themselves from a conjunction, it should be possible to achieve something similar by implementing `#docIDRunEnd()` on competitive iterators. I need to think a bit more about the pros and cons of these two approaches. @jpountz If I understand correctly, I think both optimizations can be implemented at the same time, and there is no conflict between the two. If we know we can't skip any docs before collection, then there's no need to maintain competitiveIterator-related data, and it helps for implementations that don't benefit from `docIDRunEnd`. Instead, `docIDRunEnd` can implement the skip logic during runtime. Of course, the current implementation only informs the comparator once, but if we could inform each segment separately, we could also disable/enable sort on the fly based on the current total hits and max doc of current segment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make DenseConjunctionBulkScorer align scoring windows with #docIDRunEnd(). [lucene]
jpountz commented on code in PR #14400: URL: https://github.com/apache/lucene/pull/14400#discussion_r2012300061 ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -171,27 +171,30 @@ private int scoreWindow( } } -if (acceptDocs == null) { - int minDocIDRunEnd = max; - for (DisiWrapper w : iterators) { -if (w.docID() > min) { - minDocIDRunEnd = min; - break; -} else { - minDocIDRunEnd = Math.min(minDocIDRunEnd, w.docIDRunEnd()); +int minDocIDRunEnd = max; +int bitsetWindowMax = (int) Math.min(max, (long) min + WINDOW_SIZE); +for (DisiWrapper w : iterators) { + if (w.docID() > min) { +minDocIDRunEnd = min; + } else { +int docIDRunEnd = w.docIDRunEnd(); +minDocIDRunEnd = Math.min(minDocIDRunEnd, docIDRunEnd); +// If we can find one clause that matches over more than half the window then we truncate +// the window to the run end of this clause as the benefits of evaluating one less clause +// likely dominate the overhead of using a smaller window. +if (docIDRunEnd - min >= WINDOW_SIZE / 2) { + bitsetWindowMax = Math.min(bitsetWindowMax, docIDRunEnd); } } - - if (minDocIDRunEnd - min >= WINDOW_SIZE / 2) { -// We have a large range of doc IDs that all match. -rangeDocIdStream.from = min; -rangeDocIdStream.to = minDocIDRunEnd; -collector.collect(rangeDocIdStream); -return minDocIDRunEnd; - } } -int bitsetWindowMax = (int) Math.min(max, (long) min + WINDOW_SIZE); +if (acceptDocs == null && minDocIDRunEnd >= bitsetWindowMax) { Review Comment: Yes, if all clauses fully match more than the next WINDOW_SIZE docs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make DenseConjunctionBulkScorer align scoring windows with #docIDRunEnd(). [lucene]
jpountz commented on PR #14400: URL: https://github.com/apache/lucene/pull/14400#issuecomment-2751702590 Thank you. I believe that it only makes a difference when `max-min < WINDOW_SIZE`, where more clauses would now get evaluated, but simplicity is more important so I applied your suggestion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
rmuir commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2751070467 I think in the markdown case, the bug I saw was that it didn't treat `///` as javadoc but as an ordinary inline comment. But I can experiment with the option still. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
dweiss commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2751317728 https://github.com/google/google-java-format/issues/1193 > Disabling Javadoc formatting doesn't prevent either issue. So it seems it's broken entirely. Argh. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Enable collectors to take advantage of pre-aggregated data. [lucene]
jpountz opened a new pull request, #14401: URL: https://github.com/apache/lucene/pull/14401 This introduces `LeafCollector#collectRange`, which is typically useful to take advantage of the pre-aggregated data exposed in `DocValuesSkipper`. At the moment, `DocValuesSkipper` only exposes per-block min and max values, but we could easily extend it to record sums and value counts as well. This `collectRange` method would be called if there are no deletions in the segment by: - queries that rewrite to a `MatchAllDocsQuery` (with min=0 and max=maxDoc), - `PointRangeQuery` on segments that fully match the range (typical for time-based data), - doc-value range queries and conjunctions of doc-value range queries on fields that enable sparse indexing and correlate with the index sort. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Pack file pointers when merging BKD trees [lucene]
iverase merged PR #14393: URL: https://github.com/apache/lucene/pull/14393 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Reduce memory usage when merging bkd trees [lucene]
iverase closed issue #14382: Reduce memory usage when merging bkd trees URL: https://github.com/apache/lucene/issues/14382 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Reduce memory usage when merging bkd trees [lucene]
iverase commented on issue #14382: URL: https://github.com/apache/lucene/issues/14382#issuecomment-2750449881 We are using more dense data structures now, in particular for the OneDimensionBKDWriter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Support modifying segmentInfos.counter in IndexWriter [lucene]
vigyasharma commented on issue #14362: URL: https://github.com/apache/lucene/issues/14362#issuecomment-2752804917 Thanks @guojialiang92 . Is the plan here to support creating an IndexWriter with a supplied value of `counter`, say `N`, so that all it's commit generations are `>=N` i.e. `segments_N, segments_N+1, ...` and so on ? To confirm my understanding, when a primary dies, one cannot really guarantee that all replicas were fully caught up. If the winning replica (new primary) was lagging behind, and you simply continue to use its `SegmentInfos#counter`, it might end up overwriting some segment files in other replicas. So you use raft to select the right counter value and start segments from that value. This is specifically a problem for segment replication. Regular document replication works fine because each document is reindexed anyway and segment files are not copied over. Is that more or less correct? Anyway, I don't see any problems with this support and it does have a valid use-case. If you want to raise a PR, I can help review it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
jpountz commented on PR #14273: URL: https://github.com/apache/lucene/pull/14273#issuecomment-2752610585 Quick update: we now have more queries that collect hits using `collect(DocIdStream)`, which makes this optimization more appealing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Can we use Panama Vector API for quantizing vectors? [lucene]
benwtrent closed issue #13922: Can we use Panama Vector API for quantizing vectors? URL: https://github.com/apache/lucene/issues/13922 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Can we use Panama Vector API for quantizing vectors? [lucene]
benwtrent closed issue #13922: Can we use Panama Vector API for quantizing vectors? URL: https://github.com/apache/lucene/issues/13922 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
github-actions[bot] commented on PR #13782: URL: https://github.com/apache/lucene/pull/13782#issuecomment-2752825545 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Break the loop when segment is fully deleted by prior delTerms or delQueries [lucene]
github-actions[bot] commented on PR #13398: URL: https://github.com/apache/lucene/pull/13398#issuecomment-2752825874 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make DenseConjunctionBulkScorer align scoring windows with #docIDRunEnd(). [lucene]
jpountz commented on code in PR #14400: URL: https://github.com/apache/lucene/pull/14400#discussion_r2012975372 ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -171,37 +171,36 @@ private int scoreWindow( } } -if (acceptDocs == null) { - int minDocIDRunEnd = max; - for (DisiWrapper w : iterators) { -if (w.docID() > min) { - minDocIDRunEnd = min; - break; -} else { - minDocIDRunEnd = Math.min(minDocIDRunEnd, w.docIDRunEnd()); -} - } - - if (minDocIDRunEnd - min >= WINDOW_SIZE / 2) { -// We have a large range of doc IDs that all match. -rangeDocIdStream.from = min; -rangeDocIdStream.to = minDocIDRunEnd; -collector.collect(rangeDocIdStream); -return minDocIDRunEnd; - } -} - -int bitsetWindowMax = (int) Math.min(max, (long) min + WINDOW_SIZE); - +// Partition clauses of the conjunction into: +// - clauses that don't fully match the first half of the window and get evaluated via +// #loadIntoBitSet or leaf-frog, +// - other clauses that are used to compute the greatest possible window size that they fully +// match. +// This logic helps align scoring windows with the natural #docIDRunEnd() boundaries of the +// data, which helps evaluate fewer clauses per window - without allowing windows to become too +// small thanks to the WINDOW_SIZE/2 threshold. +int minDocIDRunEnd = max; for (DisiWrapper w : iterators) { - if (w.docID() > min || w.docIDRunEnd() < bitsetWindowMax) { + int docIdRunEnd = w.docIDRunEnd(); + if (w.docID() > min || (docIdRunEnd - min) < WINDOW_SIZE / 2) { windowApproximations.add(w.approximation()); if (w.twoPhase() != null) { windowTwoPhases.add(w.twoPhase()); } + } else { +minDocIDRunEnd = Math.min(minDocIDRunEnd, docIdRunEnd); } } +if (acceptDocs == null && windowApproximations.isEmpty()) { Review Comment: If accept docs are not null, we shouldn't call `intoBitSet` on any clause. However we'll stick to a window of size 4,096 and convert the accept docs `Bits` into a bit set using `Bits#applyMask`. We may be able to do better if deletions are extremely sparse, but I couldn't think of an obvious way of handling it and I'm not sure how much this case is worth optimizing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a Faiss codec for KNN searches [lucene]
github-actions[bot] commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2752825072 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] knn search - add tests to perform exact search when filtering does not return enough results [lucene]
benwtrent merged PR #14274: URL: https://github.com/apache/lucene/pull/14274 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] New testMinMaxScalarQuantize tests failing repeatably [lucene]
benwtrent opened a new issue, #14402: URL: https://github.com/apache/lucene/issues/14402 ### Description ``` TestVectorUtilSupport > testMinMaxScalarQuantize {p0=4096} FAILED java.lang.AssertionError: Expected: a numeric value within <0.004096> of <762.170654296875> but: <762.1751708984375> differed by <4.20601562502E-4> more than delta <0.004096> at __randomizedtesting.SeedInfo.seed([C8353109B2DC21F4:77130CBE78E61421]:0) at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) at org.junit.Assert.assertThat(Assert.java:964) at org.junit.Assert.assertThat(Assert.java:930) at org.apache.lucene.internal.vectorization.TestVectorUtilSupport.assertFloatReturningProviders(TestVectorUtilSupport.java:213) at org.apache.lucene.internal.vectorization.TestVectorUtilSupport.testMinMaxScalarQuantize(TestVectorUtilSupport.java:206) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.base/java.lang.reflect.Method.invoke(Method.java:580) ``` Seems like a simple epsilon correction. ### Gradle command to reproduce ``` ./gradlew :lucene:core:test --tests "org.apache.lucene.internal.vectorization.TestVectorUtilSupport.testMinMaxScalarQuantize {p0=4096}" -Ptests.jvms=5 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1" -Ptests.seed=C8353109B2DC21F4 -Ptests.useSecurityManager=true -Ptests.gui=false -Ptests.file.encoding=UTF-8 -Ptests.vectorsize=512 -Ptests.forceintegervectors=true ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] New testMinMaxScalarQuantize tests failing repeatably [lucene]
benwtrent commented on issue #14402: URL: https://github.com/apache/lucene/issues/14402#issuecomment-2752192712 @thecoop ping ;) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make PointValues.intersect iterative instead of recursive [lucene]
jpountz commented on PR #14391: URL: https://github.com/apache/lucene/pull/14391#issuecomment-2752592897 Nightly benchmarks report a tiny slowdown for IntNRQ and CountFilteredIntNRQ (https://benchmarks.mikemccandless.com/2025.03.24.18.05.19.html) nevertheless I agree with your point that it's better to make this logic iterative rather than recursive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
rmuir commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2751009070 @dweiss thank you so much for that starter commit for evaluation. I will try it tonight and fire up eclipse and see what our options are. I finally finished parser (https://github.com/rmuir/tree-sitter-javadoc) and am now looking at options to (ab)use it, to convert our docs to markdown automatically, when I discovered that google-java-format will mess up markdown comments, e.g. too-long `///` will wrap to another line with `//` and break everything. So currently, there is no way to escape from the prettier because neither `@snippet` nor markdown can work with the formatter, so no way to make a patch that passes build. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make DenseConjunctionBulkScorer align scoring windows with #docIDRunEnd(). [lucene]
gf2121 commented on code in PR #14400: URL: https://github.com/apache/lucene/pull/14400#discussion_r2012151639 ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -171,27 +171,30 @@ private int scoreWindow( } } -if (acceptDocs == null) { - int minDocIDRunEnd = max; - for (DisiWrapper w : iterators) { -if (w.docID() > min) { - minDocIDRunEnd = min; - break; -} else { - minDocIDRunEnd = Math.min(minDocIDRunEnd, w.docIDRunEnd()); +int minDocIDRunEnd = max; +int bitsetWindowMax = (int) Math.min(max, (long) min + WINDOW_SIZE); +for (DisiWrapper w : iterators) { + if (w.docID() > min) { +minDocIDRunEnd = min; + } else { +int docIDRunEnd = w.docIDRunEnd(); +minDocIDRunEnd = Math.min(minDocIDRunEnd, docIDRunEnd); +// If we can find one clause that matches over more than half the window then we truncate +// the window to the run end of this clause as the benefits of evaluating one less clause +// likely dominate the overhead of using a smaller window. +if (docIDRunEnd - min >= WINDOW_SIZE / 2) { + bitsetWindowMax = Math.min(bitsetWindowMax, docIDRunEnd); } } - - if (minDocIDRunEnd - min >= WINDOW_SIZE / 2) { -// We have a large range of doc IDs that all match. -rangeDocIdStream.from = min; -rangeDocIdStream.to = minDocIDRunEnd; -collector.collect(rangeDocIdStream); -return minDocIDRunEnd; - } } -int bitsetWindowMax = (int) Math.min(max, (long) min + WINDOW_SIZE); +if (acceptDocs == null && minDocIDRunEnd >= bitsetWindowMax) { Review Comment: Could `minDocIDRunEnd` ever be bigger than `bitsetWindowMax` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Create vectorized versions of ScalarQuantizer.quantize and recalculateCorrectiveOffset [lucene]
benwtrent commented on code in PR #14304: URL: https://github.com/apache/lucene/pull/14304#discussion_r201706 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -334,4 +334,45 @@ public static int findNextGEQ(int[] buffer, int target, int from, int to) { assert IntStream.range(0, to - 1).noneMatch(i -> buffer[i] > buffer[i + 1]); return IMPL.findNextGEQ(buffer, target, from, to); } + + /** + * Quantizes {@code vector}, putting the result into {@code dest}. + * + * @param vector the vector to quantize + * @param dest the destination vector, can be null + * @param scale the scaling factor + * @param alpha the alpha value + * @param minQuantile the lower quantile of the distribution + * @param maxQuantile the upper quantile of the distribution + * @return the corrective offset that needs to be applied to the score + */ + public static float quantize( Review Comment: lets unify the name here with the implementation ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -334,4 +334,45 @@ public static int findNextGEQ(int[] buffer, int target, int from, int to) { assert IntStream.range(0, to - 1).noneMatch(i -> buffer[i] > buffer[i + 1]); return IMPL.findNextGEQ(buffer, target, from, to); } + + /** + * Quantizes {@code vector}, putting the result into {@code dest}. + * + * @param vector the vector to quantize + * @param dest the destination vector, can be null + * @param scale the scaling factor + * @param alpha the alpha value + * @param minQuantile the lower quantile of the distribution + * @param maxQuantile the upper quantile of the distribution + * @return the corrective offset that needs to be applied to the score + */ + public static float quantize( + float[] vector, byte[] dest, float scale, float alpha, float minQuantile, float maxQuantile) { +assert vector.length == dest.length; Review Comment: Let's throw an actual error here, illegal argument? ## lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultVectorUtilSupport.java: ## @@ -234,4 +234,79 @@ public static long int4BitDotProductImpl(byte[] q, byte[] d) { } return ret; } + + @Override + public float minMaxScalarQuantize( + float[] vector, byte[] dest, float scale, float alpha, float minQuantile, float maxQuantile) { +return new ScalarQuantizer(alpha, scale, minQuantile, maxQuantile).quantize(vector, dest, 0); + } + + @Override + public float recalculateScalarQuantizationOffset( + byte[] vector, + float oldAlpha, + float oldMinQuantile, + float scale, + float alpha, + float minQuantile, + float maxQuantile) { +return new ScalarQuantizer(alpha, scale, minQuantile, maxQuantile) +.recalculateOffset(vector, 0, oldAlpha, oldMinQuantile); + } + + static class ScalarQuantizer { Review Comment: private? ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -334,4 +334,45 @@ public static int findNextGEQ(int[] buffer, int target, int from, int to) { assert IntStream.range(0, to - 1).noneMatch(i -> buffer[i] > buffer[i + 1]); return IMPL.findNextGEQ(buffer, target, from, to); } + + /** + * Quantizes {@code vector}, putting the result into {@code dest}. Review Comment: Scalar quantizes. ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ## @@ -907,4 +907,97 @@ public static long int4BitDotProduct128(byte[] q, byte[] d) { } return subRet0 + (subRet1 << 1) + (subRet2 << 2) + (subRet3 << 3); } + + @Override + public float minMaxScalarQuantize( + float[] vector, byte[] dest, float scale, float alpha, float minQuantile, float maxQuantile) { +float correction = 0; Review Comment: lets add an assert here on vector.length & dest.length. Earlier up stream, we should throw an actual production error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Support modifying segmentInfos.counter in IndexWriter [lucene]
guojialiang92 commented on issue #14362: URL: https://github.com/apache/lucene/issues/14362#issuecomment-2753129754 Thanks @vigyasharma. Your understanding is correct (**This is specifically a problem for segment replication**). From an implementation point of view, similar to the current `IndexWriter#advanceSegmentInfosVersion`, I want to provide a `IndexWriter#advanceSegmentInfosCounter`. It may be more flexible to use, do you think it can be changed like this? I am willing to submit a PR and give it to you for review. Looking forward to your reply. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]
vsop-479 commented on PR #14371: URL: https://github.com/apache/lucene/pull/14371#issuecomment-2753155562 > a bench in jmh will be great. I measured it with `AdvanceSparseDISIBenchmark`: Benchmark Mode CntScore Error Units AdvanceSparseDISIBenchmark.advance thrpt 15 669.502 ± 4.531 ops/ms AdvanceSparseDISIBenchmark.advanceBinarySearch thrpt 15 358.620 ± 1.102 ops/ms AdvanceSparseDISIBenchmark.advanceExact thrpt 15 752.444 ± 1.810 ops/ms AdvanceSparseDISIBenchmark.advanceExactBinarySearch thrpt 15 547.818 ± 2.278 ops/ms Even I set target docs's inteval to 10, there is still a big performance degrade. Maybe I use too many `disi.slice.seek` in this binary search version. > you may find we are using VectorMask to speed up this, that was what i had in mind - get a MemorySegment slice if it is not null, and play it with VectorMask. I will try `VectorMask` when I get a chance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize slice calculation in IndexSearcher a little [lucene]
github-actions[bot] commented on PR #13860: URL: https://github.com/apache/lucene/pull/13860#issuecomment-2752825465 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use FixedLengthBytesRefArray in OneDimensionBKDWriter to hold split values [lucene]
iverase merged PR #14383: URL: https://github.com/apache/lucene/pull/14383 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org