Re: [PR] Add timeout support to AbstractVectorSimilarityQuery [lucene]
kaivalnp commented on PR #13285: URL: https://github.com/apache/lucene/pull/13285#issuecomment-2268639204 There was a conflict in `CHANGES.txt` after a recent commit, merged from `main` and resolved that @vigyasharma I've tried to address all open comments, please let me know if something is missing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] CandidateMatcher public matching functions [lucene]
bjacobowitz opened a new pull request, #13632: URL: https://github.com/apache/lucene/pull/13632 ### Description In `CandidateMatcher`, increase visibility of `matchQuery()` (from protected), `finish()` and `reportError()` (from package-private) to public, so that a `CandidateMatcher` can be used effectively from outside the `org.apache.lucene.monitor` package. ### Problem The current access protections on `CandidateMatcher` make it infeasible for a user of the Lucene Monitor library to use an existing `CandidateMatcher` if they are outside the `org.apache.lucene.monitor` package (e.g. as part of the implementation of their own `CandidateMatcher`). For example, a user working outside of the `org.apache.lucene.monitor` package could not build their own version of `ParallelMatcher` by making use of the matcher from `QueryMatch.SIMPLE_MATCHER`, because they cannot access `matchQuery`, `reportError` or `finish` on it. ### Solution This PR takes the simplest solution of increase the visibility of those functions in `CandidateMatcher` to `public`. This also requires modifying some existing `CandidateMatcher` implementations that override `protected matchQuery()` to instead override `public matchQuery()`. I've also added some just-compile tests in a subpackage `org.apache.lucene.monitor.otherpackage`, which call the newly-public functions to verify that they are accessible from outside the `org.apache.lucene.monitor` package. See previous discussion of this approach and alternatives in #13109. ### Merge Request If this PR gets merged, can you please use my `bjacobowi...@bloomberg.net` email address for the squash+merge. Thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add WrappedCandidateMatcher for composing matchers [lucene]
bjacobowitz commented on PR #13109: URL: https://github.com/apache/lucene/pull/13109#issuecomment-2269027662 Closing, see PR #13632 with new approach -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add WrappedCandidateMatcher for composing matchers [lucene]
bjacobowitz closed pull request #13109: Add WrappedCandidateMatcher for composing matchers URL: https://github.com/apache/lucene/pull/13109 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Get better cost estimate on MultiTermQuery over few terms [lucene]
gsmiller commented on code in PR #13201: URL: https://github.com/apache/lucene/pull/13201#discussion_r1704183629 ## lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java: ## @@ -240,9 +219,38 @@ public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOExcepti return null; } - final long cost = estimateCost(terms, q.getTermsCount()); - IOSupplier weightOrIteratorSupplier = rewrite(context, terms); - if (weightOrIteratorSupplier == null) return null; + assert terms != null; + + final int fieldDocCount = terms.getDocCount(); + final TermsEnum termsEnum = q.getTermsEnum(terms); + assert termsEnum != null; + + List collectedTerms = new ArrayList<>(); + boolean collectResult = collectTerms(fieldDocCount, termsEnum, collectedTerms); + + if (collectResult && collectedTerms.isEmpty()) return null; Review Comment: nit: stylistically at least, I think we generally prefer always using braces even for one-liners. Mind including them here? ## lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java: ## @@ -240,9 +219,38 @@ public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOExcepti return null; } - final long cost = estimateCost(terms, q.getTermsCount()); - IOSupplier weightOrIteratorSupplier = rewrite(context, terms); - if (weightOrIteratorSupplier == null) return null; + assert terms != null; + + final int fieldDocCount = terms.getDocCount(); + final TermsEnum termsEnum = q.getTermsEnum(terms); + assert termsEnum != null; + + List collectedTerms = new ArrayList<>(); + boolean collectResult = collectTerms(fieldDocCount, termsEnum, collectedTerms); + + if (collectResult && collectedTerms.isEmpty()) return null; + + final long cost; + if (collectResult) { +long sumTermCost = 0; +for (TermAndState collectedTerm : collectedTerms) { + sumTermCost += collectedTerm.docFreq; Review Comment: Just sort of thinking out loud here, but I wonder how expensive it is to build the boolean query and grab its weight / scoreSupplier at this point (instead of lazily doing it in `scoreSupplier#get`)? If it's cheap, another option would be to invoke `rewriteAsBooleanQuery`, pull its score supplier and then delegate `#cost` to it. That way if the cost logic in `BooleanQuery` evolves in the future, this benefits as well. Probably not worth messing with this now, but maybe worth a TODO comment in here? That is, unless we know it is prohibitively expensive. ## lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java: ## @@ -240,9 +219,38 @@ public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOExcepti return null; } - final long cost = estimateCost(terms, q.getTermsCount()); - IOSupplier weightOrIteratorSupplier = rewrite(context, terms); - if (weightOrIteratorSupplier == null) return null; + assert terms != null; + + final int fieldDocCount = terms.getDocCount(); + final TermsEnum termsEnum = q.getTermsEnum(terms); + assert termsEnum != null; + + List collectedTerms = new ArrayList<>(); + boolean collectResult = collectTerms(fieldDocCount, termsEnum, collectedTerms); + + if (collectResult && collectedTerms.isEmpty()) return null; + + final long cost; + if (collectResult) { Review Comment: We end up checking `collectResult` twice in the common case here. We could refactor this slightly to have one conditional block for when `collectResult == true` instead (and pull the `isEmpty` short-circuit check into it). WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix ScalarQuantization when used with COSINE similarity [lucene]
benwtrent merged PR #13615: URL: https://github.com/apache/lucene/pull/13615 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] int4 doesn't handle `cosine` similarity correctly [lucene]
benwtrent closed issue #13614: int4 doesn't handle `cosine` similarity correctly URL: https://github.com/apache/lucene/issues/13614 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Slightly speed up decoding blocks of postings/freqs/positions. [lucene]
gsmiller commented on code in PR #13631: URL: https://github.com/apache/lucene/pull/13631#discussion_r1704396687 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/ForDeltaUtil.java: ## @@ -41,10 +54,275 @@ private static void prefixSumOfOnes(long[] arr, long base) { } } - private final ForUtil forUtil; + private static long expandMask32(long mask32) { +return mask32 | (mask32 << 32); + } + + private static long expandMask16(long mask16) { +return expandMask32(mask16 | (mask16 << 16)); + } + + private static long mask32(int bitsPerValue) { +return expandMask32((1L << bitsPerValue) - 1); + } + + private static long mask16(int bitsPerValue) { +return expandMask16((1L << bitsPerValue) - 1); + } + + private static void expand16(long[] arr) { +for (int i = 0; i < 32; ++i) { + long l = arr[i]; + arr[i] = (l >>> 48) & 0xL; + arr[32 + i] = (l >>> 32) & 0xL; + arr[64 + i] = (l >>> 16) & 0xL; + arr[96 + i] = l & 0xL; +} + } + + private static void collapse16(long[] arr) { +for (int i = 0; i < 32; ++i) { + arr[i] = (arr[i] << 48) | (arr[32 + i] << 32) | (arr[64 + i] << 16) | arr[96 + i]; +} + } + + private static void expand32(long[] arr) { +for (int i = 0; i < 64; ++i) { + long l = arr[i]; + arr[i] = l >>> 32; + arr[64 + i] = l & 0xL; +} + } + + private static void collapse32(long[] arr) { +for (int i = 0; i < 64; ++i) { + arr[i] = (arr[i] << 32) | arr[64 + i]; +} + } + + private static void prefixSum16(long[] arr, long base) { +// When the number of bits per value is 9 or less, we can sum up all values in a block without +// risking overflowing a 16-bits integer. This allows computing the prefix sum by summing up 4 +// values at once. +innerPrefixSum16(arr); +expand16(arr); +final long l0 = base; +final long l1 = l0 + arr[ONE_BLOCK_SIZE_FOURTH - 1]; +final long l2 = l1 + arr[TWO_BLOCK_SIZE_FOURTHS - 1]; +final long l3 = l2 + arr[THREE_BLOCK_SIZE_FOURTHS - 1]; + +for (int i = 0; i < ONE_BLOCK_SIZE_FOURTH; ++i) { + arr[i] += l0; + arr[ONE_BLOCK_SIZE_FOURTH + i] += l1; + arr[TWO_BLOCK_SIZE_FOURTHS + i] += l2; + arr[THREE_BLOCK_SIZE_FOURTHS + i] += l3; +} + } + + private static void prefixSum32(long[] arr, long base) { +arr[0] += base << 32; Review Comment: nit: I wonder if we should make our handling of "base" consistent between this approach and the new 4-per-long approach at some point? Maybe it doesn't matter? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add float|byte vector support to memory index [lucene]
benwtrent opened a new pull request, #13633: URL: https://github.com/apache/lucene/pull/13633 This adds the capability to write and read knn vectors from a memory index. It currently enforces the lower level codec restriction of only one knn vector per field and it does not support nearest neighbor search (seemed unnecessary to me as it will always be the same doc...). closes: https://github.com/apache/lucene/issues/13584 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add timeout support to AbstractVectorSimilarityQuery [lucene]
vigyasharma commented on code in PR #13285: URL: https://github.com/apache/lucene/pull/13285#discussion_r1704745053 ## lucene/core/src/test/org/apache/lucene/search/BaseVectorSimilarityQueryTestCase.java: ## @@ -475,6 +479,62 @@ public void testApproximate() throws IOException { } } + /** Test that the query times out correctly. */ + public void testTimeout() throws IOException { +V[] vectors = getRandomVectors(numDocs, dim); +V queryVector = getRandomVector(dim); + +try (Directory indexStore = getIndexStore(vectors); +IndexReader reader = DirectoryReader.open(indexStore)) { + IndexSearcher searcher = newSearcher(reader); + + // This query is cacheable, explicitly prevent it + searcher.setQueryCache(null); + + Query query = + new CountingQuery( + getVectorQuery( + vectorField, + queryVector, + Float.NEGATIVE_INFINITY, + Float.NEGATIVE_INFINITY, + null)); + + assertEquals(numDocs, searcher.count(query)); // Expect some results without timeout + + searcher.setTimeout(() -> true); // Immediately timeout + assertEquals(0, searcher.count(query)); // Expect no results with the timeout + + searcher.setTimeout(new CountingQueryTimeout(numDocs - 1)); // Do not score all docs + int count = searcher.count(query); + assertTrue( + "0 < count=" + count + " < numDocs=" + numDocs, + count > 0 && count < numDocs); // Expect partial results + + // Test timeout with filter + int numFiltered = random().nextInt(numDocs / 2, numDocs); + Query filter = IntField.newSetQuery(idField, getFiltered(numFiltered)); + Query filteredQuery = + new CountingQuery( + getVectorQuery( + vectorField, + queryVector, + Float.NEGATIVE_INFINITY, + Float.NEGATIVE_INFINITY, + filter)); + + searcher.setTimeout(() -> false); // Set a timeout which is never met + assertEquals(numFiltered, searcher.count(filteredQuery)); + + searcher.setTimeout( + new CountingQueryTimeout(numFiltered - 1)); // Timeout before scoring all filtered docs + int filteredCount = searcher.count(filteredQuery); + assertTrue( + "0 < filteredCount=" + filteredCount + " < numFiltered=" + numFiltered, + filteredCount > 0 && filteredCount < numFiltered); // Expect partial results Review Comment: So this tests for cases where we timeout before exhausting the filter, nice! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add timeout support to AbstractVectorSimilarityQuery [lucene]
vigyasharma merged PR #13285: URL: https://github.com/apache/lucene/pull/13285 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add reopen method in PerThreadPKLookup [lucene]
vsop-479 commented on code in PR #13596: URL: https://github.com/apache/lucene/pull/13596#discussion_r1701382048 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/PerThreadPKLookup.java: ## @@ -97,5 +111,82 @@ public int lookup(BytesRef id) throws IOException { return -1; } - // TODO: add reopen method to carry over re-used enums...? + /** + * Reuse loaded segments' termsEnum and postingsEnum. + * + * @return null if there are no changes; else, a new DirectoryReader instance which you must + * eventually close. + * @throws IOException if there is a low-level IO error. + */ + public DirectoryReader reopen() throws IOException { Review Comment: Addressed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [KNN] Add comment and remove duplicate code [lucene]
kaivalnp commented on code in PR #13594: URL: https://github.com/apache/lucene/pull/13594#discussion_r1704945107 ## lucene/core/src/java/org/apache/lucene/search/KnnQueryUtils.java: ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import org.apache.lucene.util.BitSet; +import org.apache.lucene.util.BitSetIterator; +import org.apache.lucene.util.Bits; + +/** Common utilities for KNN queries. */ +class KnnQueryUtils { Review Comment: Q: Should we name it something like `Ann*` (Approximate Near Neighbor *) to fit with `*VectorSimilarityQuery` too? (`Knn*` sounds specific to `Knn*VectorQuery`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add timeout support to AbstractVectorSimilarityQuery [lucene]
kaivalnp commented on PR #13285: URL: https://github.com/apache/lucene/pull/13285#issuecomment-2270443993 Thank you @vigyasharma! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org