Re: [PR] Add timeout support to AbstractVectorSimilarityQuery [lucene]

2024-08-05 Thread via GitHub


kaivalnp commented on PR #13285:
URL: https://github.com/apache/lucene/pull/13285#issuecomment-2268639204

   There was a conflict in `CHANGES.txt` after a recent commit, merged from 
`main` and resolved that
   @vigyasharma I've tried to address all open comments, please let me know if 
something is missing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] CandidateMatcher public matching functions [lucene]

2024-08-05 Thread via GitHub


bjacobowitz opened a new pull request, #13632:
URL: https://github.com/apache/lucene/pull/13632

   ### Description
   
   In `CandidateMatcher`, increase visibility of `matchQuery()` (from 
protected), `finish()` and `reportError()` (from package-private) to public, so 
that a `CandidateMatcher` can be used effectively from outside the 
`org.apache.lucene.monitor` package.
   
   ### Problem
   
   The current access protections on `CandidateMatcher` make it infeasible for 
a user of the Lucene Monitor library to use an existing `CandidateMatcher` if 
they are outside the `org.apache.lucene.monitor` package (e.g. as part of the 
implementation of their own `CandidateMatcher`).
   
   For example, a user working outside of the `org.apache.lucene.monitor` 
package could not build their own version of `ParallelMatcher` by making use of 
the matcher from `QueryMatch.SIMPLE_MATCHER`, because they cannot access 
`matchQuery`, `reportError` or `finish` on it.
   
   ### Solution
   
   This PR takes the simplest solution of increase the visibility of those 
functions in `CandidateMatcher` to `public`. This also requires modifying some 
existing `CandidateMatcher` implementations that override `protected 
matchQuery()` to instead override `public matchQuery()`. 
   
   I've also added some just-compile tests in a subpackage 
`org.apache.lucene.monitor.otherpackage`, which call the newly-public functions 
to verify that they are accessible from outside the `org.apache.lucene.monitor` 
package.
   
   See previous discussion of this approach and alternatives in #13109.
   
   ### Merge Request
   
   If this PR gets merged, can you please use my `bjacobowi...@bloomberg.net` 
email address for the squash+merge. Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add WrappedCandidateMatcher for composing matchers [lucene]

2024-08-05 Thread via GitHub


bjacobowitz commented on PR #13109:
URL: https://github.com/apache/lucene/pull/13109#issuecomment-2269027662

   Closing, see PR #13632 with new approach


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add WrappedCandidateMatcher for composing matchers [lucene]

2024-08-05 Thread via GitHub


bjacobowitz closed pull request #13109: Add WrappedCandidateMatcher for 
composing matchers
URL: https://github.com/apache/lucene/pull/13109


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Get better cost estimate on MultiTermQuery over few terms [lucene]

2024-08-05 Thread via GitHub


gsmiller commented on code in PR #13201:
URL: https://github.com/apache/lucene/pull/13201#discussion_r1704183629


##
lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java:
##
@@ -240,9 +219,38 @@ public ScorerSupplier scorerSupplier(LeafReaderContext 
context) throws IOExcepti
 return null;
   }
 
-  final long cost = estimateCost(terms, q.getTermsCount());
-  IOSupplier weightOrIteratorSupplier = 
rewrite(context, terms);
-  if (weightOrIteratorSupplier == null) return null;
+  assert terms != null;
+
+  final int fieldDocCount = terms.getDocCount();
+  final TermsEnum termsEnum = q.getTermsEnum(terms);
+  assert termsEnum != null;
+
+  List collectedTerms = new ArrayList<>();
+  boolean collectResult = collectTerms(fieldDocCount, termsEnum, 
collectedTerms);
+
+  if (collectResult && collectedTerms.isEmpty()) return null;

Review Comment:
   nit: stylistically at least, I think we generally prefer always using braces 
even for one-liners. Mind including them here?



##
lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java:
##
@@ -240,9 +219,38 @@ public ScorerSupplier scorerSupplier(LeafReaderContext 
context) throws IOExcepti
 return null;
   }
 
-  final long cost = estimateCost(terms, q.getTermsCount());
-  IOSupplier weightOrIteratorSupplier = 
rewrite(context, terms);
-  if (weightOrIteratorSupplier == null) return null;
+  assert terms != null;
+
+  final int fieldDocCount = terms.getDocCount();
+  final TermsEnum termsEnum = q.getTermsEnum(terms);
+  assert termsEnum != null;
+
+  List collectedTerms = new ArrayList<>();
+  boolean collectResult = collectTerms(fieldDocCount, termsEnum, 
collectedTerms);
+
+  if (collectResult && collectedTerms.isEmpty()) return null;
+
+  final long cost;
+  if (collectResult) {
+long sumTermCost = 0;
+for (TermAndState collectedTerm : collectedTerms) {
+  sumTermCost += collectedTerm.docFreq;

Review Comment:
   Just sort of thinking out loud here, but I wonder how expensive it is to 
build the boolean query and grab its weight / scoreSupplier at this point 
(instead of lazily doing it in `scoreSupplier#get`)? If it's cheap, another 
option would be to invoke `rewriteAsBooleanQuery`, pull its score supplier and 
then delegate `#cost` to it. That way if the cost logic in `BooleanQuery` 
evolves in the future, this benefits as well. Probably not worth messing with 
this now, but maybe worth a TODO comment in here? That is, unless we know it is 
prohibitively expensive.



##
lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java:
##
@@ -240,9 +219,38 @@ public ScorerSupplier scorerSupplier(LeafReaderContext 
context) throws IOExcepti
 return null;
   }
 
-  final long cost = estimateCost(terms, q.getTermsCount());
-  IOSupplier weightOrIteratorSupplier = 
rewrite(context, terms);
-  if (weightOrIteratorSupplier == null) return null;
+  assert terms != null;
+
+  final int fieldDocCount = terms.getDocCount();
+  final TermsEnum termsEnum = q.getTermsEnum(terms);
+  assert termsEnum != null;
+
+  List collectedTerms = new ArrayList<>();
+  boolean collectResult = collectTerms(fieldDocCount, termsEnum, 
collectedTerms);
+
+  if (collectResult && collectedTerms.isEmpty()) return null;
+
+  final long cost;
+  if (collectResult) {

Review Comment:
   We end up checking `collectResult` twice in the common case here. We could 
refactor this slightly to have one conditional block for when `collectResult == 
true` instead (and pull the `isEmpty` short-circuit check into it). WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix ScalarQuantization when used with COSINE similarity [lucene]

2024-08-05 Thread via GitHub


benwtrent merged PR #13615:
URL: https://github.com/apache/lucene/pull/13615


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] int4 doesn't handle `cosine` similarity correctly [lucene]

2024-08-05 Thread via GitHub


benwtrent closed issue #13614: int4 doesn't handle `cosine` similarity correctly
URL: https://github.com/apache/lucene/issues/13614


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Slightly speed up decoding blocks of postings/freqs/positions. [lucene]

2024-08-05 Thread via GitHub


gsmiller commented on code in PR #13631:
URL: https://github.com/apache/lucene/pull/13631#discussion_r1704396687


##
lucene/core/src/java/org/apache/lucene/codecs/lucene912/ForDeltaUtil.java:
##
@@ -41,10 +54,275 @@ private static void prefixSumOfOnes(long[] arr, long base) 
{
 }
   }
 
-  private final ForUtil forUtil;
+  private static long expandMask32(long mask32) {
+return mask32 | (mask32 << 32);
+  }
+
+  private static long expandMask16(long mask16) {
+return expandMask32(mask16 | (mask16 << 16));
+  }
+
+  private static long mask32(int bitsPerValue) {
+return expandMask32((1L << bitsPerValue) - 1);
+  }
+
+  private static long mask16(int bitsPerValue) {
+return expandMask16((1L << bitsPerValue) - 1);
+  }
+
+  private static void expand16(long[] arr) {
+for (int i = 0; i < 32; ++i) {
+  long l = arr[i];
+  arr[i] = (l >>> 48) & 0xL;
+  arr[32 + i] = (l >>> 32) & 0xL;
+  arr[64 + i] = (l >>> 16) & 0xL;
+  arr[96 + i] = l & 0xL;
+}
+  }
+
+  private static void collapse16(long[] arr) {
+for (int i = 0; i < 32; ++i) {
+  arr[i] = (arr[i] << 48) | (arr[32 + i] << 32) | (arr[64 + i] << 16) | 
arr[96 + i];
+}
+  }
+
+  private static void expand32(long[] arr) {
+for (int i = 0; i < 64; ++i) {
+  long l = arr[i];
+  arr[i] = l >>> 32;
+  arr[64 + i] = l & 0xL;
+}
+  }
+
+  private static void collapse32(long[] arr) {
+for (int i = 0; i < 64; ++i) {
+  arr[i] = (arr[i] << 32) | arr[64 + i];
+}
+  }
+
+  private static void prefixSum16(long[] arr, long base) {
+// When the number of bits per value is 9 or less, we can sum up all 
values in a block without
+// risking overflowing a 16-bits integer. This allows computing the prefix 
sum by summing up 4
+// values at once.
+innerPrefixSum16(arr);
+expand16(arr);
+final long l0 = base;
+final long l1 = l0 + arr[ONE_BLOCK_SIZE_FOURTH - 1];
+final long l2 = l1 + arr[TWO_BLOCK_SIZE_FOURTHS - 1];
+final long l3 = l2 + arr[THREE_BLOCK_SIZE_FOURTHS - 1];
+
+for (int i = 0; i < ONE_BLOCK_SIZE_FOURTH; ++i) {
+  arr[i] += l0;
+  arr[ONE_BLOCK_SIZE_FOURTH + i] += l1;
+  arr[TWO_BLOCK_SIZE_FOURTHS + i] += l2;
+  arr[THREE_BLOCK_SIZE_FOURTHS + i] += l3;
+}
+  }
+
+  private static void prefixSum32(long[] arr, long base) {
+arr[0] += base << 32;

Review Comment:
   nit: I wonder if we should make our handling of "base" consistent between 
this approach and the new 4-per-long approach at some point? Maybe it doesn't 
matter?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Add float|byte vector support to memory index [lucene]

2024-08-05 Thread via GitHub


benwtrent opened a new pull request, #13633:
URL: https://github.com/apache/lucene/pull/13633

   This adds the capability to write and read knn vectors from a memory index. 
   
   It currently enforces the lower level codec restriction of only one knn 
vector per field and it does not support nearest neighbor search (seemed 
unnecessary to me as it will always be the same doc...).
   
   closes: https://github.com/apache/lucene/issues/13584


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add timeout support to AbstractVectorSimilarityQuery [lucene]

2024-08-05 Thread via GitHub


vigyasharma commented on code in PR #13285:
URL: https://github.com/apache/lucene/pull/13285#discussion_r1704745053


##
lucene/core/src/test/org/apache/lucene/search/BaseVectorSimilarityQueryTestCase.java:
##
@@ -475,6 +479,62 @@ public void testApproximate() throws IOException {
 }
   }
 
+  /** Test that the query times out correctly. */
+  public void testTimeout() throws IOException {
+V[] vectors = getRandomVectors(numDocs, dim);
+V queryVector = getRandomVector(dim);
+
+try (Directory indexStore = getIndexStore(vectors);
+IndexReader reader = DirectoryReader.open(indexStore)) {
+  IndexSearcher searcher = newSearcher(reader);
+
+  // This query is cacheable, explicitly prevent it
+  searcher.setQueryCache(null);
+
+  Query query =
+  new CountingQuery(
+  getVectorQuery(
+  vectorField,
+  queryVector,
+  Float.NEGATIVE_INFINITY,
+  Float.NEGATIVE_INFINITY,
+  null));
+
+  assertEquals(numDocs, searcher.count(query)); // Expect some results 
without timeout
+
+  searcher.setTimeout(() -> true); // Immediately timeout
+  assertEquals(0, searcher.count(query)); // Expect no results with the 
timeout
+
+  searcher.setTimeout(new CountingQueryTimeout(numDocs - 1)); // Do not 
score all docs
+  int count = searcher.count(query);
+  assertTrue(
+  "0 < count=" + count + " < numDocs=" + numDocs,
+  count > 0 && count < numDocs); // Expect partial results
+
+  // Test timeout with filter
+  int numFiltered = random().nextInt(numDocs / 2, numDocs);
+  Query filter = IntField.newSetQuery(idField, getFiltered(numFiltered));
+  Query filteredQuery =
+  new CountingQuery(
+  getVectorQuery(
+  vectorField,
+  queryVector,
+  Float.NEGATIVE_INFINITY,
+  Float.NEGATIVE_INFINITY,
+  filter));
+
+  searcher.setTimeout(() -> false); // Set a timeout which is never met
+  assertEquals(numFiltered, searcher.count(filteredQuery));
+
+  searcher.setTimeout(
+  new CountingQueryTimeout(numFiltered - 1)); // Timeout before 
scoring all filtered docs
+  int filteredCount = searcher.count(filteredQuery);
+  assertTrue(
+  "0 < filteredCount=" + filteredCount + " < numFiltered=" + 
numFiltered,
+  filteredCount > 0 && filteredCount < numFiltered); // Expect partial 
results

Review Comment:
   So this tests for cases where we timeout before exhausting the filter, nice!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add timeout support to AbstractVectorSimilarityQuery [lucene]

2024-08-05 Thread via GitHub


vigyasharma merged PR #13285:
URL: https://github.com/apache/lucene/pull/13285


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add reopen method in PerThreadPKLookup [lucene]

2024-08-05 Thread via GitHub


vsop-479 commented on code in PR #13596:
URL: https://github.com/apache/lucene/pull/13596#discussion_r1701382048


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/PerThreadPKLookup.java:
##
@@ -97,5 +111,82 @@ public int lookup(BytesRef id) throws IOException {
 return -1;
   }
 
-  // TODO: add reopen method to carry over re-used enums...?
+  /**
+   * Reuse loaded segments' termsEnum and postingsEnum.
+   *
+   * @return null if there are no changes; else, a new DirectoryReader 
instance which you must
+   * eventually close.
+   * @throws IOException if there is a low-level IO error.
+   */
+  public DirectoryReader reopen() throws IOException {

Review Comment:
   Addressed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [KNN] Add comment and remove duplicate code [lucene]

2024-08-05 Thread via GitHub


kaivalnp commented on code in PR #13594:
URL: https://github.com/apache/lucene/pull/13594#discussion_r1704945107


##
lucene/core/src/java/org/apache/lucene/search/KnnQueryUtils.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import org.apache.lucene.util.BitSet;
+import org.apache.lucene.util.BitSetIterator;
+import org.apache.lucene.util.Bits;
+
+/** Common utilities for KNN queries. */
+class KnnQueryUtils {

Review Comment:
   Q: Should we name it something like `Ann*` (Approximate Near Neighbor *) to 
fit with `*VectorSimilarityQuery` too? (`Knn*` sounds specific to 
`Knn*VectorQuery`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add timeout support to AbstractVectorSimilarityQuery [lucene]

2024-08-05 Thread via GitHub


kaivalnp commented on PR #13285:
URL: https://github.com/apache/lucene/pull/13285#issuecomment-2270443993

   Thank you @vigyasharma!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org