[PR] Fix resource leak in loadMainDataFromFile [lucene]

2025-05-28 Thread via GitHub


xcx1r3 opened a new pull request, #14727:
URL: https://github.com/apache/lucene/pull/14727

   ### Description
   Use try-with-resources to auto-close DataInputStream
   ```
   try (DataInputStream dctFile = new 
DataInputStream(Files.newInputStream(Paths.get(dctFilePath {
 ...
   }
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix resource leak in loadMainDataFromFile [lucene]

2025-05-28 Thread via GitHub


github-actions[bot] commented on PR #14727:
URL: https://github.com/apache/lucene/pull/14727#issuecomment-2915210212

   This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. 
If the PR doesn't need a changelog entry, then add the skip-changelog-check 
label to it and you will stop receiving this reminder on future updates to the 
PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a DoubleValuesSource for scoring full precision vector similarity [lucene]

2025-05-28 Thread via GitHub


vigyasharma commented on PR #14708:
URL: https://github.com/apache/lucene/pull/14708#issuecomment-2917306575

   Thanks for the review folks! I like the idea of a separate class and a 
custom vector comparator, will make these changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a DoubleValuesSource for scoring full precision vector similarity [lucene]

2025-05-28 Thread via GitHub


vigyasharma commented on PR #14708:
URL: https://github.com/apache/lucene/pull/14708#issuecomment-2917325435

   I'm not sure about the byte vector case myself. Do we see a viable need for 
it in FullPrecisionVectorSimilaritySource ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a DoubleValuesSource for scoring full precision vector similarity [lucene]

2025-05-28 Thread via GitHub


benwtrent commented on PR #14708:
URL: https://github.com/apache/lucene/pull/14708#issuecomment-2917328559

   > I'm not sure about the byte vector case myself. Do we see a viable need 
for it in FullPrecisionVectorSimilaritySource ?
   
   I am not sure. As of right now, none of the quantization schemes support 
vectors that are already bytes. 
   
   But, for future proofing, I think the name of the new similarity source 
should indicate the appropriate vector type right? Maybe we support byte in the 
future?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix resource leak in loadMainDataFromFile [lucene]

2025-05-28 Thread via GitHub


jpountz commented on PR #14727:
URL: https://github.com/apache/lucene/pull/14727#issuecomment-2916354220

   Looks good! Can you add an entry in lucene/CHANGES.txt under version 10.3?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Only run the labeller on the main branch of the lucene repository [lucene]

2025-05-28 Thread via GitHub


dweiss commented on PR #14721:
URL: https://github.com/apache/lucene/pull/14721#issuecomment-2916470586

   @pseudo-nymous - if you can shed some light on this before I merge, it'd be 
great. I'll wait a bit for your feedback.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Only run the labeller on the main branch of the lucene repository [lucene]

2025-05-28 Thread via GitHub


dweiss commented on PR #14721:
URL: https://github.com/apache/lucene/pull/14721#issuecomment-2916469412

   I've no idea. All I know is I get failures when I work on a self-fork PR, 
see here -
   https://github.com/dweiss/lucene/actions/workflows/label-pull-request.yml
   
   these recent "skipped" runs are after I installed the patch above on my 
"main" branch. I don't fully understand what's going on with permissions here, 
sorry.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] deps(java): bump org.apache.groovy:groovy-all from 4.0.26 to 4.0.27 [lucene]

2025-05-28 Thread via GitHub


dweiss merged PR #14722:
URL: https://github.com/apache/lucene/pull/14722


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] deps(java): bump com.diffplug.spotless from 7.0.3 to 7.0.4 [lucene]

2025-05-28 Thread via GitHub


dweiss merged PR #14723:
URL: https://github.com/apache/lucene/pull/14723


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a DoubleValuesSource for scoring full precision vector similarity [lucene]

2025-05-28 Thread via GitHub


msokolov commented on PR #14708:
URL: https://github.com/apache/lucene/pull/14708#issuecomment-2916875650

   +1 to add support for full precision re-ranking. Have you considered writing 
a FullPrecisionVectorSimilaritySource as a separate class?  We like to avoid 
conditional logic on boolean parameters where possible. I don't know if there 
is really a need for a byte-flavored version either? For the Query support we 
can expect the query to be supplied with high precision. During indexing 
wouldn't we want to re-rank using non-quantized query vectors as well as 
full-precision document vectors?  Not sure if that can be solved using a 
DoubleValuesSource - we would probably need to bake that in to the 
codec/hnswsearcher etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a DoubleValuesSource for scoring full precision vector similarity [lucene]

2025-05-28 Thread via GitHub


benwtrent commented on PR #14708:
URL: https://github.com/apache/lucene/pull/14708#issuecomment-2916888768

   > Have you considered writing a FullPrecisionVectorSimilaritySource as a 
separate class?
   
   A separate class would allow users to provide a custom vector comparator, 
which might be beneficial.
   
   But I agree, a new similarity source is a good idea here!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] [BlockJoin] Add ParentsChildrenBlockJoinQuery to support parent and c… [lucene]

2025-05-28 Thread via GitHub


Jinny-Wang opened a new pull request, #14728:
URL: https://github.com/apache/lucene/pull/14728

   #14565 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [BlockJoin] Add ParentsChildrenBlockJoinQuery to support parent and c… [lucene]

2025-05-28 Thread via GitHub


github-actions[bot] commented on PR #14728:
URL: https://github.com/apache/lucene/pull/14728#issuecomment-2917468073

   This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. 
If the PR doesn't need a changelog entry, then add the skip-changelog-check 
label to it and you will stop receiving this reminder on future updates to the 
PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [BlockJoin] Add ParentsChildrenBlockJoinQuery to support parent and c… [lucene]

2025-05-28 Thread via GitHub


Jinny-Wang closed pull request #14728: [BlockJoin] Add 
ParentsChildrenBlockJoinQuery to support parent and c…
URL: https://github.com/apache/lucene/pull/14728


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [BlockJoin] Add ParentsChildrenBlockJoinQuery to support parent and c… [lucene]

2025-05-28 Thread via GitHub


github-actions[bot] commented on PR #14728:
URL: https://github.com/apache/lucene/pull/14728#issuecomment-2917477162

   This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. 
If the PR doesn't need a changelog entry, then add the skip-changelog-check 
label to it and you will stop receiving this reminder on future updates to the 
PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-05-28 Thread via GitHub


jainankitk commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2917542707

   Thanks @benwtrent and @weizijun for seeing this through!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix Method declared 'final' in 'final' class in LongHeap. [lucene]

2025-05-28 Thread via GitHub


msokolov merged PR #14712:
URL: https://github.com/apache/lucene/pull/14712


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize AbstractKnnVectorQuery#createBitSet with intoBitset [lucene]

2025-05-28 Thread via GitHub


msokolov commented on code in PR #14674:
URL: https://github.com/apache/lucene/pull/14674#discussion_r2112296471


##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -226,15 +227,25 @@ private BitSet createBitSet(DocIdSetIterator iterator, 
Bits liveDocs, int maxDoc
   // If we already have a BitSet and no deletions, reuse the BitSet
   return bitSetIterator.getBitSet();
 } else {
-  // Create a new BitSet from matching and live docs
-  FilteredDocIdSetIterator filterIterator =
-  new FilteredDocIdSetIterator(iterator) {
-@Override
-protected boolean match(int doc) {
-  return liveDocs == null || liveDocs.get(doc);
-}
-  };
-  return BitSet.of(filterIterator, maxDoc);
+  int threshold = maxDoc >> 7; // same as BitSet#of
+  if (iterator.cost() >= threshold) {
+// take advantage of Disi#intoBitset and Bits#applyMask
+FixedBitSet bitSet = new FixedBitSet(maxDoc);
+bitSet.or(iterator);
+if (liveDocs != null) {
+  liveDocs.applyMask(bitSet, 0);
+}
+return bitSet;
+  } else {
+FilteredDocIdSetIterator filterIterator =
+new FilteredDocIdSetIterator(iterator) {
+  @Override
+  protected boolean match(int doc) {
+return liveDocs == null || liveDocs.get(doc);

Review Comment:
   if `liveDocs == null` we could return `Bitset.of(iterator, maxDoc)`, no?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize AbstractKnnVectorQuery#createBitSet with intoBitset [lucene]

2025-05-28 Thread via GitHub


msokolov commented on code in PR #14674:
URL: https://github.com/apache/lucene/pull/14674#discussion_r2112296471


##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -226,15 +227,25 @@ private BitSet createBitSet(DocIdSetIterator iterator, 
Bits liveDocs, int maxDoc
   // If we already have a BitSet and no deletions, reuse the BitSet
   return bitSetIterator.getBitSet();
 } else {
-  // Create a new BitSet from matching and live docs
-  FilteredDocIdSetIterator filterIterator =
-  new FilteredDocIdSetIterator(iterator) {
-@Override
-protected boolean match(int doc) {
-  return liveDocs == null || liveDocs.get(doc);
-}
-  };
-  return BitSet.of(filterIterator, maxDoc);
+  int threshold = maxDoc >> 7; // same as BitSet#of
+  if (iterator.cost() >= threshold) {
+// take advantage of Disi#intoBitset and Bits#applyMask
+FixedBitSet bitSet = new FixedBitSet(maxDoc);
+bitSet.or(iterator);
+if (liveDocs != null) {
+  liveDocs.applyMask(bitSet, 0);
+}
+return bitSet;
+  } else {
+FilteredDocIdSetIterator filterIterator =
+new FilteredDocIdSetIterator(iterator) {
+  @Override
+  protected boolean match(int doc) {
+return liveDocs == null || liveDocs.get(doc);

Review Comment:
   if `liveDocs == null` we could return `Bitset.of(iterator, maxDoc)`, no, and 
avoid the internal logic in the filtered iterator, which probably goes away 
with branch prediction :shrug: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move HitQueue in TopScoreDocCollector to a LongHeap [lucene]

2025-05-28 Thread via GitHub


jpountz commented on code in PR #14714:
URL: https://github.com/apache/lucene/pull/14714#discussion_r2111945886


##
lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java:
##
@@ -73,23 +65,22 @@ public ScoreMode scoreMode() {
   public LeafCollector getLeafCollector(LeafReaderContext context) throws 
IOException {
 final int docBase = context.docBase;
 final ScoreDoc after = this.after;
-final float afterScore;
+final int afterScore;

Review Comment:
   Does it actually help to track scores as sortable ints rather than floats? I 
had assumed we'd only encode them if they're between the after score and the 
top score?



##
lucene/core/src/java/org/apache/lucene/search/DocScoreEncoder.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+import org.apache.lucene.util.NumericUtils;
+
+/**
+ * An encoder do encode (doc, score) pair as a long whose sort order is same 
as {@code (o1, o2) ->
+ * Float.compare(o1.score, 
o2.score)).thenComparing(Comparator.comparingInt((ScoreDoc o) ->
+ * o.doc).reversed())}
+ *
+ * Note that negative score is allowed but relationship between two codes 
encoded by negative
+ * scores is undefined. The only thing guaranteed is codes encoded from 
negative scores are smaller
+ * than codes encoded from non-negative scores.
+ */
+class DocScoreEncoder {
+
+  static final long LEAST_COMPETITIVE_CODE = encode(Integer.MAX_VALUE, 
Float.NEGATIVE_INFINITY);
+  private static final int POS_INF_TO_SORTABLE_INT = 
scoreToSortableInt(Float.POSITIVE_INFINITY);
+
+  static long encode(int docId, float score) {
+return encodeIntScore(docId, scoreToSortableInt(score));
+  }
+
+  static long encodeIntScore(int docId, int score) {
+return (((long) score) << 32) | (~docId & 0xL);
+  }
+
+  static float toScore(long value) {
+return sortableIntToScore(toIntScore(value));
+  }
+
+  static int toIntScore(long value) {
+return (int) (value >>> 32);
+  }
+
+  static int docId(long value) {
+return (int) ~value;
+  }
+
+  static int nextUp(int intScore) {
+assert intScore <= POS_INF_TO_SORTABLE_INT;
+int nextUp = Math.min(POS_INF_TO_SORTABLE_INT, intScore + 1);
+assert nextUp == 
scoreToSortableInt(Math.nextUp(sortableIntToScore(intScore)));
+return nextUp;
+  }
+
+  /**
+   * Score is non-negative float so wo use floatToRawIntBits instead of {@link
+   * NumericUtils#floatToSortableInt}. We do not assert score >= 0 here to 
allow pass negative float
+   * to indicate totally non-competitive, e.g. {@link #LEAST_COMPETITIVE_CODE}.
+   */

Review Comment:
   This is a bit too subtle to my taste, could we either not have to deal with 
negative scores at all, or use NumericUtils#floatToSortableInt? FWIW, I believe 
that `LEAST_COMPETITIVE_CODE` could use a score of 0 since Integer.MAX_VALUE is 
not an allowed doc ID?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Only run the labeller on the main branch of the lucene repository [lucene]

2025-05-28 Thread via GitHub


dweiss merged PR #14721:
URL: https://github.com/apache/lucene/pull/14721


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Only run the labeller on the main branch of the lucene repository [lucene]

2025-05-28 Thread via GitHub


pseudo-nymous commented on PR #14721:
URL: https://github.com/apache/lucene/pull/14721#issuecomment-2916739701

   I'm also not sure about the permission issue. There have been past 
successful runs in forks.
   
   
[Documentation](https://docs.github.com/en/rest/issues/labels?apiVersion=2022-11-28#create-a-label)
 also states that only one of the below permission is required where we already 
have pull_request write permission.
   ```
   The fine-grained token must have at least one of the following permission 
sets:
   
   "Issues" repository permissions (write)
   "Pull requests" repository permissions (write)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [BlockJoin] Add ParentsChildrenBlockJoinQuery to support parent and c… [lucene]

2025-05-28 Thread via GitHub


msfroh commented on code in PR #14728:
URL: https://github.com/apache/lucene/pull/14728#discussion_r2112783861


##
lucene/join/src/test/org/apache/lucene/search/join/TestParentsChildrenBlockJoinQuery.java:
##
@@ -0,0 +1,186 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.join;
+
+import com.carrotsearch.randomizedtesting.annotations.ParametersFactory;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.List;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.document.StringField;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.*;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.tests.index.RandomIndexWriter;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.junit.Test;
+
+public class TestParentsChildrenBlockJoinQuery extends LuceneTestCase {
+
+  private final TestCase testCase;
+
+  public TestParentsChildrenBlockJoinQuery(TestCase testCase) {
+this.testCase = testCase;
+  }
+
+  @ParametersFactory
+  public static Collection testCases() {
+return Arrays.asList(
+new Object[] {new TestCase("EmptyIndex", 10, 0, new int[0], new 
TestDoc[0][0])},
+new Object[] {
+  new TestCase(
+  "OnlyParentDocs",
+  10,
+  0,
+  new int[0],
+  new TestDoc[][] {
+{
+  new TestDoc("parent", true),
+},
+{
+  new TestDoc("parent", true),
+},
+{
+  new TestDoc("parent", true),
+}
+  })
+},
+new Object[] {
+  new TestCase(
+  "FirstParentWithoutChild",
+  10,
+  2,
+  new int[] {1, 2},
+  new TestDoc[][] {
+{
+  new TestDoc("parent", true),
+},
+{
+  new TestDoc("child", true),
+},
+{
+  new TestDoc("child", true),
+},
+{
+  new TestDoc("parent", true),
+}

Review Comment:
   Shouldn't this be:
   
   ```
   new TestDoc[][] {
 {
   new TestDoc("parent", true)
 },
 {
   new TestDoc("child", true),
   new TestDoc("child", true),
   new TestDoc("parent", true)
 }
   }
   ```
   
   That is, shouldn't the children be part of the second parent's block?



##
lucene/join/src/test/org/apache/lucene/search/join/TestParentsChildrenBlockJoinQuery.java:
##
@@ -0,0 +1,186 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.join;
+
+import com.carrotsearch.randomizedtesting.annotations.ParametersFactory;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.List;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.document.StringField;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.*;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.tests.index.RandomIndexWriter;
+import org

Re: [PR] Avoid unnecessary comparison for CELL_CROSSES_QUERY cases [lucene]

2025-05-28 Thread via GitHub


jainankitk merged PR #14626:
URL: https://github.com/apache/lucene/pull/14626


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Support for Re-Ranking Queries using Late Interaction Model Multi-Vectors. [lucene]

2025-05-28 Thread via GitHub


vigyasharma opened a new pull request, #14729:
URL: https://github.com/apache/lucene/pull/14729

   Late Interaction models, like [ColBERT](https://arxiv.org/abs/2004.12832) 
and [ColPali](https://arxiv.org/html/2407.01449v2), capture rich semantic 
interaction between documents and queries, and have been shown to outperform 
single-vector (no-interaction) models on search relevance. These models operate 
by using multi-vector representations for query (and document) embeddings. 
   
   One challenge with including late interaction models in search, has been 
working with multi-vectors at scale. This change provides an efficient 
workaround, by adding support to rerank results of a query using late 
interaction multi-vectors.
   
   Typical envisioned use-case is to do the full corpus search using ANN search 
on single-valued vectors, followed by a second pass that reranks results using 
late-interaction multi-vector scores. This PR creates:
   1. A LateInteractionField that stores multi-vectors in BinaryDocValues
   2. A DoubleValuesSource to scores query and document multi-vectors.
   3. A FunctionScore query that wraps a provided query and reranks its result 
with late-interaction model scores.
   
   Note: This first approach does not add additional metadata to `FieldInfo`. 
As a result, we are unable to ensure consistency in shape for multi-vector 
indexed in the same field across documents.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Support for Re-Ranking Queries using Late Interaction Model Multi-Vectors. [lucene]

2025-05-28 Thread via GitHub


github-actions[bot] commented on PR #14729:
URL: https://github.com/apache/lucene/pull/14729#issuecomment-2917887297

   This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. 
If the PR doesn't need a changelog entry, then add the skip-changelog-check 
label to it and you will stop receiving this reminder on future updates to the 
PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Support for Re-Ranking Queries using Late Interaction Model Multi-Vectors. [lucene]

2025-05-28 Thread via GitHub


vigyasharma commented on PR #14729:
URL: https://github.com/apache/lucene/pull/14729#issuecomment-2917890456

   This change builds on the work shared 
[here](https://github.com/apache/lucene/pull/13525#issuecomment-2445295372) by 
@jimczi, thanks Jim! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix documentation regarding benchmark running [lucene]

2025-05-28 Thread via GitHub


github-actions[bot] commented on PR #14667:
URL: https://github.com/apache/lucene/pull/14667#issuecomment-2917902454

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Updating skip-changelog label [lucene]

2025-05-28 Thread via GitHub


github-actions[bot] commented on PR #14661:
URL: https://github.com/apache/lucene/pull/14661#issuecomment-2917902497

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] No ruff violation [lucene]

2025-05-28 Thread via GitHub


github-actions[bot] commented on PR #14725:
URL: https://github.com/apache/lucene/pull/14725#issuecomment-2917935696

   This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. 
If the PR doesn't need a changelog entry, then add the skip-changelog-check 
label to it and you will stop receiving this reminder on future updates to the 
PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix documentation regarding benchmark running [lucene]

2025-05-28 Thread via GitHub


viliam-durina commented on PR #14667:
URL: https://github.com/apache/lucene/pull/14667#issuecomment-2918487383

   @jainankitk Do you really want changelogs for this kind of changes? 
Nevertheless, I added it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix resource leak in loadMainDataFromFile [lucene]

2025-05-28 Thread via GitHub


xcx1r3 commented on PR #14727:
URL: https://github.com/apache/lucene/pull/14727#issuecomment-2918148345

   It seems there were a couple more areas got same problem and I've added a 
few more commits to this PR which are 
   ```
   1、org.apache.lucene.analysis.cn.smart.hhmm.BigramDictionary#loadFromFile 
DataInputStream dctFile 
   
   2、org.apache.lucene.benchmark.byTask.feeds.DirContentSource#getNextDocData 
BufferedReader reader 
   
   3、org.apache.lucene.benchmark.quality.trec.QueryDriver#main FSDirectory dir& 
IndexReader reader 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org