Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-03 Thread via GitHub


ChrisHegarty commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325951431

   💙 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestBoolean2.testRandomQueries fails in CI due to eating up heap space [lucene]

2024-09-03 Thread via GitHub


javanna commented on issue #11754:
URL: https://github.com/apache/lucene/issues/11754#issuecomment-2326111901

   Still relevant, I think the factor is the number of segments perhaps and how 
they get searched concurrently (e.g. number of slices), because each slice gets 
its own collector with its own queue, and that seems to blow up the heap space 
in certain conditions. I am observing the same problem in #13542 . 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-03 Thread via GitHub


javanna commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1741835108


##
lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollectorManager.java:
##
@@ -28,17 +31,77 @@
  */
 public class TotalHitCountCollectorManager
 implements CollectorManager {
+
+  /**
+   * Internal state shared across the different collectors that this collector 
manager creates. It
+   * tracks leaves seen as an argument of {@link 
Collector#getLeafCollector(LeafReaderContext)}
+   * calls, to ensure correctness: if the first partition of a segment early 
terminates, count has
+   * been already retrieved for the entire segment hence subsequent partitions 
of the same segment
+   * should also early terminate. If the first partition of a segment computes 
hit counts,
+   * subsequent partitions of the same segment should do the same, to prevent 
their counts from
+   * being retrieve from {@link LRUQueryCache} (which returns counts for the 
entire segment)
+   */
+  private final Map seenContexts = new HashMap<>();

Review Comment:
   Scratch that, @original-brownbear suggested a different approach (now 
included in the PR) that does not require a `synchronized` block. With this, it 
feels like we may get around without a user option to disable the overhead when 
it's not needed. The flag becomes odd because the searcher knows whether there 
are partitions or not and could act accordingly, but the collector manager has 
no way currently to know which searcher uses it, or get info from it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-03 Thread via GitHub


javanna commented on PR #13542:
URL: https://github.com/apache/lucene/pull/13542#issuecomment-2326200273

   I have lowered the number of partitions we may end up creating in tests per 
segment to `5`, as we could end up with thousands of those. That improves GC 
issues with `TestBoolean2`, but one remains, and it has been observed before 
even without intra-segment concurrency support, see #11754 . I tried different 
things to lower the amount of slices generated, but I was not able to eliminate 
the problem entirely, and that makes sense given the problem existed before and 
I believe it is a factor of the number of collectors performing the search (one 
per slice, hence factor of the number of slices). This PR will end up 
increasing the chance that `TestBoolean2#testRandom` fails when extreme slicing 
is tested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Operations#sameLanguage has true negatives? [lucene]

2024-09-03 Thread via GitHub


jpountz opened a new issue, #13709:
URL: https://github.com/apache/lucene/issues/13709

   ### Description
   
   The following test fails, even though both automata accept the same language 
and are determinized. Looking at the implementation, it looks like it may 
assume that both automata are not only determinized but also minimal?
   
   ```java
 public void testSameLanguage() {
   // automaton that accepts [ab]* with 2 states
   Automaton a = new Automaton();
   int state0 = a.createState();
   int state1 = a.createState();
   a.setAccept(state0, true);
   a.setAccept(state1, true);
   a.addTransition(state0, state1, 'a');
   a.addTransition(state1, state0, 'b');
   
   // automaton that accepts [ab]* with a single state
   Automaton b = new Automaton();
   state0 = b.createState();
   b.setAccept(state0, true);
   b.addTransition(state0, state0, 'a');
   b.addTransition(state0, state0, 'b');
   
   assertTrue(Operations.sameLanguage(a, b));
 }
   ```
   
   Here are what these two automata look like:
   a
   
![automaton](https://github.com/user-attachments/assets/a150ae74-7393-4516-90da-355c9cc8a472)
   b
   
![automaton2](https://github.com/user-attachments/assets/159d018a-052d-435f-bcef-96650272c714)
   
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Operations#sameLanguage has true negatives? [lucene]

2024-09-03 Thread via GitHub


rmuir commented on issue #13709:
URL: https://github.com/apache/lucene/issues/13709#issuecomment-2326646660

   doesn't look like they accept the same language to me.
   for example the automaton "b" accepts string `aa` but not automaton "a".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Operations#sameLanguage has true negatives? [lucene]

2024-09-03 Thread via GitHub


rmuir commented on issue #13709:
URL: https://github.com/apache/lucene/issues/13709#issuecomment-2326656149

   no worries, somewhat related to this, you may indeed find bugs if you have 
assertions disabled due to the smelliness of sameLanguage, please see 
https://github.com/apache/lucene/pull/13708


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Operations#sameLanguage has true negatives? [lucene]

2024-09-03 Thread via GitHub


jpountz commented on issue #13709:
URL: https://github.com/apache/lucene/issues/13709#issuecomment-2326652907

   Lol, this is embarrassing. Sorry for the noise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Operations#sameLanguage has true negatives? [lucene]

2024-09-03 Thread via GitHub


jpountz closed issue #13709: Operations#sameLanguage has true negatives?
URL: https://github.com/apache/lucene/issues/13709


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Move anonymous Weight implementation in PointRangeQuery to named class [lucene]

2024-09-03 Thread via GitHub


jainankitk opened a new issue, #13710:
URL: https://github.com/apache/lucene/issues/13710

   ### Description
   
   The `Weight` implementation in `PointRangeQuery#createWeight` is anonymous 
class making it difficult to extend or reuse specific logic from the class. Not 
to mention the class being more ~400 lines of code which makes good case for 
named class anyways IMO.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Move anonymous Weight implementation in PointRangeQuery to named class [lucene]

2024-09-03 Thread via GitHub


jainankitk opened a new pull request, #13711:
URL: https://github.com/apache/lucene/pull/13711

   ### Description
   
   Moves the anonymous `Weight` implementation in 
`PointRangeQuery#createWeight` to named class for better extensibility and 
resusability.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move anonymous Weight implementation in PointRangeQuery to named class [lucene]

2024-09-03 Thread via GitHub


jpountz commented on PR #13711:
URL: https://github.com/apache/lucene/pull/13711#issuecomment-2326913351

   I see from the linked issue that you would like to extend `PointRangeQuery`, 
but in general we don't like to think of our queries as being extensible. I 
wonder if you could do what you need through composition rather than extension?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up advancing within a block. [lucene]

2024-09-03 Thread via GitHub


gsmiller commented on PR #13692:
URL: https://github.com/apache/lucene/pull/13692#issuecomment-2326975500

   Interesting. I'm not quite clear on what the difference is in the analysis 
you posted 
[here](https://github.com/apache/lucene/pull/13692#issuecomment-2324897665) vs. 
what got merged?
   
   Clearly a tricky balance here getting something that works well for both 
dense and sparse advancing cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move anonymous Weight implementation in PointRangeQuery to named class [lucene]

2024-09-03 Thread via GitHub


jainankitk commented on PR #13711:
URL: https://github.com/apache/lucene/pull/13711#issuecomment-2327098073

   > I see from the linked issue that you would like to extend 
`PointRangeQuery`, but in general we don't like to think of our queries as 
being extensible. I wonder if you could do what you need through composition 
rather than extension?
   
   @jpountz - Thanks for your feedback. While I agree with using composition 
rather than extension for `PointRangeQuery` itself, `Weight` implementation 
itself within `PointRangeQuery` will be more reusable (via composition instead 
of extension) by being a non-anonymous class.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-09-03 Thread via GitHub


benwtrent commented on code in PR #13651:
URL: https://github.com/apache/lucene/pull/13651#discussion_r1742505038


##
lucene/core/src/java/org/apache/lucene/codecs/lucene912/BinarizedByteVectorValues.java:
##
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene912;
+
+import java.io.IOException;
+import org.apache.lucene.index.ByteVectorValues;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.VectorScorer;
+
+/**
+ * A version of {@link ByteVectorValues}, but additionally retrieving score 
correction values offset
+ * for binarization quantization scores.
+ *
+ * @lucene.experimental
+ */
+public abstract class BinarizedByteVectorValues extends DocIdSetIterator {
+  public abstract float getDistanceToCentroid() throws IOException;
+
+  /** Returns the cluster ID for the vector in the range [0, 255] */
+  public abstract short clusterId() throws IOException;
+
+  public static byte encodeClusterIdToByte(short clusterId) {
+byte bClusterId = clusterId <= 127 ? (byte) clusterId : (byte) (clusterId 
- 256);
+return bClusterId;
+  }
+
+  public static short decodeClusterIdFromByte(byte bClusterId) {
+short clusterId = bClusterId >= 0 ? (short) bClusterId : (short) 
(bClusterId + 256);
+return clusterId;
+  }
+
+  public abstract float getMagnitude() throws IOException;
+
+  public abstract float getOOQ() throws IOException;
+
+  public abstract float getNormOC() throws IOException;
+
+  public abstract float getODotC() throws IOException;

Review Comment:
   I would really like if we just did `float[]` and the callers (which are 
scorers) know how to use it. The downside now is that if something calls 
`getODotC()` but it wasn't available, this would blow up right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-09-03 Thread via GitHub


benwtrent commented on code in PR #13651:
URL: https://github.com/apache/lucene/pull/13651#discussion_r1742505038


##
lucene/core/src/java/org/apache/lucene/codecs/lucene912/BinarizedByteVectorValues.java:
##
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene912;
+
+import java.io.IOException;
+import org.apache.lucene.index.ByteVectorValues;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.VectorScorer;
+
+/**
+ * A version of {@link ByteVectorValues}, but additionally retrieving score 
correction values offset
+ * for binarization quantization scores.
+ *
+ * @lucene.experimental
+ */
+public abstract class BinarizedByteVectorValues extends DocIdSetIterator {
+  public abstract float getDistanceToCentroid() throws IOException;
+
+  /** Returns the cluster ID for the vector in the range [0, 255] */
+  public abstract short clusterId() throws IOException;
+
+  public static byte encodeClusterIdToByte(short clusterId) {
+byte bClusterId = clusterId <= 127 ? (byte) clusterId : (byte) (clusterId 
- 256);
+return bClusterId;
+  }
+
+  public static short decodeClusterIdFromByte(byte bClusterId) {
+short clusterId = bClusterId >= 0 ? (short) bClusterId : (short) 
(bClusterId + 256);
+return clusterId;
+  }
+
+  public abstract float getMagnitude() throws IOException;
+
+  public abstract float getOOQ() throws IOException;
+
+  public abstract float getNormOC() throws IOException;
+
+  public abstract float getODotC() throws IOException;

Review Comment:
   I would really like if we just did `float[]` and the callers (which are 
scorers) know how to use it. The downside now is that if something calls 
`getODotC()` but it wasn't available, this would blow up right?
   
   @john-wagster 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] move Operations.sameLanguage/subsetOf to AutomatonTestUtil in test-framework [lucene]

2024-09-03 Thread via GitHub


dweiss commented on code in PR #13708:
URL: https://github.com/apache/lucene/pull/13708#discussion_r1742548065


##
lucene/core/src/java/org/apache/lucene/util/automaton/StatePair.java:
##
@@ -35,9 +35,14 @@
  * @lucene.experimental
  */
 public class StatePair {
+  // only mike knows what it does (do not expose)

Review Comment:
   lol.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Bulk Scorer For ToParentBlockJoinQuery [lucene]

2024-09-03 Thread via GitHub


jpountz commented on code in PR #13697:
URL: https://github.com/apache/lucene/pull/13697#discussion_r1742643399


##
lucene/join/src/java/org/apache/lucene/search/join/ToParentBlockJoinQuery.java:
##
@@ -440,6 +477,99 @@ private String formatScoreExplanation(int matches, int 
start, int end, ScoreMode
 }
   }
 
+  private abstract static class BatchAwareLeafCollector extends 
FilterLeafCollector {
+public BatchAwareLeafCollector(LeafCollector in) {
+  super(in);
+}
+
+public void endBatch() throws IOException {}
+  }
+
+  private static class BlockJoinBulkScorer extends BulkScorer {
+private final BulkScorer childBulkScorer;
+private final ScoreMode scoreMode;
+private final BitSet parents;
+private final int parentsLength;
+
+public BlockJoinBulkScorer(BulkScorer childBulkScorer, ScoreMode 
scoreMode, BitSet parents) {
+  this.childBulkScorer = childBulkScorer;
+  this.scoreMode = scoreMode;
+  this.parents = parents;
+  this.parentsLength = parents.length();
+}
+
+@Override
+public int score(LeafCollector collector, Bits acceptDocs, int min, int 
max)
+throws IOException {
+  // Subtract one because max is exclusive w.r.t. score but inclusive 
w.r.t prevSetBit
+  int lastParent = parents.prevSetBit(Math.min(parentsLength, max) - 1);
+  int prevParent = min == 0 ? -1 : parents.prevSetBit(min - 1);
+  if (lastParent == prevParent) {
+// No parent docs in this range
+return max;
+  }
+
+  BatchAwareLeafCollector wrappedCollector = wrapCollector(collector);
+  childBulkScorer.score(wrappedCollector, acceptDocs, prevParent + 1, 
lastParent + 1);
+  wrappedCollector.endBatch();
+
+  // If we've scored the last parent in the bit set, return NO_MORE_DOCS 
to indicate we are done
+  // scoring
+  return lastParent + 1 >= parentsLength ? NO_MORE_DOCS : max;
+}
+
+@Override
+public long cost() {
+  return childBulkScorer.cost();
+}
+
+// TODO: Need to resolve parent doc IDs in multi-reader space?
+private BatchAwareLeafCollector wrapCollector(LeafCollector collector) {
+  return new BatchAwareLeafCollector(collector) {
+private final Score currentParentScore = new Score(scoreMode);

Review Comment:
   You need to extend it to override `setMinCompetitiveScore` and delegate to 
`scorer` when appropriate, if you want dynamic pruning to work.



##
lucene/join/src/java/org/apache/lucene/search/join/ToParentBlockJoinQuery.java:
##
@@ -440,6 +477,99 @@ private String formatScoreExplanation(int matches, int 
start, int end, ScoreMode
 }
   }
 
+  private abstract static class BatchAwareLeafCollector extends 
FilterLeafCollector {
+public BatchAwareLeafCollector(LeafCollector in) {
+  super(in);
+}
+
+public void endBatch() throws IOException {}
+  }
+
+  private static class BlockJoinBulkScorer extends BulkScorer {
+private final BulkScorer childBulkScorer;
+private final ScoreMode scoreMode;
+private final BitSet parents;
+private final int parentsLength;
+
+public BlockJoinBulkScorer(BulkScorer childBulkScorer, ScoreMode 
scoreMode, BitSet parents) {
+  this.childBulkScorer = childBulkScorer;
+  this.scoreMode = scoreMode;
+  this.parents = parents;
+  this.parentsLength = parents.length();
+}
+
+@Override
+public int score(LeafCollector collector, Bits acceptDocs, int min, int 
max)
+throws IOException {
+  // Subtract one because max is exclusive w.r.t. score but inclusive 
w.r.t prevSetBit
+  int lastParent = parents.prevSetBit(Math.min(parentsLength, max) - 1);
+  int prevParent = min == 0 ? -1 : parents.prevSetBit(min - 1);
+  if (lastParent == prevParent) {
+// No parent docs in this range
+return max;
+  }
+
+  BatchAwareLeafCollector wrappedCollector = wrapCollector(collector);
+  childBulkScorer.score(wrappedCollector, acceptDocs, prevParent + 1, 
lastParent + 1);
+  wrappedCollector.endBatch();
+
+  // If we've scored the last parent in the bit set, return NO_MORE_DOCS 
to indicate we are done
+  // scoring
+  return lastParent + 1 >= parentsLength ? NO_MORE_DOCS : max;
+}
+
+@Override
+public long cost() {
+  return childBulkScorer.cost();
+}
+
+// TODO: Need to resolve parent doc IDs in multi-reader space?
+private BatchAwareLeafCollector wrapCollector(LeafCollector collector) {
+  return new BatchAwareLeafCollector(collector) {
+private final Score currentParentScore = new Score(scoreMode);

Review Comment:
   By the way, can you add a test that dynamic pruning works with your new bulk 
scorer? See e.g. `TestMaxScoreBulkScorer#testBasicsWithTwoDisjunctionClauses` 
for an example of a similar test.



##
lucene/join/src/java/org/apache/lucene/search/join/ToParentBlockJoinQuery.java:
##
@@ -275,6 +286,53 @@ public 

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

2024-09-03 Thread via GitHub


john-wagster commented on code in PR #13651:
URL: https://github.com/apache/lucene/pull/13651#discussion_r1742666476


##
lucene/core/src/java/org/apache/lucene/codecs/lucene912/BinarizedByteVectorValues.java:
##
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene912;
+
+import java.io.IOException;
+import org.apache.lucene.index.ByteVectorValues;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.VectorScorer;
+
+/**
+ * A version of {@link ByteVectorValues}, but additionally retrieving score 
correction values offset
+ * for binarization quantization scores.
+ *
+ * @lucene.experimental
+ */
+public abstract class BinarizedByteVectorValues extends DocIdSetIterator {
+  public abstract float getDistanceToCentroid() throws IOException;
+
+  /** Returns the cluster ID for the vector in the range [0, 255] */
+  public abstract short clusterId() throws IOException;
+
+  public static byte encodeClusterIdToByte(short clusterId) {
+byte bClusterId = clusterId <= 127 ? (byte) clusterId : (byte) (clusterId 
- 256);
+return bClusterId;
+  }
+
+  public static short decodeClusterIdFromByte(byte bClusterId) {
+short clusterId = bClusterId >= 0 ? (short) bClusterId : (short) 
(bClusterId + 256);
+return clusterId;
+  }
+
+  public abstract float getMagnitude() throws IOException;
+
+  public abstract float getOOQ() throws IOException;
+
+  public abstract float getNormOC() throws IOException;
+
+  public abstract float getODotC() throws IOException;

Review Comment:
   Was naively following the pattern to get things working.  This had 
getMagnitude and getDistanceToCentroid in it prior to adding getOOQ, NormOC, 
and ODotC.  For MIP it only needs OOQ, NormC, and ODotC and essentially ignores 
(actually overlaps with the other values or 0f).  So I don't think it blows up 
as is but it is storing an extra unnecessary float for Euclidean now.  But I 
could be missing something.  
   
   Two thoughts I had and hadn't quite gotten to yet.  I think I agree with 
your thoughts here that this should deal with arbitrary float (or byte) values. 
 I think where we had last landed was some way to decode those values on both 
read and write where the Scorer? or Format? (via reflection? to avoid 
versioning problems?) would supply to the Reader and Writer how to serialize 
and deserialize appropriately based on the similarity function a set of bytes 
which would eliminate some overhead there (1 byte for Euclidean).   I would buy 
we could update these interfaces to deal with an array of floats in the 
meantime depending on if quantization of these corrective factors is not a high 
priority; I was operating right now under the assumption that we do want to try 
to refactor to store a minimum overhead for each vector (2 bytes / vector in 
the "Euclidean" and 3 bytes in the "MIP" use-case).
   
   Second thought was I think for all of these values we want to swap them out 
with 1 byte quantized forms and without having started that work yet it seems 
pretty straightforward in my head to do so, which is probably why I'm seeing it 
as part of this first pass.  
   
   Was thinking to do both of things together and refactor this to return a 
getCorrectiveBytes instead of these individual corrective factors and put the 
burden on the caller to decode them.
   
   There is a separate place where specifically for query (not indexing) 
instead of pulling through a set of corrective float[] factors we are pulling 
through an object QueryFactors now, which may be is worth discussing further 
(but is a different place in the code)?  I did this because we could optimize 
one of the floats to a short and with the belief that we could / want to 
optimize the other values as well (as well as greatly improving readability).  
Although subsequent conversations about those with the team suggest that we 
don't care about quantizing these values (at least not as much) because the 
query quantization that gets serialized is only serialized to a temporary file. 
 I digress.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to 

Re: [I] TestBoolean2.testRandomQueries fails in CI due to eating up heap space [lucene]

2024-09-03 Thread via GitHub


rmuir commented on issue #11754:
URL: https://github.com/apache/lucene/issues/11754#issuecomment-2327418150

   @javanna if you can reproduce it again, can you try 
`System.out.println(mulFactor)` at the end of the BeforeClass method?
   
   OOM in the stacktrace comes from creating priorityqueue of `1000 * 
mulFactor`. Then if we look at `@BeforeClass` where the `mulFactor` gets set, 
it isn't even constant, it is set dynamically, with exponential (!) growth like 
this:
   
   ```java
   do {
 addIndexes();
 ...
 mulFactor *= 2;
   } while (docCount < 3000 * NUM_FILLER_DOCS);
   ```
   
   I think the fact it grows index and pq size (mulFactor) exponentially, may 
be the reason for the trouble. half of the time (NUM_FILLER_DOCS=4k) it creates 
12M document index. 
   
   in some rare situations (PRE_FILLER_DOCS small or zero) it looks like it may 
need to addIndexes() 12 times to do this and so I think `mulFactor` can get 
high such as `4K`? And I think that leads to asking for millions of results at 
once (1000 * 4096) and causes the OOM.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move anonymous Weight implementation in PointRangeQuery to named class [lucene]

2024-09-03 Thread via GitHub


jpountz commented on PR #13711:
URL: https://github.com/apache/lucene/pull/13711#issuecomment-2327425891

   Sorry, I don't think we should make Lucene's `Weight` implementations public.
   
   I looked up the OpenSearch issue, if I understand correctly, the problem 
you're trying to solve is that it's wasteful for `PointRangeQuery` to evaluate 
the whole range when it's only asked for the first 10 doc IDs that match the 
range query. I agree it's wasteful. We have the same problem on nightly 
benchmarks and the IntNRQ task. I wonder if there are better ways to do what 
you're after, e.g. adding a `TopDocs Weight#topk(int n, int 
totalHitsThreshold)` API that would default to collecting hits, and that some 
classes such as `PointRangeQuery` could override. As I'm writing this, I'm not 
convinced that it's actually a good idea. Using sparse indexing would likely be 
a better approach, especially if the index can be sorted, as this would produce 
good iterators that don't have this huge up-front cost of evaluating the query 
against the entire segment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove CollectorOwner class (#13671) [lucene]

2024-09-03 Thread via GitHub


gsmiller commented on code in PR #13702:
URL: https://github.com/apache/lucene/pull/13702#discussion_r1742791755


##
lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java:
##
@@ -480,59 +462,61 @@ private void searchSequentially(
 }
 Query[] drillDownQueries = query.getDrillDownQueries();
 
-DrillSidewaysQuery dsq =
-new DrillSidewaysQuery(
-baseQuery,
-// drillDownCollectorOwner,
-// Don't pass drill down collector because drill down is collected 
by IndexSearcher
-// itself.
-// TODO: deprecate drillDown collection in DrillSidewaysQuery?
-null,
-drillSidewaysCollectorOwners,
-drillDownQueries,
-scoreSubDocsAtOnce());
-
-searcher.search(dsq, drillDownCollectorOwner);
-// This method doesn't return results as each dimension might have its own 
result type.
-// But we call getResult to trigger results reducing, so that users don't 
have to worry about
-// it.
-drillDownCollectorOwner.getResult();
-if (drillSidewaysCollectorOwners != null) {
-  for (CollectorOwner sidewaysOwner : drillSidewaysCollectorOwners) {
-sidewaysOwner.getResult();
+DrillSidewaysQuery dsq =
+new DrillSidewaysQuery<>(
+baseQuery, drillSidewaysCollectorManagers, drillDownQueries, 
scoreSubDocsAtOnce());
+
+T collectorResult = searcher.search(dsq, drillDownCollectorManager);
+List drillSidewaysResults = new ArrayList<>(drillDownDims.size());
+assert drillSidewaysCollectorManagers != null
+: "Case without drill sideways dimensions is handled above";
+int numSlices = dsq.managedDrillSidewaysCollectors.size();
+for (int dim = 0; dim < drillDownDims.size(); dim++) {
+  List collectorsForDim = new ArrayList<>(numSlices);
+  for (int slice = 0; slice < numSlices; slice++) {
+
collectorsForDim.add(dsq.managedDrillSidewaysCollectors.get(slice).get(dim));
   }
+  drillSidewaysResults.add(
+  dim, 
drillSidewaysCollectorManagers.get(dim).reduce(collectorsForDim));
 }
+return new Result<>(collectorResult, drillSidewaysResults);
   }
 
-  private void searchConcurrently(
+  private  Result 
searchConcurrently(
   final DrillDownQuery query,
-  final CollectorOwner drillDownCollectorOwner,
-  final List> drillSidewaysCollectorOwners)
+  final CollectorManager drillDownCollectorManager,
+  final List> drillSidewaysCollectorManagers)
   throws IOException {

Review Comment:
   Yeah, we could do that, but I think I'd be more inclined to leave it as-is 
for now and explore the idea of moving the "sideways" functionality into 
IndexSearcher as a supported pattern there, and fully deprecate `DrillSideways` 
as its own entry point. If we did that, we could leave out the concurrent 
version of this algorithm in IndexSearcher initially and it could be ported 
later if a valid use-case comes up?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove CollectorOwner class (#13671) [lucene]

2024-09-03 Thread via GitHub


gsmiller commented on code in PR #13702:
URL: https://github.com/apache/lucene/pull/13702#discussion_r1742821648


##
lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java:
##
@@ -414,62 +399,59 @@ public  ConcurrentDrillSidewaysResult search(
 
   /**
* Search using DrillDownQuery with custom collectors. This method can be 
used with any {@link
-   * CollectorOwner}s. It doesn't return anything because it is expected that 
you read results from
-   * provided {@link CollectorOwner}s.
-   *
-   * To read the results, run {@link CollectorOwner#getResult()} for drill 
down and all drill
-   * sideways dimensions.
-   *
-   * Note: use {@link Collections#unmodifiableList(List)} to wrap {@code
-   * drillSidewaysCollectorOwners} to convince compiler that it is safe to use 
List here.
-   *
-   * Use {@link MultiCollectorManager} wrapped by {@link CollectorOwner} to 
collect both hits and
-   * facets for the entire query and/or for drill-sideways dimensions.
+   * CollectorManager}s.
*
-   * TODO: Class CollectorOwner was created so that we can ignore 
CollectorManager type C,
-   * because we want each dimensions to be able to use their own types. 
Alternatively, we can use
-   * typesafe heterogeneous container and provide CollectorManager type for 
each dimension to this
-   * method? I do like CollectorOwner approach as it seems more intuitive?
+   * Note: Use {@link MultiCollectorManager} to collect both hits and 
facets for the entire query
+   * and/or for drill-sideways dimensions. You can also use it to wrap 
different types of {@link
+   * CollectorManager} for drill-sideways dimensions.
*/
-  public void search(
+  public  Result search(
   final DrillDownQuery query,
-  CollectorOwner drillDownCollectorOwner,
-  List> drillSidewaysCollectorOwners)
+  CollectorManager drillDownCollectorManager,
+  List> drillSidewaysCollectorManagers)
   throws IOException {
-if (drillDownCollectorOwner == null) {
+if (drillDownCollectorManager == null) {
   throw new IllegalArgumentException(
   "This search method requires client to provide drill down collector 
manager");
 }
-if (drillSidewaysCollectorOwners == null) {
+if (drillSidewaysCollectorManagers == null) {

Review Comment:
   Thanks! I personally think it might be a little aggressive to deprecate all 
these methods, especially so close to the planned 10.0 release. I'm worried 
we're overlooking legitimate needs/use-cases. Let's follow up with a general 
`DrillSideways` cleanup in 10.0? Whether that's moving everything into 
IndexSearcher or simplifying `DrillSideways`, I think we've got some options, 
but I'd prefer to take our time a little more.
   
   To this end, I reverted the latest commit that adds the deprecation on your 
branch (thanks for keeping it separate!) and am merging/back-porting. If you 
feel strongly that we should move for deprecation in 9.12 please let me know 
and we can still consider it (but as a separate commit). Does that sound 
reasonable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-09-03 Thread via GitHub


github-actions[bot] commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2327674839

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Facets#getBulkSpecificValues method [lucene]

2024-09-03 Thread via GitHub


github-actions[bot] commented on PR #12862:
URL: https://github.com/apache/lucene/pull/12862#issuecomment-2327675527

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove CollectorOwner class (#13671) [lucene]

2024-09-03 Thread via GitHub


gsmiller merged PR #13702:
URL: https://github.com/apache/lucene/pull/13702


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove CollectorOwner class (#13671) [lucene]

2024-09-03 Thread via GitHub


gsmiller commented on PR #13702:
URL: https://github.com/apache/lucene/pull/13702#issuecomment-2327689510

   Got this merged on main but will work on the back port in the morning (need 
to work through some conflicts and running out of time today).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org