[PR] Determinize automata used by IntervalsSource.regexp [lucene]

2024-09-05 Thread via GitHub


ChrisHegarty opened a new pull request, #13718:
URL: https://github.com/apache/lucene/pull/13718

   This commit determinizes internal automata used in the construction of the 
IntervalsSource created by the `regexp` factory.
   
   relates #13715


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] move Operations.sameLanguage/subsetOf to AutomatonTestUtil in test-framework [lucene]

2024-09-05 Thread via GitHub


rmuir merged PR #13708:
URL: https://github.com/apache/lucene/pull/13708


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] simplify checkWorkingCopyClean to make backporting easier? [lucene]

2024-09-05 Thread via GitHub


rmuir commented on issue #13719:
URL: https://github.com/apache/lucene/issues/13719#issuecomment-2331298246

   I do this on build side, rather than locally. so you might want to tweak it 
if you want to ignore "explicitly git-added files" which I think is our 
use-case. `git status` has a lot of options to do just that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Relax Operations.isTotal() to work with a deterministic automaton [lucene]

2024-09-05 Thread via GitHub


rmuir merged PR #13707:
URL: https://github.com/apache/lucene/pull/13707


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Relax Operations.isTotal() to work with a deterministic automaton [lucene]

2024-09-05 Thread via GitHub


mikemccand commented on code in PR #13707:
URL: https://github.com/apache/lucene/pull/13707#discussion_r1745429485


##
lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java:
##
@@ -857,22 +857,38 @@ public static boolean isEmpty(Automaton a) {
 return true;
   }
 
-  /** Returns true if the given automaton accepts all strings. The automaton 
must be minimized. */
+  /** Returns true if the given automaton accepts all strings. */
   public static boolean isTotal(Automaton a) {
 return isTotal(a, Character.MIN_CODE_POINT, Character.MAX_CODE_POINT);
   }
 
   /**
* Returns true if the given automaton accepts all strings for the specified 
min/max range of the
-   * alphabet. The automaton must be minimized.
+   * alphabet.
*/
   public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) 
{
-if (a.isAccept(0) && a.getNumTransitions(0) == 1) {
-  Transition t = new Transition();
-  a.getTransition(0, 0, t);
-  return t.dest == 0 && t.min == minAlphabet && t.max == maxAlphabet;
+BitSet states = getLiveStates(a);
+Transition spare = new Transition();
+int seenStates = 0;
+for (int state = states.nextSetBit(0); state >= 0; state = 
states.nextSetBit(state + 1)) {
+  // all reachable states must be accept states
+  if (a.isAccept(state) == false) return false;

Review Comment:
   It can return a false `false`!  (When the automaton is non-deterministic).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Reproducible test failure in TestTaxonomyFacetAssociations.testFloatSumAssociation -- ULP float issue? [lucene]

2024-09-05 Thread via GitHub


stefanvodita commented on issue #13720:
URL: https://github.com/apache/lucene/issues/13720#issuecomment-2331754886

   It's one of those float-summation-is-not-commutative errors.
   First ordering:
   ```
 1> 0.0 + 575310.1 = 575310.1
 1> 575310.1 + 701147.2 = 1276457.2
 1> 1276457.2 + 555620.8 = 1832078.0
   ```
   Second ordering:
   ```
 1> 0.0 + 575310.1 = 575310.1
 1> 575310.1 + 555620.8 = 1130931.0
 1> 1130931.0 + 701147.2 = 1832078.2
   ```
   The results are compared with an epsilon of 0.2. The short-term solution can 
be to increase that even more. The long-term solution can be #13011.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Follow-up to GH#13702 [lucene]

2024-09-05 Thread via GitHub


gsmiller opened a new pull request, #13722:
URL: https://github.com/apache/lucene/pull/13722

   Ensures we retain pre-existing (but strange) inconsistency in 
DrillSideways#search(DrillDownQuery, Collector). This is a deprecated method so 
I propose we retain this inconsistency since the method will go away with 10.0.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add dynamic range facets [lucene]

2024-09-05 Thread via GitHub


mikemccand commented on code in PR #13689:
URL: https://github.com/apache/lucene/pull/13689#discussion_r1745595485


##
lucene/demo/src/java/org/apache/lucene/demo/facet/package-info.java:
##
@@ -385,6 +385,12 @@
  * Sampling support is implemented in {@link
  * org.apache.lucene.facet.RandomSamplingFacetsCollector}.
  *
+ * Dynamic Range Facets

Review Comment:
   Thank you for updating package javadocs!



##
lucene/CHANGES.txt:
##
@@ -303,6 +303,9 @@ New Features
 
 * GITHUB#13678: Add support JDK 23 to the Panama Vectorization Provider. 
(Chris Hegarty)
 
+* GITHUB#13689: Dynamic range facets - create weighted ranges over numeric 
fields with counts per range.

Review Comment:
   Maybe make this a bit more verbose?  E.g. something like:
   
   ```
   Add a new faceting feature, dynamic range facets, which automatically picks 
a balanced set of numeric ranges based on the distribution of values that occur 
across all hits.  For use cases that have a highly variable numeric doc values 
field, such as "price" in an e-commerce application, this facet method is 
powerful as it allows the presented ranges to adapt depending on what hits the 
query actually matches.  This is in contrast to existing range faceting that 
requires the application to provide the specific fixed ranges up front.



##
lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java:
##
@@ -0,0 +1,276 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.range;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.Callable;
+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Future;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.LongValues;
+import org.apache.lucene.search.LongValuesSource;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.InPlaceMergeSorter;
+
+/**
+ * Methods to create dynamic ranges for numeric fields.
+ *
+ * @lucene.experimental
+ */
+public final class DynamicRangeUtil {
+
+  private DynamicRangeUtil() {}
+
+  /**
+   * Construct dynamic ranges using the specified weight field to generate 
equi-weight range for the
+   * specified numeric bin field
+   *
+   * @param weightFieldName Name of the specified weight field
+   * @param weightValueSource Value source of the weight field
+   * @param fieldValueSource Value source of the value field
+   * @param facetsCollector FacetsCollector
+   * @param topN Number of requested ranges
+   * @param exec An executor service that is used to do the computation
+   * @return A list of DynamicRangeInfo that contains count, relevance, min, 
max, and centroid for
+   * each range
+   */
+  public static List computeDynamicRanges(
+  String weightFieldName,
+  LongValuesSource weightValueSource,
+  LongValuesSource fieldValueSource,
+  FacetsCollector facetsCollector,
+  int topN,
+  ExecutorService exec)
+  throws IOException {
+
+List matchingDocsList = 
facetsCollector.getMatchingDocs();
+int totalDoc = matchingDocsList.stream().mapToInt(matchingDoc -> 
matchingDoc.totalHits).sum();
+long[] values = new long[totalDoc];
+long[] weights = new long[totalDoc];
+long totalWeight = 0;
+int overallLength = 0;
+
+List> futures = new ArrayList<>();
+List tasks = new ArrayList<>();
+for (FacetsCollector.MatchingDocs matchingDocs : matchingDocsList) {
+  if (matchingDocs.totalHits > 0) {
+SegmentOutput segmentOutput = new 
SegmentOutput(matchingDocs.totalHits);
+
+// [1] retrieve values and associated weights concurrently
+SegmentTask task =
+new SegmentTask(matchingDocs, fieldValueSource, weightValueSource, 
segmentOutput);
+tasks.add(task);
+futures.add(exec.submit(task));
+  }
+}
+
+// [2] wait for all segment runs to finish
+for (Future future : futures) {
+  try {
+future.get();
+  } catch (InterruptedException ie) {
+throw new RuntimeExcept

Re: [I] Reproducible test failure in TestTaxonomyFacetAssociations.testFloatSumAssociation -- ULP float issue? [lucene]

2024-09-05 Thread via GitHub


mikemccand commented on issue #13720:
URL: https://github.com/apache/lucene/issues/13720#issuecomment-2331878119

   I wish the test assert APIs allowed us to express the allowed epsilon in 
ULPs (1 or 2 or so), not a fixed float.
   
   The expected/allowed absolute error varies with how large the float is.  In 
this case 0.25 is 2 ULPs at 1832078.25 for `float32` I think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Early exit from Operations#removeDeadStates when an automaton doesn't have dead states. [lucene]

2024-09-05 Thread via GitHub


jpountz merged PR #13721:
URL: https://github.com/apache/lucene/pull/13721


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Bulk Scorer For ToParentBlockJoinQuery [lucene]

2024-09-05 Thread via GitHub


jpountz commented on code in PR #13697:
URL: https://github.com/apache/lucene/pull/13697#discussion_r1745619666


##
lucene/join/src/java/org/apache/lucene/search/join/ToParentBlockJoinQuery.java:
##
@@ -440,6 +478,101 @@ private String formatScoreExplanation(int matches, int 
start, int end, ScoreMode
 }
   }
 
+  private abstract static class BatchAwareLeafCollector extends 
FilterLeafCollector {
+public BatchAwareLeafCollector(LeafCollector in) {
+  super(in);
+}
+
+public void endBatch() throws IOException {}
+  }
+
+  private static class BlockJoinBulkScorer extends BulkScorer {
+private final BulkScorer childBulkScorer;
+private final ScoreMode scoreMode;
+private final BitSet parents;
+private final int parentsLength;
+
+public BlockJoinBulkScorer(BulkScorer childBulkScorer, ScoreMode 
scoreMode, BitSet parents) {
+  this.childBulkScorer = childBulkScorer;
+  this.scoreMode = scoreMode;
+  this.parents = parents;
+  this.parentsLength = parents.length();
+}
+
+@Override
+public int score(LeafCollector collector, Bits acceptDocs, int min, int 
max)
+throws IOException {
+  // Subtract one because max is exclusive w.r.t. score but inclusive 
w.r.t prevSetBit
+  int lastParent = parents.prevSetBit(Math.min(parentsLength, max) - 1);

Review Comment:
   I believe it would be legal to call BulkScorer.score with min=max=0, and 
this would give an out-of-bounds exception. We may need to check for the case 
when max == 0 explicitly like we do below for `min`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Bulk Scorer For ToParentBlockJoinQuery [lucene]

2024-09-05 Thread via GitHub


jpountz commented on code in PR #13697:
URL: https://github.com/apache/lucene/pull/13697#discussion_r1745752639


##
lucene/join/src/java/org/apache/lucene/search/join/ToParentBlockJoinQuery.java:
##
@@ -440,6 +478,114 @@ private String formatScoreExplanation(int matches, int 
start, int end, ScoreMode
 }
   }
 
+  private abstract static class BatchAwareLeafCollector extends 
FilterLeafCollector {
+public BatchAwareLeafCollector(LeafCollector in) {
+  super(in);
+}
+
+public void endBatch() throws IOException {}
+  }
+
+  private static class BlockJoinBulkScorer extends BulkScorer {
+private final BulkScorer childBulkScorer;
+private final ScoreMode scoreMode;
+private final BitSet parents;
+private final int parentsLength;
+
+public BlockJoinBulkScorer(BulkScorer childBulkScorer, ScoreMode 
scoreMode, BitSet parents) {
+  this.childBulkScorer = childBulkScorer;
+  this.scoreMode = scoreMode;
+  this.parents = parents;
+  this.parentsLength = parents.length();
+}
+
+@Override
+public int score(LeafCollector collector, Bits acceptDocs, int min, int 
max)
+throws IOException {
+  // Subtract one because max is exclusive w.r.t. score but inclusive 
w.r.t prevSetBit
+  int lastParent = parents.prevSetBit(Math.min(parentsLength, max) - 1);
+  int prevParent = min == 0 ? -1 : parents.prevSetBit(min - 1);
+  if (lastParent == prevParent) {
+// No parent docs in this range.
+// If we've scored the last parent in the bit set, return NO_MORE_DOCS 
to indicate we are
+// done scoring.
+return max >= parentsLength ? NO_MORE_DOCS : max;
+  }
+
+  BatchAwareLeafCollector wrappedCollector = wrapCollector(collector);
+  childBulkScorer.score(wrappedCollector, acceptDocs, prevParent + 1, 
lastParent + 1);
+  wrappedCollector.endBatch();
+
+  // If we've scored the last parent in the bit set, return NO_MORE_DOCS 
to indicate we are done
+  // scoring
+  return lastParent + 1 >= parentsLength ? NO_MORE_DOCS : max;
+}
+
+@Override
+public long cost() {
+  return childBulkScorer.cost();
+}
+
+private BatchAwareLeafCollector wrapCollector(LeafCollector collector) {
+  return new BatchAwareLeafCollector(collector) {
+private final Score currentParentScore = new Score(scoreMode);
+private int currentParent = -1;
+private Scorable scorer = null;
+
+@Override
+public void setScorer(Scorable scorer) throws IOException {
+  assert scorer != null;
+  this.scorer = scorer;
+
+  super.setScorer(
+  new Scorable() {
+@Override
+public float score() {
+  return currentParentScore.score();
+}
+
+@Override
+public void setMinCompetitiveScore(float minScore) throws 
IOException {
+  if (scoreMode == ScoreMode.None || scoreMode == 
ScoreMode.Max) {
+scorer.setMinCompetitiveScore(minScore);
+  }
+}
+  });

Review Comment:
   It looks good to me this way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Bulk Scorer For ToParentBlockJoinQuery [lucene]

2024-09-05 Thread via GitHub


Mikep86 commented on code in PR #13697:
URL: https://github.com/apache/lucene/pull/13697#discussion_r1745751201


##
lucene/join/src/java/org/apache/lucene/search/join/ToParentBlockJoinQuery.java:
##
@@ -440,6 +478,114 @@ private String formatScoreExplanation(int matches, int 
start, int end, ScoreMode
 }
   }
 
+  private abstract static class BatchAwareLeafCollector extends 
FilterLeafCollector {
+public BatchAwareLeafCollector(LeafCollector in) {
+  super(in);
+}
+
+public void endBatch() throws IOException {}
+  }
+
+  private static class BlockJoinBulkScorer extends BulkScorer {
+private final BulkScorer childBulkScorer;
+private final ScoreMode scoreMode;
+private final BitSet parents;
+private final int parentsLength;
+
+public BlockJoinBulkScorer(BulkScorer childBulkScorer, ScoreMode 
scoreMode, BitSet parents) {
+  this.childBulkScorer = childBulkScorer;
+  this.scoreMode = scoreMode;
+  this.parents = parents;
+  this.parentsLength = parents.length();
+}
+
+@Override
+public int score(LeafCollector collector, Bits acceptDocs, int min, int 
max)
+throws IOException {
+  // Subtract one because max is exclusive w.r.t. score but inclusive 
w.r.t prevSetBit
+  int lastParent = parents.prevSetBit(Math.min(parentsLength, max) - 1);
+  int prevParent = min == 0 ? -1 : parents.prevSetBit(min - 1);
+  if (lastParent == prevParent) {
+// No parent docs in this range.
+// If we've scored the last parent in the bit set, return NO_MORE_DOCS 
to indicate we are
+// done scoring.
+return max >= parentsLength ? NO_MORE_DOCS : max;
+  }
+
+  BatchAwareLeafCollector wrappedCollector = wrapCollector(collector);
+  childBulkScorer.score(wrappedCollector, acceptDocs, prevParent + 1, 
lastParent + 1);
+  wrappedCollector.endBatch();
+
+  // If we've scored the last parent in the bit set, return NO_MORE_DOCS 
to indicate we are done
+  // scoring
+  return lastParent + 1 >= parentsLength ? NO_MORE_DOCS : max;
+}
+
+@Override
+public long cost() {
+  return childBulkScorer.cost();
+}
+
+private BatchAwareLeafCollector wrapCollector(LeafCollector collector) {
+  return new BatchAwareLeafCollector(collector) {
+private final Score currentParentScore = new Score(scoreMode);
+private int currentParent = -1;
+private Scorable scorer = null;
+
+@Override
+public void setScorer(Scorable scorer) throws IOException {
+  assert scorer != null;
+  this.scorer = scorer;
+
+  super.setScorer(
+  new Scorable() {
+@Override
+public float score() {
+  return currentParentScore.score();
+}
+
+@Override
+public void setMinCompetitiveScore(float minScore) throws 
IOException {
+  if (scoreMode == ScoreMode.None || scoreMode == 
ScoreMode.Max) {
+scorer.setMinCompetitiveScore(minScore);
+  }
+}
+  });

Review Comment:
   LMKWYT of this approach. I originally added the child scorer to 
`currentParentScore`'s constructor and created `currentParentScore` in 
`setScorer`, but that didn't work because `setScorer` is potentially called 
multiple times.
   
   I then tried creating `currentParentScore` once, on the first call to 
`setScorer`, and checking that the same child scorer is passed in subsequent 
calls in an assertion. At least in the tests, this didn't work because the 
child scorer is wrapped in a different `AssertingScorable` in each call.
   
   This approach seemed like the simplest that avoids the above problems. I 
could go back to them though if we can assume that subsequent calls to 
`setScorer` will pass the same child scorer without checking in an assertion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Reproducible test failure in TestTaxonomyFacetAssociations.testFloatSumAssociation -- ULP float issue? [lucene]

2024-09-05 Thread via GitHub


stefanvodita commented on issue #13720:
URL: https://github.com/apache/lucene/issues/13720#issuecomment-2332088174

   I like the idea of comparing based on ULP! I'll poach some code for [float 
comparison](https://github.com/apache/commons-numbers/blob/master/commons-numbers-core/src/main/java/org/apache/commons/numbers/core/Precision.java#L224)
 from Apache Commons Numbers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-09-05 Thread via GitHub


mikemccand commented on PR #13572:
URL: https://github.com/apache/lucene/pull/13572#issuecomment-2332099352

   I'm trying to understand the status of this PR... so far it's a standalone 
JMH benchy that shows that using [FFM](https://openjdk.org/jeps/454) to invoke 
our own native C implementation of `dotProduct` on ints (e.g. needed for 
`int7`, `int4` HNSW scalar quantization in Lucene) that's carefully optimized 
to use the right SIMD instructions (switching depending on the capabilities of 
the CPU) shows sizable speedups (up to 10X) over our current "pure java" Panama 
vector API implementation?
   
   This is a micro-benchmark of just `dotProduct`, so the overall gains to 
Lucene's HNSW indexing and searching will be less than 10X.  But, `dotProduct` 
is [very much the hotspot of Lucene's 
HNSW](https://github.com/mikemccand/luceneutil/issues/256#issuecomment-215955)
 indexing and searching, so the end gains will likely be significant, 
especially on CPUs that perform poorly today via Panama (e.g. ARM).
   
   To make this actually accessible to users I think we would still need to:
   
 * Add this `nativeDotProduct` somewhere, likely `lucene/misc` module, and 
get gradle to compile it all
 * Make an alternative `NativeFlatVectorScorer` (in `misc`) that uses the 
`nativeDotProduct`
 * Make a `NativeLucene99HnswVectorsFormat` that is just like the default 
for `core` (`Lucene99HnswVectorsFormat` now), but it invokes 
`NativeFlatVectorScorer` instead.  Hopefully this can be done without too much 
code duplication.
 * Show examples of how one could use Lucene's default `Codec` but swap in 
this `NativeLucene99HnswVectorsFormat` for KNN fields
   
   I think this all can be done after 10.0 -- there's no particular reason why 
we need a major release to add this?  It's entirely new (optimized) added 
feature.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-05 Thread via GitHub


javanna commented on PR #13542:
URL: https://github.com/apache/lucene/pull/13542#issuecomment-2332114836

   Hey all, I have done some benchmarking with two main goals: 
   
   1) ensure there are no regressions introduced by the proposed change 
   2) ensure there is some performance gain when intra-segment is activated, as 
basic as its support is in this initial proposed step.
   
   
   I ran `wikimediumall` benchmarks with the default parameters and manually 
added count queries to the tasks executed. The default search concurrency is 
automatic, meaning it will create an executor based on the number of CPUs 
available. The index is not force merged, there are multiple segments.
   
   The first run is main (baseline) against my current branch 
(my_modified_version):
   
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
  CountTerm 4252.98  (6.3%) 3928.20  
(5.2%)   -7.6% ( -18% -4%) 0.000
   HighIntervalsOrdered   18.28  (5.7%)   17.80  
(5.4%)   -2.7% ( -12% -8%) 0.129
 CountOrHighMed  319.35  (5.3%)  311.55  
(4.5%)   -2.4% ( -11% -7%) 0.118
  HighTermMonthSort 1225.80  (3.8%) 1200.00  
(5.3%)   -2.1% ( -10% -7%) 0.148
LowSpanNear   17.17  (5.3%)   16.89  
(4.4%)   -1.7% ( -10% -8%) 0.280
MedIntervalsOrdered   62.72  (4.4%)   61.69  
(5.7%)   -1.6% ( -11% -8%) 0.310
 Fuzzy1   65.29  (6.7%)   64.36  
(4.6%)   -1.4% ( -11% -   10%) 0.431
   MedTermDayTaxoFacets   17.35  (6.9%)   17.11  
(7.0%)   -1.4% ( -14% -   13%) 0.537
   HighTermTitleBDVSort   19.73  (5.3%)   19.50  
(4.1%)   -1.2% ( -10% -8%) 0.441
 AndHighLow  954.69  (4.2%)  943.83  
(4.1%)   -1.1% (  -9% -7%) 0.384
  BrowseDayOfYearTaxoFacets3.87  (8.4%)3.83  
(4.9%)   -1.1% ( -13% -   13%) 0.603
   BrowseDateTaxoFacets3.83  (8.8%)3.79  
(5.9%)   -1.1% ( -14% -   14%) 0.641
   AndHighHighDayTaxoFacets   11.99  (4.9%)   11.86  
(6.5%)   -1.1% ( -11% -   10%) 0.556
  LowPhrase  176.29  (3.3%)  174.85  
(3.7%)   -0.8% (  -7% -6%) 0.460
BrowseRandomLabelTaxoFacets3.20  (3.9%)3.18  
(4.3%)   -0.8% (  -8% -7%) 0.547
  HighTermTitleSort   97.01  (5.0%)   96.30  
(4.6%)   -0.7% (  -9% -9%) 0.628
   HighTerm  410.73  (5.6%)  407.91  
(7.1%)   -0.7% ( -12% -   12%) 0.736
 HighPhrase   68.29  (4.3%)   67.87  
(4.0%)   -0.6% (  -8% -7%) 0.641
CountOrHighHigh   34.41 (25.3%)   34.22 
(20.2%)   -0.5% ( -36% -   60%) 0.942
  OrHighNotHigh  326.55  (6.4%)  324.92  
(5.6%)   -0.5% ( -11% -   12%) 0.793
 AndHighMed  232.64  (3.5%)  231.61  
(4.8%)   -0.4% (  -8% -8%) 0.739
   PKLookup  163.46  (8.6%)  162.75  
(6.9%)   -0.4% ( -14% -   16%) 0.860
   OrNotHighLow 1042.34  (3.9%) 1039.01  
(4.2%)   -0.3% (  -8% -8%) 0.803
   HighSloppyPhrase   19.17  (4.4%)   19.12  
(5.7%)   -0.3% (  -9% -   10%) 0.855
  BrowseMonthTaxoFacets4.08  (5.0%)4.07  
(7.4%)   -0.2% ( -11% -   12%) 0.908
   HighSpanNear   17.99  (5.9%)   18.00  
(6.0%)0.0% ( -11% -   12%) 0.984
 OrHighMedDayTaxoFacets2.49  (7.3%)2.50  
(6.7%)0.2% ( -12% -   15%) 0.936
   OrNotHighMed  296.13  (5.0%)  297.20  
(5.9%)0.4% (  -9% -   11%) 0.833
  OrHighMed  350.64  (4.2%)  352.22  
(4.6%)0.5% (  -8% -9%) 0.748
  MedPhrase   60.88  (3.8%)   61.18  
(4.4%)0.5% (  -7% -8%) 0.695
CountAndHighMed  272.12  (3.4%)  273.55  
(4.8%)0.5% (  -7% -9%) 0.691
Respell   35.73  (4.9%)   35.93  
(6.7%)0.6% ( -10% -   12%) 0.763
MedSloppyPhrase   19.80  (6.6%)   19.92  
(7.3%)0.6% ( -12% -   15%) 0.778
LowSloppyPhrase   17.75  (4.3%)   17.86  
(3.7%)0.6% (  -7% -9%) 0.622
Prefix3 1050.17  (3.9%) 1057.01  
(4.8%)0.7% (  -7% -9%) 0.636
   Wildcard  143.63  (4.1

[PR] Add unit-of-least-precision float comparison [lucene]

2024-09-05 Thread via GitHub


stefanvodita opened a new pull request, #13723:
URL: https://github.com/apache/lucene/pull/13723

   Comparing floats with a fixed epsilon doesn't really work. We add comparison 
based on unit-of-lest-precision (ULP) and use it to fix a failing test.
   
   Closes #13720


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Dry up TestScorerPerf [lucene]

2024-09-05 Thread via GitHub


javanna merged PR #13712:
URL: https://github.com/apache/lucene/pull/13712


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Dimensionality reduction in Lucene [lucene]

2024-09-05 Thread via GitHub


tanyaroosta opened a new issue, #13727:
URL: https://github.com/apache/lucene/issues/13727

   ### Description
   
   Hi. I am opening a new issue to follow up on a [discussion in this 
issue](https://github.com/apache/lucene/issues/13403#issuecomment-2132043000) 
regarding using the segment size to decide if we should do vector quantization 
or not. There are two things to figure out with this, 1) if it would be 
helpful, 2) how to build/tune?
   
   I like to start the conversation and get some ideas and feedback from the 
community. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-05 Thread via GitHub


msokolov commented on PR #13542:
URL: https://github.com/apache/lucene/pull/13542#issuecomment-2332479198

   Thanks for the testing, @javanna! Indeed it is clear that this change does 
*something* and could be useful as-is for some query loads. I'm also encouraged 
by Adrien's comments. Although ideally I would have preferred to see us get to 
a really good place, I get the PNP argument, and appreciate that we don't want 
to do everything all at once.
   
   Regarding the small regression due to the collector overhead, I wonder if it 
would be possible to create some kind of coupling between Collector and 
IndexSearcher such that the Collector can assert that it is to be used with an 
IndexSearcher that "enables cross-segment concurrency". Then it would be up to 
the user to select the proper Collector (or collector option) to use with the 
IndexSearcher they have configured. Or perhaps IndexSearcher could be a 
Collector factory?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] simplify checkWorkingCopyClean to make backporting easier? [lucene]

2024-09-05 Thread via GitHub


rmuir commented on issue #13719:
URL: https://github.com/apache/lucene/issues/13719#issuecomment-2332541693

   @dweiss yes that's the case i hit, it is just the "switching branches from 
main" use-case and it seems to trip on things such as buildSrc and 
benchmark-jmh, i end out manually rm -rf'ing them after switching.
   
   in my uses of git-status checks, I actually run the git-status in CI after 
all the build logic runs, to make sure no tasks/tests/etc modify the tree. It 
is a different concern maybe than what we might be doing with it (I'm not 
sure). e.g. I'm not trying to detect "you forgot to git-add"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] simplify checkWorkingCopyClean to make backporting easier? [lucene]

2024-09-05 Thread via GitHub


dweiss commented on issue #13719:
URL: https://github.com/apache/lucene/issues/13719#issuecomment-2332544468

   I think this logic is flawed:
   ```
   // git ignores any folders which are empty (this includes folders 
with recursively empty sub-folders).
   def untrackedNonEmptyFolders = status.untrackedFolders.findAll { 
path ->
 File location = file("${rootProject.projectDir}/${path}")
 boolean hasFiles = false
 Files.walkFileTree(location.toPath(), new 
SimpleFileVisitor() {
   @Override
   FileVisitResult visitFile(Path file, BasicFileAttributes attrs) 
throws IOException {
 hasFiles = true
 // Terminate early.
 return FileVisitResult.TERMINATE
   }
 })
 return hasFiles
   }
   ```
   
   we shouldn't do anything about untrackedFolders. Looking at jgit's 
implementation, I don't even see the reason it's there and I can't remember why 
I (or somebody else) added it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] simplify checkWorkingCopyClean to make backporting easier? [lucene]

2024-09-05 Thread via GitHub


dweiss commented on issue #13719:
URL: https://github.com/apache/lucene/issues/13719#issuecomment-2332550998

   https://github.com/apache/lucene/pull/13728 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-05 Thread via GitHub


javanna commented on PR #13542:
URL: https://github.com/apache/lucene/pull/13542#issuecomment-2332571178

   >  I think we could keep it simple and provide a separate collector manager 
perhaps that supports intra-segment concurrency for now
   
   Having thought a little more, I am not sure this is a good idea. The first 
problem I have with it is naming, what would we call this other collector 
manager? :)
   
   We could otherwise add a constructor to the existing collector manager that 
takes the slices as an argument. Not very clean, but would allow the manager to 
go through the slices and determine what collector impl to return once 
`newCollector `is called later. We could make this new constructor optional, 
and leave the default constructor untouched for now (when created through it 
the manager won't support intra-segment concurrency), so we don't break 
existing usages.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should the static search methods in FacetsCollector take a FacetsCollector as last argument? [lucene]

2024-09-05 Thread via GitHub


gsmiller commented on issue #13725:
URL: https://github.com/apache/lucene/issues/13725#issuecomment-2332818574

   My vote would be to be more restrictive with these signatures and specify 
`FacetsCollectorManager` given that these are sugar methods meant to make it 
easier to do faceting while also the "main" search. Today's faceting module all 
works against `FacetsCollector` explicitly in the various APIs, so having a 
generic `Collector` or `CollectorManager` here seems confusing IMO.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Break the loop when segment is fully deleted by prior delTerms or delQueries [lucene]

2024-09-05 Thread via GitHub


github-actions[bot] commented on PR #13398:
URL: https://github.com/apache/lucene/pull/13398#issuecomment-2332950383

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Gradle builds slow to start [lucene]

2024-09-05 Thread via GitHub


dweiss opened a new issue, #13730:
URL: https://github.com/apache/lucene/issues/13730

   ### Description
   
   This has been mentioned by Mike Sokolov, I think. Gradle builds have become 
slw to start as we upgraded from version to version. Interestingly, I've 
come across this hint:
   
   
https://docs.gradle.org/current/userguide/sharing_build_logic_between_subprojects.html#sec:convention_plugins_vs_cross_configuration
   
   which is the exact opposite to what we do. I do have some background in 
aspect-oriented programming so, to me, this kind of cross-configuration and 
separation of concerns is a way to clean up the build configuration and 
separate different parts of it. Well, gradle folks clearly think otherwise.
   
   I have not debugged this intensively but when you run even the smallest task 
with the ```-debug``` option, you'll see a lot of this:
   ```
   2024-09-06T08:31:11.679+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 11: acquired lock on worker lease
   2024-09-06T08:31:11.679+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 11: released lock on worker lease
   2024-09-06T08:31:11.679+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 10: acquired lock on worker lease
   2024-09-06T08:31:11.680+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 10: released lock on worker lease
   2024-09-06T08:31:11.680+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 9: acquired lock on worker lease
   2024-09-06T08:31:11.680+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 9: released lock on worker lease
   2024-09-06T08:31:11.680+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 8: acquired lock on worker lease
   2024-09-06T08:31:11.680+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 8: released lock on worker lease
   2024-09-06T08:31:11.680+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 2: acquired lock on worker lease
   2024-09-06T08:31:11.680+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 2: released lock on worker lease
   2024-09-06T08:31:11.680+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 3: acquired lock on worker lease
   2024-09-06T08:31:11.680+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 3: released lock on worker lease
   2024-09-06T08:31:11.680+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 5: acquired lock on worker lease
   2024-09-06T08:31:11.680+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 5: released lock on worker lease
   2024-09-06T08:31:11.680+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 4: acquired lock on worker lease
   2024-09-06T08:31:11.680+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 4: released lock on worker lease
   2024-09-06T08:31:11.680+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 6: acquired lock on worker lease
   2024-09-06T08:31:11.680+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 6: released lock on worker lease
   2024-09-06T08:31:11.681+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 7: acquired lock on worker lease
   2024-09-06T08:31:11.681+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 7: released lock on worker lease
   2024-09-06T08:31:11.681+0200 [DEBUG] 
[org.gradle.internal.operations.DefaultBuildOperationRunner] Build operation 
'Cross-configure project :lucene:analysis:opennlp' started
   2024-09-06T08:31:11.681+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 6: acquired lock on worker lease
   2024-09-06T08:31:11.681+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 6: released lock on worker lease
   2024-09-06T08:31:11.681+0200 [DEBUG] 
[org.gradle.internal.operations.DefaultBuildOperationRunner] Completing Build 
operation 'Cross-configure project :lucene:analysis:opennlp'
   2024-09-06T08:31:11.681+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedResourceLock] Execution worker 
Thread 4: acquired lock on worker lease
   2024-09-06T08:31:11.681+0200 [DEBUG] 
[org.gradle.internal.resources.AbstractTrackedReso