Re: [PR] Deprecate FacetsCollector#search utility methods [lucene]

2024-09-07 Thread via GitHub


javanna commented on code in PR #13737:
URL: https://github.com/apache/lucene/pull/13737#discussion_r1748020468


##
lucene/facet/src/java/org/apache/lucene/facet/FacetsCollectorManager.java:
##
@@ -54,4 +79,167 @@ public ReducedFacetsCollector(final 
Collection facetsCollectors
   facetsCollector -> 
matchingDocs.addAll(facetsCollector.getMatchingDocs()));
 }
   }
+
+  /** Utility method, to search and also collect all hits into the provided 
{@link Collector}. */

Review Comment:
   good catch, I missed that. Will backport your commit!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Deprecate FacetsCollector#search utility methods [lucene]

2024-09-07 Thread via GitHub


javanna merged PR #13737:
URL: https://github.com/apache/lucene/pull/13737


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Bulk Scorer For ToParentBlockJoinQuery [lucene]

2024-09-07 Thread via GitHub


jpountz commented on code in PR #13697:
URL: https://github.com/apache/lucene/pull/13697#discussion_r1748062164


##
lucene/join/src/java/org/apache/lucene/search/join/ToParentBlockJoinQuery.java:
##
@@ -156,6 +162,11 @@ public Scorer get(long leadCost) throws IOException {
   return new BlockJoinScorer(childScorerSupplier.get(leadCost), 
parents, scoreMode);
 }
 
+@Override
+public BulkScorer bulkScorer() throws IOException {
+  return new BlockJoinBulkScorer(childScorerSupplier.bulkScorer(), 
parents, scoreMode);

Review Comment:
   I see @gsmiller suggested optimizing the ScoreMode.NONE case, which doesn't 
require scoring all children of a given parent. Then we should probably use the 
default bulk scorer here (by returing `super.bulkScorer()` if the score mode is 
NONE?



##
lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoinBulkScorer.java:
##
@@ -0,0 +1,450 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.join;
+
+import com.carrotsearch.randomizedtesting.generators.RandomPicks;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.BooleanClause;
+import org.apache.lucene.search.BooleanQuery;
+import org.apache.lucene.search.BoostQuery;
+import org.apache.lucene.search.BulkScorer;
+import org.apache.lucene.search.ConstantScoreQuery;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.LeafCollector;
+import org.apache.lucene.search.Scorable;
+import org.apache.lucene.search.ScorerSupplier;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.Weight;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.tests.index.RandomIndexWriter;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.tests.util.TestUtil;
+
+public class TestBlockJoinBulkScorer extends LuceneTestCase {
+  private static final String TYPE_FIELD_NAME = "type";
+  private static final String VALUE_FIELD_NAME = "value";
+  private static final String PARENT_FILTER_VALUE = "parent";
+  private static final String CHILD_FILTER_VALUE = "child";
+
+  private enum MatchValue {
+MATCH_A("A", 1),
+MATCH_B("B", 2),
+MATCH_C("C", 3),
+MATCH_D("D", 4);
+
+private static final List VALUES = List.of(values());
+
+private final String text;
+private final int score;
+
+MatchValue(String text, int score) {
+  this.text = text;
+  this.score = score;
+}
+
+public String getText() {
+  return text;
+}
+
+public int getScore() {
+  return score;
+}
+
+@Override
+public String toString() {
+  return text;
+}
+
+public static MatchValue random() {
+  return RandomPicks.randomFrom(LuceneTestCase.random(), VALUES);
+}
+  }
+
+  private record ChildDocMatch(int docId, List matches) {
+public ChildDocMatch(int docId, List matches) {
+  this.docId = docId;
+  this.matches = Collections.unmodifiableList(matches);
+}
+  }
+
+  private static Map> populateRandomIndex(
+  RandomIndexWriter writer, int maxParentDocCount, int maxChildDocCount, 
int maxChildDocMatches)
+  throws IOException {
+Map> expectedMatches = new HashMap<>();
+
+final int parentDocCount = random().nextInt(1, maxParentDocCount + 1);
+int currentDocId = 0;
+for (int i = 0; i < parentDocCount; i++) {
+  final int childDocCount = random().nextInt(maxChildDocCount + 1);
+  List docs = new ArrayList<>(childDocCount);
+  List childDocMatches = new ArrayList<>(childDocCount);
+
+  for (int j = 0; j < childDocCount; j++) {
+// Build a child doc
+Document childDoc = new Document();
+childDoc.add(newStringField(TYPE_FIELD_NAME, CHILD_F

Re: [PR] Add dynamic range facets [lucene]

2024-09-07 Thread via GitHub


stefanvodita commented on code in PR #13689:
URL: https://github.com/apache/lucene/pull/13689#discussion_r1748213889


##
lucene/demo/src/java/org/apache/lucene/demo/facet/DynamicRangeFacetsExample.java:
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.demo.facet;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Locale;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.document.NumericDocValuesField;
+import org.apache.lucene.document.StringField;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.FacetsConfig;
+import org.apache.lucene.facet.range.DynamicRangeUtil;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.LongValuesSource;
+import org.apache.lucene.search.MatchAllDocsQuery;
+import org.apache.lucene.store.ByteBuffersDirectory;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.NamedThreadFactory;
+
+/**
+ * Demo dynamic range faceting.
+ *
+ * The results look like so: min: 63 max: 75 centroid: 69.00 count: 2 
weight: 137 min: 79
+ * max: 96 centroid: 86.00 count: 3 weight: 83
+ *
+ * We've computed dynamic ranges over popularity weighted by number of 
books. We can read the
+ * results as so: There are 137 books written by authors in the 63 to 75 
popularity range.
+ *
+ * How it works: We collect all the values (popularity) and their weights 
(book counts). We sort
+ * the values and find the approximate weight per range. In this case the 
total weight is 220 (total
+ * books by all authors) and we want 2 ranges, so we're aiming for 110 books 
in each range. We add
+ * Chesterton to the first range, since he is the least popular author. He's 
written a lot of books,
+ * the range's weight is 90. We add Tolstoy to the first range, since he is 
next in line of
+ * popularity. He's written another 47 books, which brings the total weight to 
137. We're over the
+ * 110 target weight, so we stop and add everyone left to the second range.
+ */
+public class DynamicRangeFacetsExample {

Review Comment:
   Thanks! I added links to some faceting examples and the guide in the demo 
overview.



##
lucene/CHANGES.txt:
##
@@ -303,6 +303,9 @@ New Features
 
 * GITHUB#13678: Add support JDK 23 to the Panama Vectorization Provider. 
(Chris Hegarty)
 
+* GITHUB#13689: Dynamic range facets - create weighted ranges over numeric 
fields with counts per range.

Review Comment:
   Thank you, that's much better than what I wrote!



##
lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java:
##
@@ -0,0 +1,276 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.range;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.Callable;
+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Future;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.search.Doc

Re: [PR] Add dynamic range facets [lucene]

2024-09-07 Thread via GitHub


stefanvodita commented on code in PR #13689:
URL: https://github.com/apache/lucene/pull/13689#discussion_r1748216227


##
lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java:
##
@@ -0,0 +1,276 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.range;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.Callable;
+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Future;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.LongValues;
+import org.apache.lucene.search.LongValuesSource;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.InPlaceMergeSorter;
+
+/**
+ * Methods to create dynamic ranges for numeric fields.
+ *
+ * @lucene.experimental
+ */
+public final class DynamicRangeUtil {
+
+  private DynamicRangeUtil() {}
+
+  /**
+   * Construct dynamic ranges using the specified weight field to generate 
equi-weight range for the
+   * specified numeric bin field
+   *
+   * @param weightFieldName Name of the specified weight field
+   * @param weightValueSource Value source of the weight field
+   * @param fieldValueSource Value source of the value field
+   * @param facetsCollector FacetsCollector
+   * @param topN Number of requested ranges
+   * @param exec An executor service that is used to do the computation
+   * @return A list of DynamicRangeInfo that contains count, relevance, min, 
max, and centroid for
+   * each range
+   */
+  public static List computeDynamicRanges(
+  String weightFieldName,
+  LongValuesSource weightValueSource,
+  LongValuesSource fieldValueSource,
+  FacetsCollector facetsCollector,
+  int topN,
+  ExecutorService exec)
+  throws IOException {
+
+List matchingDocsList = 
facetsCollector.getMatchingDocs();
+int totalDoc = matchingDocsList.stream().mapToInt(matchingDoc -> 
matchingDoc.totalHits).sum();
+long[] values = new long[totalDoc];
+long[] weights = new long[totalDoc];
+long totalWeight = 0;
+int overallLength = 0;
+
+List> futures = new ArrayList<>();
+List tasks = new ArrayList<>();
+for (FacetsCollector.MatchingDocs matchingDocs : matchingDocsList) {
+  if (matchingDocs.totalHits > 0) {
+SegmentOutput segmentOutput = new 
SegmentOutput(matchingDocs.totalHits);
+
+// [1] retrieve values and associated weights concurrently
+SegmentTask task =
+new SegmentTask(matchingDocs, fieldValueSource, weightValueSource, 
segmentOutput);
+tasks.add(task);
+futures.add(exec.submit(task));
+  }
+}
+
+// [2] wait for all segment runs to finish
+for (Future future : futures) {
+  try {
+future.get();
+  } catch (InterruptedException ie) {
+throw new RuntimeException(ie);
+  } catch (ExecutionException ee) {
+IOUtils.rethrowAlways(ee.getCause());
+  }
+}
+
+// [3] merge the segment value and weight arrays into one array 
respectively and update the
+// total weights
+// and valid value length
+for (SegmentTask task : tasks) {
+  SegmentOutput curSegmentOutput = task.segmentOutput;
+  // if segment total weight overflows, return null
+  if (curSegmentOutput == null) {
+return null;
+  }
+
+  assert curSegmentOutput.values.length == curSegmentOutput.weights.length;
+
+  try {
+totalWeight = Math.addExact(curSegmentOutput.segmentTotalWeight, 
totalWeight);
+  } catch (ArithmeticException ae) {
+throw new IllegalArgumentException(
+"weight field \"" + weightFieldName + "\": long totalWeight value 
out of bounds", ae);
+  }
+
+  int currSegmentLen = curSegmentOutput.segmentIdx;
+  System.arraycopy(curSegmentOutput.values, 0, values, overallLength, 
currSegmentLen);
+  System.arraycopy(curSegmentOutput.weights, 0, weights, overallLength, 
currSegmentLen);
+  overallLength += currSegmentLen;
+}
+return computeDynamicNumericRanges(values, weights

Re: [PR] Add deprecated complement (~) operator to RegExp [lucene]

2024-09-07 Thread via GitHub


rmuir commented on PR #13739:
URL: https://github.com/apache/lucene/pull/13739#issuecomment-2336480505

   Lets update your migration PR with upgrade notes. Not many people want to 
mess with automaton regex parser, so I see some value in keeping the commit 
revertable? In general, trying to minimize surface area here to dodge any 
future issues with 10.x back porting too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Implement Accountable for NFARunAutomaton [lucene]

2024-09-07 Thread via GitHub


zhaih opened a new pull request, #13741:
URL: https://github.com/apache/lucene/pull/13741

   ### Description
   
   As discussed in #13715 this PR fixes: 
   1. `hashCode` of `CompiledAutomaton` forgot to consider `nfaRunAutomaton`
   2. `ramBytesUsed` of `CompiledAutomaton` forgot to consider `nfaRunAutomaton`
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org