Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]

2025-04-05 Thread via GitHub


jainankitk commented on code in PR #14267:
URL: https://github.com/apache/lucene/pull/14267#discussion_r2026298155


##
lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java:
##
@@ -517,6 +623,11 @@ public byte[] getUpperPoint() {
 return upperPoint.clone();
   }
 
+  // for test
+  public boolean isEqualValues() {

Review Comment:
   Good catch @gsmiller! Can we make this package-private?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-05 Thread via GitHub


mayya-sharipova commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2027061812


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() {
* The default implementation is empty
*/
   public void finishMerge() throws IOException {}
+
+  /** A string representing the off-heap category for quantized vectors. */
+  public static final String QUANTIZED = "QUANTIZED";
+
+  /** A string representing the off-heap category for the HNSW graph. */
+  public static final String HNSW_GRAPH = "HNSW_GRAPH";
+
+  /** A string representing the off-heap category for raw vectors. */
+  public static final String RAW = "RAW";
+
+  /**
+   * Returns the desired size of off-heap memory the given field. This size 
can be used to help
+   * determine the memory requirements for optimal search performance, which 
can be greatly affected
+   * by page faults when not enough memory is available.
+   *
+   * For reporting purposes, the backing off-heap index structures are 
broken into three
+   * categories: 1. {@link #RAW}, 2. {@link #HNSW_GRAPH}, and 3. {@link 
#QUANTIZED}. The returned
+   * map will have zero or one entry for each of these categories.
+   *
+   * The long value is the size in bytes of the off-heap space needed if 
the associated index
+   * structure were to be fully loaded in memory. While somewhat analogous to 
{@link
+   * Accountable#ramBytesUsed()} (which reports actual on-heap memory usage), 
the metrics reported
+   * by this method are not actual usage but rather the amount of available 
memory needed to fully
+   * load the index into memory, rather than an actual RAM usage requirement.
+   *
+   * To determine the total desired off-heap memory size for the given 
field:
+   *
+   * {@code
+   * 
getOffHeapByteSize(field).values().stream().mapToLong(Long::longValue).sum();
+   * }
+   *
+   * @param fieldInfo the fieldInfo
+   * @return a map of the desired off-heap memory requirements by category
+   * @lucene.experimental
+   */
+  public abstract Map getOffHeapByteSize(FieldInfo fieldInfo);

Review Comment:
   Very nice API! +1
   But I am also thinking about other use cases: where we need to know the 
total size across all vector fields and we don't know/don't remember field 
names. Do you think it is worth to have this API: may be that returns a map of 
maps?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] PointInSetQuery clips segments by lower and upper [lucene]

2025-04-05 Thread via GitHub


hanbj commented on code in PR #14268:
URL: https://github.com/apache/lucene/pull/14268#discussion_r2020444502


##
lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java:
##
@@ -122,6 +126,11 @@ protected PointInSetQuery(String field, int numDims, int 
bytesPerDim, Stream pac
 }
 sortedPackedPoints = builder.finish();
 sortedPackedPointsHashCode = sortedPackedPoints.hashCode();
+if (previous != null) {
+  BytesRef max = previous.get();
+  upperPoint = new byte[bytesPerDim * numDims];
+  System.arraycopy(max.bytes, max.offset, upperPoint, 0, 
upperPoint.length);

Review Comment:
   You're right, usually the length of the copied array is used. I used the 
length of max here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]

2025-04-05 Thread via GitHub


jpountz merged PR #14273:
URL: https://github.com/apache/lucene/pull/14273


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adds github action to verify changelog entry and set milestone to PRs [lucene]

2025-04-05 Thread via GitHub


javanna commented on PR #14279:
URL: https://github.com/apache/lucene/pull/14279#issuecomment-2737697207

   Hey @stefanvodita the changelog entry for this was filed under 10.2, but I 
don't believe the change itself was backported. Can you double check and either 
backport or move the changelog entry? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-05 Thread via GitHub


ChrisHegarty commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r202743


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() {
* The default implementation is empty
*/
   public void finishMerge() throws IOException {}
+
+  /** A string representing the off-heap category for quantized vectors. */
+  public static final String QUANTIZED = "QUANTIZED";
+
+  /** A string representing the off-heap category for the HNSW graph. */
+  public static final String HNSW_GRAPH = "HNSW_GRAPH";
+
+  /** A string representing the off-heap category for raw vectors. */
+  public static final String RAW = "RAW";
+
+  /**
+   * Returns the desired size of off-heap memory the given field. This size 
can be used to help
+   * determine the memory requirements for optimal search performance, which 
can be greatly affected
+   * by page faults when not enough memory is available.
+   *
+   * For reporting purposes, the backing off-heap index structures are 
broken into three
+   * categories: 1. {@link #RAW}, 2. {@link #HNSW_GRAPH}, and 3. {@link 
#QUANTIZED}. The returned
+   * map will have zero or one entry for each of these categories.
+   *
+   * The long value is the size in bytes of the off-heap space needed if 
the associated index
+   * structure were to be fully loaded in memory. While somewhat analogous to 
{@link
+   * Accountable#ramBytesUsed()} (which reports actual on-heap memory usage), 
the metrics reported
+   * by this method are not actual usage but rather the amount of available 
memory needed to fully
+   * load the index into memory, rather than an actual RAM usage requirement.
+   *
+   * To determine the total desired off-heap memory size for the given 
field:
+   *
+   * {@code
+   * 
getOffHeapByteSize(field).values().stream().mapToLong(Long::longValue).sum();
+   * }
+   *
+   * @param fieldInfo the fieldInfo
+   * @return a map of the desired off-heap memory requirements by category
+   * @lucene.experimental
+   */
+  public abstract Map getOffHeapByteSize(FieldInfo fieldInfo);

Review Comment:
   @mayya-sharipova the expected usage is that the caller would get all the 
fieldInfos from the reader ( `LeafReader::getFieldInfos` ), and then iterate 
over them checking for vector info. Which should be straightforward to do 
without an additional API point.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] fix TestIndexWriterWithThreads#testIOExceptionDuringAbortWithThreadsOnlyOnce [lucene]

2025-04-05 Thread via GitHub


guojialiang92 opened a new pull request, #14424:
URL: https://github.com/apache/lucene/pull/14424

   ### Description
   
   
   This PR aims to address issue 
[14423](https://github.com/apache/lucene/issues/14423).
   
   ### Tests
   
   1. In order to stabilize the reproduce problem, I added a test 
`TestIndexWriterWithThreads#testIOExceptionWithMergeNotEndLongTime`. For 
details, please refer to [14423](https://github.com/apache/lucene/issues/14423).
   2. I also fixed 
`TestIndexWriterWithThreads#testIOExceptionDuringAbortWithThreadsOnlyOnce`.
   
   ### Checklist
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my 
code conforms to the standards described there to the best of my ability.
   - [x] I have given Lucene maintainers 
[access](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the main branch.
   - [x] I have run ./gradlew check.
   - [x] I have added tests for my changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] MultiRange query for SortedNumericc DocValues [lucene]

2025-04-05 Thread via GitHub


mkhludnev commented on code in PR #14404:
URL: https://github.com/apache/lucene/pull/14404#discussion_r2013967382


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedNumericDocValuesMultiRangeQuery.java:
##
@@ -0,0 +1,249 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.search;
+
+import java.io.IOException;
+import java.util.*;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesSkipper;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.SortedNumericDocValues;
+import org.apache.lucene.search.ConstantScoreScorerSupplier;
+import org.apache.lucene.search.ConstantScoreWeight;
+import org.apache.lucene.search.DocValuesRangeIterator;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.QueryVisitor;
+import org.apache.lucene.search.ScoreMode;
+import org.apache.lucene.search.ScorerSupplier;
+import org.apache.lucene.search.TwoPhaseIterator;
+import org.apache.lucene.search.Weight;
+import org.apache.lucene.util.PriorityQueue;
+
+/**
+ * A union multiple ranges over SortedNumericDocValuesField
+ *
+ * @lucene.experimental
+ */
+public class SortedNumericDocValuesMultiRangeQuery extends Query {
+
+  protected final String fieldName;
+  protected final NavigableSet 
sortedClauses;
+
+  protected SortedNumericDocValuesMultiRangeQuery(
+  String fieldName, List clauses) {
+this.fieldName = fieldName;
+sortedClauses = resolveOverlaps(clauses);
+  }
+
+  private static final class Edge {
+private final DocValuesMultiRangeQuery.LongRange range;
+private final boolean point;
+private final boolean upper;
+
+private static Edge createPoint(DocValuesMultiRangeQuery.LongRange r) {
+  return new Edge(r);
+}
+
+long getValue() {
+  return upper ? range.upper : range.lower;
+}
+
+private Edge(DocValuesMultiRangeQuery.LongRange range, boolean upper) {
+  this.range = range;
+  this.upper = upper;
+  this.point = false;
+}
+
+/** expecting Arrays.equals(lower.bytes,upper.bytes) i.e. point */
+private Edge(DocValuesMultiRangeQuery.LongRange range) {
+  this.range = range;
+  this.upper = false;
+  this.point = true;
+}
+  }
+
+  /** Merges overlapping ranges. map.floor() doesn't work with overlaps */
+  private static NavigableSet 
resolveOverlaps(
+  Collection clauses) {
+NavigableSet sortedClauses =
+new TreeSet<>(
+Comparator.comparing(r -> r.lower)
+//.thenComparing(r -> r.upper)
+);
+PriorityQueue heap =
+new PriorityQueue<>(clauses.size() * 2) {
+  @Override
+  protected boolean lessThan(Edge a, Edge b) {
+return a.getValue() - b.getValue() < 0;
+  }
+};
+for (DocValuesMultiRangeQuery.LongRange r : clauses) {
+  long cmp = r.lower - r.upper;
+  if (cmp == 0) {
+heap.add(Edge.createPoint(r));
+  } else {
+if (cmp < 0) {
+  heap.add(new Edge(r, false));
+  heap.add(new Edge(r, true));
+} // else drop reverse ranges
+  }
+}
+int totalEdges = heap.size();
+int depth = 0;
+Edge started = null;
+for (int i = 0; i < totalEdges; i++) {
+  Edge smallest = heap.pop();
+  if (depth == 0 && smallest.point) {
+if (i < totalEdges - 1 && heap.top().point) { // repeating same points
+  if (smallest.getValue() == heap.top().getValue()) {
+continue;
+  }
+}
+sortedClauses.add(smallest.range);
+  }
+  if (!smallest.point) {
+if (!smallest.upper) {
+  depth++;
+  if (depth == 1) { // just started
+started = smallest;
+  }
+} else {
+  depth--;
+  if (depth == 0) {
+sortedClauses.add(
+started.range == smallest.range // no overlap case, the most 
often
+? smallest.range
+: new DocValuesMultiRangeQuery.LongRange(
+started.getValue(), smallest.getValue()));
+started = n

Re: [PR] quick exit on filter query matching no docs when rewriting knn query [lucene]

2025-04-05 Thread via GitHub


jpountz commented on PR #14418:
URL: https://github.com/apache/lucene/pull/14418#issuecomment-2762525239

   Can you help me understand what work this change helps save?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow skip cache factor to be updated dynamically [lucene]

2025-04-05 Thread via GitHub


sgup432 commented on PR #14412:
URL: https://github.com/apache/lucene/pull/14412#issuecomment-2763186278

   @jpountz Added a CHANGES entry. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Pack file pointers when merging BKD trees [lucene]

2025-04-05 Thread via GitHub


benwtrent commented on code in PR #14393:
URL: https://github.com/apache/lucene/pull/14393#discussion_r2010085479


##
lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java:
##
@@ -1961,7 +1989,7 @@ private void build(
   int leafCardinality = heapSource.computeCardinality(from, to, 
commonPrefixLengths);
 
   // Save the block file pointer:
-  leafBlockFPs[leavesOffset] = out.getFilePointer();
+  leafBlockFPs.add(out.getFilePointer());

Review Comment:
   Ah, since filepointers are monotonic, we can make the compact. NICE!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Examine the affects of MADV_RANDOM when MGLRU is enabled in Linux kernel [lucene]

2025-04-05 Thread via GitHub


jimczi commented on issue #14408:
URL: https://github.com/apache/lucene/issues/14408#issuecomment-2755375551

   I believe the question is whether we need to reconsider our assumptions when 
defaulting to random read advice in the current code. With the linked change, 
using `MADV_RANDOM` will exclude pages from the LRU list, but the original 
intent was simply to reduce the read-ahead size. 
   
   We explicitly use random advice in the vector format and FST files, which 
should, by default, benefit from LRU. Users should not be required to make 
changes to achieve the correct behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] PointInSetQuery early exit on non-matching segments [lucene]

2025-04-05 Thread via GitHub


hanbj commented on code in PR #14268:
URL: https://github.com/apache/lucene/pull/14268#discussion_r2022086841


##
lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java:
##
@@ -248,6 +255,33 @@ public long cost() {
 }
   }
 
+  private boolean checkValidPointValues(PointValues values) throws 
IOException {

Review Comment:
   Already rollback



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] KeywordField.newSetQuery() to uses prefixed terms for IndexOrDocValuesQuery [lucene]

2025-04-05 Thread via GitHub


jainankitk commented on code in PR #14435:
URL: https://github.com/apache/lucene/pull/14435#discussion_r2027694440


##
lucene/core/src/java/org/apache/lucene/document/KeywordField.java:
##
@@ -175,9 +174,8 @@ public static Query newExactQuery(String field, String 
value) {
   public static Query newSetQuery(String field, Collection values) {
 Objects.requireNonNull(field, "field must not be null");
 Objects.requireNonNull(values, "values must not be null");
-Query indexQuery = new TermInSetQuery(field, values);
-Query dvQuery = new TermInSetQuery(MultiTermQuery.DOC_VALUES_REWRITE, 
field, values);
-return new IndexOrDocValuesQuery(indexQuery, dvQuery);
+return TermInSetQuery.newIndexOrDocValuesQuery(
+MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE, field, values);

Review Comment:
   Probably we can use `TermInSetQuery.newIndexOrDocValuesQuery(field, values)` 
here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-04-05 Thread via GitHub


gf2121 commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2006876856


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java:
##
@@ -0,0 +1,552 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class TrieBuilder {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  static final int LEAF_NODE_HAS_TERMS = 1 << 5;
+  static final int LEAF_NODE_HAS_FLOOR = 1 << 6;
+  static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1;
+  static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0;
+
+  /**
+   * The output describing the term block the prefix point to.
+   *
+   * @param fp describes the on-disk terms block which a trie node points to.
+   * @param hasTerms A boolean which will be false if this on-disk block 
consists entirely of
+   * pointers to child blocks.
+   * @param floorData A {@link BytesRef} which will be non-null when a large 
block of terms sharing
+   * a single trie prefix is split into multiple on-disk blocks.
+   */
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+BUILDING,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+
+// The utf8 digit that leads to this Node, 0 for root node
+private final int label;
+// The children listed in order by their utf8 label
+private final LinkedList children;
+// The output of this node.
+private Output output;
+
+// Vars used during saving:
+
+// The file pointer point to where the node saved. -1 means the node has 
not been saved.
+private long fp = -1;
+// The iterator whose next() point to the first child has not been saved.
+private Iterator childrenIterator;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.BUILDING;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  static TrieBuilder bytesRefToTrie(BytesRef k, Output v) {
+return new TrieBuilder(k, v);
+  }
+
+  private TrieBuilder(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  /**
+   * Absorb all (K, V) pairs from the given trie into this one. The given trie 
builder should not
+   * have key that already exists in this one, otherwise a {@link 
IllegalArgumentException } will be
+   * thrown and this trie will get destroyed.
+   *
+   * Note: the given trie will be destroyed after absorbing.
+   */
+  void absorb(TrieBuilder trieBuilder) {
+if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+// Use a simple stack to avoid recursion.
+Deque stack = new ArrayDeque<>();
+stack.add(() -> absorb(this.root, trieBuilder.root, stack));
+while (!stack.isEmpty()) {
+  stack.pop().run();
+}
+trieBuilder.status = Status.DESTROYED;
+  }
+
+  private void absorb(Node n, Node add, Deque stack) {
+assert n.label == add.label;
+if (add.output != null) {
+  if (n.output != null) {
+ 

[PR] New IndexReaderFunctions.positionLength from the norm [lucene]

2025-04-05 Thread via GitHub


dsmiley opened a new pull request, #14433:
URL: https://github.com/apache/lucene/pull/14433

   ### Description
   
   Introduces 
`org.apache.lucene.queries.function.IndexReaderFunctions#positionLength`
   
   Javadocs:
   > Creates a value source that returns the position length (number of terms) 
of a field, approximated from the "norm".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup merging of HNSW graphs [lucene]

2025-04-05 Thread via GitHub


mayya-sharipova commented on code in PR #14331:
URL: https://github.com/apache/lucene/pull/14331#discussion_r2005462586


##
lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentHnswMerger.java:
##
@@ -51,19 +57,85 @@ protected HnswBuilder createBuilder(KnnVectorValues 
mergedVectorValues, int maxO
 OnHeapHnswGraph graph;
 BitSet initializedNodes = null;
 
-if (initReader == null) {
+if (graphReaders.size() == 0) {
   graph = new OnHeapHnswGraph(M, maxOrd);
 } else {
+  
graphReaders.sort(Comparator.comparingInt(GraphReader::graphSize).reversed());
+  GraphReader initGraphReader = graphReaders.get(0);
+  KnnVectorsReader initReader = initGraphReader.reader();
+  MergeState.DocMap initDocMap = initGraphReader.initDocMap();
+  int initGraphSize = initGraphReader.graphSize();
   HnswGraph initializerGraph = ((HnswGraphProvider) 
initReader).getGraph(fieldInfo.name);
+
   if (initializerGraph.size() == 0) {
 graph = new OnHeapHnswGraph(M, maxOrd);
   } else {
 initializedNodes = new FixedBitSet(maxOrd);
-int[] oldToNewOrdinalMap = getNewOrdMapping(mergedVectorValues, 
initializedNodes);
+int[] oldToNewOrdinalMap =
+getNewOrdMapping(
+fieldInfo,
+initReader,
+initDocMap,
+initGraphSize,
+mergedVectorValues,
+initializedNodes);
 graph = InitializedHnswGraphBuilder.initGraph(initializerGraph, 
oldToNewOrdinalMap, maxOrd);
   }
 }
 return new HnswConcurrentMergeBuilder(
 taskExecutor, numWorker, scorerSupplier, beamWidth, graph, 
initializedNodes);
   }
+
+  /**
+   * Creates a new mapping from old ordinals to new ordinals and returns the 
total number of vectors
+   * in the newly merged segment.
+   *
+   * @param mergedVectorValues vector values in the merged segment
+   * @param initializedNodes track what nodes have been initialized
+   * @return the mapping from old ordinals to new ordinals
+   * @throws IOException If an error occurs while reading from the merge state
+   */
+  private static final int[] getNewOrdMapping(
+  FieldInfo fieldInfo,
+  KnnVectorsReader initReader,
+  MergeState.DocMap initDocMap,
+  int initGraphSize,
+  KnnVectorValues mergedVectorValues,
+  BitSet initializedNodes)
+  throws IOException {
+KnnVectorValues.DocIndexIterator initializerIterator = null;
+
+switch (fieldInfo.getVectorEncoding()) {
+  case BYTE -> initializerIterator = 
initReader.getByteVectorValues(fieldInfo.name).iterator();
+  case FLOAT32 ->
+  initializerIterator = 
initReader.getFloatVectorValues(fieldInfo.name).iterator();
+}
+
+IntIntHashMap newIdToOldOrdinal = new IntIntHashMap(initGraphSize);
+int maxNewDocID = -1;
+for (int docId = initializerIterator.nextDoc();
+docId != NO_MORE_DOCS;
+docId = initializerIterator.nextDoc()) {
+  int newId = initDocMap.get(docId);
+  maxNewDocID = Math.max(newId, maxNewDocID);
+  newIdToOldOrdinal.put(newId, initializerIterator.index());

Review Comment:
   Addressed in cb852a6387a09ba43049b8a24f1e026c309b368b



##
lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentHnswMerger.java:
##
@@ -51,19 +57,85 @@ protected HnswBuilder createBuilder(KnnVectorValues 
mergedVectorValues, int maxO
 OnHeapHnswGraph graph;
 BitSet initializedNodes = null;
 
-if (initReader == null) {
+if (graphReaders.size() == 0) {
   graph = new OnHeapHnswGraph(M, maxOrd);
 } else {
+  
graphReaders.sort(Comparator.comparingInt(GraphReader::graphSize).reversed());
+  GraphReader initGraphReader = graphReaders.get(0);
+  KnnVectorsReader initReader = initGraphReader.reader();
+  MergeState.DocMap initDocMap = initGraphReader.initDocMap();
+  int initGraphSize = initGraphReader.graphSize();
   HnswGraph initializerGraph = ((HnswGraphProvider) 
initReader).getGraph(fieldInfo.name);
+
   if (initializerGraph.size() == 0) {
 graph = new OnHeapHnswGraph(M, maxOrd);
   } else {
 initializedNodes = new FixedBitSet(maxOrd);
-int[] oldToNewOrdinalMap = getNewOrdMapping(mergedVectorValues, 
initializedNodes);
+int[] oldToNewOrdinalMap =
+getNewOrdMapping(
+fieldInfo,
+initReader,
+initDocMap,
+initGraphSize,
+mergedVectorValues,
+initializedNodes);
 graph = InitializedHnswGraphBuilder.initGraph(initializerGraph, 
oldToNewOrdinalMap, maxOrd);
   }
 }
 return new HnswConcurrentMergeBuilder(
 taskExecutor, numWorker, scorerSupplier, beamWidth, graph, 
initializedNodes);
   }
+
+  /**
+   * Creates a new mapping from old ordinals to new ordinals and returns the 
total number of vectors
+   * in the newly merged segment.
+   *
+   * 

Re: [I] Address gradle temp file pollution insanity [lucene]

2025-04-05 Thread via GitHub


dweiss commented on issue #14385:
URL: https://github.com/apache/lucene/issues/14385#issuecomment-2743732858

   I think the hack we had in 
https://github.com/apache/lucene-solr/pull/1767/files used to work but gradle 
must have relocated those temp files... 
   
   The fix is simple but I'd like to do some analysis what exactly happened and 
when first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]

2025-04-05 Thread via GitHub


rmuir commented on PR #14381:
URL: https://github.com/apache/lucene/pull/14381#issuecomment-2743822277

   @dweiss thanks for the suggestion there, gazillions of array creations 
avoided. so now this thing will only spike cpu during parsing at worst. I 
honestly forget you can pass functions to functions in java now :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] upgrade icu dependency from 74.2 -> 77.1 [lucene]

2025-04-05 Thread via GitHub


rmuir merged PR #14386:
URL: https://github.com/apache/lucene/pull/14386


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-04-05 Thread via GitHub


gf2121 commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2006940286


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java:
##
@@ -0,0 +1,552 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class TrieBuilder {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  static final int LEAF_NODE_HAS_TERMS = 1 << 5;
+  static final int LEAF_NODE_HAS_FLOOR = 1 << 6;
+  static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1;
+  static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0;
+
+  /**
+   * The output describing the term block the prefix point to.
+   *
+   * @param fp describes the on-disk terms block which a trie node points to.
+   * @param hasTerms A boolean which will be false if this on-disk block 
consists entirely of
+   * pointers to child blocks.
+   * @param floorData A {@link BytesRef} which will be non-null when a large 
block of terms sharing
+   * a single trie prefix is split into multiple on-disk blocks.
+   */
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+BUILDING,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+
+// The utf8 digit that leads to this Node, 0 for root node
+private final int label;
+// The children listed in order by their utf8 label
+private final LinkedList children;
+// The output of this node.
+private Output output;
+
+// Vars used during saving:
+
+// The file pointer point to where the node saved. -1 means the node has 
not been saved.
+private long fp = -1;
+// The iterator whose next() point to the first child has not been saved.
+private Iterator childrenIterator;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.BUILDING;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  static TrieBuilder bytesRefToTrie(BytesRef k, Output v) {
+return new TrieBuilder(k, v);
+  }
+
+  private TrieBuilder(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  /**
+   * Absorb all (K, V) pairs from the given trie into this one. The given trie 
builder should not
+   * have key that already exists in this one, otherwise a {@link 
IllegalArgumentException } will be
+   * thrown and this trie will get destroyed.
+   *
+   * Note: the given trie will be destroyed after absorbing.
+   */
+  void absorb(TrieBuilder trieBuilder) {
+if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+// Use a simple stack to avoid recursion.
+Deque stack = new ArrayDeque<>();
+stack.add(() -> absorb(this.root, trieBuilder.root, stack));
+while (!stack.isEmpty()) {
+  stack.pop().run();
+}
+trieBuilder.status = Status.DESTROYED;
+  }
+
+  private void absorb(Node n, Node add, Deque stack) {
+assert n.label == add.label;
+if (add.output != null) {
+  if (n.output != null) {
+ 

Re: [PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]

2025-04-05 Thread via GitHub


rmuir commented on code in PR #14381:
URL: https://github.com/apache/lucene/pull/14381#discussion_r2007499003


##
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java:
##
@@ -778,6 +786,53 @@ private int[] toCaseInsensitiveChar(int codepoint) {
 }
   }
 
+  /**
+   * Expands range to include case-insensitive matches.
+   *
+   * This is expensive: case-insensitive range involves iterating over the 
range space, adding
+   * alternatives. Jump on the grenade here, contain CPU and memory explosion 
just to this method
+   * activated by optional flag.
+   */
+  private void expandCaseInsensitiveRange(
+  int start, int end, List rangeStarts, List rangeEnds) {
+if (start > end)
+  throw new IllegalArgumentException(
+  "invalid range: from (" + start + ") cannot be > to (" + end + ")");
+
+// contain the explosion of transitions by using a throwaway state
+Automaton scratch = new Automaton();
+int state = scratch.createState();
+
+// iterate over range, adding codepoint and any alternatives as transitions
+for (int i = start; i <= end; i++) {
+  scratch.addTransition(state, state, i);
+  int[] altCodePoints = CaseFolding.lookupAlternates(i);
+  if (altCodePoints != null) {
+for (int alt : altCodePoints) {
+  scratch.addTransition(state, state, alt);
+}
+  } else {
+int altCase =
+Character.isLowerCase(i) ? Character.toUpperCase(i) : 
Character.toLowerCase(i);
+if (altCase != i) {
+  scratch.addTransition(state, state, altCase);
+}
+  }
+}

Review Comment:
   this one is best as a separate PR. I will work it today.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Completion FSTs to be loaded off-heap by default [lucene]

2025-04-05 Thread via GitHub


javanna commented on code in PR #14364:
URL: https://github.com/apache/lucene/pull/14364#discussion_r2000872434


##
lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestSuggestField.java:
##
@@ -951,7 +951,16 @@ static IndexWriterConfig iwcWithSuggestField(Analyzer 
analyzer, final Set

Re: [I] TestIndexSortBackwardsCompatibility.testSortedIndexAddDocBlocks fails reproducibly [lucene]

2025-04-05 Thread via GitHub


dweiss closed issue #14344: 
TestIndexSortBackwardsCompatibility.testSortedIndexAddDocBlocks fails 
reproducibly
URL: https://github.com/apache/lucene/issues/14344


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Preparing existing profiler for adding concurrent profiling [lucene]

2025-04-05 Thread via GitHub


jainankitk commented on PR #14413:
URL: https://github.com/apache/lucene/pull/14413#issuecomment-2762048902

   > You just need to replace ctx with _.
   
   Ah, my bad! I tried `.`, but we can't use that as part of variable name. 
Thanks for the suggestion @jpountz.
   
   At a high level, I have unified the concurrent/non-concurrent profiling 
paths as suggested. The `QueryProfilerTree` is shared across slices, and we 
recursively build the ProfilerTree for each slice for response. There are few 
kinks that we still need to be iron out. For example:
   
   * `Weight` creation is global across slices. How do we account for its time? 
Should be have separate global tree with just the weight times? We can't just 
get away with having weight count at the top as `Weight` is shared for child 
queries as well, right?
   * The new in-memory structure for profiled queries is bit like below (notice 
additional list for slices):
   ```
   "query": [ <-- for list of slices
  [ <-- for list of root queries
 {
   "type": "TermQuery",
   "description": "foo:bar",
   "time_in_nanos" : 11972972,
   "breakdown" :
   {
   ```
   We can probably have map of slices, with key being the `sliceId`:
   
   ```
  "query": {
  "some global information":
  "slices": {
  "slice1": [ <-- for list of root queries
   {
   "type": "TermQuery",
   "description": "foo:bar",
   "time_in_nanos" : 11972972,
   "breakdown" :
   {...}}],
  "slice2": [],
  "slice3": []}
 }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] ParallelLeafReader.getTermVectors can indirectly load TVs multiple times [LUCENE-6868] [lucene]

2025-04-05 Thread via GitHub


vigyasharma closed issue #7926: ParallelLeafReader.getTermVectors can 
indirectly load TVs multiple times [LUCENE-6868]
URL: https://github.com/apache/lucene/issues/7926


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-05 Thread via GitHub


jimczi commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2027392059


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() {
* The default implementation is empty
*/
   public void finishMerge() throws IOException {}
+
+  /** A string representing the off-heap category for quantized vectors. */
+  public static final String QUANTIZED = "QUANTIZED";

Review Comment:
   nit: I wonder if we should rather reflect the underlying format here. 
Something like flat_vector_float, flat_vector_byte, flat_vector_bbq?
   



##
lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java:
##
@@ -257,6 +259,19 @@ public long ramBytesUsed() {
 return size;
   }
 
+  @Override
+  public Map getOffHeapByteSize(FieldInfo fieldInfo) {
+Objects.requireNonNull(fieldInfo);
+var raw = rawVectorsReader.getOffHeapByteSize(fieldInfo);
+var fieldEntry = fields.get(fieldInfo.name);
+if (fieldEntry == null) {
+  assert fieldInfo.getVectorEncoding() == VectorEncoding.BYTE;

Review Comment:
   This is not possible, this format doesn't accept raw vector in the byte 
format.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] New IndexReaderFunctions.positionLength from the norm [lucene]

2025-04-05 Thread via GitHub


bruno-roustant commented on PR #14433:
URL: https://github.com/apache/lucene/pull/14433#issuecomment-2777888670

   Why not numTerms() instead of positionLength()?
   Inside Similarity.computeNorm(), the value is named numTerms.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Add CaseFolding.fold(), inverse of expand(), move to UnicodeUtil, add filter [lucene]

2025-04-05 Thread via GitHub


rmuir opened a new pull request, #14389:
URL: https://github.com/apache/lucene/pull/14389

   Regexp has the ability to erase case differences at query time (the slow 
way), but there's no corresponding ability to do it the fast-way: at index time.
   
   There's LowerCaseFilter, but LowerCaseFilter normalizes text for display 
purposes, which is different than case folding which eliminates case 
differences and is appropriate for search.
   
   Generate fold() data in a similar way as expand() data. Expose via 
UnicodeUtil and tableize basic latin for performance. Add CaseFoldingFilter.
   
   No Analyzer chains have been modified yet, but we should be able to improve 
Unicode support by swapping out LowerCaseFilter as a followup. Some filters 
such as GreekLowerCaseFilter can probably be eliminated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]

2025-04-05 Thread via GitHub


jainankitk commented on PR #14267:
URL: https://github.com/apache/lucene/pull/14267#issuecomment-2773131906

   @hanbj - Thanks for patiently addressing the review comments. While I don't 
see any performance regression risk myself, I am wondering if we can do one 
quick performance benchmark run, just to ensure we are not missing anything 
obvious?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]

2025-04-05 Thread via GitHub


guojialiang92 commented on PR #14417:
URL: https://github.com/apache/lucene/pull/14417#issuecomment-2766116736

   Thanks, @vigyasharma 
   I also looked at Lucene's native segment replication, just sharing my 
personal opinion.
   
   > Also, IIUC `IndexWriter#advanceSegmentInfosVersion()` was added to handle 
similar scenarios for NRT replication (Lucene's native segment replication 
implementation). I'm curious why we didn't run into the need to advance 
`SegmentInfos#counter` at that time. Do you remember, @mikemccand (I know it's 
been a while! (: )?
   
   In the code comments of Lucene's native segment replication, the risk of 
file conflicts is also mentioned, but no additional processing is done. From a 
robustness perspective, perhaps control should also be carried out. The 
relevant code is as follows:
   ReplicaNode#fileIsIdentical (**Segment name was reused!  This is rare but 
possible and otherwise devastating**)
   ```
 private boolean fileIsIdentical(String fileName, FileMetaData srcMetaData) 
throws IOException {
   
   FileMetaData destMetaData = readLocalFileMetaData(fileName);
   if (destMetaData == null) {
 // Something went wrong in reading the file (it's corrupt, truncated, 
does not exist, etc.):
 return false;
   }
   
   if (Arrays.equals(destMetaData.header(), srcMetaData.header()) == false
   || Arrays.equals(destMetaData.footer(), srcMetaData.footer()) == 
false) {
 // Segment name was reused!  This is rare but possible and otherwise 
devastating:
 if (isVerboseFiles()) {
   message("file " + fileName + ": will copy [header/footer is 
different]");
 }
 return false;
   } else {
 return true;
   }
 }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Use @snippet javadoc tag for snippets [lucene]

2025-04-05 Thread via GitHub


dweiss commented on issue #14257:
URL: https://github.com/apache/lucene/issues/14257#issuecomment-2755082414

   I've toyed with it a bit but I don't see a way for it to not break those /// 
comments. An alternative is to fork it, fix what we need and then use the 
forked version from spotless. This is a doable alternative to using Eclipse's 
formatter - I really don't mind either.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-04-05 Thread via GitHub


gf2121 commented on PR #14333:
URL: https://github.com/apache/lucene/pull/14333#issuecomment-2771814782

   I roughly implemented the idea. This is my first time forking a new codec, 
hopefully have not made too many mistakes :)
   
   A few thoughts during my refactoring:
   
   * I thought i only need to fork a `Lucene103BlockTreeTerms` to intersect 
with `Lucene101Postings`. But it seems challenging based on current API design. 
I have to fork the new `Lucene103Postings` as well. Maybe this is a point can 
be improved?
   
   * I'm not sure if it matters to make it a default codec or not in main, as 
main will not get released anyway? Default to main without backporting sounds 
good enough to me. If we stick not to make it default, maybe this codec should 
be moved into test / sandbox / codec module? It would be weird to see multiple 
codecs in a core module for contributors who don't have the context of this PR.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Incorrect use of fsync [lucene]

2025-04-05 Thread via GitHub


rmuir commented on issue #14334:
URL: https://github.com/apache/lucene/issues/14334#issuecomment-2772221194

   Nobody needs to fsync any temporary files, ever. They are temporary: we 
don't need them durable. Look at how lucene uses  temporary files to understand 
this. 
   
   We don't need such files to persist to any storage device ever. Personally I 
use tmpfs for temp files, they only go to memory.
   
   if your operating system doesn't give you any error when using temporary 
files then your operating system is broken: get a new one. If your computer 
doesn't detect memory corruption then buy ECC memory. Lucene has checksums and 
other safeguards that might indicate it, but that's no guarantee it is just 
best-effort. IMO You read too far into a stackoverflow comment here without 
understanding how some of this works.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] bump antlr 4.11.1 -> 4.13.2 [lucene]

2025-04-05 Thread via GitHub


rmuir commented on code in PR #14388:
URL: https://github.com/apache/lucene/pull/14388#discussion_r2008139072


##
lucene/expressions/src/generated/checksums/generateAntlr.json:
##
@@ -1,7 +1,8 @@
 {
 
"lucene/expressions/src/java/org/apache/lucene/expressions/js/Javascript.g4": 
"818e89aae0b6c7601051802013898c128fe7c1ba",
 
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptBaseVisitor.java":
 "6965abdb8b069aaceac1ce4f32ed965b194f3a25",
-
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptLexer.java":
 "b8d6b259ebbfce09a5379a1a2aa4c1ddd4e378eb",
-
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptParser.java":
 "7a3a7b9de17f4a8d41ef342312eae5c55e483e08",
-
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptVisitor.java":
 "ec24bb2b9004bc38ee808970870deed12351039e"
+
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptLexer.java":
 "6508dc5008e96a1ad28c967a3401407ba83f140b",
+
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptParser.java":
 "ba6d0c00af113f115fc7a1f165da7726afb2e8c5",
+
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptVisitor.java":
 "ec24bb2b9004bc38ee808970870deed12351039e",
+"property:antlr-version": "4.13.2"

Review Comment:
   Thanks for this! Yes, this is what it looks like in ICU json file, which  
works perfectly:
   
   e.g. in `./lucene/analysis/icu/src/generated/checksums/genRbbi.json`:
   
   ```json
   {
   ...
   "property:icuConfig": "com.ibm.icu:icu4j:77.1"
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] New IndexReaderFunctions.positionLength from the norm [lucene]

2025-04-05 Thread via GitHub


dsmiley commented on PR #14433:
URL: https://github.com/apache/lucene/pull/14433#issuecomment-2780732429

   `fieldLength` works for me.  I'd like `fieldPositionLength` more as it 
characterizes the basis of the length (it's not characters).  BTW some other 
methods on this class don't have "field" in the name yet take a field arg and 
so are a statistic about a field.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] KeywordField.newSetQuery() to reuse prefixed terms in IndexOrDocValuesQuery [lucene]

2025-04-05 Thread via GitHub


mkhludnev commented on code in PR #14435:
URL: https://github.com/apache/lucene/pull/14435#discussion_r2029801915


##
lucene/core/src/java/org/apache/lucene/document/KeywordField.java:
##
@@ -175,9 +174,8 @@ public static Query newExactQuery(String field, String 
value) {
   public static Query newSetQuery(String field, Collection values) {
 Objects.requireNonNull(field, "field must not be null");
 Objects.requireNonNull(values, "values must not be null");
-Query indexQuery = new TermInSetQuery(field, values);
-Query dvQuery = new TermInSetQuery(MultiTermQuery.DOC_VALUES_REWRITE, 
field, values);
-return new IndexOrDocValuesQuery(indexQuery, dvQuery);
+return TermInSetQuery.newIndexOrDocValuesQuery(
+MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE, field, values);

Review Comment:
   ok. Got it. Since we add something, let's add as least as possible. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add a timeout for forceMergeDeletes in IndexWriter [lucene]

2025-04-05 Thread via GitHub


jpountz commented on issue #14431:
URL: https://github.com/apache/lucene/issues/14431#issuecomment-2780641860

   > and some deletes being addressed is better than none.
   
   This part of your message suggests that deletes get reclaimed progressively 
over time, which is often not true. So waiting for 50% of the time it takes to 
run merges may not result in an index that has significant fewer deletes than 
if not waiting at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] New IndexReaderFunctions.positionLength from the norm [lucene]

2025-04-05 Thread via GitHub


jpountz commented on PR #14433:
URL: https://github.com/apache/lucene/pull/14433#issuecomment-2780644329

   What about calling it just "field length", since this is the length as 
computed for the purpose of length normalization?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow skip cache factor to be updated dynamically [lucene]

2025-04-05 Thread via GitHub


sgup432 commented on code in PR #14412:
URL: https://github.com/apache/lucene/pull/14412#discussion_r2019109527


##
lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java:
##
@@ -122,12 +123,30 @@ public LRUQueryCache(
   long maxRamBytesUsed,
   Predicate leavesToCache,
   float skipCacheFactor) {
+this(maxSize, maxRamBytesUsed, leavesToCache, new 
AtomicReference<>(skipCacheFactor));
+  }
+
+  /**
+   * Additionally, allows the ability to pass skipCacheFactor in form of 
AtomicReference where the
+   * caller can dynamically update(in a thread safe way) its value by calling 
skipCacheFactor.set()
+   * on their end.
+   */
+  public LRUQueryCache(
+  int maxSize,
+  long maxRamBytesUsed,
+  Predicate leavesToCache,
+  AtomicReference skipCacheFactor) {

Review Comment:
   Made the changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Let Decompressor implement the Closeable interface. [lucene]

2025-04-05 Thread via GitHub


jpountz commented on PR #14438:
URL: https://github.com/apache/lucene/pull/14438#issuecomment-2778028781

   Unfortunately, you can't easily use close() to release resources from a 
Decompressor, because `StoredFieldsReader` is cloneable, and close() is never 
called on the clones. The only workaround that comes to mind would consist of 
using thread-locals, but I don't think we want to support that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] KeywordField.newSetQuery() to uses prefixed terms for IndexOrDocValuesQuery [lucene]

2025-04-05 Thread via GitHub


mkhludnev opened a new pull request, #14435:
URL: https://github.com/apache/lucene/pull/14435

   fix #14425
   
   ### Description
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] QueryParser parsing a phrase with a wildcard [lucene]

2025-04-05 Thread via GitHub


viliam-durina opened a new issue, #14440:
URL: https://github.com/apache/lucene/issues/14440

   ### Description
   
   Hi all,
   
   I have tried to parse this query using the classic QueryParser:
   
 String sQuery = "\"foo bar*\"";
   
   The query was parsed into a PhraseQuery with two terms: "foo" and "bar". 
That is the wildcard was lost and the query doesn't handle the "bar" term as a 
prefix.
   
   I think this is an issue: Lucene should either produce an error, if wildcard 
search isn't supported within phrases, or it should produce a correct query.
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Use FixedLengthBytesRefArray in OneDimensionBKDWriter to hold split values [lucene]

2025-04-05 Thread via GitHub


iverase opened a new pull request, #14383:
URL: https://github.com/apache/lucene/pull/14383

   We are currently using a list which feels wasteful. For example looking into 
the heap dump on an IP field, we were using almost double of the heap necessary 
to hold the split values:
   
   https://github.com/user-attachments/assets/d839b0f4-ed6b-43bf-8060-47560b68be2a";
 />
   
   Using FixedLengthBytesRefArray should reduce memory usage and avoid 
humongous allocations.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-04-05 Thread via GitHub


benwtrent commented on PR #14173:
URL: https://github.com/apache/lucene/pull/14173#issuecomment-2744242199

   > do you confirm that, according to your knowledge, any relevant and active 
work toward multi-valued vectors in Lucene is effectively aggregated here?
   
   @alessandrobenedetti I think so. This is the latest stab at it. 
   
   > Main concern is still related to ordinals to become long as far as I can 
see :)
   
   Indeed, I just don't see how Lucene can actually support multi-value vectors 
without switching to long ordinals for the vectors. Otherwise, we enforce some 
limitation on the number of vectors per segment, or some limitation on the 
number of vectors per doc (e.g. every doc can only have 256/65535 vectors).
   
   Making HNSW indexing & merging ~2x (given other constants, it might not be 
exactly 2x, maybe a little less) more expensive for heap usage is a pretty 
steep cost. Especially for something I am not sure how many folks will actually 
use.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Speedup merging of HNSW graphs (#14331) [lucene]

2025-04-05 Thread via GitHub


mayya-sharipova opened a new pull request, #14380:
URL: https://github.com/apache/lucene/pull/14380

   Backport for #14331
   
   Currently when doing merging of HNSW graphs incrementally, we first 
initialize a graph from the biggest segment, and for other segments, we rebuild 
the graphs completely by going through a segment's vector values one by one, 
searching for it in the new graph to find best neighbours to connect it with.
   
   This PR proposes more efficient merging based on the idea if we know where 
we want to insert a node, we have a good idea of where we want to insert its 
neighbours. Similarly to the current approach, we initialize a new graph from 
the biggest segment. For all other segments, we find a smaller set of nodes 
that "covers" their graph, and we insert that set as usual. For other nodes, 
outside of J sets, we do lighter searches with pre-calculated eps.
   
   This allows substantial speedups in merging (up to 2x in force-merge).
   
   The algorithm is based on the following steps:
   
   1. Get all graphs that don't have deletions and sort them by size 
(descending).
   2. Copy the largest graph to the new graph (`gL`).
   3. For each remaining small graph (`gS`):
  - Find the nodes that best cover `gS` (join set `j`). These nodes will be 
inserted into `gL` as usual: by searching `gL` to find the best candidates 
(`w`) to which connect the nodes.
  - For each remaining node in `gS` do "lighter" searches:
- We provide `eps` to search in `gL`. We form `eps` by the union of the 
node's neighbors in `gS` and the node's neighbors' neighbors in `gL`. We also 
limit `beamWidth` (`efConstruction` ) to `M * 3`.
   
   Algorithm designed by Thomas Veasey
   
   ### Description
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Handle NaN results in TestVectorUtilSupport.testBinaryVectors [lucene]

2025-04-05 Thread via GitHub


benwtrent commented on code in PR #14419:
URL: https://github.com/apache/lucene/pull/14419#discussion_r2018509188


##
lucene/core/src/test/org/apache/lucene/internal/vectorization/TestVectorUtilSupport.java:
##
@@ -210,9 +210,13 @@ public void testMinMaxScalarQuantize() {
   }
 
   private void 
assertFloatReturningProviders(ToDoubleFunction func) {
-assertThat(
-func.applyAsDouble(PANAMA_PROVIDER.getVectorUtilSupport()),
-closeTo(func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport()), 
delta));
+double luceneValue = 
func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport());

Review Comment:
   Using `assertEquals` is fine with the delta. I don't know of any special 
reason to use `closeTo` here. @thecoop what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] KeywordField.newSetQuery() to reuse prefixed terms in IndexOrDocValuesQuery [lucene]

2025-04-05 Thread via GitHub


jainankitk commented on code in PR #14435:
URL: https://github.com/apache/lucene/pull/14435#discussion_r2029926829


##
lucene/core/src/java/org/apache/lucene/document/KeywordField.java:
##
@@ -175,9 +174,8 @@ public static Query newExactQuery(String field, String 
value) {
   public static Query newSetQuery(String field, Collection values) {
 Objects.requireNonNull(field, "field must not be null");
 Objects.requireNonNull(values, "values must not be null");
-Query indexQuery = new TermInSetQuery(field, values);
-Query dvQuery = new TermInSetQuery(MultiTermQuery.DOC_VALUES_REWRITE, 
field, values);
-return new IndexOrDocValuesQuery(indexQuery, dvQuery);
+return TermInSetQuery.newIndexOrDocValuesQuery(
+MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE, field, values);

Review Comment:
   Thanks for making the change. I know its minor, but important for keeping it 
clean!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]

2025-04-05 Thread via GitHub


uschindler commented on code in PR #14384:
URL: https://github.com/apache/lucene/pull/14384#discussion_r2008497905


##
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java:
##
@@ -759,23 +759,14 @@ private Automaton toAutomaton(
* @return the original codepoint and the set of alternates
*/
   private int[] toCaseInsensitiveChar(int codepoint) {
-int[] altCodepoints = CaseFolding.lookupAlternates(codepoint);
-if (altCodepoints != null) {
-  int[] concat = new int[altCodepoints.length + 1];
-  System.arraycopy(altCodepoints, 0, concat, 0, altCodepoints.length);
-  concat[altCodepoints.length] = codepoint;
-  return concat;
-} else {
-  int altCase =
-  Character.isLowerCase(codepoint)
-  ? Character.toUpperCase(codepoint)
-  : Character.toLowerCase(codepoint);
-  if (altCase != codepoint) {
-return new int[] {altCase, codepoint};
-  } else {
-return new int[] {codepoint};
-  }
-}
+List list = new ArrayList<>();
+CaseFolding.expand(
+codepoint,
+(int variant) -> {

Review Comment:
   wouldn't have `list::add` as method reference not have worked?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-04-05 Thread via GitHub


mikemccand commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2005470873


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java:
##
@@ -0,0 +1,486 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class Trie {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+UNSAVED,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+private final int label;
+private final LinkedList children;
+private Output output;
+private long fp = -1;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.UNSAVED;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  Trie(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  void putAll(Trie trie) {
+if (status != Status.UNSAVED || trie.status != Status.UNSAVED) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+trie.status = Status.DESTROYED;
+putAll(this.root, trie.root);
+  }
+
+  private static void putAll(Node n, Node add) {
+assert n.label == add.label;
+if (add.output != null) {
+  n.output = add.output;
+}
+ListIterator iter = n.children.listIterator();
+// TODO we can do more efficient if there is no intersection, block tree 
always do that
+outer:
+for (Node addChild : add.children) {
+  while (iter.hasNext()) {
+Node nChild = iter.next();
+if (nChild.label == addChild.label) {
+  putAll(nChild, addChild);
+  continue outer;
+}
+if (nChild.label > addChild.label) {
+  iter.previous(); // move back
+  iter.add(addChild);
+  continue outer;
+}
+  }
+  iter.add(addChild);
+}
+  }
+
+  Output getEmptyOutput() {
+return root.output;
+  }
+
+  void forEach(BiConsumer consumer) {
+if (root.output != null) {
+  consumer.accept(new BytesRef(), root.output);
+}
+intersect(root.children, new BytesRefBuilder(), consumer);
+  }
+
+  private void intersect(
+  List nodes, BytesRefBuilder key, BiConsumer 
consumer) {
+for (Node node : nodes) {
+  key.append((byte) node.label);
+  if (node.output != null) consumer.accept(key.toBytesRef(), node.output);
+  intersect(node.children, key, consumer);
+  key.setLength(key.length() - 1);
+}
+  }
+
+  void save(DataOutput meta, IndexOutput index) throws IOException {
+if (status != Status.UNSAVED) {
+  throw new IllegalStateException("only unsaved trie can be saved");
+}
+status = Status.SAVED;
+meta.writeVLong(index.getFilePointer());
+saveNodes(index);
+meta.writeVLong(root.fp);
+index.writeLong(0L); // additional 8 bytes for over-reading
+meta.writeVLong(index.getFilePointer());
+  }
+
+  void saveNodes(IndexOutput index) throws IOException {
+final long startFP = index.getFilePointer();
+Deque stack = new ArrayDeque<>();
+sta

Re: [I] IndexReader#leaves method is slightly confusing [lucene]

2025-04-05 Thread via GitHub


jpountz commented on issue #14367:
URL: https://github.com/apache/lucene/issues/14367#issuecomment-2748919960

   Hmm, maybe I closed a bit too quickly as this issue only pointed out 
confusion with `IndexReader#leaves`, it did not suggest a particular approach.
   
   That said, I'm aligned with the last paragraph: "Really minor at this point, 
and probably not worth going through the pain of deprecating IndexReader#leaves 
and changing at few hundred places", it's not too confusing to me so I'm not 
sure it actually warrants a change. I'll leave it closed for now but happy to 
reopen if there is traction for improving this API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Adding TestSpanWithinQuery with basic test cases for SpanWithinQuery [lucene]

2025-04-05 Thread via GitHub


slow-J opened a new pull request, #14405:
URL: https://github.com/apache/lucene/pull/14405

   TEST: ./gradlew check
   
   ### Description
   
   I was looking at an old issue https://github.com/apache/lucene/issues/7145 
which talks about unit tests for SpanWithinQuery. I noticed that there was no 
class for basic unit tests for SpanWithinQuery, while we do this for many other 
SpanQueries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Use @snippet javadoc tag for snippets [lucene]

2025-04-05 Thread via GitHub


rmuir commented on issue #14257:
URL: https://github.com/apache/lucene/issues/14257#issuecomment-2754255056

   @dweiss I also wonder, with an "autoformat" workflow, if we even care so 
much.
   
   I don't understand what is so sacrosanct about google's format: to me it is 
ugly. Snippet tag is from java 18 (6 releases back) and google doesn't care, 
they are a big corporation and probably the type to keep code on e.g. java 8. I 
don't think we should weigh their opinions very much on anything.
   
   All autoformatters lead to ugliness at times, it is just the tradeoff you 
make to avoid hassles, and still reap the benefits of avoiding formatting 
bikesheds, noise in PRs, etc.
   
   I just think autoformat the code in a consistent way, call it a day.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]

2025-04-05 Thread via GitHub


vigyasharma commented on PR #14417:
URL: https://github.com/apache/lucene/pull/14417#issuecomment-2764418906

   I think we can add a couple more tests to make it robust.
   1. Some tests around concurrency –  index with multiple threads, then 
advance the counter in one of the threads, and validate behavior. You can look 
at `ThreadedIndexingAndSearchingTestCase` and its derived tests for motivation.
   2. A test for the crash-recovery scenario, which I suppose it the primary 
use case. We could make the writer index a bunch of docs, then kill it, start a 
new writer on the same index, and advance its counter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]

2025-04-05 Thread via GitHub


gf2121 commented on code in PR #14401:
URL: https://github.com/apache/lucene/pull/14401#discussion_r2019735302


##
lucene/test-framework/src/java/org/apache/lucene/tests/search/AssertingLeafCollector.java:
##
@@ -50,6 +50,14 @@ public void collect(DocIdStream stream) throws IOException {
 in.collect(new AssertingDocIdStream(stream));
   }
 
+  @Override
+  public void collectRange(int min, int max) throws IOException {
+assert min > lastCollected;
+assert max > min;

Review Comment:
   Maybe assert `min >= this.min` and `max <= this.max` as well :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adds github action to verify changelog entry and set milestone to PRs [lucene]

2025-04-05 Thread via GitHub


stefanvodita commented on PR #14279:
URL: https://github.com/apache/lucene/pull/14279#issuecomment-2743574250

   Thanks for pointing that out @javanna! Funny how that happened on a PR 
that's specifically about the changelog. We should only push this to main. I'll 
actually delete the entry for now since we're still iterating on this workflow 
to make it work. See #13898.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-04-05 Thread via GitHub


gf2121 commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2006885578


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieReader.java:
##
@@ -0,0 +1,228 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.RandomAccessInput;
+
+class TrieReader {
+
+  private static final long NO_OUTPUT = -1;
+  private static final long NO_FLOOR_DATA = -1;
+  private static final long[] BYTES_MINUS_1_MASK =
+  new long[] {
+0xFFL,
+0xL,
+0xFFL,
+0xL,
+0xFFL,
+0xL,
+0xFFL,
+0xL
+  };
+
+  static class Node {

Review Comment:
   Yeah, `TrieBuilder.Node` and `TrieReader.Node`. I think the class prefix has 
made it clear :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] KeywordField.newSetQuery() to reuse prefixed terms in IndexOrDocValue… [lucene]

2025-04-05 Thread via GitHub


mkhludnev opened a new pull request, #14442:
URL: https://github.com/apache/lucene/pull/14442

   …sQuery (#14435)
   
   * KeywordField.newSetQuery() reuses prefixed terms.
   
   fix #14425
   
   ### Description
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-04-05 Thread via GitHub


mikemccand commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2022727361


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java:
##
@@ -0,0 +1,552 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class TrieBuilder {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  static final int LEAF_NODE_HAS_TERMS = 1 << 5;
+  static final int LEAF_NODE_HAS_FLOOR = 1 << 6;
+  static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1;
+  static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0;
+
+  /**
+   * The output describing the term block the prefix point to.
+   *
+   * @param fp describes the on-disk terms block which a trie node points to.
+   * @param hasTerms A boolean which will be false if this on-disk block 
consists entirely of
+   * pointers to child blocks.
+   * @param floorData A {@link BytesRef} which will be non-null when a large 
block of terms sharing
+   * a single trie prefix is split into multiple on-disk blocks.
+   */
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+BUILDING,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+
+// The utf8 digit that leads to this Node, 0 for root node
+private final int label;
+// The children listed in order by their utf8 label
+private final LinkedList children;
+// The output of this node.
+private Output output;
+
+// Vars used during saving:
+
+// The file pointer point to where the node saved. -1 means the node has 
not been saved.
+private long fp = -1;
+// The iterator whose next() point to the first child has not been saved.
+private Iterator childrenIterator;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.BUILDING;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  static TrieBuilder bytesRefToTrie(BytesRef k, Output v) {
+return new TrieBuilder(k, v);
+  }
+
+  private TrieBuilder(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  /**
+   * Absorb all (K, V) pairs from the given trie into this one. The given trie 
builder should not
+   * have key that already exists in this one, otherwise a {@link 
IllegalArgumentException } will be
+   * thrown and this trie will get destroyed.
+   *
+   * Note: the given trie will be destroyed after absorbing.
+   */
+  void absorb(TrieBuilder trieBuilder) {
+if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+// Use a simple stack to avoid recursion.
+Deque stack = new ArrayDeque<>();
+stack.add(() -> absorb(this.root, trieBuilder.root, stack));
+while (!stack.isEmpty()) {
+  stack.pop().run();
+}
+trieBuilder.status = Status.DESTROYED;
+  }
+
+  private void absorb(Node n, Node add, Deque stack) {
+assert n.label == add.label;
+if (add.output != null) {
+  if (n.output != null) {
+ 

Re: [I] Reuse packedTerms between two TermInSetQuery which are combined by IndexOrDocValuesQuery [lucene]

2025-04-05 Thread via GitHub


mkhludnev closed issue #14425: Reuse packedTerms between two TermInSetQuery 
which are combined by IndexOrDocValuesQuery
URL: https://github.com/apache/lucene/issues/14425


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Reuse packedTerms between two TermInSetQuery which are combined by IndexOrDocValuesQuery [lucene]

2025-04-05 Thread via GitHub


mkhludnev closed issue #14425: Reuse packedTerms between two TermInSetQuery 
which are combined by IndexOrDocValuesQuery
URL: https://github.com/apache/lucene/issues/14425


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] KeywordField.newSetQuery() to reuse prefixed terms in IndexOrDocValuesQuery [lucene]

2025-04-05 Thread via GitHub


mkhludnev merged PR #14435:
URL: https://github.com/apache/lucene/pull/14435


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize commit retention policy to maintain only the last 5 commits [lucene]

2025-04-05 Thread via GitHub


vigyasharma commented on PR #14325:
URL: https://github.com/apache/lucene/pull/14325#issuecomment-2781125749

   This PR changes the existing `KeepLastCommitDeletionPolicy` which is not 
what we want. I've created a new, beginner issue, #1 that specifies the 
requirements from this task.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize commit retention policy to maintain only the last 5 commits [lucene]

2025-04-05 Thread via GitHub


vigyasharma closed pull request #14325: Optimize commit retention policy to 
maintain only the last 5 commits
URL: https://github.com/apache/lucene/pull/14325


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Revert "Add UnwrappingReuseStrategy for AnalyzerWrapper (#14154)" [lucene]

2025-04-05 Thread via GitHub


mayya-sharipova merged PR #14437:
URL: https://github.com/apache/lucene/pull/14437


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-04-05 Thread via GitHub


rmuir commented on code in PR #14350:
URL: https://github.com/apache/lucene/pull/14350#discussion_r2000536590


##
lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java:
##
@@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) {
 
 return alts;
   }
+
+  /**
+   * Folds the case of the given character according to {@link 
Character#toLowerCase(int)}, but with
+   * exceptions if the turkic flag is set.
+   *
+   * @param codepoint to code point for the character to fold
+   * @param turkic if true, then apply tr/az folding rules
+   * @return the folded character
+   */
+  static int foldCase(int codepoint, boolean turkic) {
+if (turkic) {
+  if (codepoint == 0x00130) { // İ [LATIN CAPITAL LETTER I WITH DOT ABOVE]
+return 0x00069; // i [LATIN SMALL LETTER I]
+  } else if (codepoint == 0x49) { //  I [LATIN CAPITAL LETTER I]
+return 0x00131; // ı [LATIN SMALL LETTER DOTLESS I]
+  }
+}
+return Character.toLowerCase(codepoint);

Review Comment:
   For real case folding we have to do more than this. it is a simple 1-1 
mapping but e.g. `Σ`, `σ`, and `ς`, will all fold to σ. Whereas toLowerCase(ς) 
= ς. Because it is already in lower-case, just in final-form. This is just an 
example. To see more, compare your function against ICU 
UCharacter.foldCase(int, bool) across all of unicode.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-04-05 Thread via GitHub


mikemccand commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2022767256


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java:
##
@@ -0,0 +1,632 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Arrays;
+import java.util.Deque;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/**
+ * A builder to build prefix tree (trie) as the index of block tree, and can 
be saved to disk.
+ *
+ * TODO make it a more memory efficient structure
+ */
+class TrieBuilder {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  static final int LEAF_NODE_HAS_TERMS = 1 << 5;
+  static final int LEAF_NODE_HAS_FLOOR = 1 << 6;
+  static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1;
+  static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0;
+
+  /**
+   * The output describing the term block the prefix point to.
+   *
+   * @param fp the file pointer to the on-disk terms block which a trie node 
points to.
+   * @param hasTerms false if this on-disk block consists entirely of pointers 
to child blocks.
+   * @param floorData will be non-null when a large block of terms sharing a 
single trie prefix is
+   * split into multiple on-disk blocks.
+   */
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+BUILDING,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+
+// The utf8 digit that leads to this Node, 0 for root node
+private final int label;
+// The output of this node.
+private Output output;
+// The number of children of this node.
+private int childrenNum;
+// Pointers to relative nodes
+private Node next;
+private Node firstChild;
+private Node lastChild;
+
+// Vars used during saving:
+
+// The file pointer point to where the node saved. -1 means the node has 
not been saved.
+private long fp = -1;
+// The latest child that have been saved. null means no child has been 
saved.
+private Node savedTo;
+
+Node(int label, Output output) {
+  this.label = label;
+  this.output = output;
+}
+  }
+
+  private Status status = Status.BUILDING;
+  final Node root = new Node(0, null);
+  private final BytesRef minKey;
+  private BytesRef maxKey;
+
+  static TrieBuilder bytesRefToTrie(BytesRef k, Output v) {
+return new TrieBuilder(k, v);
+  }
+
+  private TrieBuilder(BytesRef k, Output v) {
+minKey = maxKey = BytesRef.deepCopyOf(k);
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output);
+  parent.firstChild = parent.lastChild = node;
+  parent.childrenNum = 1;
+  parent = node;
+}
+  }
+
+  /**
+   * Absorb all (K, V) pairs from the given trie into this one. The given trie 
builder need to
+   * ensure its keys greater or equals than max key of this one.
+   *
+   * Note: the given trie will be destroyed after absorbing.
+   */
+  void absorb(TrieBuilder trieBuilder) {

Review Comment:
   Maybe rename to `append`?  The two tries are strictly orthogonal, and, the 
incoming trie is > `this` one?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java:
##
@@ -0,0 +1,632 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, V

Re: [I] Reuse packedTerms between two TermInSetQuery which are combined by IndexOrDocValuesQuery [lucene]

2025-04-05 Thread via GitHub


mkhludnev commented on issue #14425:
URL: https://github.com/apache/lucene/issues/14425#issuecomment-2781083660

   To be released in 10.3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] KeywordField.newSetQuery() to reuse prefixed terms in IndexOrDocValue… [lucene]

2025-04-05 Thread via GitHub


mkhludnev merged PR #14442:
URL: https://github.com/apache/lucene/pull/14442


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Support incremental refresh in Searcher Managers. [lucene]

2025-04-05 Thread via GitHub


vigyasharma opened a new pull request, #14443:
URL: https://github.com/apache/lucene/pull/14443

   In segment based replication systems, a large replication payload 
(checkpoint) can induce heavy page faults, cause thrashing for in-flight search 
requests, and affect overall search performance. 
   
   A potential way to handle these bursts, is to leverage multiple commit 
points in the Lucene index. Instead of refreshing to the latest commit for a 
large replication payload, searchers can intelligently select the commit point 
that they can safely absorb. By processing through multiple such points, 
searchers can eventually get to the latest commit, without incurring too many 
page faults.
   
   This change lets users define a commit selection strategy, controlling which 
commit the searcher manager refreshes on. Addresses #14219 
   
   
   **Usage:**
   To incrementally refresh through multiple commit points until searcher is 
current with its directory:
   - Define a commit selection strategy using the `RefreshCommitSupplier` 
interface.
   - Update searcher managers with this strategy via 
`setRefreshCommitSupplier()`
   - Invoke `maybeRefresh()` or `maybeRefreshBlocking` in a loop until 
`isSearcherCurrent()` returns true.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org