Re: [PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]

2025-03-20 Thread via GitHub


gf2121 commented on PR #14371:
URL: https://github.com/apache/lucene/pull/14371#issuecomment-2739828325

   Thanks @vsop-479 , have you been able to measure the performance of your 
patch?
   
   I had similar idea recently. If you look at newest code in 
`Lucene101PostingsReader`, you may find we are using `VectorMask` to speed up 
this, that was what i had in mind - get a MemorySegment slice if it is not 
null, and play it with VectorMask.
   
   
https://github.com/apache/lucene/blob/b75602428f78bd79296eb3b6be27465961b63d09/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java#L781-L787


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Handling concurrent search in QueryProfiler [lucene]

2025-03-20 Thread via GitHub


jainankitk commented on issue #14375:
URL: https://github.com/apache/lucene/issues/14375#issuecomment-2739471091

   > Maybe it tries to do too much by providing min/avg/max aggregates and it 
should just provide per-slice breakdowns, leaving whether and how to compile 
aggregates to the application?
   
   I agree with you, especially when the number of slices is < 5. In those 
cases, we are outputting almost the same number of metrics across all the 
slices, without having this lossy reduction
   
   > Out of curiosity, how do you know if two calls to advance() come from the 
same slice or different slices?
   
   The timers are built for each leaf, and we aggregate these timers while 
building profiler output for each slice using `Map> sliceCollectorsToLeaves`. This 
`sliceCollectorToLeaves` is incrementally built as part of the `searchLeaf` 
method. Relies on the fact that we need to use this slice->leaf mapping only 
when we build the final response, by when `searchLeaf` has been invoked for all 
the leaves, and we know which leaves are sharing the collector and are thus, 
part of the same slice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup merging of HNSW graphs [lucene]

2025-03-20 Thread via GitHub


mayya-sharipova commented on code in PR #14331:
URL: https://github.com/apache/lucene/pull/14331#discussion_r2005416489


##
lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentHnswMerger.java:
##
@@ -51,19 +57,85 @@ protected HnswBuilder createBuilder(KnnVectorValues 
mergedVectorValues, int maxO
 OnHeapHnswGraph graph;
 BitSet initializedNodes = null;
 
-if (initReader == null) {
+if (graphReaders.size() == 0) {
   graph = new OnHeapHnswGraph(M, maxOrd);
 } else {
+  
graphReaders.sort(Comparator.comparingInt(GraphReader::graphSize).reversed());
+  GraphReader initGraphReader = graphReaders.get(0);
+  KnnVectorsReader initReader = initGraphReader.reader();
+  MergeState.DocMap initDocMap = initGraphReader.initDocMap();
+  int initGraphSize = initGraphReader.graphSize();
   HnswGraph initializerGraph = ((HnswGraphProvider) 
initReader).getGraph(fieldInfo.name);
+
   if (initializerGraph.size() == 0) {
 graph = new OnHeapHnswGraph(M, maxOrd);
   } else {
 initializedNodes = new FixedBitSet(maxOrd);
-int[] oldToNewOrdinalMap = getNewOrdMapping(mergedVectorValues, 
initializedNodes);
+int[] oldToNewOrdinalMap =
+getNewOrdMapping(
+fieldInfo,
+initReader,
+initDocMap,
+initGraphSize,
+mergedVectorValues,
+initializedNodes);
 graph = InitializedHnswGraphBuilder.initGraph(initializerGraph, 
oldToNewOrdinalMap, maxOrd);
   }
 }
 return new HnswConcurrentMergeBuilder(
 taskExecutor, numWorker, scorerSupplier, beamWidth, graph, 
initializedNodes);
   }
+
+  /**
+   * Creates a new mapping from old ordinals to new ordinals and returns the 
total number of vectors
+   * in the newly merged segment.
+   *
+   * @param mergedVectorValues vector values in the merged segment
+   * @param initializedNodes track what nodes have been initialized
+   * @return the mapping from old ordinals to new ordinals
+   * @throws IOException If an error occurs while reading from the merge state
+   */
+  private static final int[] getNewOrdMapping(
+  FieldInfo fieldInfo,
+  KnnVectorsReader initReader,
+  MergeState.DocMap initDocMap,
+  int initGraphSize,
+  KnnVectorValues mergedVectorValues,
+  BitSet initializedNodes)
+  throws IOException {
+KnnVectorValues.DocIndexIterator initializerIterator = null;
+
+switch (fieldInfo.getVectorEncoding()) {
+  case BYTE -> initializerIterator = 
initReader.getByteVectorValues(fieldInfo.name).iterator();
+  case FLOAT32 ->
+  initializerIterator = 
initReader.getFloatVectorValues(fieldInfo.name).iterator();
+}
+
+IntIntHashMap newIdToOldOrdinal = new IntIntHashMap(initGraphSize);
+int maxNewDocID = -1;
+for (int docId = initializerIterator.nextDoc();
+docId != NO_MORE_DOCS;
+docId = initializerIterator.nextDoc()) {
+  int newId = initDocMap.get(docId);
+  maxNewDocID = Math.max(newId, maxNewDocID);
+  newIdToOldOrdinal.put(newId, initializerIterator.index());
+}
+
+if (maxNewDocID == -1) {
+  return new int[0];
+}
+final int[] oldToNewOrdinalMap = new int[initGraphSize];
+KnnVectorValues.DocIndexIterator mergedVectorIterator = 
mergedVectorValues.iterator();
+for (int newDocId = mergedVectorIterator.nextDoc();
+newDocId <= maxNewDocID;
+newDocId = mergedVectorIterator.nextDoc()) {
+  int hashDocIndex = newIdToOldOrdinal.indexOf(newDocId);
+  if (newIdToOldOrdinal.indexExists(hashDocIndex)) {

Review Comment:
   @benwtrent Thanks for the comment. This piece is moved from  
[IncrementalHNSWGraphMerger](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/IncrementalHnswGraphMerger.java#L154)
 without modifications. 
   
   > Is this stuff around indexOf indexExists, etc. just performance 
improvements over a simple newIdToOldOrdinal.get(...)
   
   Agree about `IntIntHashMap` and may be this piece can be optimized. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup merging of HNSW graphs [lucene]

2025-03-20 Thread via GitHub


mayya-sharipova commented on code in PR #14331:
URL: https://github.com/apache/lucene/pull/14331#discussion_r2005416489


##
lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentHnswMerger.java:
##
@@ -51,19 +57,85 @@ protected HnswBuilder createBuilder(KnnVectorValues 
mergedVectorValues, int maxO
 OnHeapHnswGraph graph;
 BitSet initializedNodes = null;
 
-if (initReader == null) {
+if (graphReaders.size() == 0) {
   graph = new OnHeapHnswGraph(M, maxOrd);
 } else {
+  
graphReaders.sort(Comparator.comparingInt(GraphReader::graphSize).reversed());
+  GraphReader initGraphReader = graphReaders.get(0);
+  KnnVectorsReader initReader = initGraphReader.reader();
+  MergeState.DocMap initDocMap = initGraphReader.initDocMap();
+  int initGraphSize = initGraphReader.graphSize();
   HnswGraph initializerGraph = ((HnswGraphProvider) 
initReader).getGraph(fieldInfo.name);
+
   if (initializerGraph.size() == 0) {
 graph = new OnHeapHnswGraph(M, maxOrd);
   } else {
 initializedNodes = new FixedBitSet(maxOrd);
-int[] oldToNewOrdinalMap = getNewOrdMapping(mergedVectorValues, 
initializedNodes);
+int[] oldToNewOrdinalMap =
+getNewOrdMapping(
+fieldInfo,
+initReader,
+initDocMap,
+initGraphSize,
+mergedVectorValues,
+initializedNodes);
 graph = InitializedHnswGraphBuilder.initGraph(initializerGraph, 
oldToNewOrdinalMap, maxOrd);
   }
 }
 return new HnswConcurrentMergeBuilder(
 taskExecutor, numWorker, scorerSupplier, beamWidth, graph, 
initializedNodes);
   }
+
+  /**
+   * Creates a new mapping from old ordinals to new ordinals and returns the 
total number of vectors
+   * in the newly merged segment.
+   *
+   * @param mergedVectorValues vector values in the merged segment
+   * @param initializedNodes track what nodes have been initialized
+   * @return the mapping from old ordinals to new ordinals
+   * @throws IOException If an error occurs while reading from the merge state
+   */
+  private static final int[] getNewOrdMapping(
+  FieldInfo fieldInfo,
+  KnnVectorsReader initReader,
+  MergeState.DocMap initDocMap,
+  int initGraphSize,
+  KnnVectorValues mergedVectorValues,
+  BitSet initializedNodes)
+  throws IOException {
+KnnVectorValues.DocIndexIterator initializerIterator = null;
+
+switch (fieldInfo.getVectorEncoding()) {
+  case BYTE -> initializerIterator = 
initReader.getByteVectorValues(fieldInfo.name).iterator();
+  case FLOAT32 ->
+  initializerIterator = 
initReader.getFloatVectorValues(fieldInfo.name).iterator();
+}
+
+IntIntHashMap newIdToOldOrdinal = new IntIntHashMap(initGraphSize);
+int maxNewDocID = -1;
+for (int docId = initializerIterator.nextDoc();
+docId != NO_MORE_DOCS;
+docId = initializerIterator.nextDoc()) {
+  int newId = initDocMap.get(docId);
+  maxNewDocID = Math.max(newId, maxNewDocID);
+  newIdToOldOrdinal.put(newId, initializerIterator.index());
+}
+
+if (maxNewDocID == -1) {
+  return new int[0];
+}
+final int[] oldToNewOrdinalMap = new int[initGraphSize];
+KnnVectorValues.DocIndexIterator mergedVectorIterator = 
mergedVectorValues.iterator();
+for (int newDocId = mergedVectorIterator.nextDoc();
+newDocId <= maxNewDocID;
+newDocId = mergedVectorIterator.nextDoc()) {
+  int hashDocIndex = newIdToOldOrdinal.indexOf(newDocId);
+  if (newIdToOldOrdinal.indexExists(hashDocIndex)) {

Review Comment:
   @benwtrent Thanks for the comment. This piece is moved from  
[IncrementalHNSWGraphMerger](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/IncrementalHnswGraphMerger.java#L154)
 without modifications. 
   
   > Is this stuff around indexOf indexExists, etc. just performance 
improvements over a simple newIdToOldOrdinal.get(...)
   
   Agree about `IntIntHashMap` and may be this piece can be optimized.  But as 
it is not a part of this PR, I will leave TODO item for that for follow up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-20 Thread via GitHub


gf2121 commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2005519750


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java:
##
@@ -0,0 +1,486 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class Trie {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+UNSAVED,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+private final int label;
+private final LinkedList children;
+private Output output;
+private long fp = -1;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.UNSAVED;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  Trie(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  void putAll(Trie trie) {
+if (status != Status.UNSAVED || trie.status != Status.UNSAVED) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+trie.status = Status.DESTROYED;
+putAll(this.root, trie.root);
+  }
+
+  private static void putAll(Node n, Node add) {
+assert n.label == add.label;
+if (add.output != null) {
+  n.output = add.output;
+}
+ListIterator iter = n.children.listIterator();
+// TODO we can do more efficient if there is no intersection, block tree 
always do that
+outer:
+for (Node addChild : add.children) {
+  while (iter.hasNext()) {
+Node nChild = iter.next();
+if (nChild.label == addChild.label) {
+  putAll(nChild, addChild);
+  continue outer;
+}
+if (nChild.label > addChild.label) {
+  iter.previous(); // move back
+  iter.add(addChild);
+  continue outer;
+}
+  }
+  iter.add(addChild);
+}
+  }
+
+  Output getEmptyOutput() {
+return root.output;
+  }
+
+  void forEach(BiConsumer consumer) {
+if (root.output != null) {
+  consumer.accept(new BytesRef(), root.output);
+}
+intersect(root.children, new BytesRefBuilder(), consumer);
+  }
+
+  private void intersect(
+  List nodes, BytesRefBuilder key, BiConsumer 
consumer) {
+for (Node node : nodes) {
+  key.append((byte) node.label);
+  if (node.output != null) consumer.accept(key.toBytesRef(), node.output);
+  intersect(node.children, key, consumer);
+  key.setLength(key.length() - 1);
+}
+  }
+
+  void save(DataOutput meta, IndexOutput index) throws IOException {
+if (status != Status.UNSAVED) {
+  throw new IllegalStateException("only unsaved trie can be saved");
+}
+status = Status.SAVED;
+meta.writeVLong(index.getFilePointer());
+saveNodes(index);
+meta.writeVLong(root.fp);
+index.writeLong(0L); // additional 8 bytes for over-reading
+meta.writeVLong(index.getFilePointer());
+  }
+
+  void saveNodes(IndexOutput index) throws IOException {
+final long startFP = index.getFilePointer();
+Deque stack = new ArrayDeque<>();
+stack.p

Re: [PR] Adjust equivalent min similarity HNSW exploration logic [lucene]

2025-03-20 Thread via GitHub


mayya-sharipova commented on code in PR #14366:
URL: https://github.com/apache/lucene/pull/14366#discussion_r2005708592


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##
@@ -266,11 +266,21 @@ void searchLevel(
 // A bound that holds the minimum similarity to the query vector that a 
candidate vector must
 // have to be considered.
 float minAcceptedSimilarity = 
Math.nextUp(results.minCompetitiveSimilarity());
+// We should allow exploring equivalent minAcceptedSimilarity values at 
least once
+boolean shouldExploreMinSim = true;
 while (candidates.size() > 0 && results.earlyTerminated() == false) {
   // get the best candidate (closest or best scoring)
   float topCandidateSimilarity = candidates.topScore();
   if (topCandidateSimilarity < minAcceptedSimilarity) {
-break;
+// if the similarity is equivalent to the minAcceptedSimilarity, we 
should explore one

Review Comment:
   nit: format the comment a little better (auto format seems to break it 
strangely)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adjust equivalent min similarity HNSW exploration logic [lucene]

2025-03-20 Thread via GitHub


mayya-sharipova commented on code in PR #14366:
URL: https://github.com/apache/lucene/pull/14366#discussion_r2005709557


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##
@@ -266,11 +266,21 @@ void searchLevel(
 // A bound that holds the minimum similarity to the query vector that a 
candidate vector must
 // have to be considered.
 float minAcceptedSimilarity = 
Math.nextUp(results.minCompetitiveSimilarity());
+// We should allow exploring equivalent minAcceptedSimilarity values at 
least once
+boolean shouldExploreMinSim = true;
 while (candidates.size() > 0 && results.earlyTerminated() == false) {
   // get the best candidate (closest or best scoring)
   float topCandidateSimilarity = candidates.topScore();
   if (topCandidateSimilarity < minAcceptedSimilarity) {
-break;
+// if the similarity is equivalent to the minAcceptedSimilarity, we 
should explore one
+// candidate
+// however, running into many duplicates can be expensive, so we 
should stop exploring if
+// equivalent minimum scores are found
+if (shouldExploreMinSim && Math.nextUp(candidates.topScore()) == 
minAcceptedSimilarity) {

Review Comment:
   if (shouldExploreMinSim && Math.nextUp(candidates.topScore()) == 
minAcceptedSimilarity) => 
   if (shouldExploreMinSim && Math.nextUp(topCandidateSimilarity) == 
minAcceptedSimilarity)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup merging of HNSW graphs [lucene]

2025-03-20 Thread via GitHub


mayya-sharipova merged PR #14331:
URL: https://github.com/apache/lucene/pull/14331


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Case insensitive regex query with character range [lucene]

2025-03-20 Thread via GitHub


rmuir commented on issue #14378:
URL: https://github.com/apache/lucene/issues/14378#issuecomment-2740658343

   This isn't a bug, regex parser just does not have this feature.
   
   We can add it, but it must be an additional opt-in flag due to performance 
tradeoffs involved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] build support: java 24 [lucene]

2025-03-20 Thread via GitHub


rmuir opened a new issue, #14379:
URL: https://github.com/apache/lucene/issues/14379

   ### Description
   
   java 23 has disappeared and has been replaced with java 24. 
   
   the build currently requires 23 exactly, which creates a hurdle for users, 
since it is difficult to get: does not exist in operating system package 
manager, must be dug out of a graveyard, etc.
   
   yes, i know the problem of gradle here, but that doesn't excuse their 
behavior. they shouldnt be parsing classfiles of the jvm anyway.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] build support: java 24 [lucene]

2025-03-20 Thread via GitHub


dweiss commented on issue #14379:
URL: https://github.com/apache/lucene/issues/14379#issuecomment-2740968727

   Really nice indeed! Sadly, I think it'll take just about a million years 
before it propagates through all the layers until it can hit gradle (but I'd 
love to be proven wrong). ;)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-20 Thread via GitHub


gf2121 commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2006940286


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java:
##
@@ -0,0 +1,552 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class TrieBuilder {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  static final int LEAF_NODE_HAS_TERMS = 1 << 5;
+  static final int LEAF_NODE_HAS_FLOOR = 1 << 6;
+  static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1;
+  static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0;
+
+  /**
+   * The output describing the term block the prefix point to.
+   *
+   * @param fp describes the on-disk terms block which a trie node points to.
+   * @param hasTerms A boolean which will be false if this on-disk block 
consists entirely of
+   * pointers to child blocks.
+   * @param floorData A {@link BytesRef} which will be non-null when a large 
block of terms sharing
+   * a single trie prefix is split into multiple on-disk blocks.
+   */
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+BUILDING,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+
+// The utf8 digit that leads to this Node, 0 for root node
+private final int label;
+// The children listed in order by their utf8 label
+private final LinkedList children;
+// The output of this node.
+private Output output;
+
+// Vars used during saving:
+
+// The file pointer point to where the node saved. -1 means the node has 
not been saved.
+private long fp = -1;
+// The iterator whose next() point to the first child has not been saved.
+private Iterator childrenIterator;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.BUILDING;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  static TrieBuilder bytesRefToTrie(BytesRef k, Output v) {
+return new TrieBuilder(k, v);
+  }
+
+  private TrieBuilder(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  /**
+   * Absorb all (K, V) pairs from the given trie into this one. The given trie 
builder should not
+   * have key that already exists in this one, otherwise a {@link 
IllegalArgumentException } will be
+   * thrown and this trie will get destroyed.
+   *
+   * Note: the given trie will be destroyed after absorbing.
+   */
+  void absorb(TrieBuilder trieBuilder) {
+if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+// Use a simple stack to avoid recursion.
+Deque stack = new ArrayDeque<>();
+stack.add(() -> absorb(this.root, trieBuilder.root, stack));
+while (!stack.isEmpty()) {
+  stack.pop().run();
+}
+trieBuilder.status = Status.DESTROYED;
+  }
+
+  private void absorb(Node n, Node add, Deque stack) {
+assert n.label == add.label;
+if (add.output != null) {
+  if (n.output != null) {
+ 

Re: [PR] Speedup merging of HNSW graphs (#14331) [lucene]

2025-03-20 Thread via GitHub


mayya-sharipova merged PR #14380:
URL: https://github.com/apache/lucene/pull/14380


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-20 Thread via GitHub


gf2121 commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2006901371


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java:
##
@@ -0,0 +1,552 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class TrieBuilder {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  static final int LEAF_NODE_HAS_TERMS = 1 << 5;
+  static final int LEAF_NODE_HAS_FLOOR = 1 << 6;
+  static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1;
+  static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0;
+
+  /**
+   * The output describing the term block the prefix point to.
+   *
+   * @param fp describes the on-disk terms block which a trie node points to.
+   * @param hasTerms A boolean which will be false if this on-disk block 
consists entirely of
+   * pointers to child blocks.
+   * @param floorData A {@link BytesRef} which will be non-null when a large 
block of terms sharing
+   * a single trie prefix is split into multiple on-disk blocks.
+   */
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+BUILDING,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+
+// The utf8 digit that leads to this Node, 0 for root node
+private final int label;
+// The children listed in order by their utf8 label
+private final LinkedList children;
+// The output of this node.
+private Output output;
+
+// Vars used during saving:
+
+// The file pointer point to where the node saved. -1 means the node has 
not been saved.
+private long fp = -1;
+// The iterator whose next() point to the first child has not been saved.
+private Iterator childrenIterator;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.BUILDING;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  static TrieBuilder bytesRefToTrie(BytesRef k, Output v) {
+return new TrieBuilder(k, v);
+  }
+
+  private TrieBuilder(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  /**
+   * Absorb all (K, V) pairs from the given trie into this one. The given trie 
builder should not
+   * have key that already exists in this one, otherwise a {@link 
IllegalArgumentException } will be
+   * thrown and this trie will get destroyed.
+   *
+   * Note: the given trie will be destroyed after absorbing.
+   */
+  void absorb(TrieBuilder trieBuilder) {
+if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+// Use a simple stack to avoid recursion.
+Deque stack = new ArrayDeque<>();
+stack.add(() -> absorb(this.root, trieBuilder.root, stack));
+while (!stack.isEmpty()) {
+  stack.pop().run();
+}
+trieBuilder.status = Status.DESTROYED;
+  }
+
+  private void absorb(Node n, Node add, Deque stack) {
+assert n.label == add.label;
+if (add.output != null) {
+  if (n.output != null) {
+ 

Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-20 Thread via GitHub


gf2121 commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2006904572


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/IntersectTermsEnumFrame.java:
##
@@ -89,8 +89,6 @@ final class IntersectTermsEnumFrame {
 
   final ByteArrayDataInput bytesReader = new ByteArrayDataInput();
 
-  int outputNum;

Review Comment:
   This was previously used, in IntersectTermsEnum.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]

2025-03-20 Thread via GitHub


dweiss commented on code in PR #14381:
URL: https://github.com/apache/lucene/pull/14381#discussion_r2006930601


##
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java:
##
@@ -778,6 +786,53 @@ private int[] toCaseInsensitiveChar(int codepoint) {
 }
   }
 
+  /**
+   * Expands range to include case-insensitive matches.
+   *
+   * This is expensive: case-insensitive range involves iterating over the 
range space, adding
+   * alternatives. Jump on the grenade here, contain CPU and memory explosion 
just to this method
+   * activated by optional flag.
+   */
+  private void expandCaseInsensitiveRange(
+  int start, int end, List rangeStarts, List rangeEnds) {
+if (start > end)
+  throw new IllegalArgumentException(
+  "invalid range: from (" + start + ") cannot be > to (" + end + ")");
+
+// contain the explosion of transitions by using a throwaway state
+Automaton scratch = new Automaton();
+int state = scratch.createState();
+
+// iterate over range, adding codepoint and any alternatives as transitions
+for (int i = start; i <= end; i++) {
+  scratch.addTransition(state, state, i);
+  int[] altCodePoints = CaseFolding.lookupAlternates(i);
+  if (altCodePoints != null) {
+for (int alt : altCodePoints) {
+  scratch.addTransition(state, state, alt);
+}
+  } else {
+int altCase =
+Character.isLowerCase(i) ? Character.toUpperCase(i) : 
Character.toLowerCase(i);
+if (altCase != i) {
+  scratch.addTransition(state, state, altCase);
+}
+  }
+}

Review Comment:
   I think this pattern is used in more than one place. Maybe it'd be nicer to 
make CaseFolding.forAllAlternatives(altCase -> scratch.addTransition(state, 
state, altCase)) and have all that logic for checking for alternatives embedded 
in CaseFolding?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] NRT replication should make it possible/easy to use bite-sized commits [lucene]

2025-03-20 Thread via GitHub


vigyasharma commented on issue #14219:
URL: https://github.com/apache/lucene/issues/14219#issuecomment-2742476154

   > The searchers can then carefully pick and choose which commit points they 
want to switch too, in a bite sized / stepping stone manner
   
   The key here is making searchers refresh on a point-in-time commit instead 
of the latest one. 
   
   Searcher refresh today, for classes like `SearcherManager`, 
`SearcherTaxonomyManager`, or `ReaderManager` is implemented in 
`refreshIfNeeded()`, which in turn, calls 
`DirectoryReader.openIfChanged(oldReader)`. This opens a new reader on the 
latest commit available in the directory. For bite-sized commits, we would want 
to use the `DirectoryReader.openIfChanged(oldReader, commit)` method, and open 
readers on the specific commit we select.
   
   `refreshIfNeeded` is invoked via the base class method – 
`ReferenceManager#maybeRefresh()`, which takes care of reference management 
tasks like acquiring exclusive locks, invoking refreshIfNeeded, doing an atomic 
swap if needed, releasing the old reference, and notifying listeners. This API 
however, does not accept an index commit. It also doesn't feel right to modify 
this API or add new ones, since that would break the reference agnostic nature 
of ReferenceManager.
   
   One way to achieve specific commit refresh, is for the applications to 
extend `SearcherManager` (or `SearcherTaxonomyManager`) and override 
`refreshIfNeeded`. However, this can be a bit involved as applications may have 
to re-implement some logic (like searcher and taxonomy coupling in 
SearcherTaxonomyManager).
   
   To support this out of box within Lucene, we could modify the existing 
`SearcherManager` / `SearcherTaxonomyManager` to use the commit specific 
DirectoryReader refresh, and provide hooks within these classes to select the 
ideal commit. Like a `RefreshCommitSelector` policy that users can implement to 
supply the refresh commit. The policy could default to returning the latest 
directory commit, to stay in line with current behavior. Expert users could 
supply more nuanced logic like picking the newest commit that has < N bytes 
worth of new segment files.
   Alternately, we could also add a new `NRTSearcherManager` with 
above-mentioned support. 
   
   I think these hooks can be useful for segment replication. Would like to 
hear from the community on whether this seems like a useful thing to support 
from within Lucene.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-20 Thread via GitHub


mikemccand commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2005462386


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java:
##
@@ -0,0 +1,486 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class Trie {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+UNSAVED,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+private final int label;
+private final LinkedList children;
+private Output output;
+private long fp = -1;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.UNSAVED;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  Trie(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  void putAll(Trie trie) {
+if (status != Status.UNSAVED || trie.status != Status.UNSAVED) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+trie.status = Status.DESTROYED;
+putAll(this.root, trie.root);
+  }
+
+  private static void putAll(Node n, Node add) {
+assert n.label == add.label;
+if (add.output != null) {
+  n.output = add.output;
+}
+ListIterator iter = n.children.listIterator();
+// TODO we can do more efficient if there is no intersection, block tree 
always do that
+outer:
+for (Node addChild : add.children) {
+  while (iter.hasNext()) {
+Node nChild = iter.next();
+if (nChild.label == addChild.label) {
+  putAll(nChild, addChild);
+  continue outer;
+}
+if (nChild.label > addChild.label) {

Review Comment:
   Maybe document and `assert` this somewhere?  I think it's non-obvious ... 
it's the merge sorting in `absorb` that makes it so I think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup merging of HNSW graphs [lucene]

2025-03-20 Thread via GitHub


mayya-sharipova commented on code in PR #14331:
URL: https://github.com/apache/lucene/pull/14331#discussion_r2005464935


##
lucene/core/src/java/org/apache/lucene/util/hnsw/MergingHnswGraphBuilder.java:
##
@@ -0,0 +1,198 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.HashSet;
+import java.util.Set;
+import org.apache.lucene.util.BitSet;
+
+/**
+ * A graph builder that is used during segments' merging.
+ *
+ * This builder uses a smart algorithm to merge multiple graphs into a 
single graph. The
+ * algorithm is based on the idea if we know where we want to insert a node, 
we have a good idea of
+ * where we want to insert its neighbours.
+ *
+ * The algorithm is based on the following steps: 1. Get all graphs that 
don't have deletions and
+ * sort them by size. 2. Copy the largest graph to the new graph (gL). 3. For 
each remaining small
+ * graph (gS) - Find the nodes that best cover gS: join set `j`. These nodes 
will be inserted into
+ * gL as usual: by searching gL to find the best candidates: `w` to which 
connect the nodes. - For
+ * each remaining node in gS: - we do NOT do search in gL. Instead, we form 
`w` by union of the
+ * node's neighbors' in Gs and the node's neighbors' neighbors' in gL.
+ *
+ * We expect the size of join set `j` to be small, 1/5-1/2 of size gS. And 
for the rest of the
+ * nodes of gS, we expect savings by not doing extra search in gL.
+ *
+ * @lucene.experimental
+ */

Review Comment:
   Addressed in cb852a6387a09ba43049b8a24f1e026c309b368b



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-20 Thread via GitHub


mikemccand commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2005463552


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java:
##
@@ -0,0 +1,486 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class Trie {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+UNSAVED,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+private final int label;
+private final LinkedList children;
+private Output output;
+private long fp = -1;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.UNSAVED;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  Trie(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  void putAll(Trie trie) {
+if (status != Status.UNSAVED || trie.status != Status.UNSAVED) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+trie.status = Status.DESTROYED;
+putAll(this.root, trie.root);
+  }
+
+  private static void putAll(Node n, Node add) {
+assert n.label == add.label;
+if (add.output != null) {
+  n.output = add.output;
+}
+ListIterator iter = n.children.listIterator();
+// TODO we can do more efficient if there is no intersection, block tree 
always do that
+outer:
+for (Node addChild : add.children) {
+  while (iter.hasNext()) {
+Node nChild = iter.next();
+if (nChild.label == addChild.label) {
+  putAll(nChild, addChild);
+  continue outer;
+}
+if (nChild.label > addChild.label) {
+  iter.previous(); // move back
+  iter.add(addChild);
+  continue outer;
+}
+  }
+  iter.add(addChild);
+}
+  }
+
+  Output getEmptyOutput() {
+return root.output;
+  }
+
+  void forEach(BiConsumer consumer) {
+if (root.output != null) {
+  consumer.accept(new BytesRef(), root.output);
+}
+intersect(root.children, new BytesRefBuilder(), consumer);
+  }
+
+  private void intersect(
+  List nodes, BytesRefBuilder key, BiConsumer 
consumer) {
+for (Node node : nodes) {
+  key.append((byte) node.label);
+  if (node.output != null) consumer.accept(key.toBytesRef(), node.output);
+  intersect(node.children, key, consumer);
+  key.setLength(key.length() - 1);
+}
+  }
+
+  void save(DataOutput meta, IndexOutput index) throws IOException {
+if (status != Status.UNSAVED) {
+  throw new IllegalStateException("only unsaved trie can be saved");
+}
+status = Status.SAVED;

Review Comment:
   +1 to design for today!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


--

Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-20 Thread via GitHub


mikemccand commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2005466812


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java:
##
@@ -0,0 +1,486 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class Trie {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+UNSAVED,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+private final int label;
+private final LinkedList children;
+private Output output;
+private long fp = -1;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.UNSAVED;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  Trie(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  void putAll(Trie trie) {
+if (status != Status.UNSAVED || trie.status != Status.UNSAVED) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+trie.status = Status.DESTROYED;
+putAll(this.root, trie.root);
+  }
+
+  private static void putAll(Node n, Node add) {
+assert n.label == add.label;
+if (add.output != null) {
+  n.output = add.output;
+}
+ListIterator iter = n.children.listIterator();
+// TODO we can do more efficient if there is no intersection, block tree 
always do that
+outer:
+for (Node addChild : add.children) {
+  while (iter.hasNext()) {
+Node nChild = iter.next();
+if (nChild.label == addChild.label) {
+  putAll(nChild, addChild);

Review Comment:
   Let's explicitly test massive terms sent to `TrieBuilder`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-20 Thread via GitHub


mikemccand commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2005473348


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java:
##
@@ -0,0 +1,486 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class Trie {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+UNSAVED,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+private final int label;
+private final LinkedList children;
+private Output output;
+private long fp = -1;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.UNSAVED;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  Trie(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  void putAll(Trie trie) {
+if (status != Status.UNSAVED || trie.status != Status.UNSAVED) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+trie.status = Status.DESTROYED;
+putAll(this.root, trie.root);
+  }
+
+  private static void putAll(Node n, Node add) {
+assert n.label == add.label;
+if (add.output != null) {
+  n.output = add.output;
+}
+ListIterator iter = n.children.listIterator();
+// TODO we can do more efficient if there is no intersection, block tree 
always do that
+outer:
+for (Node addChild : add.children) {
+  while (iter.hasNext()) {
+Node nChild = iter.next();
+if (nChild.label == addChild.label) {
+  putAll(nChild, addChild);
+  continue outer;
+}
+if (nChild.label > addChild.label) {
+  iter.previous(); // move back
+  iter.add(addChild);
+  continue outer;
+}
+  }
+  iter.add(addChild);
+}
+  }
+
+  Output getEmptyOutput() {
+return root.output;
+  }
+
+  void forEach(BiConsumer consumer) {
+if (root.output != null) {
+  consumer.accept(new BytesRef(), root.output);
+}
+intersect(root.children, new BytesRefBuilder(), consumer);
+  }
+
+  private void intersect(
+  List nodes, BytesRefBuilder key, BiConsumer 
consumer) {
+for (Node node : nodes) {
+  key.append((byte) node.label);
+  if (node.output != null) consumer.accept(key.toBytesRef(), node.output);
+  intersect(node.children, key, consumer);
+  key.setLength(key.length() - 1);
+}
+  }
+
+  void save(DataOutput meta, IndexOutput index) throws IOException {
+if (status != Status.UNSAVED) {
+  throw new IllegalStateException("only unsaved trie can be saved");
+}
+status = Status.SAVED;
+meta.writeVLong(index.getFilePointer());
+saveNodes(index);
+meta.writeVLong(root.fp);
+index.writeLong(0L); // additional 8 bytes for over-reading
+meta.writeVLong(index.getFilePointer());
+  }
+
+  void saveNodes(IndexOutput index) throws IOException {
+final long startFP = index.getFilePointer();
+Deque stack = new ArrayDeque<>();
+sta

[I] Case insensitive regex query with character range [lucene]

2025-03-20 Thread via GitHub


petrsimon opened a new issue, #14378:
URL: https://github.com/apache/lucene/issues/14378

   ### Description
   
   Hi,
   I'm implementing regex search in java app with Elasticsearch 8.* and I've 
noticed unexpected behaviour with `CASE_INSENSITIVE` flag.
   
   It seems that Lucene ignores the flag in case of character ranges and I'm 
getting the following results when searching fields analyzed by KeywordAnalyzer:
   
   ```
   "[abc]+" -> "abc", "ABC"
   "[ABC]+" ->  "abc", "ABC"
   "[a-c]+" -> "abc"
   "[A-C]+" -> "ABC"
   ```
   
   Reproduced in https://github.com/petrsimon/lucene-regex
   
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]

2025-03-20 Thread via GitHub


vsop-479 commented on PR #14371:
URL: https://github.com/apache/lucene/pull/14371#issuecomment-2740387593

   Thanks for your feedback @gf2121. This patch is still in process, and have 
not been measured. 
   
   > If you look at newest code in Lucene101PostingsReader, you may find we are 
using VectorMask to speed up this
   
   Thanks for reminding this, I noticed the vectorization approach when I find 
https://github.com/apache/lucene/pull/13692 has been reverted.  But I am not 
sure we can use vector for `IndexedDISI.slice`.
   
   > that was what i had in mind - get a MemorySegment slice if it is not null, 
and play it with VectorMask.
   
   That would be nice, I just noticed  `ShortVector#fromMemorySegment`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Case insensitive regex query with character range [lucene]

2025-03-20 Thread via GitHub


petrsimon commented on issue #14378:
URL: https://github.com/apache/lucene/issues/14378#issuecomment-2740838289

   I see, thanks a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Case insensitive regex query with character range [lucene]

2025-03-20 Thread via GitHub


rmuir commented on issue #14378:
URL: https://github.com/apache/lucene/issues/14378#issuecomment-2740916833

   If the regexp parser doesn't document it has the feature, then it doesn't 
support it: 
https://lucene.apache.org/core/10_1_0/core/org/apache/lucene/util/automaton/RegExp.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-20 Thread via GitHub


mikemccand commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2005458945


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java:
##
@@ -0,0 +1,486 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class Trie {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+UNSAVED,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+private final int label;
+private final LinkedList children;
+private Output output;
+private long fp = -1;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.UNSAVED;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  Trie(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  void putAll(Trie trie) {
+if (status != Status.UNSAVED || trie.status != Status.UNSAVED) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+trie.status = Status.DESTROYED;
+putAll(this.root, trie.root);
+  }
+
+  private static void putAll(Node n, Node add) {
+assert n.label == add.label;
+if (add.output != null) {
+  n.output = add.output;
+}
+ListIterator iter = n.children.listIterator();
+// TODO we can do more efficient if there is no intersection, block tree 
always do that

Review Comment:
   Ahhh, OK I see, it is like a merge sort ("zipper" two iterators together).  
I agree it's O(M+N).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-20 Thread via GitHub


gf2121 commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2005498999


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java:
##
@@ -0,0 +1,486 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class Trie {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+UNSAVED,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+private final int label;
+private final LinkedList children;
+private Output output;
+private long fp = -1;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.UNSAVED;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  Trie(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  void putAll(Trie trie) {
+if (status != Status.UNSAVED || trie.status != Status.UNSAVED) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+trie.status = Status.DESTROYED;
+putAll(this.root, trie.root);
+  }
+
+  private static void putAll(Node n, Node add) {
+assert n.label == add.label;
+if (add.output != null) {
+  n.output = add.output;
+}
+ListIterator iter = n.children.listIterator();
+// TODO we can do more efficient if there is no intersection, block tree 
always do that
+outer:
+for (Node addChild : add.children) {
+  while (iter.hasNext()) {
+Node nChild = iter.next();
+if (nChild.label == addChild.label) {
+  putAll(nChild, addChild);
+  continue outer;
+}
+if (nChild.label > addChild.label) {
+  iter.previous(); // move back
+  iter.add(addChild);
+  continue outer;
+}
+  }
+  iter.add(addChild);
+}
+  }
+
+  Output getEmptyOutput() {
+return root.output;
+  }
+
+  void forEach(BiConsumer consumer) {
+if (root.output != null) {
+  consumer.accept(new BytesRef(), root.output);
+}
+intersect(root.children, new BytesRefBuilder(), consumer);
+  }
+
+  private void intersect(
+  List nodes, BytesRefBuilder key, BiConsumer 
consumer) {
+for (Node node : nodes) {
+  key.append((byte) node.label);
+  if (node.output != null) consumer.accept(key.toBytesRef(), node.output);
+  intersect(node.children, key, consumer);
+  key.setLength(key.length() - 1);
+}
+  }
+
+  void save(DataOutput meta, IndexOutput index) throws IOException {
+if (status != Status.UNSAVED) {
+  throw new IllegalStateException("only unsaved trie can be saved");
+}
+status = Status.SAVED;
+meta.writeVLong(index.getFilePointer());
+saveNodes(index);
+meta.writeVLong(root.fp);
+index.writeLong(0L); // additional 8 bytes for over-reading
+meta.writeVLong(index.getFilePointer());
+  }
+
+  void saveNodes(IndexOutput index) throws IOException {
+final long startFP = index.getFilePointer();
+Deque stack = new ArrayDeque<>();
+stack.p

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

2025-03-20 Thread via GitHub


tteofili commented on PR #14094:
URL: https://github.com/apache/lucene/pull/14094#issuecomment-2740827581

   additional experiments with different quantization levels and filtering:
   
   ## No-fitlering
   
   ### Baseline
   ```
   recall  latency(ms)nDoc  topK  fanout  maxConn  beamWidth  quantized  
index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB) 
 indexType
0.9854.620  20   100  50   64250 no
106.46   1878.64 3  600.08   585.938  585.938   
HNSW
0.8993.657  20   100  50   64250 7 bits 
67.74   2952.47 5  746.34   733.185  147.247
   HNSW
0.5852.328  20   100  50   64250 4 bits 
46.86   4268.03 3  675.33   659.943   74.005
   HNSW
0.9839.212  50   100  50   64250 no
235.68   2121.56 8 1501.44  1464.844 1464.844   
HNSW
0.9007.562  50   100  50   64250 7 bits
165.99   3012.30 9 1867.29  1832.962  368.118   
HNSW
0.5804.934  50   100  50   64250 4 bits
130.65   3826.96 8 1689.29  1649.857  185.013   
HNSW
   ```
   ### Candidate
   ```
   recall  latency(ms)nDoc  topK  fanout  maxConn  beamWidth  quantized  
visited  index(s)  index_docs/s  num_segments  index_size(MB)  selectivity  
vec_disk(MB)  vec_RAM(MB)  indexType
0.9803.744  20   100  50   64250 no
10690106.82   1872.29 3  600.10 1.00   
585.938  585.938   HNSW
0.8963.473  20   100  50   64250 7 bits
11878 68.83   2905.54 5  746.39 1.00   
733.185  147.247   HNSW
0.5852.032  20   100  50   64250 4 bits
13279 51.32   3897.12 3  675.32 1.00   
659.943   74.005   HNSW
0.9828.549  50   100  50   64250 no
23079248.29   2013.81 8 1501.32 1.00  
1464.844 1464.844   HNSW
0.8986.733  50   100  50   64250 7 bits
23629167.17   2991.02 9 1867.31 1.00  
1832.962  368.118   HNSW
0.5813.776  50   100  50   64250 4 bits
21179152.43   3280.24 5 1690.38 1.00  
1649.857  185.013   HNSW
   ```
   
   ## Filtering 
   
   ### Baseline
   ```
   recall  latency(ms)nDoc  topK  fanout  maxConn  beamWidth  quantized  
visited  index(s)  index_docs/s  num_segments  index_size(MB)  selectivity  
vec_disk(MB)  vec_RAM(MB)  indexType
1.0000.642  20   100  50   64250 no 
1965109.81   1821.26 3  600.16 0.01   
585.938  585.938   HNSW
0.9644.947  20   100  50   64250 no 
9504110.91   1803.33 3  600.11 0.10   
585.938  585.938   HNSW
0.9838.417  20   100  50   64250 no
22193103.13   1939.28 3  600.09 0.50   
585.938  585.938   HNSW
0.9180.762  20   100  50   64250 7 bits 
1981 64.33   3108.82 5  746.33 0.01   
733.185  147.247   HNSW
0.8924.310  20   100  50   64250 7 bits
10302 66.23   3019.87 5  746.34 0.10   
733.185  147.247   HNSW
0.8986.900  20   100  50   64250 7 bits
23394 69.09   2894.82 4  746.51 0.50   
733.185  147.247   HNSW
0.6601.137  20   100  50   64250 4 bits 
1695 50.01   3999.44 3  675.40 0.01   
659.943   74.005   HNSW
0.6192.852  20   100  50   64250 4 bits
11021 49.88   4010.03 3  675.31 0.10   
659.943   74.005   HNSW
0.5924.429  20   100  50   64250 4 bits
27121 48.72   4104.75 3  675.30 0.50   
659.943   74.005   HNSW
1.0002.371  50   100  50   64250 no 
5017244.18   2047.64 8 1501.36 0.01  
1464.844 1464.844   HNSW
0.968   11.976  50  

Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-20 Thread via GitHub


gf2121 commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2006876856


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java:
##
@@ -0,0 +1,552 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class TrieBuilder {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  static final int LEAF_NODE_HAS_TERMS = 1 << 5;
+  static final int LEAF_NODE_HAS_FLOOR = 1 << 6;
+  static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1;
+  static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0;
+
+  /**
+   * The output describing the term block the prefix point to.
+   *
+   * @param fp describes the on-disk terms block which a trie node points to.
+   * @param hasTerms A boolean which will be false if this on-disk block 
consists entirely of
+   * pointers to child blocks.
+   * @param floorData A {@link BytesRef} which will be non-null when a large 
block of terms sharing
+   * a single trie prefix is split into multiple on-disk blocks.
+   */
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+BUILDING,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+
+// The utf8 digit that leads to this Node, 0 for root node
+private final int label;
+// The children listed in order by their utf8 label
+private final LinkedList children;
+// The output of this node.
+private Output output;
+
+// Vars used during saving:
+
+// The file pointer point to where the node saved. -1 means the node has 
not been saved.
+private long fp = -1;
+// The iterator whose next() point to the first child has not been saved.
+private Iterator childrenIterator;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.BUILDING;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  static TrieBuilder bytesRefToTrie(BytesRef k, Output v) {
+return new TrieBuilder(k, v);
+  }
+
+  private TrieBuilder(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  /**
+   * Absorb all (K, V) pairs from the given trie into this one. The given trie 
builder should not
+   * have key that already exists in this one, otherwise a {@link 
IllegalArgumentException } will be
+   * thrown and this trie will get destroyed.
+   *
+   * Note: the given trie will be destroyed after absorbing.
+   */
+  void absorb(TrieBuilder trieBuilder) {
+if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+// Use a simple stack to avoid recursion.
+Deque stack = new ArrayDeque<>();
+stack.add(() -> absorb(this.root, trieBuilder.root, stack));
+while (!stack.isEmpty()) {
+  stack.pop().run();
+}
+trieBuilder.status = Status.DESTROYED;
+  }
+
+  private void absorb(Node n, Node add, Deque stack) {
+assert n.label == add.label;
+if (add.output != null) {
+  if (n.output != null) {
+ 

Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-20 Thread via GitHub


gf2121 commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2006876856


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java:
##
@@ -0,0 +1,552 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class TrieBuilder {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  static final int LEAF_NODE_HAS_TERMS = 1 << 5;
+  static final int LEAF_NODE_HAS_FLOOR = 1 << 6;
+  static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1;
+  static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0;
+
+  /**
+   * The output describing the term block the prefix point to.
+   *
+   * @param fp describes the on-disk terms block which a trie node points to.
+   * @param hasTerms A boolean which will be false if this on-disk block 
consists entirely of
+   * pointers to child blocks.
+   * @param floorData A {@link BytesRef} which will be non-null when a large 
block of terms sharing
+   * a single trie prefix is split into multiple on-disk blocks.
+   */
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+BUILDING,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+
+// The utf8 digit that leads to this Node, 0 for root node
+private final int label;
+// The children listed in order by their utf8 label
+private final LinkedList children;
+// The output of this node.
+private Output output;
+
+// Vars used during saving:
+
+// The file pointer point to where the node saved. -1 means the node has 
not been saved.
+private long fp = -1;
+// The iterator whose next() point to the first child has not been saved.
+private Iterator childrenIterator;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.BUILDING;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  static TrieBuilder bytesRefToTrie(BytesRef k, Output v) {
+return new TrieBuilder(k, v);
+  }
+
+  private TrieBuilder(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  /**
+   * Absorb all (K, V) pairs from the given trie into this one. The given trie 
builder should not
+   * have key that already exists in this one, otherwise a {@link 
IllegalArgumentException } will be
+   * thrown and this trie will get destroyed.
+   *
+   * Note: the given trie will be destroyed after absorbing.
+   */
+  void absorb(TrieBuilder trieBuilder) {
+if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+// Use a simple stack to avoid recursion.
+Deque stack = new ArrayDeque<>();
+stack.add(() -> absorb(this.root, trieBuilder.root, stack));
+while (!stack.isEmpty()) {
+  stack.pop().run();
+}
+trieBuilder.status = Status.DESTROYED;
+  }
+
+  private void absorb(Node n, Node add, Deque stack) {
+assert n.label == add.label;
+if (add.output != null) {
+  if (n.output != null) {
+ 

Re: [PR] Optimize ParallelLeafReader to improve term vector fetching efficienc [lucene]

2025-03-20 Thread via GitHub


DivyanshIITB commented on PR #14373:
URL: https://github.com/apache/lucene/pull/14373#issuecomment-2741117231

   Just a gentle reminder 
   @vigyasharma 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-20 Thread via GitHub


gf2121 commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2005865922


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java:
##
@@ -0,0 +1,552 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class TrieBuilder {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  static final int LEAF_NODE_HAS_TERMS = 1 << 5;
+  static final int LEAF_NODE_HAS_FLOOR = 1 << 6;
+  static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1;
+  static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0;
+
+  /**
+   * The output describing the term block the prefix point to.
+   *
+   * @param fp describes the on-disk terms block which a trie node points to.
+   * @param hasTerms A boolean which will be false if this on-disk block 
consists entirely of
+   * pointers to child blocks.
+   * @param floorData A {@link BytesRef} which will be non-null when a large 
block of terms sharing
+   * a single trie prefix is split into multiple on-disk blocks.
+   */
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+BUILDING,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+
+// The utf8 digit that leads to this Node, 0 for root node
+private final int label;
+// The children listed in order by their utf8 label
+private final LinkedList children;
+// The output of this node.
+private Output output;
+
+// Vars used during saving:
+
+// The file pointer point to where the node saved. -1 means the node has 
not been saved.
+private long fp = -1;
+// The iterator whose next() point to the first child has not been saved.
+private Iterator childrenIterator;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.BUILDING;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  static TrieBuilder bytesRefToTrie(BytesRef k, Output v) {
+return new TrieBuilder(k, v);
+  }
+
+  private TrieBuilder(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  /**
+   * Absorb all (K, V) pairs from the given trie into this one. The given trie 
builder should not
+   * have key that already exists in this one, otherwise a {@link 
IllegalArgumentException } will be
+   * thrown and this trie will get destroyed.
+   *
+   * Note: the given trie will be destroyed after absorbing.
+   */
+  void absorb(TrieBuilder trieBuilder) {
+if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+// Use a simple stack to avoid recursion.
+Deque stack = new ArrayDeque<>();
+stack.add(() -> absorb(this.root, trieBuilder.root, stack));
+while (!stack.isEmpty()) {
+  stack.pop().run();
+}
+trieBuilder.status = Status.DESTROYED;
+  }
+
+  private void absorb(Node n, Node add, Deque stack) {
+assert n.label == add.label;
+if (add.output != null) {
+  if (n.output != null) {
+ 

Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-20 Thread via GitHub


gf2121 commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2005852050


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java:
##
@@ -0,0 +1,552 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class TrieBuilder {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  static final int LEAF_NODE_HAS_TERMS = 1 << 5;
+  static final int LEAF_NODE_HAS_FLOOR = 1 << 6;
+  static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1;
+  static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0;
+
+  /**
+   * The output describing the term block the prefix point to.
+   *
+   * @param fp describes the on-disk terms block which a trie node points to.
+   * @param hasTerms A boolean which will be false if this on-disk block 
consists entirely of
+   * pointers to child blocks.
+   * @param floorData A {@link BytesRef} which will be non-null when a large 
block of terms sharing
+   * a single trie prefix is split into multiple on-disk blocks.
+   */
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+BUILDING,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {

Review Comment:
   Double yes on read / write side!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] build support: java 24 [lucene]

2025-03-20 Thread via GitHub


rmuir commented on issue #14379:
URL: https://github.com/apache/lucene/issues/14379#issuecomment-2740938412

   I tried allowing 24 and gradle only failed in the usual way (incompatible 
classfile): we have to wait for them to issue a gradle release that "supports 
24" so they can parse the classfiles of the jdk :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] build support: java 24 [lucene]

2025-03-20 Thread via GitHub


dweiss commented on issue #14379:
URL: https://github.com/apache/lucene/issues/14379#issuecomment-2740939418

   https://github.com/gradle/gradle/issues/32290


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] build support: java 24 [lucene]

2025-03-20 Thread via GitHub


dweiss commented on issue #14379:
URL: https://github.com/apache/lucene/issues/14379#issuecomment-2740930456

   > build currently requires 23 exactly
   
   This is terrible. I'll take a look. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] build support: java 24 [lucene]

2025-03-20 Thread via GitHub


dweiss commented on issue #14379:
URL: https://github.com/apache/lucene/issues/14379#issuecomment-2740943620

   They do tons of weird stuff these days that require bytecode manipulation 
and touching everything upon loading. I don't think there is a way around other 
than wait for that issue to be solved. I'll keep an eye on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org