Re: [PR] Add CaseFolding.fold(), inverse of expand(), move to UnicodeUtil, add filter [lucene]

2025-03-21 Thread via GitHub


msfroh commented on PR #14389:
URL: https://github.com/apache/lucene/pull/14389#issuecomment-2744397587

   Awesome! Can I go ahead and use this for 
https://github.com/apache/lucene/pull/14350 once it's merged?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]

2025-03-21 Thread via GitHub


rmuir opened a new pull request, #14381:
URL: https://github.com/apache/lucene/pull/14381

   Add optional flag to support case-insensitive ranges. A minimal DFA is 
always created. This works with Unicode but may have a performance cost.
   
   Each codepoint in the range must be iterated, and any alternatives added to 
a set. This can be large if the range spans much of Unicode.
   
   CPU and memory costs are contained within a single function enabled by the 
optional flag. For example when matching a caseless `/[a-z]/`, 56 codepoints 
will be accumulated into an `int[]`, which is then compressed to 5 ranges 
before adding to the parse tree.
   
   Closes #14378
   
   Here's what resulting `/[a-z]/` automaton looks like in case you are curious:
   ![graphviz 
(5)](https://github.com/user-attachments/assets/dfbc25cd-4a32-4ffc-aee3-ab8dd43a63ec)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]

2025-03-21 Thread via GitHub


rmuir commented on code in PR #14381:
URL: https://github.com/apache/lucene/pull/14381#discussion_r2007006500


##
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java:
##
@@ -778,6 +786,53 @@ private int[] toCaseInsensitiveChar(int codepoint) {
 }
   }
 
+  /**
+   * Expands range to include case-insensitive matches.
+   *
+   * This is expensive: case-insensitive range involves iterating over the 
range space, adding
+   * alternatives. Jump on the grenade here, contain CPU and memory explosion 
just to this method
+   * activated by optional flag.
+   */
+  private void expandCaseInsensitiveRange(
+  int start, int end, List rangeStarts, List rangeEnds) {
+if (start > end)
+  throw new IllegalArgumentException(
+  "invalid range: from (" + start + ") cannot be > to (" + end + ")");
+
+// contain the explosion of transitions by using a throwaway state
+Automaton scratch = new Automaton();
+int state = scratch.createState();
+
+// iterate over range, adding codepoint and any alternatives as transitions
+for (int i = start; i <= end; i++) {
+  scratch.addTransition(state, state, i);
+  int[] altCodePoints = CaseFolding.lookupAlternates(i);
+  if (altCodePoints != null) {
+for (int alt : altCodePoints) {
+  scratch.addTransition(state, state, alt);
+}
+  } else {
+int altCase =
+Character.isLowerCase(i) ? Character.toUpperCase(i) : 
Character.toLowerCase(i);
+if (altCase != i) {
+  scratch.addTransition(state, state, altCase);
+}
+  }
+}

Review Comment:
   good call. this is better than returning mutable arrays that could get 
messed up by bugs, or creating gazillions of arrays.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-03-21 Thread via GitHub


alessandrobenedetti commented on code in PR #14173:
URL: https://github.com/apache/lucene/pull/14173#discussion_r2007476642


##
lucene/core/src/java/org/apache/lucene/util/hnsw/UpdatableScoreHeap.java:
##


Review Comment:
   For example, what are the benefits of this in comparison to the changes I 
proposed: lucene/core/src/java/org/apache/lucene/util/LongHeap.java in 
https://github.com/apache/lucene/pull/12314/files?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adjust equivalent min similarity HNSW exploration logic [lucene]

2025-03-21 Thread via GitHub


benwtrent merged PR #14366:
URL: https://github.com/apache/lucene/pull/14366


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestKnnGraph.testMultiThreadedSearch random test failure [lucene]

2025-03-21 Thread via GitHub


benwtrent closed issue #14327: TestKnnGraph.testMultiThreadedSearch random test 
failure
URL: https://github.com/apache/lucene/issues/14327


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestKnnGraph.testMultiThreadedSearch random test failure [lucene]

2025-03-21 Thread via GitHub


benwtrent closed issue #14327: TestKnnGraph.testMultiThreadedSearch random test 
failure
URL: https://github.com/apache/lucene/issues/14327


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Reduce memory usage when merging bkd trees [lucene]

2025-03-21 Thread via GitHub


iverase opened a new issue, #14382:
URL: https://github.com/apache/lucene/issues/14382

   When building BKD trees, we hold two arrays in memory which sizes grows 
linearly with the number of leaf nodes. One of the array contains the pointer 
to the start of a leaf node, and the other containing the split value. The 
number of leaf nodes does not grow with the number of documents but with the 
number of values, therefore in the case of multi-values, those arrays can grow 
quite big. 
   
   The situation is particularly inefficient for the `OneDimensionBKDWriter` 
where we are using a List to hold the split values. I wonder if we can use more 
efficient data structures to lower the heapusage.
   
   For example, maybe we can use the `FixedLengthBytesRefArray` to hold split 
values or used some packing algorithm to hold the leaf pointers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

2025-03-21 Thread via GitHub


benwtrent commented on code in PR #14094:
URL: https://github.com/apache/lucene/pull/14094#discussion_r2007365351


##
lucene/core/src/java/org/apache/lucene/util/hnsw/OrdinalTranslatedKnnCollector.java:
##
@@ -50,4 +51,11 @@ public TopDocs topDocs() {
 : TotalHits.Relation.EQUAL_TO),
 td.scoreDocs);
   }
+
+  @Override
+  public void nextCandidate() {
+if (this.collector instanceof HnswKnnCollector) {
+  ((HnswKnnCollector) this.collector).nextCandidate();
+}

Review Comment:
   ```suggestion
   if (this.collector instanceof HnswKnnCollector hnswCollector) {
 hnswCollector.nextCandidate();
   }
   ```



##
lucene/core/src/java/org/apache/lucene/search/HnswQueueSaturationCollector.java:
##
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * A {@link HnswKnnCollector} that early exits when nearest neighbor queue 
keeps saturating beyond a
+ * 'patience' parameter. This records the rate of collection of new nearest 
neighbors in the {@code
+ * delegate} KnnCollector queue, at each HNSW node candidate visit. Once it 
saturates for a number
+ * of consecutive node visits (e.g., the patience parameter), this early 
terminates.
+ *
+ * @lucene.experimental
+ */
+public class HnswQueueSaturationCollector extends HnswKnnCollector {
+
+  private final KnnCollector delegate;
+  private final double saturationThreshold;
+  private final int patience;
+  private boolean patienceFinished;
+  private int countSaturated;
+  private int previousQueueSize;
+  private int currentQueueSize;
+
+  HnswQueueSaturationCollector(KnnCollector delegate, double 
saturationThreshold, int patience) {
+super(delegate);
+this.delegate = delegate;
+this.previousQueueSize = 0;
+this.currentQueueSize = 0;
+this.countSaturated = 0;
+this.patienceFinished = false;
+this.saturationThreshold = saturationThreshold;
+this.patience = patience;
+  }
+
+  @Override
+  public boolean earlyTerminated() {
+return delegate.earlyTerminated() || patienceFinished;
+  }
+
+  @Override
+  public boolean collect(int docId, float similarity) {
+boolean collect = delegate.collect(docId, similarity);
+if (collect) {
+  currentQueueSize++;
+}
+return collect;
+  }
+
+  @Override
+  public float minCompetitiveSimilarity() {
+return delegate.minCompetitiveSimilarity();
+  }

Review Comment:
   since we are a decorator, do we need this?



##
lucene/core/src/java/org/apache/lucene/search/HnswKnnCollector.java:
##
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+/**
+ * {@link KnnCollector} that exposes methods to hook into specific parts of 
the HNSW algorithm.
+ *
+ * @lucene.experimental
+ */
+public abstract class HnswKnnCollector extends KnnCollector.Decorator {

Review Comment:
   Ah, it is a little frustrating as we already have an "HNSWStrategy" and now 
we have an "HNSWCollector". 
   
   Could we utilize an HNSWStrategy? Or make `nextCandidate` a more general API?
   
   My thought on the strategy would be that the graph searcher to indicate 
through the strategy object when the next group of vectors will be searched and 
the strategy would have a reference to the collector to which it can forward 
the request. 
   
   Of course, this still requires a new `HnswQueueSaturati

Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]

2025-03-21 Thread via GitHub


rmuir commented on PR #14384:
URL: https://github.com/apache/lucene/pull/14384#issuecomment-2743662885

   This one is pretty easy to understand, the `CaseFolding` class now just 
gives you `UnicodeSet(ch).closeOver(UnicodeSet.SIMPLE_CASE_INSENSITIVE)` 
without requiring that you have ICU. 
   
   The generation depends strictly upon ICU version (which I will separately 
upgrade for unicode 16 now that java 24 has it).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-21 Thread via GitHub


gf2121 commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2007802395


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java:
##
@@ -0,0 +1,552 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class TrieBuilder {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  static final int LEAF_NODE_HAS_TERMS = 1 << 5;
+  static final int LEAF_NODE_HAS_FLOOR = 1 << 6;
+  static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1;
+  static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0;
+
+  /**
+   * The output describing the term block the prefix point to.
+   *
+   * @param fp describes the on-disk terms block which a trie node points to.
+   * @param hasTerms A boolean which will be false if this on-disk block 
consists entirely of
+   * pointers to child blocks.
+   * @param floorData A {@link BytesRef} which will be non-null when a large 
block of terms sharing
+   * a single trie prefix is split into multiple on-disk blocks.
+   */
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+BUILDING,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+
+// The utf8 digit that leads to this Node, 0 for root node
+private final int label;
+// The children listed in order by their utf8 label
+private final LinkedList children;
+// The output of this node.
+private Output output;
+
+// Vars used during saving:
+
+// The file pointer point to where the node saved. -1 means the node has 
not been saved.
+private long fp = -1;
+// The iterator whose next() point to the first child has not been saved.
+private Iterator childrenIterator;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.BUILDING;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  static TrieBuilder bytesRefToTrie(BytesRef k, Output v) {
+return new TrieBuilder(k, v);
+  }
+
+  private TrieBuilder(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  /**
+   * Absorb all (K, V) pairs from the given trie into this one. The given trie 
builder should not
+   * have key that already exists in this one, otherwise a {@link 
IllegalArgumentException } will be
+   * thrown and this trie will get destroyed.
+   *
+   * Note: the given trie will be destroyed after absorbing.
+   */
+  void absorb(TrieBuilder trieBuilder) {
+if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+// Use a simple stack to avoid recursion.
+Deque stack = new ArrayDeque<>();
+stack.add(() -> absorb(this.root, trieBuilder.root, stack));
+while (!stack.isEmpty()) {
+  stack.pop().run();
+}
+trieBuilder.status = Status.DESTROYED;
+  }
+
+  private void absorb(Node n, Node add, Deque stack) {
+assert n.label == add.label;
+if (add.output != null) {
+  if (n.output != null) {
+ 

Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]

2025-03-21 Thread via GitHub


rmuir commented on PR #14384:
URL: https://github.com/apache/lucene/pull/14384#issuecomment-2743717079

   It was easy because @uschindler already created a similar groovy script 
before.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Case insensitive regex query with character range [lucene]

2025-03-21 Thread via GitHub


rmuir closed issue #14378: Case insensitive regex query with character range
URL: https://github.com/apache/lucene/issues/14378


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-21 Thread via GitHub


rmuir commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2744342290

   Maybe this one helps the issue: https://github.com/apache/lucene/pull/14389


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add CaseFolding.fold(), inverse of expand(), move to UnicodeUtil, add filter [lucene]

2025-03-21 Thread via GitHub


john-wagster commented on PR #14389:
URL: https://github.com/apache/lucene/pull/14389#issuecomment-2744360276

   This is great; helps me progress some of the regex work in ES for why I 
started that CaseFolding work.  Thanks for iterating on this @rmuir. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-03-21 Thread via GitHub


alessandrobenedetti commented on PR #14173:
URL: https://github.com/apache/lucene/pull/14173#issuecomment-2743148001

   Catching up on this and trying to understand how far we are now from my 
original idea and implementation:
   https://github.com/apache/lucene/pull/12314
   
   Obviously, my code is completely outdated, but reading across this PR and 
https://github.com/apache/lucene/pull/13525, it seems we are converging again 
to what I originally proposed.
   
   I'll work on this for the next couple of weeks, so I should be able to add 
some comments and additional opinions.
   
   Main concern is still related to ordinals to become long as far as I can see 
:)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Handling concurrent search in QueryProfiler [lucene]

2025-03-21 Thread via GitHub


jpountz commented on issue #14375:
URL: https://github.com/apache/lucene/issues/14375#issuecomment-271819

   Done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Handle degenerate case where all HNSW search candidates are filtered [lucene]

2025-03-21 Thread via GitHub


benwtrent commented on issue #11787:
URL: https://github.com/apache/lucene/issues/11787#issuecomment-2743830018

   I think this has been fixed with all our HNSW filtering fixes:
   
- we drop to brute force if we explore too much
- we bypass the graph if the filter passes <= `k` docs
- We have implemented improved filtering search logic to aid with speed. 
   
   We can comfortably close this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

2025-03-21 Thread via GitHub


jpountz commented on PR #14365:
URL: https://github.com/apache/lucene/pull/14365#issuecomment-273990

   Hurray!
- https://benchmarks.mikemccandless.com/TermDayOfYearSort.html
- https://benchmarks.mikemccandless.com/TermDTSort.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]

2025-03-21 Thread via GitHub


rmuir commented on PR #14384:
URL: https://github.com/apache/lucene/pull/14384#issuecomment-2743831323

   There was something about gradle itself that was upset about dependencies 
wrt generation tasks, if i recall... cycle detection or something was 
complaining about it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Handle degenerate case where all HNSW search candidates are filtered [lucene]

2025-03-21 Thread via GitHub


benwtrent closed issue #11787: Handle degenerate case where all HNSW search 
candidates are filtered
URL: https://github.com/apache/lucene/issues/11787


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]

2025-03-21 Thread via GitHub


rmuir commented on PR #14384:
URL: https://github.com/apache/lucene/pull/14384#issuecomment-2743736828

   I will followup with an ICU upgrade PR to this one. I don't expect that this 
file will change except for the version in the comment though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Implement #docIDRunEnd() on PostingsEnum. [lucene]

2025-03-21 Thread via GitHub


jpountz opened a new pull request, #14390:
URL: https://github.com/apache/lucene/pull/14390

   This implements `BlockPostingsEnum#docIDRunEnd()` by comparing the delta 
between doc IDs and between doc counts on the various skip levels.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-03-21 Thread via GitHub


vigyasharma commented on PR #14173:
URL: https://github.com/apache/lucene/pull/14173#issuecomment-2744562872

   Thanks for looking into this PR @alessandrobenedetti , this is the latest 
iteration on multi-vector support.
   
   It does build on the same central idea of assigning a unique ordinal to each 
vector and mapping multiple ordinals to a single doc. I tried a few other 
approaches, but this one seemed cleanest.
   
   I think the key difference over #12314 , are changes to store metadata that 
lets us map multiple ordinals to a single doc. This is implemented in 
`MultiVectorOrdConfiguration` using `DirectMonotonicWriter/Reader`. For every 
doc, I maintain the ordinal of its first vector (`baseOrdinal`) along with no. 
of vectors in the doc, and use these to do the `ordToDoc` mapping for vectors. 
I didn't fully understand how this was done in your orginal PR, specifically 
how it mapped an ordinal back to its docId, given we can have variable no. of 
vectors per doc. Maybe I missed something. If you had a simpler implementation, 
I'm happy to circle back to it.
   
   I also added an `allVectorValues()` API to `Byte|FloatVectorValues`, which I 
think will help during query time. Other that this, the changes are mostly 
around integrating multi-vector support and will likely have a lot of overlap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-03-21 Thread via GitHub


vigyasharma commented on code in PR #14173:
URL: https://github.com/apache/lucene/pull/14173#discussion_r2008411867


##
lucene/core/src/java/org/apache/lucene/util/hnsw/UpdatableScoreHeap.java:
##


Review Comment:
   I'd like to keep the logic to update scores for already ingested docs 
encapsulated within the heap. By returning the array index within the heap (the 
LongHeap changes in #12314), we shift this responsibility to consumers, like 
the [NeighborQueue 
changes](https://github.com/apache/lucene/blob/1523ee796a6d35a7d92532590458b2a2d8dd9e4b/lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java#L99-L113),
 which can be trappy and cause repeated code.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestKnnGraph.testMultiThreadedSearch random test failure [lucene]

2025-03-21 Thread via GitHub


rmuir commented on issue #14327:
URL: https://github.com/apache/lucene/issues/14327#issuecomment-2743546449

   thank you @benwtrent 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Address gradle temp file pollution insanity [lucene]

2025-03-21 Thread via GitHub


dweiss commented on issue #14385:
URL: https://github.com/apache/lucene/issues/14385#issuecomment-2743927366

   Ok, I've added gradle's "user home" tmp cleaning as well. Anything older 
than 3 hours is removed. This folder may be shared across builds so the time 
limit is there to prevent accidental cross-build issues.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] bump antlr 4.11.1 -> 4.13.2 [lucene]

2025-03-21 Thread via GitHub


rmuir opened a new pull request, #14388:
URL: https://github.com/apache/lucene/pull/14388

   Dependency is outdated, the main changes to generated code avoid warnings in 
java21+
   
   This one didn't magically work like ICU, I simply force-regenerated. I tried 
messing around with the gradle dependsOn logic to get it to trigger on the 
antlr version bump, but was unsuccessful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

2025-03-21 Thread via GitHub


jpountz commented on PR #14365:
URL: https://github.com/apache/lucene/pull/14365#issuecomment-2744460409

   I pushed an annotation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Clean up junk from gradle's user home (~/.gradle/.tmp). #14385 [lucene]

2025-03-21 Thread via GitHub


dweiss merged PR #14387:
URL: https://github.com/apache/lucene/pull/14387


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Address gradle temp file pollution insanity [lucene]

2025-03-21 Thread via GitHub


dweiss opened a new issue, #14385:
URL: https://github.com/apache/lucene/issues/14385

   ### Description
   
   Gradle creates temp files it never cleans up. Until this is resolved, let's 
try to keep some housekeeping ourselves.
   
   Related issues:
   * #10215 
   * #10510
   * https://github.com/gradle/gradle/issues/15367
   * https://github.com/gradle/gradle/issues/12020


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Address gradle temp file pollution insanity [lucene]

2025-03-21 Thread via GitHub


dweiss commented on issue #14385:
URL: https://github.com/apache/lucene/issues/14385#issuecomment-2743998381

   There are also *.log files to wipe clean.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Clean up junk from gradle's user home (~/.gradle/.tmp). #14385 [lucene]

2025-03-21 Thread via GitHub


dweiss commented on PR #14387:
URL: https://github.com/apache/lucene/pull/14387#issuecomment-2743985270

   I'll merge this in. Low risk and we can always revert if needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Address gradle temp file pollution insanity [lucene]

2025-03-21 Thread via GitHub


dweiss closed issue #14385: Address gradle temp file pollution insanity
URL: https://github.com/apache/lucene/issues/14385


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] gradle build leaks tons of gradle-worker-classpath* files in tmpdir [LUCENE-9175] [lucene]

2025-03-21 Thread via GitHub


dweiss closed issue #10215: gradle build leaks tons of gradle-worker-classpath* 
files in tmpdir [LUCENE-9175]
URL: https://github.com/apache/lucene/issues/10215


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]

2025-03-21 Thread via GitHub


rmuir merged PR #14381:
URL: https://github.com/apache/lucene/pull/14381


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] bump antlr 4.11.1 -> 4.13.2 [lucene]

2025-03-21 Thread via GitHub


dweiss commented on code in PR #14388:
URL: https://github.com/apache/lucene/pull/14388#discussion_r2008140343


##
lucene/expressions/src/generated/checksums/generateAntlr.json:
##
@@ -1,7 +1,8 @@
 {
 
"lucene/expressions/src/java/org/apache/lucene/expressions/js/Javascript.g4": 
"818e89aae0b6c7601051802013898c128fe7c1ba",
 
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptBaseVisitor.java":
 "6965abdb8b069aaceac1ce4f32ed965b194f3a25",
-
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptLexer.java":
 "b8d6b259ebbfce09a5379a1a2aa4c1ddd4e378eb",
-
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptParser.java":
 "7a3a7b9de17f4a8d41ef342312eae5c55e483e08",
-
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptVisitor.java":
 "ec24bb2b9004bc38ee808970870deed12351039e"
+
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptLexer.java":
 "6508dc5008e96a1ad28c967a3401407ba83f140b",
+
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptParser.java":
 "ba6d0c00af113f115fc7a1f165da7726afb2e8c5",
+
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptVisitor.java":
 "ec24bb2b9004bc38ee808970870deed12351039e",
+"property:antlr-version": "4.13.2"

Review Comment:
   Yeah - we could add full coordinates but I don't think this matters. The 
version should be fine. I'll try to do an overhaul of the build anyway and 
maybe consolidate this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]

2025-03-21 Thread via GitHub


dweiss commented on PR #14384:
URL: https://github.com/apache/lucene/pull/14384#issuecomment-2743758629

   I mean the entire structure of tasks that are used in regenerate. It's 
complex. I remember I couldn't do it in any easier way before - maybe something 
has changed that would allow it to be simpler (I doubt though).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]

2025-03-21 Thread via GitHub


rmuir commented on PR #14381:
URL: https://github.com/apache/lucene/pull/14381#issuecomment-2743798573

   after fixing the turkish here's the (correct) automaton for `/[a-z]/`: the 
only special cases are long-s and kelvin sign as you expect:
   
   ![graphviz 
(6)](https://github.com/user-attachments/assets/06fcd30b-c578-4979-bf59-7b55d5a2560d)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Init HNSW merge with graph containing deleted documents [lucene]

2025-03-21 Thread via GitHub


benwtrent commented on issue #12533:
URL: https://github.com/apache/lucene/issues/12533#issuecomment-2743826644

   I think in addition to the recent merge improvements 
(https://github.com/apache/lucene/pull/14331), the ability to "fix up" the 
individual graphs that have deletions and THEN doing the merge might gain 
significant speed improvements. 
   
   Additionally, folks might actually only want to expunge deletes, in this 
case, rewriting the entire graphs is incredibly wasteful, and we should instead 
"fix up" the graphs by adjusting the deleted nodes directly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] upgrade icu dependency from 74.2 -> 77.1 [lucene]

2025-03-21 Thread via GitHub


rmuir commented on PR #14386:
URL: https://github.com/apache/lucene/pull/14386#issuecomment-2743954863

   @dweiss i know you dislike the complexity, but the `gradlew regenerate` 
really saves a metric ton of human time and prevents mistakes for updates like 
these.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Case insensitive regex query with character range [lucene]

2025-03-21 Thread via GitHub


rmuir closed issue #14378: Case insensitive regex query with character range
URL: https://github.com/apache/lucene/issues/14378


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]

2025-03-21 Thread via GitHub


rmuir merged PR #14384:
URL: https://github.com/apache/lucene/pull/14384


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Address gradle temp file pollution insanity [lucene]

2025-03-21 Thread via GitHub


dweiss commented on issue #14385:
URL: https://github.com/apache/lucene/issues/14385#issuecomment-2743756362

   It's this commit that moved the temp folder from java.io.tmpdir, which we 
redirected and cleaned up.
   
   
https://github.com/gradle/gradle/commit/8c2f6b7db50ab071a289fb5c4cbb9b2125609105#diff-a89e26b86bb25dd2df7ef61416478f3b9034cc4625633830a5413a5c5d7124f6
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] BlockJoinBulkScorer could check for parent deletions (not children) [lucene]

2025-03-21 Thread via GitHub


jimczi closed pull request #14067: BlockJoinBulkScorer could check for parent 
deletions (not children)
URL: https://github.com/apache/lucene/pull/14067


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] upgrade icu dependency from 74.2 -> 77.1 [lucene]

2025-03-21 Thread via GitHub


dweiss commented on PR #14386:
URL: https://github.com/apache/lucene/pull/14386#issuecomment-2743977198

   I know, I know. I don't think we should remove it - I just hope it can be 
implemented in a less hairy way. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

2025-03-21 Thread via GitHub


tteofili commented on code in PR #14094:
URL: https://github.com/apache/lucene/pull/14094#discussion_r2007923461


##
lucene/core/src/java/org/apache/lucene/search/HnswQueueSaturationCollector.java:
##
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * A {@link HnswKnnCollector} that early exits when nearest neighbor queue 
keeps saturating beyond a
+ * 'patience' parameter. This records the rate of collection of new nearest 
neighbors in the {@code
+ * delegate} KnnCollector queue, at each HNSW node candidate visit. Once it 
saturates for a number
+ * of consecutive node visits (e.g., the patience parameter), this early 
terminates.
+ *
+ * @lucene.experimental
+ */
+public class HnswQueueSaturationCollector extends HnswKnnCollector {
+
+  private final KnnCollector delegate;
+  private final double saturationThreshold;
+  private final int patience;
+  private boolean patienceFinished;
+  private int countSaturated;
+  private int previousQueueSize;
+  private int currentQueueSize;
+
+  HnswQueueSaturationCollector(KnnCollector delegate, double 
saturationThreshold, int patience) {
+super(delegate);
+this.delegate = delegate;
+this.previousQueueSize = 0;
+this.currentQueueSize = 0;
+this.countSaturated = 0;
+this.patienceFinished = false;
+this.saturationThreshold = saturationThreshold;
+this.patience = patience;
+  }
+
+  @Override
+  public boolean earlyTerminated() {
+return delegate.earlyTerminated() || patienceFinished;
+  }
+
+  @Override
+  public boolean collect(int docId, float similarity) {
+boolean collect = delegate.collect(docId, similarity);
+if (collect) {
+  currentQueueSize++;
+}
+return collect;
+  }
+
+  @Override
+  public float minCompetitiveSimilarity() {
+return delegate.minCompetitiveSimilarity();
+  }
+
+  @Override
+  public TopDocs topDocs() {
+TopDocs topDocs;
+if (patienceFinished && delegate.earlyTerminated() == false) {
+  TopDocs delegateDocs = delegate.topDocs();
+  TotalHits totalHits =
+  new TotalHits(delegateDocs.totalHits.value(), 
TotalHits.Relation.EQUAL_TO);
+  topDocs = new TopDocs(totalHits, delegateDocs.scoreDocs);
+} else {
+  topDocs = delegate.topDocs();
+}
+return topDocs;
+  }
+
+  @Override
+  public void nextCandidate() {

Review Comment:
   I really like this idea Ben, I'll see if I can make up something reasonable 
for that ;) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Clean up junk from gradle's user home (~/.gradle/.tmp). #14385 [lucene]

2025-03-21 Thread via GitHub


rmuir commented on PR #14387:
URL: https://github.com/apache/lucene/pull/14387#issuecomment-2743932179

   `./gradlew -XX:UseDweissTempFileGC`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Handling concurrent search in QueryProfiler [lucene]

2025-03-21 Thread via GitHub


jainankitk commented on issue #14375:
URL: https://github.com/apache/lucene/issues/14375#issuecomment-2744045551

   @jpountz - Can you assign this issue to me? I don't have permissions to do 
that myself


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize ParallelLeafReader to improve term vector fetching efficienc [lucene]

2025-03-21 Thread via GitHub


vigyasharma commented on code in PR #14373:
URL: https://github.com/apache/lucene/pull/14373#discussion_r2008678599


##
lucene/core/src/java/org/apache/lucene/index/ParallelLeafReader.java:
##
@@ -348,15 +348,24 @@ public void prefetch(int docID) throws IOException {
   @Override
   public Fields get(int docID) throws IOException {
 ParallelFields fields = null;
-for (Map.Entry ent : tvFieldToReader.entrySet()) {
-  String fieldName = ent.getKey();
-  TermVectors termVectors = readerToTermVectors.get(ent.getValue());
-  Terms vector = termVectors.get(docID, fieldName);
-  if (vector != null) {
-if (fields == null) {
-  fields = new ParallelFields();
-}
-fields.addField(fieldName, vector);
+
+// Step 2: Fetch all term vectors once per reader
+for (Map.Entry entry : 
readerToTermVectors.entrySet()) {
+  TermVectors termVectors = entry.getValue();
+  Fields docFields = termVectors.get(docID); // Fetch all fields at 
once
+
+  if (docFields != null) {
+  if (fields == null) {
+  fields = new ParallelFields();
+  }
+
+  // Step 3: Aggregate only required fields
+  for (String fieldName : docFields) {
+  Terms vector = docFields.terms(fieldName);
+  if (vector != null) {

Review Comment:
   When would this be null? Since we're going through fields returned by 
`termVectors.get(docId)`, the field should exist and have terms.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] bump antlr 4.11.1 -> 4.13.2 [lucene]

2025-03-21 Thread via GitHub


dweiss commented on PR #14388:
URL: https://github.com/apache/lucene/pull/14388#issuecomment-2744155828

   > This one didn't magically work like ICU
   
   I've pushed a commit that should do the trick. ICU version wasn't in the 
inputs so the build didn't know it'd been updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] bump antlr 4.11.1 -> 4.13.2 [lucene]

2025-03-21 Thread via GitHub


dweiss commented on code in PR #14388:
URL: https://github.com/apache/lucene/pull/14388#discussion_r2008129712


##
lucene/expressions/src/generated/checksums/generateAntlr.json:
##
@@ -1,7 +1,13 @@
 {
+
"../../../../../.gradle/caches/modules-2/files-2.1/com.ibm.icu/icu4j/72.1/bc9057df4b5efddf7f6d1880bf7f3399f4ce5633/icu4j-72.1.jar":
 "bc9057df4b5efddf7f6d1880bf7f3399f4ce5633",
+
"../../../../../.gradle/caches/modules-2/files-2.1/org.abego.treelayout/org.abego.treelayout.core/1.0.3/457216e8e6578099ae63667bb1e4439235892028/org.abego.treelayout.core-1.0.3.jar":
 "457216e8e6578099ae63667bb1e4439235892028",
+
"../../../../../.gradle/caches/modules-2/files-2.1/org.antlr/ST4/4.3.4/bf68d049dd4e6e104055a79ac3bf9e6307d29258/ST4-4.3.4.jar":
 "bf68d049dd4e6e104055a79ac3bf9e6307d29258",
+
"../../../../../.gradle/caches/modules-2/files-2.1/org.antlr/antlr-runtime/3.5.3/9011fb189c5ed6d99e5f3322514848d1ec1e1416/antlr-runtime-3.5.3.jar":
 "9011fb189c5ed6d99e5f3322514848d1ec1e1416",
+
"../../../../../.gradle/caches/modules-2/files-2.1/org.antlr/antlr4-runtime/4.13.2/fc3db6d844df652a3d5db31c87fa12757f13691d/antlr4-runtime-4.13.2.jar":
 "fc3db6d844df652a3d5db31c87fa12757f13691d",
+
"../../../../../.gradle/caches/modules-2/files-2.1/org.antlr/antlr4/4.13.2/a2bc0d399506a7297568baee188b481727d45d3b/antlr4-4.13.2.jar":
 "a2bc0d399506a7297568baee188b481727d45d3b",

Review Comment:
   ok, we can't have it done this way, sorry. I'll revert.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add CaseFolding.fold(), inverse of expand(), move to UnicodeUtil, add filter [lucene]

2025-03-21 Thread via GitHub


rmuir commented on PR #14389:
URL: https://github.com/apache/lucene/pull/14389#issuecomment-2744704384

   I will straighten out the build, this one is kinda draftish as it needs more 
tests etc. just wanted to toss out the idea. 
   
   If it is autogenerated we can easily maintain some cohesive story rather 
than crazy Unicode puzzles. 
   
   It is tempting to want full case folding as that's a benefit to eg German, 
but we need to step. Perf gets more complex, etc. Simple is an improvement over 
lowercasing. 
   
   The goal here is to not regress indexing performance if users switch from 
lowercase to simple case folding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org