Re: [PR] Add CaseFolding.fold(), inverse of expand(), move to UnicodeUtil, add filter [lucene]
msfroh commented on PR #14389: URL: https://github.com/apache/lucene/pull/14389#issuecomment-2744397587 Awesome! Can I go ahead and use this for https://github.com/apache/lucene/pull/14350 once it's merged? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]
rmuir opened a new pull request, #14381: URL: https://github.com/apache/lucene/pull/14381 Add optional flag to support case-insensitive ranges. A minimal DFA is always created. This works with Unicode but may have a performance cost. Each codepoint in the range must be iterated, and any alternatives added to a set. This can be large if the range spans much of Unicode. CPU and memory costs are contained within a single function enabled by the optional flag. For example when matching a caseless `/[a-z]/`, 56 codepoints will be accumulated into an `int[]`, which is then compressed to 5 ranges before adding to the parse tree. Closes #14378 Here's what resulting `/[a-z]/` automaton looks like in case you are curious:  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]
rmuir commented on code in PR #14381: URL: https://github.com/apache/lucene/pull/14381#discussion_r2007006500 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -778,6 +786,53 @@ private int[] toCaseInsensitiveChar(int codepoint) { } } + /** + * Expands range to include case-insensitive matches. + * + * This is expensive: case-insensitive range involves iterating over the range space, adding + * alternatives. Jump on the grenade here, contain CPU and memory explosion just to this method + * activated by optional flag. + */ + private void expandCaseInsensitiveRange( + int start, int end, List rangeStarts, List rangeEnds) { +if (start > end) + throw new IllegalArgumentException( + "invalid range: from (" + start + ") cannot be > to (" + end + ")"); + +// contain the explosion of transitions by using a throwaway state +Automaton scratch = new Automaton(); +int state = scratch.createState(); + +// iterate over range, adding codepoint and any alternatives as transitions +for (int i = start; i <= end; i++) { + scratch.addTransition(state, state, i); + int[] altCodePoints = CaseFolding.lookupAlternates(i); + if (altCodePoints != null) { +for (int alt : altCodePoints) { + scratch.addTransition(state, state, alt); +} + } else { +int altCase = +Character.isLowerCase(i) ? Character.toUpperCase(i) : Character.toLowerCase(i); +if (altCase != i) { + scratch.addTransition(state, state, altCase); +} + } +} Review Comment: good call. this is better than returning mutable arrays that could get messed up by bugs, or creating gazillions of arrays. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]
alessandrobenedetti commented on code in PR #14173: URL: https://github.com/apache/lucene/pull/14173#discussion_r2007476642 ## lucene/core/src/java/org/apache/lucene/util/hnsw/UpdatableScoreHeap.java: ## Review Comment: For example, what are the benefits of this in comparison to the changes I proposed: lucene/core/src/java/org/apache/lucene/util/LongHeap.java in https://github.com/apache/lucene/pull/12314/files? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adjust equivalent min similarity HNSW exploration logic [lucene]
benwtrent merged PR #14366: URL: https://github.com/apache/lucene/pull/14366 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestKnnGraph.testMultiThreadedSearch random test failure [lucene]
benwtrent closed issue #14327: TestKnnGraph.testMultiThreadedSearch random test failure URL: https://github.com/apache/lucene/issues/14327 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestKnnGraph.testMultiThreadedSearch random test failure [lucene]
benwtrent closed issue #14327: TestKnnGraph.testMultiThreadedSearch random test failure URL: https://github.com/apache/lucene/issues/14327 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Reduce memory usage when merging bkd trees [lucene]
iverase opened a new issue, #14382: URL: https://github.com/apache/lucene/issues/14382 When building BKD trees, we hold two arrays in memory which sizes grows linearly with the number of leaf nodes. One of the array contains the pointer to the start of a leaf node, and the other containing the split value. The number of leaf nodes does not grow with the number of documents but with the number of values, therefore in the case of multi-values, those arrays can grow quite big. The situation is particularly inefficient for the `OneDimensionBKDWriter` where we are using a List to hold the split values. I wonder if we can use more efficient data structures to lower the heapusage. For example, maybe we can use the `FixedLengthBytesRefArray` to hold split values or used some packing algorithm to hold the leaf pointers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]
benwtrent commented on code in PR #14094: URL: https://github.com/apache/lucene/pull/14094#discussion_r2007365351 ## lucene/core/src/java/org/apache/lucene/util/hnsw/OrdinalTranslatedKnnCollector.java: ## @@ -50,4 +51,11 @@ public TopDocs topDocs() { : TotalHits.Relation.EQUAL_TO), td.scoreDocs); } + + @Override + public void nextCandidate() { +if (this.collector instanceof HnswKnnCollector) { + ((HnswKnnCollector) this.collector).nextCandidate(); +} Review Comment: ```suggestion if (this.collector instanceof HnswKnnCollector hnswCollector) { hnswCollector.nextCandidate(); } ``` ## lucene/core/src/java/org/apache/lucene/search/HnswQueueSaturationCollector.java: ## @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * A {@link HnswKnnCollector} that early exits when nearest neighbor queue keeps saturating beyond a + * 'patience' parameter. This records the rate of collection of new nearest neighbors in the {@code + * delegate} KnnCollector queue, at each HNSW node candidate visit. Once it saturates for a number + * of consecutive node visits (e.g., the patience parameter), this early terminates. + * + * @lucene.experimental + */ +public class HnswQueueSaturationCollector extends HnswKnnCollector { + + private final KnnCollector delegate; + private final double saturationThreshold; + private final int patience; + private boolean patienceFinished; + private int countSaturated; + private int previousQueueSize; + private int currentQueueSize; + + HnswQueueSaturationCollector(KnnCollector delegate, double saturationThreshold, int patience) { +super(delegate); +this.delegate = delegate; +this.previousQueueSize = 0; +this.currentQueueSize = 0; +this.countSaturated = 0; +this.patienceFinished = false; +this.saturationThreshold = saturationThreshold; +this.patience = patience; + } + + @Override + public boolean earlyTerminated() { +return delegate.earlyTerminated() || patienceFinished; + } + + @Override + public boolean collect(int docId, float similarity) { +boolean collect = delegate.collect(docId, similarity); +if (collect) { + currentQueueSize++; +} +return collect; + } + + @Override + public float minCompetitiveSimilarity() { +return delegate.minCompetitiveSimilarity(); + } Review Comment: since we are a decorator, do we need this? ## lucene/core/src/java/org/apache/lucene/search/HnswKnnCollector.java: ## @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +/** + * {@link KnnCollector} that exposes methods to hook into specific parts of the HNSW algorithm. + * + * @lucene.experimental + */ +public abstract class HnswKnnCollector extends KnnCollector.Decorator { Review Comment: Ah, it is a little frustrating as we already have an "HNSWStrategy" and now we have an "HNSWCollector". Could we utilize an HNSWStrategy? Or make `nextCandidate` a more general API? My thought on the strategy would be that the graph searcher to indicate through the strategy object when the next group of vectors will be searched and the strategy would have a reference to the collector to which it can forward the request. Of course, this still requires a new `HnswQueueSaturati
Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]
rmuir commented on PR #14384: URL: https://github.com/apache/lucene/pull/14384#issuecomment-2743662885 This one is pretty easy to understand, the `CaseFolding` class now just gives you `UnicodeSet(ch).closeOver(UnicodeSet.SIMPLE_CASE_INSENSITIVE)` without requiring that you have ICU. The generation depends strictly upon ICU version (which I will separately upgrade for unicode 16 now that java 24 has it). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2007802395 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java: ## @@ -0,0 +1,552 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class TrieBuilder { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + static final int LEAF_NODE_HAS_TERMS = 1 << 5; + static final int LEAF_NODE_HAS_FLOOR = 1 << 6; + static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1; + static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0; + + /** + * The output describing the term block the prefix point to. + * + * @param fp describes the on-disk terms block which a trie node points to. + * @param hasTerms A boolean which will be false if this on-disk block consists entirely of + * pointers to child blocks. + * @param floorData A {@link BytesRef} which will be non-null when a large block of terms sharing + * a single trie prefix is split into multiple on-disk blocks. + */ + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +BUILDING, +SAVED, +DESTROYED + } + + private static class Node { + +// The utf8 digit that leads to this Node, 0 for root node +private final int label; +// The children listed in order by their utf8 label +private final LinkedList children; +// The output of this node. +private Output output; + +// Vars used during saving: + +// The file pointer point to where the node saved. -1 means the node has not been saved. +private long fp = -1; +// The iterator whose next() point to the first child has not been saved. +private Iterator childrenIterator; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.BUILDING; + final Node root = new Node(0, null, new LinkedList<>()); + + static TrieBuilder bytesRefToTrie(BytesRef k, Output v) { +return new TrieBuilder(k, v); + } + + private TrieBuilder(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + /** + * Absorb all (K, V) pairs from the given trie into this one. The given trie builder should not + * have key that already exists in this one, otherwise a {@link IllegalArgumentException } will be + * thrown and this trie will get destroyed. + * + * Note: the given trie will be destroyed after absorbing. + */ + void absorb(TrieBuilder trieBuilder) { +if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) { + throw new IllegalStateException("tries should be unsaved"); +} +// Use a simple stack to avoid recursion. +Deque stack = new ArrayDeque<>(); +stack.add(() -> absorb(this.root, trieBuilder.root, stack)); +while (!stack.isEmpty()) { + stack.pop().run(); +} +trieBuilder.status = Status.DESTROYED; + } + + private void absorb(Node n, Node add, Deque stack) { +assert n.label == add.label; +if (add.output != null) { + if (n.output != null) { +
Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]
rmuir commented on PR #14384: URL: https://github.com/apache/lucene/pull/14384#issuecomment-2743717079 It was easy because @uschindler already created a similar groovy script before. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Case insensitive regex query with character range [lucene]
rmuir closed issue #14378: Case insensitive regex query with character range URL: https://github.com/apache/lucene/issues/14378 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2744342290 Maybe this one helps the issue: https://github.com/apache/lucene/pull/14389 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add CaseFolding.fold(), inverse of expand(), move to UnicodeUtil, add filter [lucene]
john-wagster commented on PR #14389: URL: https://github.com/apache/lucene/pull/14389#issuecomment-2744360276 This is great; helps me progress some of the regex work in ES for why I started that CaseFolding work. Thanks for iterating on this @rmuir. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]
alessandrobenedetti commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2743148001 Catching up on this and trying to understand how far we are now from my original idea and implementation: https://github.com/apache/lucene/pull/12314 Obviously, my code is completely outdated, but reading across this PR and https://github.com/apache/lucene/pull/13525, it seems we are converging again to what I originally proposed. I'll work on this for the next couple of weeks, so I should be able to add some comments and additional opinions. Main concern is still related to ordinals to become long as far as I can see :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Handling concurrent search in QueryProfiler [lucene]
jpountz commented on issue #14375: URL: https://github.com/apache/lucene/issues/14375#issuecomment-271819 Done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Handle degenerate case where all HNSW search candidates are filtered [lucene]
benwtrent commented on issue #11787: URL: https://github.com/apache/lucene/issues/11787#issuecomment-2743830018 I think this has been fixed with all our HNSW filtering fixes: - we drop to brute force if we explore too much - we bypass the graph if the filter passes <= `k` docs - We have implemented improved filtering search logic to aid with speed. We can comfortably close this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]
jpountz commented on PR #14365: URL: https://github.com/apache/lucene/pull/14365#issuecomment-273990 Hurray! - https://benchmarks.mikemccandless.com/TermDayOfYearSort.html - https://benchmarks.mikemccandless.com/TermDTSort.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]
rmuir commented on PR #14384: URL: https://github.com/apache/lucene/pull/14384#issuecomment-2743831323 There was something about gradle itself that was upset about dependencies wrt generation tasks, if i recall... cycle detection or something was complaining about it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Handle degenerate case where all HNSW search candidates are filtered [lucene]
benwtrent closed issue #11787: Handle degenerate case where all HNSW search candidates are filtered URL: https://github.com/apache/lucene/issues/11787 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]
rmuir commented on PR #14384: URL: https://github.com/apache/lucene/pull/14384#issuecomment-2743736828 I will followup with an ICU upgrade PR to this one. I don't expect that this file will change except for the version in the comment though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Implement #docIDRunEnd() on PostingsEnum. [lucene]
jpountz opened a new pull request, #14390: URL: https://github.com/apache/lucene/pull/14390 This implements `BlockPostingsEnum#docIDRunEnd()` by comparing the delta between doc IDs and between doc counts on the various skip levels. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]
vigyasharma commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2744562872 Thanks for looking into this PR @alessandrobenedetti , this is the latest iteration on multi-vector support. It does build on the same central idea of assigning a unique ordinal to each vector and mapping multiple ordinals to a single doc. I tried a few other approaches, but this one seemed cleanest. I think the key difference over #12314 , are changes to store metadata that lets us map multiple ordinals to a single doc. This is implemented in `MultiVectorOrdConfiguration` using `DirectMonotonicWriter/Reader`. For every doc, I maintain the ordinal of its first vector (`baseOrdinal`) along with no. of vectors in the doc, and use these to do the `ordToDoc` mapping for vectors. I didn't fully understand how this was done in your orginal PR, specifically how it mapped an ordinal back to its docId, given we can have variable no. of vectors per doc. Maybe I missed something. If you had a simpler implementation, I'm happy to circle back to it. I also added an `allVectorValues()` API to `Byte|FloatVectorValues`, which I think will help during query time. Other that this, the changes are mostly around integrating multi-vector support and will likely have a lot of overlap. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]
vigyasharma commented on code in PR #14173: URL: https://github.com/apache/lucene/pull/14173#discussion_r2008411867 ## lucene/core/src/java/org/apache/lucene/util/hnsw/UpdatableScoreHeap.java: ## Review Comment: I'd like to keep the logic to update scores for already ingested docs encapsulated within the heap. By returning the array index within the heap (the LongHeap changes in #12314), we shift this responsibility to consumers, like the [NeighborQueue changes](https://github.com/apache/lucene/blob/1523ee796a6d35a7d92532590458b2a2d8dd9e4b/lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java#L99-L113), which can be trappy and cause repeated code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestKnnGraph.testMultiThreadedSearch random test failure [lucene]
rmuir commented on issue #14327: URL: https://github.com/apache/lucene/issues/14327#issuecomment-2743546449 thank you @benwtrent -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Address gradle temp file pollution insanity [lucene]
dweiss commented on issue #14385: URL: https://github.com/apache/lucene/issues/14385#issuecomment-2743927366 Ok, I've added gradle's "user home" tmp cleaning as well. Anything older than 3 hours is removed. This folder may be shared across builds so the time limit is there to prevent accidental cross-build issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] bump antlr 4.11.1 -> 4.13.2 [lucene]
rmuir opened a new pull request, #14388: URL: https://github.com/apache/lucene/pull/14388 Dependency is outdated, the main changes to generated code avoid warnings in java21+ This one didn't magically work like ICU, I simply force-regenerated. I tried messing around with the gradle dependsOn logic to get it to trigger on the antlr version bump, but was unsuccessful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]
jpountz commented on PR #14365: URL: https://github.com/apache/lucene/pull/14365#issuecomment-2744460409 I pushed an annotation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Clean up junk from gradle's user home (~/.gradle/.tmp). #14385 [lucene]
dweiss merged PR #14387: URL: https://github.com/apache/lucene/pull/14387 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Address gradle temp file pollution insanity [lucene]
dweiss opened a new issue, #14385: URL: https://github.com/apache/lucene/issues/14385 ### Description Gradle creates temp files it never cleans up. Until this is resolved, let's try to keep some housekeeping ourselves. Related issues: * #10215 * #10510 * https://github.com/gradle/gradle/issues/15367 * https://github.com/gradle/gradle/issues/12020 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Address gradle temp file pollution insanity [lucene]
dweiss commented on issue #14385: URL: https://github.com/apache/lucene/issues/14385#issuecomment-2743998381 There are also *.log files to wipe clean. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Clean up junk from gradle's user home (~/.gradle/.tmp). #14385 [lucene]
dweiss commented on PR #14387: URL: https://github.com/apache/lucene/pull/14387#issuecomment-2743985270 I'll merge this in. Low risk and we can always revert if needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Address gradle temp file pollution insanity [lucene]
dweiss closed issue #14385: Address gradle temp file pollution insanity URL: https://github.com/apache/lucene/issues/14385 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] gradle build leaks tons of gradle-worker-classpath* files in tmpdir [LUCENE-9175] [lucene]
dweiss closed issue #10215: gradle build leaks tons of gradle-worker-classpath* files in tmpdir [LUCENE-9175] URL: https://github.com/apache/lucene/issues/10215 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]
rmuir merged PR #14381: URL: https://github.com/apache/lucene/pull/14381 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] bump antlr 4.11.1 -> 4.13.2 [lucene]
dweiss commented on code in PR #14388: URL: https://github.com/apache/lucene/pull/14388#discussion_r2008140343 ## lucene/expressions/src/generated/checksums/generateAntlr.json: ## @@ -1,7 +1,8 @@ { "lucene/expressions/src/java/org/apache/lucene/expressions/js/Javascript.g4": "818e89aae0b6c7601051802013898c128fe7c1ba", "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptBaseVisitor.java": "6965abdb8b069aaceac1ce4f32ed965b194f3a25", - "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptLexer.java": "b8d6b259ebbfce09a5379a1a2aa4c1ddd4e378eb", - "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptParser.java": "7a3a7b9de17f4a8d41ef342312eae5c55e483e08", - "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptVisitor.java": "ec24bb2b9004bc38ee808970870deed12351039e" + "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptLexer.java": "6508dc5008e96a1ad28c967a3401407ba83f140b", + "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptParser.java": "ba6d0c00af113f115fc7a1f165da7726afb2e8c5", + "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptVisitor.java": "ec24bb2b9004bc38ee808970870deed12351039e", +"property:antlr-version": "4.13.2" Review Comment: Yeah - we could add full coordinates but I don't think this matters. The version should be fine. I'll try to do an overhaul of the build anyway and maybe consolidate this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]
dweiss commented on PR #14384: URL: https://github.com/apache/lucene/pull/14384#issuecomment-2743758629 I mean the entire structure of tasks that are used in regenerate. It's complex. I remember I couldn't do it in any easier way before - maybe something has changed that would allow it to be simpler (I doubt though). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]
rmuir commented on PR #14381: URL: https://github.com/apache/lucene/pull/14381#issuecomment-2743798573 after fixing the turkish here's the (correct) automaton for `/[a-z]/`: the only special cases are long-s and kelvin sign as you expect:  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Init HNSW merge with graph containing deleted documents [lucene]
benwtrent commented on issue #12533: URL: https://github.com/apache/lucene/issues/12533#issuecomment-2743826644 I think in addition to the recent merge improvements (https://github.com/apache/lucene/pull/14331), the ability to "fix up" the individual graphs that have deletions and THEN doing the merge might gain significant speed improvements. Additionally, folks might actually only want to expunge deletes, in this case, rewriting the entire graphs is incredibly wasteful, and we should instead "fix up" the graphs by adjusting the deleted nodes directly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] upgrade icu dependency from 74.2 -> 77.1 [lucene]
rmuir commented on PR #14386: URL: https://github.com/apache/lucene/pull/14386#issuecomment-2743954863 @dweiss i know you dislike the complexity, but the `gradlew regenerate` really saves a metric ton of human time and prevents mistakes for updates like these. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Case insensitive regex query with character range [lucene]
rmuir closed issue #14378: Case insensitive regex query with character range URL: https://github.com/apache/lucene/issues/14378 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]
rmuir merged PR #14384: URL: https://github.com/apache/lucene/pull/14384 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Address gradle temp file pollution insanity [lucene]
dweiss commented on issue #14385: URL: https://github.com/apache/lucene/issues/14385#issuecomment-2743756362 It's this commit that moved the temp folder from java.io.tmpdir, which we redirected and cleaned up. https://github.com/gradle/gradle/commit/8c2f6b7db50ab071a289fb5c4cbb9b2125609105#diff-a89e26b86bb25dd2df7ef61416478f3b9034cc4625633830a5413a5c5d7124f6 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] BlockJoinBulkScorer could check for parent deletions (not children) [lucene]
jimczi closed pull request #14067: BlockJoinBulkScorer could check for parent deletions (not children) URL: https://github.com/apache/lucene/pull/14067 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] upgrade icu dependency from 74.2 -> 77.1 [lucene]
dweiss commented on PR #14386: URL: https://github.com/apache/lucene/pull/14386#issuecomment-2743977198 I know, I know. I don't think we should remove it - I just hope it can be implemented in a less hairy way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]
tteofili commented on code in PR #14094: URL: https://github.com/apache/lucene/pull/14094#discussion_r2007923461 ## lucene/core/src/java/org/apache/lucene/search/HnswQueueSaturationCollector.java: ## @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * A {@link HnswKnnCollector} that early exits when nearest neighbor queue keeps saturating beyond a + * 'patience' parameter. This records the rate of collection of new nearest neighbors in the {@code + * delegate} KnnCollector queue, at each HNSW node candidate visit. Once it saturates for a number + * of consecutive node visits (e.g., the patience parameter), this early terminates. + * + * @lucene.experimental + */ +public class HnswQueueSaturationCollector extends HnswKnnCollector { + + private final KnnCollector delegate; + private final double saturationThreshold; + private final int patience; + private boolean patienceFinished; + private int countSaturated; + private int previousQueueSize; + private int currentQueueSize; + + HnswQueueSaturationCollector(KnnCollector delegate, double saturationThreshold, int patience) { +super(delegate); +this.delegate = delegate; +this.previousQueueSize = 0; +this.currentQueueSize = 0; +this.countSaturated = 0; +this.patienceFinished = false; +this.saturationThreshold = saturationThreshold; +this.patience = patience; + } + + @Override + public boolean earlyTerminated() { +return delegate.earlyTerminated() || patienceFinished; + } + + @Override + public boolean collect(int docId, float similarity) { +boolean collect = delegate.collect(docId, similarity); +if (collect) { + currentQueueSize++; +} +return collect; + } + + @Override + public float minCompetitiveSimilarity() { +return delegate.minCompetitiveSimilarity(); + } + + @Override + public TopDocs topDocs() { +TopDocs topDocs; +if (patienceFinished && delegate.earlyTerminated() == false) { + TopDocs delegateDocs = delegate.topDocs(); + TotalHits totalHits = + new TotalHits(delegateDocs.totalHits.value(), TotalHits.Relation.EQUAL_TO); + topDocs = new TopDocs(totalHits, delegateDocs.scoreDocs); +} else { + topDocs = delegate.topDocs(); +} +return topDocs; + } + + @Override + public void nextCandidate() { Review Comment: I really like this idea Ben, I'll see if I can make up something reasonable for that ;) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Clean up junk from gradle's user home (~/.gradle/.tmp). #14385 [lucene]
rmuir commented on PR #14387: URL: https://github.com/apache/lucene/pull/14387#issuecomment-2743932179 `./gradlew -XX:UseDweissTempFileGC` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Handling concurrent search in QueryProfiler [lucene]
jainankitk commented on issue #14375: URL: https://github.com/apache/lucene/issues/14375#issuecomment-2744045551 @jpountz - Can you assign this issue to me? I don't have permissions to do that myself -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ParallelLeafReader to improve term vector fetching efficienc [lucene]
vigyasharma commented on code in PR #14373: URL: https://github.com/apache/lucene/pull/14373#discussion_r2008678599 ## lucene/core/src/java/org/apache/lucene/index/ParallelLeafReader.java: ## @@ -348,15 +348,24 @@ public void prefetch(int docID) throws IOException { @Override public Fields get(int docID) throws IOException { ParallelFields fields = null; -for (Map.Entry ent : tvFieldToReader.entrySet()) { - String fieldName = ent.getKey(); - TermVectors termVectors = readerToTermVectors.get(ent.getValue()); - Terms vector = termVectors.get(docID, fieldName); - if (vector != null) { -if (fields == null) { - fields = new ParallelFields(); -} -fields.addField(fieldName, vector); + +// Step 2: Fetch all term vectors once per reader +for (Map.Entry entry : readerToTermVectors.entrySet()) { + TermVectors termVectors = entry.getValue(); + Fields docFields = termVectors.get(docID); // Fetch all fields at once + + if (docFields != null) { + if (fields == null) { + fields = new ParallelFields(); + } + + // Step 3: Aggregate only required fields + for (String fieldName : docFields) { + Terms vector = docFields.terms(fieldName); + if (vector != null) { Review Comment: When would this be null? Since we're going through fields returned by `termVectors.get(docId)`, the field should exist and have terms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] bump antlr 4.11.1 -> 4.13.2 [lucene]
dweiss commented on PR #14388: URL: https://github.com/apache/lucene/pull/14388#issuecomment-2744155828 > This one didn't magically work like ICU I've pushed a commit that should do the trick. ICU version wasn't in the inputs so the build didn't know it'd been updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] bump antlr 4.11.1 -> 4.13.2 [lucene]
dweiss commented on code in PR #14388: URL: https://github.com/apache/lucene/pull/14388#discussion_r2008129712 ## lucene/expressions/src/generated/checksums/generateAntlr.json: ## @@ -1,7 +1,13 @@ { + "../../../../../.gradle/caches/modules-2/files-2.1/com.ibm.icu/icu4j/72.1/bc9057df4b5efddf7f6d1880bf7f3399f4ce5633/icu4j-72.1.jar": "bc9057df4b5efddf7f6d1880bf7f3399f4ce5633", + "../../../../../.gradle/caches/modules-2/files-2.1/org.abego.treelayout/org.abego.treelayout.core/1.0.3/457216e8e6578099ae63667bb1e4439235892028/org.abego.treelayout.core-1.0.3.jar": "457216e8e6578099ae63667bb1e4439235892028", + "../../../../../.gradle/caches/modules-2/files-2.1/org.antlr/ST4/4.3.4/bf68d049dd4e6e104055a79ac3bf9e6307d29258/ST4-4.3.4.jar": "bf68d049dd4e6e104055a79ac3bf9e6307d29258", + "../../../../../.gradle/caches/modules-2/files-2.1/org.antlr/antlr-runtime/3.5.3/9011fb189c5ed6d99e5f3322514848d1ec1e1416/antlr-runtime-3.5.3.jar": "9011fb189c5ed6d99e5f3322514848d1ec1e1416", + "../../../../../.gradle/caches/modules-2/files-2.1/org.antlr/antlr4-runtime/4.13.2/fc3db6d844df652a3d5db31c87fa12757f13691d/antlr4-runtime-4.13.2.jar": "fc3db6d844df652a3d5db31c87fa12757f13691d", + "../../../../../.gradle/caches/modules-2/files-2.1/org.antlr/antlr4/4.13.2/a2bc0d399506a7297568baee188b481727d45d3b/antlr4-4.13.2.jar": "a2bc0d399506a7297568baee188b481727d45d3b", Review Comment: ok, we can't have it done this way, sorry. I'll revert. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add CaseFolding.fold(), inverse of expand(), move to UnicodeUtil, add filter [lucene]
rmuir commented on PR #14389: URL: https://github.com/apache/lucene/pull/14389#issuecomment-2744704384 I will straighten out the build, this one is kinda draftish as it needs more tests etc. just wanted to toss out the idea. If it is autogenerated we can easily maintain some cohesive story rather than crazy Unicode puzzles. It is tempting to want full case folding as that's a benefit to eg German, but we need to step. Perf gets more complex, etc. Simple is an improvement over lowercasing. The goal here is to not regress indexing performance if users switch from lowercase to simple case folding. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org