Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]
jainankitk commented on code in PR #14267: URL: https://github.com/apache/lucene/pull/14267#discussion_r2026298155 ## lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java: ## @@ -517,6 +623,11 @@ public byte[] getUpperPoint() { return upperPoint.clone(); } + // for test + public boolean isEqualValues() { Review Comment: Good catch @gsmiller! Can we make this package-private? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
mayya-sharipova commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2027061812 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java: ## @@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() { * The default implementation is empty */ public void finishMerge() throws IOException {} + + /** A string representing the off-heap category for quantized vectors. */ + public static final String QUANTIZED = "QUANTIZED"; + + /** A string representing the off-heap category for the HNSW graph. */ + public static final String HNSW_GRAPH = "HNSW_GRAPH"; + + /** A string representing the off-heap category for raw vectors. */ + public static final String RAW = "RAW"; + + /** + * Returns the desired size of off-heap memory the given field. This size can be used to help + * determine the memory requirements for optimal search performance, which can be greatly affected + * by page faults when not enough memory is available. + * + * For reporting purposes, the backing off-heap index structures are broken into three + * categories: 1. {@link #RAW}, 2. {@link #HNSW_GRAPH}, and 3. {@link #QUANTIZED}. The returned + * map will have zero or one entry for each of these categories. + * + * The long value is the size in bytes of the off-heap space needed if the associated index + * structure were to be fully loaded in memory. While somewhat analogous to {@link + * Accountable#ramBytesUsed()} (which reports actual on-heap memory usage), the metrics reported + * by this method are not actual usage but rather the amount of available memory needed to fully + * load the index into memory, rather than an actual RAM usage requirement. + * + * To determine the total desired off-heap memory size for the given field: + * + * {@code + * getOffHeapByteSize(field).values().stream().mapToLong(Long::longValue).sum(); + * } + * + * @param fieldInfo the fieldInfo + * @return a map of the desired off-heap memory requirements by category + * @lucene.experimental + */ + public abstract Map getOffHeapByteSize(FieldInfo fieldInfo); Review Comment: Very nice API! +1 But I am also thinking about other use cases: where we need to know the total size across all vector fields and we don't know/don't remember field names. Do you think it is worth to have this API: may be that returns a map of maps? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] PointInSetQuery clips segments by lower and upper [lucene]
hanbj commented on code in PR #14268: URL: https://github.com/apache/lucene/pull/14268#discussion_r2020444502 ## lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java: ## @@ -122,6 +126,11 @@ protected PointInSetQuery(String field, int numDims, int bytesPerDim, Stream pac } sortedPackedPoints = builder.finish(); sortedPackedPointsHashCode = sortedPackedPoints.hashCode(); +if (previous != null) { + BytesRef max = previous.get(); + upperPoint = new byte[bytesPerDim * numDims]; + System.arraycopy(max.bytes, max.offset, upperPoint, 0, upperPoint.length); Review Comment: You're right, usually the length of the copied array is used. I used the length of max here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
jpountz merged PR #14273: URL: https://github.com/apache/lucene/pull/14273 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adds github action to verify changelog entry and set milestone to PRs [lucene]
javanna commented on PR #14279: URL: https://github.com/apache/lucene/pull/14279#issuecomment-2737697207 Hey @stefanvodita the changelog entry for this was filed under 10.2, but I don't believe the change itself was backported. Can you double check and either backport or move the changelog entry? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
ChrisHegarty commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r202743 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java: ## @@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() { * The default implementation is empty */ public void finishMerge() throws IOException {} + + /** A string representing the off-heap category for quantized vectors. */ + public static final String QUANTIZED = "QUANTIZED"; + + /** A string representing the off-heap category for the HNSW graph. */ + public static final String HNSW_GRAPH = "HNSW_GRAPH"; + + /** A string representing the off-heap category for raw vectors. */ + public static final String RAW = "RAW"; + + /** + * Returns the desired size of off-heap memory the given field. This size can be used to help + * determine the memory requirements for optimal search performance, which can be greatly affected + * by page faults when not enough memory is available. + * + * For reporting purposes, the backing off-heap index structures are broken into three + * categories: 1. {@link #RAW}, 2. {@link #HNSW_GRAPH}, and 3. {@link #QUANTIZED}. The returned + * map will have zero or one entry for each of these categories. + * + * The long value is the size in bytes of the off-heap space needed if the associated index + * structure were to be fully loaded in memory. While somewhat analogous to {@link + * Accountable#ramBytesUsed()} (which reports actual on-heap memory usage), the metrics reported + * by this method are not actual usage but rather the amount of available memory needed to fully + * load the index into memory, rather than an actual RAM usage requirement. + * + * To determine the total desired off-heap memory size for the given field: + * + * {@code + * getOffHeapByteSize(field).values().stream().mapToLong(Long::longValue).sum(); + * } + * + * @param fieldInfo the fieldInfo + * @return a map of the desired off-heap memory requirements by category + * @lucene.experimental + */ + public abstract Map getOffHeapByteSize(FieldInfo fieldInfo); Review Comment: @mayya-sharipova the expected usage is that the caller would get all the fieldInfos from the reader ( `LeafReader::getFieldInfos` ), and then iterate over them checking for vector info. Which should be straightforward to do without an additional API point. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] fix TestIndexWriterWithThreads#testIOExceptionDuringAbortWithThreadsOnlyOnce [lucene]
guojialiang92 opened a new pull request, #14424: URL: https://github.com/apache/lucene/pull/14424 ### Description This PR aims to address issue [14423](https://github.com/apache/lucene/issues/14423). ### Tests 1. In order to stabilize the reproduce problem, I added a test `TestIndexWriterWithThreads#testIOExceptionWithMergeNotEndLongTime`. For details, please refer to [14423](https://github.com/apache/lucene/issues/14423). 2. I also fixed `TestIndexWriterWithThreads#testIOExceptionDuringAbortWithThreadsOnlyOnce`. ### Checklist - [x] I have reviewed the guidelines for [How to Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my code conforms to the standards described there to the best of my ability. - [x] I have given Lucene maintainers [access](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the main branch. - [x] I have run ./gradlew check. - [x] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] MultiRange query for SortedNumericc DocValues [lucene]
mkhludnev commented on code in PR #14404: URL: https://github.com/apache/lucene/pull/14404#discussion_r2013967382 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedNumericDocValuesMultiRangeQuery.java: ## @@ -0,0 +1,249 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.search; + +import java.io.IOException; +import java.util.*; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.DocValuesSkipper; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.SortedNumericDocValues; +import org.apache.lucene.search.ConstantScoreScorerSupplier; +import org.apache.lucene.search.ConstantScoreWeight; +import org.apache.lucene.search.DocValuesRangeIterator; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.QueryVisitor; +import org.apache.lucene.search.ScoreMode; +import org.apache.lucene.search.ScorerSupplier; +import org.apache.lucene.search.TwoPhaseIterator; +import org.apache.lucene.search.Weight; +import org.apache.lucene.util.PriorityQueue; + +/** + * A union multiple ranges over SortedNumericDocValuesField + * + * @lucene.experimental + */ +public class SortedNumericDocValuesMultiRangeQuery extends Query { + + protected final String fieldName; + protected final NavigableSet sortedClauses; + + protected SortedNumericDocValuesMultiRangeQuery( + String fieldName, List clauses) { +this.fieldName = fieldName; +sortedClauses = resolveOverlaps(clauses); + } + + private static final class Edge { +private final DocValuesMultiRangeQuery.LongRange range; +private final boolean point; +private final boolean upper; + +private static Edge createPoint(DocValuesMultiRangeQuery.LongRange r) { + return new Edge(r); +} + +long getValue() { + return upper ? range.upper : range.lower; +} + +private Edge(DocValuesMultiRangeQuery.LongRange range, boolean upper) { + this.range = range; + this.upper = upper; + this.point = false; +} + +/** expecting Arrays.equals(lower.bytes,upper.bytes) i.e. point */ +private Edge(DocValuesMultiRangeQuery.LongRange range) { + this.range = range; + this.upper = false; + this.point = true; +} + } + + /** Merges overlapping ranges. map.floor() doesn't work with overlaps */ + private static NavigableSet resolveOverlaps( + Collection clauses) { +NavigableSet sortedClauses = +new TreeSet<>( +Comparator.comparing(r -> r.lower) +//.thenComparing(r -> r.upper) +); +PriorityQueue heap = +new PriorityQueue<>(clauses.size() * 2) { + @Override + protected boolean lessThan(Edge a, Edge b) { +return a.getValue() - b.getValue() < 0; + } +}; +for (DocValuesMultiRangeQuery.LongRange r : clauses) { + long cmp = r.lower - r.upper; + if (cmp == 0) { +heap.add(Edge.createPoint(r)); + } else { +if (cmp < 0) { + heap.add(new Edge(r, false)); + heap.add(new Edge(r, true)); +} // else drop reverse ranges + } +} +int totalEdges = heap.size(); +int depth = 0; +Edge started = null; +for (int i = 0; i < totalEdges; i++) { + Edge smallest = heap.pop(); + if (depth == 0 && smallest.point) { +if (i < totalEdges - 1 && heap.top().point) { // repeating same points + if (smallest.getValue() == heap.top().getValue()) { +continue; + } +} +sortedClauses.add(smallest.range); + } + if (!smallest.point) { +if (!smallest.upper) { + depth++; + if (depth == 1) { // just started +started = smallest; + } +} else { + depth--; + if (depth == 0) { +sortedClauses.add( +started.range == smallest.range // no overlap case, the most often +? smallest.range +: new DocValuesMultiRangeQuery.LongRange( +started.getValue(), smallest.getValue())); +started = n
Re: [PR] quick exit on filter query matching no docs when rewriting knn query [lucene]
jpountz commented on PR #14418: URL: https://github.com/apache/lucene/pull/14418#issuecomment-2762525239 Can you help me understand what work this change helps save? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow skip cache factor to be updated dynamically [lucene]
sgup432 commented on PR #14412: URL: https://github.com/apache/lucene/pull/14412#issuecomment-2763186278 @jpountz Added a CHANGES entry. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Pack file pointers when merging BKD trees [lucene]
benwtrent commented on code in PR #14393: URL: https://github.com/apache/lucene/pull/14393#discussion_r2010085479 ## lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java: ## @@ -1961,7 +1989,7 @@ private void build( int leafCardinality = heapSource.computeCardinality(from, to, commonPrefixLengths); // Save the block file pointer: - leafBlockFPs[leavesOffset] = out.getFilePointer(); + leafBlockFPs.add(out.getFilePointer()); Review Comment: Ah, since filepointers are monotonic, we can make the compact. NICE! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Examine the affects of MADV_RANDOM when MGLRU is enabled in Linux kernel [lucene]
jimczi commented on issue #14408: URL: https://github.com/apache/lucene/issues/14408#issuecomment-2755375551 I believe the question is whether we need to reconsider our assumptions when defaulting to random read advice in the current code. With the linked change, using `MADV_RANDOM` will exclude pages from the LRU list, but the original intent was simply to reduce the read-ahead size. We explicitly use random advice in the vector format and FST files, which should, by default, benefit from LRU. Users should not be required to make changes to achieve the correct behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] PointInSetQuery early exit on non-matching segments [lucene]
hanbj commented on code in PR #14268: URL: https://github.com/apache/lucene/pull/14268#discussion_r2022086841 ## lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java: ## @@ -248,6 +255,33 @@ public long cost() { } } + private boolean checkValidPointValues(PointValues values) throws IOException { Review Comment: Already rollback -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] KeywordField.newSetQuery() to uses prefixed terms for IndexOrDocValuesQuery [lucene]
jainankitk commented on code in PR #14435: URL: https://github.com/apache/lucene/pull/14435#discussion_r2027694440 ## lucene/core/src/java/org/apache/lucene/document/KeywordField.java: ## @@ -175,9 +174,8 @@ public static Query newExactQuery(String field, String value) { public static Query newSetQuery(String field, Collection values) { Objects.requireNonNull(field, "field must not be null"); Objects.requireNonNull(values, "values must not be null"); -Query indexQuery = new TermInSetQuery(field, values); -Query dvQuery = new TermInSetQuery(MultiTermQuery.DOC_VALUES_REWRITE, field, values); -return new IndexOrDocValuesQuery(indexQuery, dvQuery); +return TermInSetQuery.newIndexOrDocValuesQuery( +MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE, field, values); Review Comment: Probably we can use `TermInSetQuery.newIndexOrDocValuesQuery(field, values)` here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2006876856 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java: ## @@ -0,0 +1,552 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class TrieBuilder { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + static final int LEAF_NODE_HAS_TERMS = 1 << 5; + static final int LEAF_NODE_HAS_FLOOR = 1 << 6; + static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1; + static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0; + + /** + * The output describing the term block the prefix point to. + * + * @param fp describes the on-disk terms block which a trie node points to. + * @param hasTerms A boolean which will be false if this on-disk block consists entirely of + * pointers to child blocks. + * @param floorData A {@link BytesRef} which will be non-null when a large block of terms sharing + * a single trie prefix is split into multiple on-disk blocks. + */ + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +BUILDING, +SAVED, +DESTROYED + } + + private static class Node { + +// The utf8 digit that leads to this Node, 0 for root node +private final int label; +// The children listed in order by their utf8 label +private final LinkedList children; +// The output of this node. +private Output output; + +// Vars used during saving: + +// The file pointer point to where the node saved. -1 means the node has not been saved. +private long fp = -1; +// The iterator whose next() point to the first child has not been saved. +private Iterator childrenIterator; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.BUILDING; + final Node root = new Node(0, null, new LinkedList<>()); + + static TrieBuilder bytesRefToTrie(BytesRef k, Output v) { +return new TrieBuilder(k, v); + } + + private TrieBuilder(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + /** + * Absorb all (K, V) pairs from the given trie into this one. The given trie builder should not + * have key that already exists in this one, otherwise a {@link IllegalArgumentException } will be + * thrown and this trie will get destroyed. + * + * Note: the given trie will be destroyed after absorbing. + */ + void absorb(TrieBuilder trieBuilder) { +if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) { + throw new IllegalStateException("tries should be unsaved"); +} +// Use a simple stack to avoid recursion. +Deque stack = new ArrayDeque<>(); +stack.add(() -> absorb(this.root, trieBuilder.root, stack)); +while (!stack.isEmpty()) { + stack.pop().run(); +} +trieBuilder.status = Status.DESTROYED; + } + + private void absorb(Node n, Node add, Deque stack) { +assert n.label == add.label; +if (add.output != null) { + if (n.output != null) { +
[PR] New IndexReaderFunctions.positionLength from the norm [lucene]
dsmiley opened a new pull request, #14433: URL: https://github.com/apache/lucene/pull/14433 ### Description Introduces `org.apache.lucene.queries.function.IndexReaderFunctions#positionLength` Javadocs: > Creates a value source that returns the position length (number of terms) of a field, approximated from the "norm". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup merging of HNSW graphs [lucene]
mayya-sharipova commented on code in PR #14331: URL: https://github.com/apache/lucene/pull/14331#discussion_r2005462586 ## lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentHnswMerger.java: ## @@ -51,19 +57,85 @@ protected HnswBuilder createBuilder(KnnVectorValues mergedVectorValues, int maxO OnHeapHnswGraph graph; BitSet initializedNodes = null; -if (initReader == null) { +if (graphReaders.size() == 0) { graph = new OnHeapHnswGraph(M, maxOrd); } else { + graphReaders.sort(Comparator.comparingInt(GraphReader::graphSize).reversed()); + GraphReader initGraphReader = graphReaders.get(0); + KnnVectorsReader initReader = initGraphReader.reader(); + MergeState.DocMap initDocMap = initGraphReader.initDocMap(); + int initGraphSize = initGraphReader.graphSize(); HnswGraph initializerGraph = ((HnswGraphProvider) initReader).getGraph(fieldInfo.name); + if (initializerGraph.size() == 0) { graph = new OnHeapHnswGraph(M, maxOrd); } else { initializedNodes = new FixedBitSet(maxOrd); -int[] oldToNewOrdinalMap = getNewOrdMapping(mergedVectorValues, initializedNodes); +int[] oldToNewOrdinalMap = +getNewOrdMapping( +fieldInfo, +initReader, +initDocMap, +initGraphSize, +mergedVectorValues, +initializedNodes); graph = InitializedHnswGraphBuilder.initGraph(initializerGraph, oldToNewOrdinalMap, maxOrd); } } return new HnswConcurrentMergeBuilder( taskExecutor, numWorker, scorerSupplier, beamWidth, graph, initializedNodes); } + + /** + * Creates a new mapping from old ordinals to new ordinals and returns the total number of vectors + * in the newly merged segment. + * + * @param mergedVectorValues vector values in the merged segment + * @param initializedNodes track what nodes have been initialized + * @return the mapping from old ordinals to new ordinals + * @throws IOException If an error occurs while reading from the merge state + */ + private static final int[] getNewOrdMapping( + FieldInfo fieldInfo, + KnnVectorsReader initReader, + MergeState.DocMap initDocMap, + int initGraphSize, + KnnVectorValues mergedVectorValues, + BitSet initializedNodes) + throws IOException { +KnnVectorValues.DocIndexIterator initializerIterator = null; + +switch (fieldInfo.getVectorEncoding()) { + case BYTE -> initializerIterator = initReader.getByteVectorValues(fieldInfo.name).iterator(); + case FLOAT32 -> + initializerIterator = initReader.getFloatVectorValues(fieldInfo.name).iterator(); +} + +IntIntHashMap newIdToOldOrdinal = new IntIntHashMap(initGraphSize); +int maxNewDocID = -1; +for (int docId = initializerIterator.nextDoc(); +docId != NO_MORE_DOCS; +docId = initializerIterator.nextDoc()) { + int newId = initDocMap.get(docId); + maxNewDocID = Math.max(newId, maxNewDocID); + newIdToOldOrdinal.put(newId, initializerIterator.index()); Review Comment: Addressed in cb852a6387a09ba43049b8a24f1e026c309b368b ## lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentHnswMerger.java: ## @@ -51,19 +57,85 @@ protected HnswBuilder createBuilder(KnnVectorValues mergedVectorValues, int maxO OnHeapHnswGraph graph; BitSet initializedNodes = null; -if (initReader == null) { +if (graphReaders.size() == 0) { graph = new OnHeapHnswGraph(M, maxOrd); } else { + graphReaders.sort(Comparator.comparingInt(GraphReader::graphSize).reversed()); + GraphReader initGraphReader = graphReaders.get(0); + KnnVectorsReader initReader = initGraphReader.reader(); + MergeState.DocMap initDocMap = initGraphReader.initDocMap(); + int initGraphSize = initGraphReader.graphSize(); HnswGraph initializerGraph = ((HnswGraphProvider) initReader).getGraph(fieldInfo.name); + if (initializerGraph.size() == 0) { graph = new OnHeapHnswGraph(M, maxOrd); } else { initializedNodes = new FixedBitSet(maxOrd); -int[] oldToNewOrdinalMap = getNewOrdMapping(mergedVectorValues, initializedNodes); +int[] oldToNewOrdinalMap = +getNewOrdMapping( +fieldInfo, +initReader, +initDocMap, +initGraphSize, +mergedVectorValues, +initializedNodes); graph = InitializedHnswGraphBuilder.initGraph(initializerGraph, oldToNewOrdinalMap, maxOrd); } } return new HnswConcurrentMergeBuilder( taskExecutor, numWorker, scorerSupplier, beamWidth, graph, initializedNodes); } + + /** + * Creates a new mapping from old ordinals to new ordinals and returns the total number of vectors + * in the newly merged segment. + * + *
Re: [I] Address gradle temp file pollution insanity [lucene]
dweiss commented on issue #14385: URL: https://github.com/apache/lucene/issues/14385#issuecomment-2743732858 I think the hack we had in https://github.com/apache/lucene-solr/pull/1767/files used to work but gradle must have relocated those temp files... The fix is simple but I'd like to do some analysis what exactly happened and when first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]
rmuir commented on PR #14381: URL: https://github.com/apache/lucene/pull/14381#issuecomment-2743822277 @dweiss thanks for the suggestion there, gazillions of array creations avoided. so now this thing will only spike cpu during parsing at worst. I honestly forget you can pass functions to functions in java now :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] upgrade icu dependency from 74.2 -> 77.1 [lucene]
rmuir merged PR #14386: URL: https://github.com/apache/lucene/pull/14386 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2006940286 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java: ## @@ -0,0 +1,552 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class TrieBuilder { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + static final int LEAF_NODE_HAS_TERMS = 1 << 5; + static final int LEAF_NODE_HAS_FLOOR = 1 << 6; + static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1; + static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0; + + /** + * The output describing the term block the prefix point to. + * + * @param fp describes the on-disk terms block which a trie node points to. + * @param hasTerms A boolean which will be false if this on-disk block consists entirely of + * pointers to child blocks. + * @param floorData A {@link BytesRef} which will be non-null when a large block of terms sharing + * a single trie prefix is split into multiple on-disk blocks. + */ + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +BUILDING, +SAVED, +DESTROYED + } + + private static class Node { + +// The utf8 digit that leads to this Node, 0 for root node +private final int label; +// The children listed in order by their utf8 label +private final LinkedList children; +// The output of this node. +private Output output; + +// Vars used during saving: + +// The file pointer point to where the node saved. -1 means the node has not been saved. +private long fp = -1; +// The iterator whose next() point to the first child has not been saved. +private Iterator childrenIterator; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.BUILDING; + final Node root = new Node(0, null, new LinkedList<>()); + + static TrieBuilder bytesRefToTrie(BytesRef k, Output v) { +return new TrieBuilder(k, v); + } + + private TrieBuilder(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + /** + * Absorb all (K, V) pairs from the given trie into this one. The given trie builder should not + * have key that already exists in this one, otherwise a {@link IllegalArgumentException } will be + * thrown and this trie will get destroyed. + * + * Note: the given trie will be destroyed after absorbing. + */ + void absorb(TrieBuilder trieBuilder) { +if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) { + throw new IllegalStateException("tries should be unsaved"); +} +// Use a simple stack to avoid recursion. +Deque stack = new ArrayDeque<>(); +stack.add(() -> absorb(this.root, trieBuilder.root, stack)); +while (!stack.isEmpty()) { + stack.pop().run(); +} +trieBuilder.status = Status.DESTROYED; + } + + private void absorb(Node n, Node add, Deque stack) { +assert n.label == add.label; +if (add.output != null) { + if (n.output != null) { +
Re: [PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]
rmuir commented on code in PR #14381: URL: https://github.com/apache/lucene/pull/14381#discussion_r2007499003 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -778,6 +786,53 @@ private int[] toCaseInsensitiveChar(int codepoint) { } } + /** + * Expands range to include case-insensitive matches. + * + * This is expensive: case-insensitive range involves iterating over the range space, adding + * alternatives. Jump on the grenade here, contain CPU and memory explosion just to this method + * activated by optional flag. + */ + private void expandCaseInsensitiveRange( + int start, int end, List rangeStarts, List rangeEnds) { +if (start > end) + throw new IllegalArgumentException( + "invalid range: from (" + start + ") cannot be > to (" + end + ")"); + +// contain the explosion of transitions by using a throwaway state +Automaton scratch = new Automaton(); +int state = scratch.createState(); + +// iterate over range, adding codepoint and any alternatives as transitions +for (int i = start; i <= end; i++) { + scratch.addTransition(state, state, i); + int[] altCodePoints = CaseFolding.lookupAlternates(i); + if (altCodePoints != null) { +for (int alt : altCodePoints) { + scratch.addTransition(state, state, alt); +} + } else { +int altCase = +Character.isLowerCase(i) ? Character.toUpperCase(i) : Character.toLowerCase(i); +if (altCase != i) { + scratch.addTransition(state, state, altCase); +} + } +} Review Comment: this one is best as a separate PR. I will work it today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Completion FSTs to be loaded off-heap by default [lucene]
javanna commented on code in PR #14364: URL: https://github.com/apache/lucene/pull/14364#discussion_r2000872434 ## lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestSuggestField.java: ## @@ -951,7 +951,16 @@ static IndexWriterConfig iwcWithSuggestField(Analyzer analyzer, final Set
Re: [I] TestIndexSortBackwardsCompatibility.testSortedIndexAddDocBlocks fails reproducibly [lucene]
dweiss closed issue #14344: TestIndexSortBackwardsCompatibility.testSortedIndexAddDocBlocks fails reproducibly URL: https://github.com/apache/lucene/issues/14344 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Preparing existing profiler for adding concurrent profiling [lucene]
jainankitk commented on PR #14413: URL: https://github.com/apache/lucene/pull/14413#issuecomment-2762048902 > You just need to replace ctx with _. Ah, my bad! I tried `.`, but we can't use that as part of variable name. Thanks for the suggestion @jpountz. At a high level, I have unified the concurrent/non-concurrent profiling paths as suggested. The `QueryProfilerTree` is shared across slices, and we recursively build the ProfilerTree for each slice for response. There are few kinks that we still need to be iron out. For example: * `Weight` creation is global across slices. How do we account for its time? Should be have separate global tree with just the weight times? We can't just get away with having weight count at the top as `Weight` is shared for child queries as well, right? * The new in-memory structure for profiled queries is bit like below (notice additional list for slices): ``` "query": [ <-- for list of slices [ <-- for list of root queries { "type": "TermQuery", "description": "foo:bar", "time_in_nanos" : 11972972, "breakdown" : { ``` We can probably have map of slices, with key being the `sliceId`: ``` "query": { "some global information": "slices": { "slice1": [ <-- for list of root queries { "type": "TermQuery", "description": "foo:bar", "time_in_nanos" : 11972972, "breakdown" : {...}}], "slice2": [], "slice3": []} } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] ParallelLeafReader.getTermVectors can indirectly load TVs multiple times [LUCENE-6868] [lucene]
vigyasharma closed issue #7926: ParallelLeafReader.getTermVectors can indirectly load TVs multiple times [LUCENE-6868] URL: https://github.com/apache/lucene/issues/7926 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
jimczi commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2027392059 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java: ## @@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() { * The default implementation is empty */ public void finishMerge() throws IOException {} + + /** A string representing the off-heap category for quantized vectors. */ + public static final String QUANTIZED = "QUANTIZED"; Review Comment: nit: I wonder if we should rather reflect the underlying format here. Something like flat_vector_float, flat_vector_byte, flat_vector_bbq? ## lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java: ## @@ -257,6 +259,19 @@ public long ramBytesUsed() { return size; } + @Override + public Map getOffHeapByteSize(FieldInfo fieldInfo) { +Objects.requireNonNull(fieldInfo); +var raw = rawVectorsReader.getOffHeapByteSize(fieldInfo); +var fieldEntry = fields.get(fieldInfo.name); +if (fieldEntry == null) { + assert fieldInfo.getVectorEncoding() == VectorEncoding.BYTE; Review Comment: This is not possible, this format doesn't accept raw vector in the byte format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] New IndexReaderFunctions.positionLength from the norm [lucene]
bruno-roustant commented on PR #14433: URL: https://github.com/apache/lucene/pull/14433#issuecomment-2777888670 Why not numTerms() instead of positionLength()? Inside Similarity.computeNorm(), the value is named numTerms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add CaseFolding.fold(), inverse of expand(), move to UnicodeUtil, add filter [lucene]
rmuir opened a new pull request, #14389: URL: https://github.com/apache/lucene/pull/14389 Regexp has the ability to erase case differences at query time (the slow way), but there's no corresponding ability to do it the fast-way: at index time. There's LowerCaseFilter, but LowerCaseFilter normalizes text for display purposes, which is different than case folding which eliminates case differences and is appropriate for search. Generate fold() data in a similar way as expand() data. Expose via UnicodeUtil and tableize basic latin for performance. Add CaseFoldingFilter. No Analyzer chains have been modified yet, but we should be able to improve Unicode support by swapping out LowerCaseFilter as a followup. Some filters such as GreekLowerCaseFilter can probably be eliminated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]
jainankitk commented on PR #14267: URL: https://github.com/apache/lucene/pull/14267#issuecomment-2773131906 @hanbj - Thanks for patiently addressing the review comments. While I don't see any performance regression risk myself, I am wondering if we can do one quick performance benchmark run, just to ensure we are not missing anything obvious? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]
guojialiang92 commented on PR #14417: URL: https://github.com/apache/lucene/pull/14417#issuecomment-2766116736 Thanks, @vigyasharma I also looked at Lucene's native segment replication, just sharing my personal opinion. > Also, IIUC `IndexWriter#advanceSegmentInfosVersion()` was added to handle similar scenarios for NRT replication (Lucene's native segment replication implementation). I'm curious why we didn't run into the need to advance `SegmentInfos#counter` at that time. Do you remember, @mikemccand (I know it's been a while! (: )? In the code comments of Lucene's native segment replication, the risk of file conflicts is also mentioned, but no additional processing is done. From a robustness perspective, perhaps control should also be carried out. The relevant code is as follows: ReplicaNode#fileIsIdentical (**Segment name was reused! This is rare but possible and otherwise devastating**) ``` private boolean fileIsIdentical(String fileName, FileMetaData srcMetaData) throws IOException { FileMetaData destMetaData = readLocalFileMetaData(fileName); if (destMetaData == null) { // Something went wrong in reading the file (it's corrupt, truncated, does not exist, etc.): return false; } if (Arrays.equals(destMetaData.header(), srcMetaData.header()) == false || Arrays.equals(destMetaData.footer(), srcMetaData.footer()) == false) { // Segment name was reused! This is rare but possible and otherwise devastating: if (isVerboseFiles()) { message("file " + fileName + ": will copy [header/footer is different]"); } return false; } else { return true; } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
dweiss commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2755082414 I've toyed with it a bit but I don't see a way for it to not break those /// comments. An alternative is to fork it, fix what we need and then use the forked version from spotless. This is a doable alternative to using Eclipse's formatter - I really don't mind either. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on PR #14333: URL: https://github.com/apache/lucene/pull/14333#issuecomment-2771814782 I roughly implemented the idea. This is my first time forking a new codec, hopefully have not made too many mistakes :) A few thoughts during my refactoring: * I thought i only need to fork a `Lucene103BlockTreeTerms` to intersect with `Lucene101Postings`. But it seems challenging based on current API design. I have to fork the new `Lucene103Postings` as well. Maybe this is a point can be improved? * I'm not sure if it matters to make it a default codec or not in main, as main will not get released anyway? Default to main without backporting sounds good enough to me. If we stick not to make it default, maybe this codec should be moved into test / sandbox / codec module? It would be weird to see multiple codecs in a core module for contributors who don't have the context of this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Incorrect use of fsync [lucene]
rmuir commented on issue #14334: URL: https://github.com/apache/lucene/issues/14334#issuecomment-2772221194 Nobody needs to fsync any temporary files, ever. They are temporary: we don't need them durable. Look at how lucene uses temporary files to understand this. We don't need such files to persist to any storage device ever. Personally I use tmpfs for temp files, they only go to memory. if your operating system doesn't give you any error when using temporary files then your operating system is broken: get a new one. If your computer doesn't detect memory corruption then buy ECC memory. Lucene has checksums and other safeguards that might indicate it, but that's no guarantee it is just best-effort. IMO You read too far into a stackoverflow comment here without understanding how some of this works. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] bump antlr 4.11.1 -> 4.13.2 [lucene]
rmuir commented on code in PR #14388: URL: https://github.com/apache/lucene/pull/14388#discussion_r2008139072 ## lucene/expressions/src/generated/checksums/generateAntlr.json: ## @@ -1,7 +1,8 @@ { "lucene/expressions/src/java/org/apache/lucene/expressions/js/Javascript.g4": "818e89aae0b6c7601051802013898c128fe7c1ba", "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptBaseVisitor.java": "6965abdb8b069aaceac1ce4f32ed965b194f3a25", - "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptLexer.java": "b8d6b259ebbfce09a5379a1a2aa4c1ddd4e378eb", - "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptParser.java": "7a3a7b9de17f4a8d41ef342312eae5c55e483e08", - "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptVisitor.java": "ec24bb2b9004bc38ee808970870deed12351039e" + "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptLexer.java": "6508dc5008e96a1ad28c967a3401407ba83f140b", + "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptParser.java": "ba6d0c00af113f115fc7a1f165da7726afb2e8c5", + "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptVisitor.java": "ec24bb2b9004bc38ee808970870deed12351039e", +"property:antlr-version": "4.13.2" Review Comment: Thanks for this! Yes, this is what it looks like in ICU json file, which works perfectly: e.g. in `./lucene/analysis/icu/src/generated/checksums/genRbbi.json`: ```json { ... "property:icuConfig": "com.ibm.icu:icu4j:77.1" } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] New IndexReaderFunctions.positionLength from the norm [lucene]
dsmiley commented on PR #14433: URL: https://github.com/apache/lucene/pull/14433#issuecomment-2780732429 `fieldLength` works for me. I'd like `fieldPositionLength` more as it characterizes the basis of the length (it's not characters). BTW some other methods on this class don't have "field" in the name yet take a field arg and so are a statistic about a field. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] KeywordField.newSetQuery() to reuse prefixed terms in IndexOrDocValuesQuery [lucene]
mkhludnev commented on code in PR #14435: URL: https://github.com/apache/lucene/pull/14435#discussion_r2029801915 ## lucene/core/src/java/org/apache/lucene/document/KeywordField.java: ## @@ -175,9 +174,8 @@ public static Query newExactQuery(String field, String value) { public static Query newSetQuery(String field, Collection values) { Objects.requireNonNull(field, "field must not be null"); Objects.requireNonNull(values, "values must not be null"); -Query indexQuery = new TermInSetQuery(field, values); -Query dvQuery = new TermInSetQuery(MultiTermQuery.DOC_VALUES_REWRITE, field, values); -return new IndexOrDocValuesQuery(indexQuery, dvQuery); +return TermInSetQuery.newIndexOrDocValuesQuery( +MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE, field, values); Review Comment: ok. Got it. Since we add something, let's add as least as possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add a timeout for forceMergeDeletes in IndexWriter [lucene]
jpountz commented on issue #14431: URL: https://github.com/apache/lucene/issues/14431#issuecomment-2780641860 > and some deletes being addressed is better than none. This part of your message suggests that deletes get reclaimed progressively over time, which is often not true. So waiting for 50% of the time it takes to run merges may not result in an index that has significant fewer deletes than if not waiting at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] New IndexReaderFunctions.positionLength from the norm [lucene]
jpountz commented on PR #14433: URL: https://github.com/apache/lucene/pull/14433#issuecomment-2780644329 What about calling it just "field length", since this is the length as computed for the purpose of length normalization? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow skip cache factor to be updated dynamically [lucene]
sgup432 commented on code in PR #14412: URL: https://github.com/apache/lucene/pull/14412#discussion_r2019109527 ## lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java: ## @@ -122,12 +123,30 @@ public LRUQueryCache( long maxRamBytesUsed, Predicate leavesToCache, float skipCacheFactor) { +this(maxSize, maxRamBytesUsed, leavesToCache, new AtomicReference<>(skipCacheFactor)); + } + + /** + * Additionally, allows the ability to pass skipCacheFactor in form of AtomicReference where the + * caller can dynamically update(in a thread safe way) its value by calling skipCacheFactor.set() + * on their end. + */ + public LRUQueryCache( + int maxSize, + long maxRamBytesUsed, + Predicate leavesToCache, + AtomicReference skipCacheFactor) { Review Comment: Made the changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Let Decompressor implement the Closeable interface. [lucene]
jpountz commented on PR #14438: URL: https://github.com/apache/lucene/pull/14438#issuecomment-2778028781 Unfortunately, you can't easily use close() to release resources from a Decompressor, because `StoredFieldsReader` is cloneable, and close() is never called on the clones. The only workaround that comes to mind would consist of using thread-locals, but I don't think we want to support that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] KeywordField.newSetQuery() to uses prefixed terms for IndexOrDocValuesQuery [lucene]
mkhludnev opened a new pull request, #14435: URL: https://github.com/apache/lucene/pull/14435 fix #14425 ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] QueryParser parsing a phrase with a wildcard [lucene]
viliam-durina opened a new issue, #14440: URL: https://github.com/apache/lucene/issues/14440 ### Description Hi all, I have tried to parse this query using the classic QueryParser: String sQuery = "\"foo bar*\""; The query was parsed into a PhraseQuery with two terms: "foo" and "bar". That is the wildcard was lost and the query doesn't handle the "bar" term as a prefix. I think this is an issue: Lucene should either produce an error, if wildcard search isn't supported within phrases, or it should produce a correct query. ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Use FixedLengthBytesRefArray in OneDimensionBKDWriter to hold split values [lucene]
iverase opened a new pull request, #14383: URL: https://github.com/apache/lucene/pull/14383 We are currently using a list which feels wasteful. For example looking into the heap dump on an IP field, we were using almost double of the heap necessary to hold the split values: https://github.com/user-attachments/assets/d839b0f4-ed6b-43bf-8060-47560b68be2a"; /> Using FixedLengthBytesRefArray should reduce memory usage and avoid humongous allocations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]
benwtrent commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2744242199 > do you confirm that, according to your knowledge, any relevant and active work toward multi-valued vectors in Lucene is effectively aggregated here? @alessandrobenedetti I think so. This is the latest stab at it. > Main concern is still related to ordinals to become long as far as I can see :) Indeed, I just don't see how Lucene can actually support multi-value vectors without switching to long ordinals for the vectors. Otherwise, we enforce some limitation on the number of vectors per segment, or some limitation on the number of vectors per doc (e.g. every doc can only have 256/65535 vectors). Making HNSW indexing & merging ~2x (given other constants, it might not be exactly 2x, maybe a little less) more expensive for heap usage is a pretty steep cost. Especially for something I am not sure how many folks will actually use. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Speedup merging of HNSW graphs (#14331) [lucene]
mayya-sharipova opened a new pull request, #14380: URL: https://github.com/apache/lucene/pull/14380 Backport for #14331 Currently when doing merging of HNSW graphs incrementally, we first initialize a graph from the biggest segment, and for other segments, we rebuild the graphs completely by going through a segment's vector values one by one, searching for it in the new graph to find best neighbours to connect it with. This PR proposes more efficient merging based on the idea if we know where we want to insert a node, we have a good idea of where we want to insert its neighbours. Similarly to the current approach, we initialize a new graph from the biggest segment. For all other segments, we find a smaller set of nodes that "covers" their graph, and we insert that set as usual. For other nodes, outside of J sets, we do lighter searches with pre-calculated eps. This allows substantial speedups in merging (up to 2x in force-merge). The algorithm is based on the following steps: 1. Get all graphs that don't have deletions and sort them by size (descending). 2. Copy the largest graph to the new graph (`gL`). 3. For each remaining small graph (`gS`): - Find the nodes that best cover `gS` (join set `j`). These nodes will be inserted into `gL` as usual: by searching `gL` to find the best candidates (`w`) to which connect the nodes. - For each remaining node in `gS` do "lighter" searches: - We provide `eps` to search in `gL`. We form `eps` by the union of the node's neighbors in `gS` and the node's neighbors' neighbors in `gL`. We also limit `beamWidth` (`efConstruction` ) to `M * 3`. Algorithm designed by Thomas Veasey ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Handle NaN results in TestVectorUtilSupport.testBinaryVectors [lucene]
benwtrent commented on code in PR #14419: URL: https://github.com/apache/lucene/pull/14419#discussion_r2018509188 ## lucene/core/src/test/org/apache/lucene/internal/vectorization/TestVectorUtilSupport.java: ## @@ -210,9 +210,13 @@ public void testMinMaxScalarQuantize() { } private void assertFloatReturningProviders(ToDoubleFunction func) { -assertThat( -func.applyAsDouble(PANAMA_PROVIDER.getVectorUtilSupport()), -closeTo(func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport()), delta)); +double luceneValue = func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport()); Review Comment: Using `assertEquals` is fine with the delta. I don't know of any special reason to use `closeTo` here. @thecoop what do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] KeywordField.newSetQuery() to reuse prefixed terms in IndexOrDocValuesQuery [lucene]
jainankitk commented on code in PR #14435: URL: https://github.com/apache/lucene/pull/14435#discussion_r2029926829 ## lucene/core/src/java/org/apache/lucene/document/KeywordField.java: ## @@ -175,9 +174,8 @@ public static Query newExactQuery(String field, String value) { public static Query newSetQuery(String field, Collection values) { Objects.requireNonNull(field, "field must not be null"); Objects.requireNonNull(values, "values must not be null"); -Query indexQuery = new TermInSetQuery(field, values); -Query dvQuery = new TermInSetQuery(MultiTermQuery.DOC_VALUES_REWRITE, field, values); -return new IndexOrDocValuesQuery(indexQuery, dvQuery); +return TermInSetQuery.newIndexOrDocValuesQuery( +MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE, field, values); Review Comment: Thanks for making the change. I know its minor, but important for keeping it clean! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]
uschindler commented on code in PR #14384: URL: https://github.com/apache/lucene/pull/14384#discussion_r2008497905 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -759,23 +759,14 @@ private Automaton toAutomaton( * @return the original codepoint and the set of alternates */ private int[] toCaseInsensitiveChar(int codepoint) { -int[] altCodepoints = CaseFolding.lookupAlternates(codepoint); -if (altCodepoints != null) { - int[] concat = new int[altCodepoints.length + 1]; - System.arraycopy(altCodepoints, 0, concat, 0, altCodepoints.length); - concat[altCodepoints.length] = codepoint; - return concat; -} else { - int altCase = - Character.isLowerCase(codepoint) - ? Character.toUpperCase(codepoint) - : Character.toLowerCase(codepoint); - if (altCase != codepoint) { -return new int[] {altCase, codepoint}; - } else { -return new int[] {codepoint}; - } -} +List list = new ArrayList<>(); +CaseFolding.expand( +codepoint, +(int variant) -> { Review Comment: wouldn't have `list::add` as method reference not have worked? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
mikemccand commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2005470873 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java: ## @@ -0,0 +1,486 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class Trie { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +UNSAVED, +SAVED, +DESTROYED + } + + private static class Node { +private final int label; +private final LinkedList children; +private Output output; +private long fp = -1; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.UNSAVED; + final Node root = new Node(0, null, new LinkedList<>()); + + Trie(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + void putAll(Trie trie) { +if (status != Status.UNSAVED || trie.status != Status.UNSAVED) { + throw new IllegalStateException("tries should be unsaved"); +} +trie.status = Status.DESTROYED; +putAll(this.root, trie.root); + } + + private static void putAll(Node n, Node add) { +assert n.label == add.label; +if (add.output != null) { + n.output = add.output; +} +ListIterator iter = n.children.listIterator(); +// TODO we can do more efficient if there is no intersection, block tree always do that +outer: +for (Node addChild : add.children) { + while (iter.hasNext()) { +Node nChild = iter.next(); +if (nChild.label == addChild.label) { + putAll(nChild, addChild); + continue outer; +} +if (nChild.label > addChild.label) { + iter.previous(); // move back + iter.add(addChild); + continue outer; +} + } + iter.add(addChild); +} + } + + Output getEmptyOutput() { +return root.output; + } + + void forEach(BiConsumer consumer) { +if (root.output != null) { + consumer.accept(new BytesRef(), root.output); +} +intersect(root.children, new BytesRefBuilder(), consumer); + } + + private void intersect( + List nodes, BytesRefBuilder key, BiConsumer consumer) { +for (Node node : nodes) { + key.append((byte) node.label); + if (node.output != null) consumer.accept(key.toBytesRef(), node.output); + intersect(node.children, key, consumer); + key.setLength(key.length() - 1); +} + } + + void save(DataOutput meta, IndexOutput index) throws IOException { +if (status != Status.UNSAVED) { + throw new IllegalStateException("only unsaved trie can be saved"); +} +status = Status.SAVED; +meta.writeVLong(index.getFilePointer()); +saveNodes(index); +meta.writeVLong(root.fp); +index.writeLong(0L); // additional 8 bytes for over-reading +meta.writeVLong(index.getFilePointer()); + } + + void saveNodes(IndexOutput index) throws IOException { +final long startFP = index.getFilePointer(); +Deque stack = new ArrayDeque<>(); +sta
Re: [I] IndexReader#leaves method is slightly confusing [lucene]
jpountz commented on issue #14367: URL: https://github.com/apache/lucene/issues/14367#issuecomment-2748919960 Hmm, maybe I closed a bit too quickly as this issue only pointed out confusion with `IndexReader#leaves`, it did not suggest a particular approach. That said, I'm aligned with the last paragraph: "Really minor at this point, and probably not worth going through the pain of deprecating IndexReader#leaves and changing at few hundred places", it's not too confusing to me so I'm not sure it actually warrants a change. I'll leave it closed for now but happy to reopen if there is traction for improving this API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Adding TestSpanWithinQuery with basic test cases for SpanWithinQuery [lucene]
slow-J opened a new pull request, #14405: URL: https://github.com/apache/lucene/pull/14405 TEST: ./gradlew check ### Description I was looking at an old issue https://github.com/apache/lucene/issues/7145 which talks about unit tests for SpanWithinQuery. I noticed that there was no class for basic unit tests for SpanWithinQuery, while we do this for many other SpanQueries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
rmuir commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2754255056 @dweiss I also wonder, with an "autoformat" workflow, if we even care so much. I don't understand what is so sacrosanct about google's format: to me it is ugly. Snippet tag is from java 18 (6 releases back) and google doesn't care, they are a big corporation and probably the type to keep code on e.g. java 8. I don't think we should weigh their opinions very much on anything. All autoformatters lead to ugliness at times, it is just the tradeoff you make to avoid hassles, and still reap the benefits of avoiding formatting bikesheds, noise in PRs, etc. I just think autoformat the code in a consistent way, call it a day. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]
vigyasharma commented on PR #14417: URL: https://github.com/apache/lucene/pull/14417#issuecomment-2764418906 I think we can add a couple more tests to make it robust. 1. Some tests around concurrency – index with multiple threads, then advance the counter in one of the threads, and validate behavior. You can look at `ThreadedIndexingAndSearchingTestCase` and its derived tests for motivation. 2. A test for the crash-recovery scenario, which I suppose it the primary use case. We could make the writer index a bunch of docs, then kill it, start a new writer on the same index, and advance its counter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]
gf2121 commented on code in PR #14401: URL: https://github.com/apache/lucene/pull/14401#discussion_r2019735302 ## lucene/test-framework/src/java/org/apache/lucene/tests/search/AssertingLeafCollector.java: ## @@ -50,6 +50,14 @@ public void collect(DocIdStream stream) throws IOException { in.collect(new AssertingDocIdStream(stream)); } + @Override + public void collectRange(int min, int max) throws IOException { +assert min > lastCollected; +assert max > min; Review Comment: Maybe assert `min >= this.min` and `max <= this.max` as well :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adds github action to verify changelog entry and set milestone to PRs [lucene]
stefanvodita commented on PR #14279: URL: https://github.com/apache/lucene/pull/14279#issuecomment-2743574250 Thanks for pointing that out @javanna! Funny how that happened on a PR that's specifically about the changelog. We should only push this to main. I'll actually delete the entry for now since we're still iterating on this workflow to make it work. See #13898. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2006885578 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieReader.java: ## @@ -0,0 +1,228 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.RandomAccessInput; + +class TrieReader { + + private static final long NO_OUTPUT = -1; + private static final long NO_FLOOR_DATA = -1; + private static final long[] BYTES_MINUS_1_MASK = + new long[] { +0xFFL, +0xL, +0xFFL, +0xL, +0xFFL, +0xL, +0xFFL, +0xL + }; + + static class Node { Review Comment: Yeah, `TrieBuilder.Node` and `TrieReader.Node`. I think the class prefix has made it clear :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] KeywordField.newSetQuery() to reuse prefixed terms in IndexOrDocValue… [lucene]
mkhludnev opened a new pull request, #14442: URL: https://github.com/apache/lucene/pull/14442 …sQuery (#14435) * KeywordField.newSetQuery() reuses prefixed terms. fix #14425 ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
mikemccand commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2022727361 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java: ## @@ -0,0 +1,552 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class TrieBuilder { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + static final int LEAF_NODE_HAS_TERMS = 1 << 5; + static final int LEAF_NODE_HAS_FLOOR = 1 << 6; + static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1; + static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0; + + /** + * The output describing the term block the prefix point to. + * + * @param fp describes the on-disk terms block which a trie node points to. + * @param hasTerms A boolean which will be false if this on-disk block consists entirely of + * pointers to child blocks. + * @param floorData A {@link BytesRef} which will be non-null when a large block of terms sharing + * a single trie prefix is split into multiple on-disk blocks. + */ + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +BUILDING, +SAVED, +DESTROYED + } + + private static class Node { + +// The utf8 digit that leads to this Node, 0 for root node +private final int label; +// The children listed in order by their utf8 label +private final LinkedList children; +// The output of this node. +private Output output; + +// Vars used during saving: + +// The file pointer point to where the node saved. -1 means the node has not been saved. +private long fp = -1; +// The iterator whose next() point to the first child has not been saved. +private Iterator childrenIterator; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.BUILDING; + final Node root = new Node(0, null, new LinkedList<>()); + + static TrieBuilder bytesRefToTrie(BytesRef k, Output v) { +return new TrieBuilder(k, v); + } + + private TrieBuilder(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + /** + * Absorb all (K, V) pairs from the given trie into this one. The given trie builder should not + * have key that already exists in this one, otherwise a {@link IllegalArgumentException } will be + * thrown and this trie will get destroyed. + * + * Note: the given trie will be destroyed after absorbing. + */ + void absorb(TrieBuilder trieBuilder) { +if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) { + throw new IllegalStateException("tries should be unsaved"); +} +// Use a simple stack to avoid recursion. +Deque stack = new ArrayDeque<>(); +stack.add(() -> absorb(this.root, trieBuilder.root, stack)); +while (!stack.isEmpty()) { + stack.pop().run(); +} +trieBuilder.status = Status.DESTROYED; + } + + private void absorb(Node n, Node add, Deque stack) { +assert n.label == add.label; +if (add.output != null) { + if (n.output != null) { +
Re: [I] Reuse packedTerms between two TermInSetQuery which are combined by IndexOrDocValuesQuery [lucene]
mkhludnev closed issue #14425: Reuse packedTerms between two TermInSetQuery which are combined by IndexOrDocValuesQuery URL: https://github.com/apache/lucene/issues/14425 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Reuse packedTerms between two TermInSetQuery which are combined by IndexOrDocValuesQuery [lucene]
mkhludnev closed issue #14425: Reuse packedTerms between two TermInSetQuery which are combined by IndexOrDocValuesQuery URL: https://github.com/apache/lucene/issues/14425 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] KeywordField.newSetQuery() to reuse prefixed terms in IndexOrDocValuesQuery [lucene]
mkhludnev merged PR #14435: URL: https://github.com/apache/lucene/pull/14435 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize commit retention policy to maintain only the last 5 commits [lucene]
vigyasharma commented on PR #14325: URL: https://github.com/apache/lucene/pull/14325#issuecomment-2781125749 This PR changes the existing `KeepLastCommitDeletionPolicy` which is not what we want. I've created a new, beginner issue, #1 that specifies the requirements from this task. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize commit retention policy to maintain only the last 5 commits [lucene]
vigyasharma closed pull request #14325: Optimize commit retention policy to maintain only the last 5 commits URL: https://github.com/apache/lucene/pull/14325 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Revert "Add UnwrappingReuseStrategy for AnalyzerWrapper (#14154)" [lucene]
mayya-sharipova merged PR #14437: URL: https://github.com/apache/lucene/pull/14437 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
rmuir commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r2000536590 ## lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java: ## @@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) { return alts; } + + /** + * Folds the case of the given character according to {@link Character#toLowerCase(int)}, but with + * exceptions if the turkic flag is set. + * + * @param codepoint to code point for the character to fold + * @param turkic if true, then apply tr/az folding rules + * @return the folded character + */ + static int foldCase(int codepoint, boolean turkic) { +if (turkic) { + if (codepoint == 0x00130) { // İ [LATIN CAPITAL LETTER I WITH DOT ABOVE] +return 0x00069; // i [LATIN SMALL LETTER I] + } else if (codepoint == 0x49) { // I [LATIN CAPITAL LETTER I] +return 0x00131; // ı [LATIN SMALL LETTER DOTLESS I] + } +} +return Character.toLowerCase(codepoint); Review Comment: For real case folding we have to do more than this. it is a simple 1-1 mapping but e.g. `Σ`, `σ`, and `ς`, will all fold to σ. Whereas toLowerCase(ς) = ς. Because it is already in lower-case, just in final-form. This is just an example. To see more, compare your function against ICU UCharacter.foldCase(int, bool) across all of unicode. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
mikemccand commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2022767256 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java: ## @@ -0,0 +1,632 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Arrays; +import java.util.Deque; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** + * A builder to build prefix tree (trie) as the index of block tree, and can be saved to disk. + * + * TODO make it a more memory efficient structure + */ +class TrieBuilder { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + static final int LEAF_NODE_HAS_TERMS = 1 << 5; + static final int LEAF_NODE_HAS_FLOOR = 1 << 6; + static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1; + static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0; + + /** + * The output describing the term block the prefix point to. + * + * @param fp the file pointer to the on-disk terms block which a trie node points to. + * @param hasTerms false if this on-disk block consists entirely of pointers to child blocks. + * @param floorData will be non-null when a large block of terms sharing a single trie prefix is + * split into multiple on-disk blocks. + */ + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +BUILDING, +SAVED, +DESTROYED + } + + private static class Node { + +// The utf8 digit that leads to this Node, 0 for root node +private final int label; +// The output of this node. +private Output output; +// The number of children of this node. +private int childrenNum; +// Pointers to relative nodes +private Node next; +private Node firstChild; +private Node lastChild; + +// Vars used during saving: + +// The file pointer point to where the node saved. -1 means the node has not been saved. +private long fp = -1; +// The latest child that have been saved. null means no child has been saved. +private Node savedTo; + +Node(int label, Output output) { + this.label = label; + this.output = output; +} + } + + private Status status = Status.BUILDING; + final Node root = new Node(0, null); + private final BytesRef minKey; + private BytesRef maxKey; + + static TrieBuilder bytesRefToTrie(BytesRef k, Output v) { +return new TrieBuilder(k, v); + } + + private TrieBuilder(BytesRef k, Output v) { +minKey = maxKey = BytesRef.deepCopyOf(k); +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output); + parent.firstChild = parent.lastChild = node; + parent.childrenNum = 1; + parent = node; +} + } + + /** + * Absorb all (K, V) pairs from the given trie into this one. The given trie builder need to + * ensure its keys greater or equals than max key of this one. + * + * Note: the given trie will be destroyed after absorbing. + */ + void absorb(TrieBuilder trieBuilder) { Review Comment: Maybe rename to `append`? The two tries are strictly orthogonal, and, the incoming trie is > `this` one? ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java: ## @@ -0,0 +1,632 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, V
Re: [I] Reuse packedTerms between two TermInSetQuery which are combined by IndexOrDocValuesQuery [lucene]
mkhludnev commented on issue #14425: URL: https://github.com/apache/lucene/issues/14425#issuecomment-2781083660 To be released in 10.3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] KeywordField.newSetQuery() to reuse prefixed terms in IndexOrDocValue… [lucene]
mkhludnev merged PR #14442: URL: https://github.com/apache/lucene/pull/14442 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Support incremental refresh in Searcher Managers. [lucene]
vigyasharma opened a new pull request, #14443: URL: https://github.com/apache/lucene/pull/14443 In segment based replication systems, a large replication payload (checkpoint) can induce heavy page faults, cause thrashing for in-flight search requests, and affect overall search performance. A potential way to handle these bursts, is to leverage multiple commit points in the Lucene index. Instead of refreshing to the latest commit for a large replication payload, searchers can intelligently select the commit point that they can safely absorb. By processing through multiple such points, searchers can eventually get to the latest commit, without incurring too many page faults. This change lets users define a commit selection strategy, controlling which commit the searcher manager refreshes on. Addresses #14219 **Usage:** To incrementally refresh through multiple commit points until searcher is current with its directory: - Define a commit selection strategy using the `RefreshCommitSupplier` interface. - Update searcher managers with this strategy via `setRefreshCommitSupplier()` - Invoke `maybeRefresh()` or `maybeRefreshBlocking` in a loop until `isSearcherCurrent()` returns true. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org