Re: [PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]
gf2121 commented on PR #14371: URL: https://github.com/apache/lucene/pull/14371#issuecomment-2739828325 Thanks @vsop-479 , have you been able to measure the performance of your patch? I had similar idea recently. If you look at newest code in `Lucene101PostingsReader`, you may find we are using `VectorMask` to speed up this, that was what i had in mind - get a MemorySegment slice if it is not null, and play it with VectorMask. https://github.com/apache/lucene/blob/b75602428f78bd79296eb3b6be27465961b63d09/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java#L781-L787 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Handling concurrent search in QueryProfiler [lucene]
jainankitk commented on issue #14375: URL: https://github.com/apache/lucene/issues/14375#issuecomment-2739471091 > Maybe it tries to do too much by providing min/avg/max aggregates and it should just provide per-slice breakdowns, leaving whether and how to compile aggregates to the application? I agree with you, especially when the number of slices is < 5. In those cases, we are outputting almost the same number of metrics across all the slices, without having this lossy reduction > Out of curiosity, how do you know if two calls to advance() come from the same slice or different slices? The timers are built for each leaf, and we aggregate these timers while building profiler output for each slice using `Map> sliceCollectorsToLeaves`. This `sliceCollectorToLeaves` is incrementally built as part of the `searchLeaf` method. Relies on the fact that we need to use this slice->leaf mapping only when we build the final response, by when `searchLeaf` has been invoked for all the leaves, and we know which leaves are sharing the collector and are thus, part of the same slice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup merging of HNSW graphs [lucene]
mayya-sharipova commented on code in PR #14331: URL: https://github.com/apache/lucene/pull/14331#discussion_r2005416489 ## lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentHnswMerger.java: ## @@ -51,19 +57,85 @@ protected HnswBuilder createBuilder(KnnVectorValues mergedVectorValues, int maxO OnHeapHnswGraph graph; BitSet initializedNodes = null; -if (initReader == null) { +if (graphReaders.size() == 0) { graph = new OnHeapHnswGraph(M, maxOrd); } else { + graphReaders.sort(Comparator.comparingInt(GraphReader::graphSize).reversed()); + GraphReader initGraphReader = graphReaders.get(0); + KnnVectorsReader initReader = initGraphReader.reader(); + MergeState.DocMap initDocMap = initGraphReader.initDocMap(); + int initGraphSize = initGraphReader.graphSize(); HnswGraph initializerGraph = ((HnswGraphProvider) initReader).getGraph(fieldInfo.name); + if (initializerGraph.size() == 0) { graph = new OnHeapHnswGraph(M, maxOrd); } else { initializedNodes = new FixedBitSet(maxOrd); -int[] oldToNewOrdinalMap = getNewOrdMapping(mergedVectorValues, initializedNodes); +int[] oldToNewOrdinalMap = +getNewOrdMapping( +fieldInfo, +initReader, +initDocMap, +initGraphSize, +mergedVectorValues, +initializedNodes); graph = InitializedHnswGraphBuilder.initGraph(initializerGraph, oldToNewOrdinalMap, maxOrd); } } return new HnswConcurrentMergeBuilder( taskExecutor, numWorker, scorerSupplier, beamWidth, graph, initializedNodes); } + + /** + * Creates a new mapping from old ordinals to new ordinals and returns the total number of vectors + * in the newly merged segment. + * + * @param mergedVectorValues vector values in the merged segment + * @param initializedNodes track what nodes have been initialized + * @return the mapping from old ordinals to new ordinals + * @throws IOException If an error occurs while reading from the merge state + */ + private static final int[] getNewOrdMapping( + FieldInfo fieldInfo, + KnnVectorsReader initReader, + MergeState.DocMap initDocMap, + int initGraphSize, + KnnVectorValues mergedVectorValues, + BitSet initializedNodes) + throws IOException { +KnnVectorValues.DocIndexIterator initializerIterator = null; + +switch (fieldInfo.getVectorEncoding()) { + case BYTE -> initializerIterator = initReader.getByteVectorValues(fieldInfo.name).iterator(); + case FLOAT32 -> + initializerIterator = initReader.getFloatVectorValues(fieldInfo.name).iterator(); +} + +IntIntHashMap newIdToOldOrdinal = new IntIntHashMap(initGraphSize); +int maxNewDocID = -1; +for (int docId = initializerIterator.nextDoc(); +docId != NO_MORE_DOCS; +docId = initializerIterator.nextDoc()) { + int newId = initDocMap.get(docId); + maxNewDocID = Math.max(newId, maxNewDocID); + newIdToOldOrdinal.put(newId, initializerIterator.index()); +} + +if (maxNewDocID == -1) { + return new int[0]; +} +final int[] oldToNewOrdinalMap = new int[initGraphSize]; +KnnVectorValues.DocIndexIterator mergedVectorIterator = mergedVectorValues.iterator(); +for (int newDocId = mergedVectorIterator.nextDoc(); +newDocId <= maxNewDocID; +newDocId = mergedVectorIterator.nextDoc()) { + int hashDocIndex = newIdToOldOrdinal.indexOf(newDocId); + if (newIdToOldOrdinal.indexExists(hashDocIndex)) { Review Comment: @benwtrent Thanks for the comment. This piece is moved from [IncrementalHNSWGraphMerger](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/IncrementalHnswGraphMerger.java#L154) without modifications. > Is this stuff around indexOf indexExists, etc. just performance improvements over a simple newIdToOldOrdinal.get(...) Agree about `IntIntHashMap` and may be this piece can be optimized. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup merging of HNSW graphs [lucene]
mayya-sharipova commented on code in PR #14331: URL: https://github.com/apache/lucene/pull/14331#discussion_r2005416489 ## lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentHnswMerger.java: ## @@ -51,19 +57,85 @@ protected HnswBuilder createBuilder(KnnVectorValues mergedVectorValues, int maxO OnHeapHnswGraph graph; BitSet initializedNodes = null; -if (initReader == null) { +if (graphReaders.size() == 0) { graph = new OnHeapHnswGraph(M, maxOrd); } else { + graphReaders.sort(Comparator.comparingInt(GraphReader::graphSize).reversed()); + GraphReader initGraphReader = graphReaders.get(0); + KnnVectorsReader initReader = initGraphReader.reader(); + MergeState.DocMap initDocMap = initGraphReader.initDocMap(); + int initGraphSize = initGraphReader.graphSize(); HnswGraph initializerGraph = ((HnswGraphProvider) initReader).getGraph(fieldInfo.name); + if (initializerGraph.size() == 0) { graph = new OnHeapHnswGraph(M, maxOrd); } else { initializedNodes = new FixedBitSet(maxOrd); -int[] oldToNewOrdinalMap = getNewOrdMapping(mergedVectorValues, initializedNodes); +int[] oldToNewOrdinalMap = +getNewOrdMapping( +fieldInfo, +initReader, +initDocMap, +initGraphSize, +mergedVectorValues, +initializedNodes); graph = InitializedHnswGraphBuilder.initGraph(initializerGraph, oldToNewOrdinalMap, maxOrd); } } return new HnswConcurrentMergeBuilder( taskExecutor, numWorker, scorerSupplier, beamWidth, graph, initializedNodes); } + + /** + * Creates a new mapping from old ordinals to new ordinals and returns the total number of vectors + * in the newly merged segment. + * + * @param mergedVectorValues vector values in the merged segment + * @param initializedNodes track what nodes have been initialized + * @return the mapping from old ordinals to new ordinals + * @throws IOException If an error occurs while reading from the merge state + */ + private static final int[] getNewOrdMapping( + FieldInfo fieldInfo, + KnnVectorsReader initReader, + MergeState.DocMap initDocMap, + int initGraphSize, + KnnVectorValues mergedVectorValues, + BitSet initializedNodes) + throws IOException { +KnnVectorValues.DocIndexIterator initializerIterator = null; + +switch (fieldInfo.getVectorEncoding()) { + case BYTE -> initializerIterator = initReader.getByteVectorValues(fieldInfo.name).iterator(); + case FLOAT32 -> + initializerIterator = initReader.getFloatVectorValues(fieldInfo.name).iterator(); +} + +IntIntHashMap newIdToOldOrdinal = new IntIntHashMap(initGraphSize); +int maxNewDocID = -1; +for (int docId = initializerIterator.nextDoc(); +docId != NO_MORE_DOCS; +docId = initializerIterator.nextDoc()) { + int newId = initDocMap.get(docId); + maxNewDocID = Math.max(newId, maxNewDocID); + newIdToOldOrdinal.put(newId, initializerIterator.index()); +} + +if (maxNewDocID == -1) { + return new int[0]; +} +final int[] oldToNewOrdinalMap = new int[initGraphSize]; +KnnVectorValues.DocIndexIterator mergedVectorIterator = mergedVectorValues.iterator(); +for (int newDocId = mergedVectorIterator.nextDoc(); +newDocId <= maxNewDocID; +newDocId = mergedVectorIterator.nextDoc()) { + int hashDocIndex = newIdToOldOrdinal.indexOf(newDocId); + if (newIdToOldOrdinal.indexExists(hashDocIndex)) { Review Comment: @benwtrent Thanks for the comment. This piece is moved from [IncrementalHNSWGraphMerger](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/IncrementalHnswGraphMerger.java#L154) without modifications. > Is this stuff around indexOf indexExists, etc. just performance improvements over a simple newIdToOldOrdinal.get(...) Agree about `IntIntHashMap` and may be this piece can be optimized. But as it is not a part of this PR, I will leave TODO item for that for follow up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2005519750 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java: ## @@ -0,0 +1,486 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class Trie { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +UNSAVED, +SAVED, +DESTROYED + } + + private static class Node { +private final int label; +private final LinkedList children; +private Output output; +private long fp = -1; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.UNSAVED; + final Node root = new Node(0, null, new LinkedList<>()); + + Trie(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + void putAll(Trie trie) { +if (status != Status.UNSAVED || trie.status != Status.UNSAVED) { + throw new IllegalStateException("tries should be unsaved"); +} +trie.status = Status.DESTROYED; +putAll(this.root, trie.root); + } + + private static void putAll(Node n, Node add) { +assert n.label == add.label; +if (add.output != null) { + n.output = add.output; +} +ListIterator iter = n.children.listIterator(); +// TODO we can do more efficient if there is no intersection, block tree always do that +outer: +for (Node addChild : add.children) { + while (iter.hasNext()) { +Node nChild = iter.next(); +if (nChild.label == addChild.label) { + putAll(nChild, addChild); + continue outer; +} +if (nChild.label > addChild.label) { + iter.previous(); // move back + iter.add(addChild); + continue outer; +} + } + iter.add(addChild); +} + } + + Output getEmptyOutput() { +return root.output; + } + + void forEach(BiConsumer consumer) { +if (root.output != null) { + consumer.accept(new BytesRef(), root.output); +} +intersect(root.children, new BytesRefBuilder(), consumer); + } + + private void intersect( + List nodes, BytesRefBuilder key, BiConsumer consumer) { +for (Node node : nodes) { + key.append((byte) node.label); + if (node.output != null) consumer.accept(key.toBytesRef(), node.output); + intersect(node.children, key, consumer); + key.setLength(key.length() - 1); +} + } + + void save(DataOutput meta, IndexOutput index) throws IOException { +if (status != Status.UNSAVED) { + throw new IllegalStateException("only unsaved trie can be saved"); +} +status = Status.SAVED; +meta.writeVLong(index.getFilePointer()); +saveNodes(index); +meta.writeVLong(root.fp); +index.writeLong(0L); // additional 8 bytes for over-reading +meta.writeVLong(index.getFilePointer()); + } + + void saveNodes(IndexOutput index) throws IOException { +final long startFP = index.getFilePointer(); +Deque stack = new ArrayDeque<>(); +stack.p
Re: [PR] Adjust equivalent min similarity HNSW exploration logic [lucene]
mayya-sharipova commented on code in PR #14366: URL: https://github.com/apache/lucene/pull/14366#discussion_r2005708592 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java: ## @@ -266,11 +266,21 @@ void searchLevel( // A bound that holds the minimum similarity to the query vector that a candidate vector must // have to be considered. float minAcceptedSimilarity = Math.nextUp(results.minCompetitiveSimilarity()); +// We should allow exploring equivalent minAcceptedSimilarity values at least once +boolean shouldExploreMinSim = true; while (candidates.size() > 0 && results.earlyTerminated() == false) { // get the best candidate (closest or best scoring) float topCandidateSimilarity = candidates.topScore(); if (topCandidateSimilarity < minAcceptedSimilarity) { -break; +// if the similarity is equivalent to the minAcceptedSimilarity, we should explore one Review Comment: nit: format the comment a little better (auto format seems to break it strangely) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adjust equivalent min similarity HNSW exploration logic [lucene]
mayya-sharipova commented on code in PR #14366: URL: https://github.com/apache/lucene/pull/14366#discussion_r2005709557 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java: ## @@ -266,11 +266,21 @@ void searchLevel( // A bound that holds the minimum similarity to the query vector that a candidate vector must // have to be considered. float minAcceptedSimilarity = Math.nextUp(results.minCompetitiveSimilarity()); +// We should allow exploring equivalent minAcceptedSimilarity values at least once +boolean shouldExploreMinSim = true; while (candidates.size() > 0 && results.earlyTerminated() == false) { // get the best candidate (closest or best scoring) float topCandidateSimilarity = candidates.topScore(); if (topCandidateSimilarity < minAcceptedSimilarity) { -break; +// if the similarity is equivalent to the minAcceptedSimilarity, we should explore one +// candidate +// however, running into many duplicates can be expensive, so we should stop exploring if +// equivalent minimum scores are found +if (shouldExploreMinSim && Math.nextUp(candidates.topScore()) == minAcceptedSimilarity) { Review Comment: if (shouldExploreMinSim && Math.nextUp(candidates.topScore()) == minAcceptedSimilarity) => if (shouldExploreMinSim && Math.nextUp(topCandidateSimilarity) == minAcceptedSimilarity) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup merging of HNSW graphs [lucene]
mayya-sharipova merged PR #14331: URL: https://github.com/apache/lucene/pull/14331 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Case insensitive regex query with character range [lucene]
rmuir commented on issue #14378: URL: https://github.com/apache/lucene/issues/14378#issuecomment-2740658343 This isn't a bug, regex parser just does not have this feature. We can add it, but it must be an additional opt-in flag due to performance tradeoffs involved. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] build support: java 24 [lucene]
rmuir opened a new issue, #14379: URL: https://github.com/apache/lucene/issues/14379 ### Description java 23 has disappeared and has been replaced with java 24. the build currently requires 23 exactly, which creates a hurdle for users, since it is difficult to get: does not exist in operating system package manager, must be dug out of a graveyard, etc. yes, i know the problem of gradle here, but that doesn't excuse their behavior. they shouldnt be parsing classfiles of the jvm anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] build support: java 24 [lucene]
dweiss commented on issue #14379: URL: https://github.com/apache/lucene/issues/14379#issuecomment-2740968727 Really nice indeed! Sadly, I think it'll take just about a million years before it propagates through all the layers until it can hit gradle (but I'd love to be proven wrong). ;) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2006940286 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java: ## @@ -0,0 +1,552 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class TrieBuilder { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + static final int LEAF_NODE_HAS_TERMS = 1 << 5; + static final int LEAF_NODE_HAS_FLOOR = 1 << 6; + static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1; + static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0; + + /** + * The output describing the term block the prefix point to. + * + * @param fp describes the on-disk terms block which a trie node points to. + * @param hasTerms A boolean which will be false if this on-disk block consists entirely of + * pointers to child blocks. + * @param floorData A {@link BytesRef} which will be non-null when a large block of terms sharing + * a single trie prefix is split into multiple on-disk blocks. + */ + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +BUILDING, +SAVED, +DESTROYED + } + + private static class Node { + +// The utf8 digit that leads to this Node, 0 for root node +private final int label; +// The children listed in order by their utf8 label +private final LinkedList children; +// The output of this node. +private Output output; + +// Vars used during saving: + +// The file pointer point to where the node saved. -1 means the node has not been saved. +private long fp = -1; +// The iterator whose next() point to the first child has not been saved. +private Iterator childrenIterator; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.BUILDING; + final Node root = new Node(0, null, new LinkedList<>()); + + static TrieBuilder bytesRefToTrie(BytesRef k, Output v) { +return new TrieBuilder(k, v); + } + + private TrieBuilder(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + /** + * Absorb all (K, V) pairs from the given trie into this one. The given trie builder should not + * have key that already exists in this one, otherwise a {@link IllegalArgumentException } will be + * thrown and this trie will get destroyed. + * + * Note: the given trie will be destroyed after absorbing. + */ + void absorb(TrieBuilder trieBuilder) { +if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) { + throw new IllegalStateException("tries should be unsaved"); +} +// Use a simple stack to avoid recursion. +Deque stack = new ArrayDeque<>(); +stack.add(() -> absorb(this.root, trieBuilder.root, stack)); +while (!stack.isEmpty()) { + stack.pop().run(); +} +trieBuilder.status = Status.DESTROYED; + } + + private void absorb(Node n, Node add, Deque stack) { +assert n.label == add.label; +if (add.output != null) { + if (n.output != null) { +
Re: [PR] Speedup merging of HNSW graphs (#14331) [lucene]
mayya-sharipova merged PR #14380: URL: https://github.com/apache/lucene/pull/14380 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2006901371 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java: ## @@ -0,0 +1,552 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class TrieBuilder { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + static final int LEAF_NODE_HAS_TERMS = 1 << 5; + static final int LEAF_NODE_HAS_FLOOR = 1 << 6; + static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1; + static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0; + + /** + * The output describing the term block the prefix point to. + * + * @param fp describes the on-disk terms block which a trie node points to. + * @param hasTerms A boolean which will be false if this on-disk block consists entirely of + * pointers to child blocks. + * @param floorData A {@link BytesRef} which will be non-null when a large block of terms sharing + * a single trie prefix is split into multiple on-disk blocks. + */ + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +BUILDING, +SAVED, +DESTROYED + } + + private static class Node { + +// The utf8 digit that leads to this Node, 0 for root node +private final int label; +// The children listed in order by their utf8 label +private final LinkedList children; +// The output of this node. +private Output output; + +// Vars used during saving: + +// The file pointer point to where the node saved. -1 means the node has not been saved. +private long fp = -1; +// The iterator whose next() point to the first child has not been saved. +private Iterator childrenIterator; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.BUILDING; + final Node root = new Node(0, null, new LinkedList<>()); + + static TrieBuilder bytesRefToTrie(BytesRef k, Output v) { +return new TrieBuilder(k, v); + } + + private TrieBuilder(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + /** + * Absorb all (K, V) pairs from the given trie into this one. The given trie builder should not + * have key that already exists in this one, otherwise a {@link IllegalArgumentException } will be + * thrown and this trie will get destroyed. + * + * Note: the given trie will be destroyed after absorbing. + */ + void absorb(TrieBuilder trieBuilder) { +if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) { + throw new IllegalStateException("tries should be unsaved"); +} +// Use a simple stack to avoid recursion. +Deque stack = new ArrayDeque<>(); +stack.add(() -> absorb(this.root, trieBuilder.root, stack)); +while (!stack.isEmpty()) { + stack.pop().run(); +} +trieBuilder.status = Status.DESTROYED; + } + + private void absorb(Node n, Node add, Deque stack) { +assert n.label == add.label; +if (add.output != null) { + if (n.output != null) { +
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2006904572 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/IntersectTermsEnumFrame.java: ## @@ -89,8 +89,6 @@ final class IntersectTermsEnumFrame { final ByteArrayDataInput bytesReader = new ByteArrayDataInput(); - int outputNum; Review Comment: This was previously used, in IntersectTermsEnum. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] RegExp: add CASE_INSENSITIVE_RANGE support [lucene]
dweiss commented on code in PR #14381: URL: https://github.com/apache/lucene/pull/14381#discussion_r2006930601 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -778,6 +786,53 @@ private int[] toCaseInsensitiveChar(int codepoint) { } } + /** + * Expands range to include case-insensitive matches. + * + * This is expensive: case-insensitive range involves iterating over the range space, adding + * alternatives. Jump on the grenade here, contain CPU and memory explosion just to this method + * activated by optional flag. + */ + private void expandCaseInsensitiveRange( + int start, int end, List rangeStarts, List rangeEnds) { +if (start > end) + throw new IllegalArgumentException( + "invalid range: from (" + start + ") cannot be > to (" + end + ")"); + +// contain the explosion of transitions by using a throwaway state +Automaton scratch = new Automaton(); +int state = scratch.createState(); + +// iterate over range, adding codepoint and any alternatives as transitions +for (int i = start; i <= end; i++) { + scratch.addTransition(state, state, i); + int[] altCodePoints = CaseFolding.lookupAlternates(i); + if (altCodePoints != null) { +for (int alt : altCodePoints) { + scratch.addTransition(state, state, alt); +} + } else { +int altCase = +Character.isLowerCase(i) ? Character.toUpperCase(i) : Character.toLowerCase(i); +if (altCase != i) { + scratch.addTransition(state, state, altCase); +} + } +} Review Comment: I think this pattern is used in more than one place. Maybe it'd be nicer to make CaseFolding.forAllAlternatives(altCase -> scratch.addTransition(state, state, altCase)) and have all that logic for checking for alternatives embedded in CaseFolding? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] NRT replication should make it possible/easy to use bite-sized commits [lucene]
vigyasharma commented on issue #14219: URL: https://github.com/apache/lucene/issues/14219#issuecomment-2742476154 > The searchers can then carefully pick and choose which commit points they want to switch too, in a bite sized / stepping stone manner The key here is making searchers refresh on a point-in-time commit instead of the latest one. Searcher refresh today, for classes like `SearcherManager`, `SearcherTaxonomyManager`, or `ReaderManager` is implemented in `refreshIfNeeded()`, which in turn, calls `DirectoryReader.openIfChanged(oldReader)`. This opens a new reader on the latest commit available in the directory. For bite-sized commits, we would want to use the `DirectoryReader.openIfChanged(oldReader, commit)` method, and open readers on the specific commit we select. `refreshIfNeeded` is invoked via the base class method – `ReferenceManager#maybeRefresh()`, which takes care of reference management tasks like acquiring exclusive locks, invoking refreshIfNeeded, doing an atomic swap if needed, releasing the old reference, and notifying listeners. This API however, does not accept an index commit. It also doesn't feel right to modify this API or add new ones, since that would break the reference agnostic nature of ReferenceManager. One way to achieve specific commit refresh, is for the applications to extend `SearcherManager` (or `SearcherTaxonomyManager`) and override `refreshIfNeeded`. However, this can be a bit involved as applications may have to re-implement some logic (like searcher and taxonomy coupling in SearcherTaxonomyManager). To support this out of box within Lucene, we could modify the existing `SearcherManager` / `SearcherTaxonomyManager` to use the commit specific DirectoryReader refresh, and provide hooks within these classes to select the ideal commit. Like a `RefreshCommitSelector` policy that users can implement to supply the refresh commit. The policy could default to returning the latest directory commit, to stay in line with current behavior. Expert users could supply more nuanced logic like picking the newest commit that has < N bytes worth of new segment files. Alternately, we could also add a new `NRTSearcherManager` with above-mentioned support. I think these hooks can be useful for segment replication. Would like to hear from the community on whether this seems like a useful thing to support from within Lucene. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
mikemccand commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2005462386 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java: ## @@ -0,0 +1,486 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class Trie { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +UNSAVED, +SAVED, +DESTROYED + } + + private static class Node { +private final int label; +private final LinkedList children; +private Output output; +private long fp = -1; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.UNSAVED; + final Node root = new Node(0, null, new LinkedList<>()); + + Trie(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + void putAll(Trie trie) { +if (status != Status.UNSAVED || trie.status != Status.UNSAVED) { + throw new IllegalStateException("tries should be unsaved"); +} +trie.status = Status.DESTROYED; +putAll(this.root, trie.root); + } + + private static void putAll(Node n, Node add) { +assert n.label == add.label; +if (add.output != null) { + n.output = add.output; +} +ListIterator iter = n.children.listIterator(); +// TODO we can do more efficient if there is no intersection, block tree always do that +outer: +for (Node addChild : add.children) { + while (iter.hasNext()) { +Node nChild = iter.next(); +if (nChild.label == addChild.label) { + putAll(nChild, addChild); + continue outer; +} +if (nChild.label > addChild.label) { Review Comment: Maybe document and `assert` this somewhere? I think it's non-obvious ... it's the merge sorting in `absorb` that makes it so I think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup merging of HNSW graphs [lucene]
mayya-sharipova commented on code in PR #14331: URL: https://github.com/apache/lucene/pull/14331#discussion_r2005464935 ## lucene/core/src/java/org/apache/lucene/util/hnsw/MergingHnswGraphBuilder.java: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.hnsw; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.HashSet; +import java.util.Set; +import org.apache.lucene.util.BitSet; + +/** + * A graph builder that is used during segments' merging. + * + * This builder uses a smart algorithm to merge multiple graphs into a single graph. The + * algorithm is based on the idea if we know where we want to insert a node, we have a good idea of + * where we want to insert its neighbours. + * + * The algorithm is based on the following steps: 1. Get all graphs that don't have deletions and + * sort them by size. 2. Copy the largest graph to the new graph (gL). 3. For each remaining small + * graph (gS) - Find the nodes that best cover gS: join set `j`. These nodes will be inserted into + * gL as usual: by searching gL to find the best candidates: `w` to which connect the nodes. - For + * each remaining node in gS: - we do NOT do search in gL. Instead, we form `w` by union of the + * node's neighbors' in Gs and the node's neighbors' neighbors' in gL. + * + * We expect the size of join set `j` to be small, 1/5-1/2 of size gS. And for the rest of the + * nodes of gS, we expect savings by not doing extra search in gL. + * + * @lucene.experimental + */ Review Comment: Addressed in cb852a6387a09ba43049b8a24f1e026c309b368b -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
mikemccand commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2005463552 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java: ## @@ -0,0 +1,486 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class Trie { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +UNSAVED, +SAVED, +DESTROYED + } + + private static class Node { +private final int label; +private final LinkedList children; +private Output output; +private long fp = -1; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.UNSAVED; + final Node root = new Node(0, null, new LinkedList<>()); + + Trie(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + void putAll(Trie trie) { +if (status != Status.UNSAVED || trie.status != Status.UNSAVED) { + throw new IllegalStateException("tries should be unsaved"); +} +trie.status = Status.DESTROYED; +putAll(this.root, trie.root); + } + + private static void putAll(Node n, Node add) { +assert n.label == add.label; +if (add.output != null) { + n.output = add.output; +} +ListIterator iter = n.children.listIterator(); +// TODO we can do more efficient if there is no intersection, block tree always do that +outer: +for (Node addChild : add.children) { + while (iter.hasNext()) { +Node nChild = iter.next(); +if (nChild.label == addChild.label) { + putAll(nChild, addChild); + continue outer; +} +if (nChild.label > addChild.label) { + iter.previous(); // move back + iter.add(addChild); + continue outer; +} + } + iter.add(addChild); +} + } + + Output getEmptyOutput() { +return root.output; + } + + void forEach(BiConsumer consumer) { +if (root.output != null) { + consumer.accept(new BytesRef(), root.output); +} +intersect(root.children, new BytesRefBuilder(), consumer); + } + + private void intersect( + List nodes, BytesRefBuilder key, BiConsumer consumer) { +for (Node node : nodes) { + key.append((byte) node.label); + if (node.output != null) consumer.accept(key.toBytesRef(), node.output); + intersect(node.children, key, consumer); + key.setLength(key.length() - 1); +} + } + + void save(DataOutput meta, IndexOutput index) throws IOException { +if (status != Status.UNSAVED) { + throw new IllegalStateException("only unsaved trie can be saved"); +} +status = Status.SAVED; Review Comment: +1 to design for today! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --
Re: [PR] A specialized Trie for Block Tree Index [lucene]
mikemccand commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2005466812 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java: ## @@ -0,0 +1,486 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class Trie { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +UNSAVED, +SAVED, +DESTROYED + } + + private static class Node { +private final int label; +private final LinkedList children; +private Output output; +private long fp = -1; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.UNSAVED; + final Node root = new Node(0, null, new LinkedList<>()); + + Trie(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + void putAll(Trie trie) { +if (status != Status.UNSAVED || trie.status != Status.UNSAVED) { + throw new IllegalStateException("tries should be unsaved"); +} +trie.status = Status.DESTROYED; +putAll(this.root, trie.root); + } + + private static void putAll(Node n, Node add) { +assert n.label == add.label; +if (add.output != null) { + n.output = add.output; +} +ListIterator iter = n.children.listIterator(); +// TODO we can do more efficient if there is no intersection, block tree always do that +outer: +for (Node addChild : add.children) { + while (iter.hasNext()) { +Node nChild = iter.next(); +if (nChild.label == addChild.label) { + putAll(nChild, addChild); Review Comment: Let's explicitly test massive terms sent to `TrieBuilder`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
mikemccand commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2005473348 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java: ## @@ -0,0 +1,486 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class Trie { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +UNSAVED, +SAVED, +DESTROYED + } + + private static class Node { +private final int label; +private final LinkedList children; +private Output output; +private long fp = -1; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.UNSAVED; + final Node root = new Node(0, null, new LinkedList<>()); + + Trie(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + void putAll(Trie trie) { +if (status != Status.UNSAVED || trie.status != Status.UNSAVED) { + throw new IllegalStateException("tries should be unsaved"); +} +trie.status = Status.DESTROYED; +putAll(this.root, trie.root); + } + + private static void putAll(Node n, Node add) { +assert n.label == add.label; +if (add.output != null) { + n.output = add.output; +} +ListIterator iter = n.children.listIterator(); +// TODO we can do more efficient if there is no intersection, block tree always do that +outer: +for (Node addChild : add.children) { + while (iter.hasNext()) { +Node nChild = iter.next(); +if (nChild.label == addChild.label) { + putAll(nChild, addChild); + continue outer; +} +if (nChild.label > addChild.label) { + iter.previous(); // move back + iter.add(addChild); + continue outer; +} + } + iter.add(addChild); +} + } + + Output getEmptyOutput() { +return root.output; + } + + void forEach(BiConsumer consumer) { +if (root.output != null) { + consumer.accept(new BytesRef(), root.output); +} +intersect(root.children, new BytesRefBuilder(), consumer); + } + + private void intersect( + List nodes, BytesRefBuilder key, BiConsumer consumer) { +for (Node node : nodes) { + key.append((byte) node.label); + if (node.output != null) consumer.accept(key.toBytesRef(), node.output); + intersect(node.children, key, consumer); + key.setLength(key.length() - 1); +} + } + + void save(DataOutput meta, IndexOutput index) throws IOException { +if (status != Status.UNSAVED) { + throw new IllegalStateException("only unsaved trie can be saved"); +} +status = Status.SAVED; +meta.writeVLong(index.getFilePointer()); +saveNodes(index); +meta.writeVLong(root.fp); +index.writeLong(0L); // additional 8 bytes for over-reading +meta.writeVLong(index.getFilePointer()); + } + + void saveNodes(IndexOutput index) throws IOException { +final long startFP = index.getFilePointer(); +Deque stack = new ArrayDeque<>(); +sta
[I] Case insensitive regex query with character range [lucene]
petrsimon opened a new issue, #14378: URL: https://github.com/apache/lucene/issues/14378 ### Description Hi, I'm implementing regex search in java app with Elasticsearch 8.* and I've noticed unexpected behaviour with `CASE_INSENSITIVE` flag. It seems that Lucene ignores the flag in case of character ranges and I'm getting the following results when searching fields analyzed by KeywordAnalyzer: ``` "[abc]+" -> "abc", "ABC" "[ABC]+" -> "abc", "ABC" "[a-c]+" -> "abc" "[A-C]+" -> "ABC" ``` Reproduced in https://github.com/petrsimon/lucene-regex ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]
vsop-479 commented on PR #14371: URL: https://github.com/apache/lucene/pull/14371#issuecomment-2740387593 Thanks for your feedback @gf2121. This patch is still in process, and have not been measured. > If you look at newest code in Lucene101PostingsReader, you may find we are using VectorMask to speed up this Thanks for reminding this, I noticed the vectorization approach when I find https://github.com/apache/lucene/pull/13692 has been reverted. But I am not sure we can use vector for `IndexedDISI.slice`. > that was what i had in mind - get a MemorySegment slice if it is not null, and play it with VectorMask. That would be nice, I just noticed `ShortVector#fromMemorySegment`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Case insensitive regex query with character range [lucene]
petrsimon commented on issue #14378: URL: https://github.com/apache/lucene/issues/14378#issuecomment-2740838289 I see, thanks a lot! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Case insensitive regex query with character range [lucene]
rmuir commented on issue #14378: URL: https://github.com/apache/lucene/issues/14378#issuecomment-2740916833 If the regexp parser doesn't document it has the feature, then it doesn't support it: https://lucene.apache.org/core/10_1_0/core/org/apache/lucene/util/automaton/RegExp.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
mikemccand commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2005458945 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java: ## @@ -0,0 +1,486 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class Trie { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +UNSAVED, +SAVED, +DESTROYED + } + + private static class Node { +private final int label; +private final LinkedList children; +private Output output; +private long fp = -1; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.UNSAVED; + final Node root = new Node(0, null, new LinkedList<>()); + + Trie(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + void putAll(Trie trie) { +if (status != Status.UNSAVED || trie.status != Status.UNSAVED) { + throw new IllegalStateException("tries should be unsaved"); +} +trie.status = Status.DESTROYED; +putAll(this.root, trie.root); + } + + private static void putAll(Node n, Node add) { +assert n.label == add.label; +if (add.output != null) { + n.output = add.output; +} +ListIterator iter = n.children.listIterator(); +// TODO we can do more efficient if there is no intersection, block tree always do that Review Comment: Ahhh, OK I see, it is like a merge sort ("zipper" two iterators together). I agree it's O(M+N). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2005498999 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java: ## @@ -0,0 +1,486 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class Trie { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +UNSAVED, +SAVED, +DESTROYED + } + + private static class Node { +private final int label; +private final LinkedList children; +private Output output; +private long fp = -1; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.UNSAVED; + final Node root = new Node(0, null, new LinkedList<>()); + + Trie(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + void putAll(Trie trie) { +if (status != Status.UNSAVED || trie.status != Status.UNSAVED) { + throw new IllegalStateException("tries should be unsaved"); +} +trie.status = Status.DESTROYED; +putAll(this.root, trie.root); + } + + private static void putAll(Node n, Node add) { +assert n.label == add.label; +if (add.output != null) { + n.output = add.output; +} +ListIterator iter = n.children.listIterator(); +// TODO we can do more efficient if there is no intersection, block tree always do that +outer: +for (Node addChild : add.children) { + while (iter.hasNext()) { +Node nChild = iter.next(); +if (nChild.label == addChild.label) { + putAll(nChild, addChild); + continue outer; +} +if (nChild.label > addChild.label) { + iter.previous(); // move back + iter.add(addChild); + continue outer; +} + } + iter.add(addChild); +} + } + + Output getEmptyOutput() { +return root.output; + } + + void forEach(BiConsumer consumer) { +if (root.output != null) { + consumer.accept(new BytesRef(), root.output); +} +intersect(root.children, new BytesRefBuilder(), consumer); + } + + private void intersect( + List nodes, BytesRefBuilder key, BiConsumer consumer) { +for (Node node : nodes) { + key.append((byte) node.label); + if (node.output != null) consumer.accept(key.toBytesRef(), node.output); + intersect(node.children, key, consumer); + key.setLength(key.length() - 1); +} + } + + void save(DataOutput meta, IndexOutput index) throws IOException { +if (status != Status.UNSAVED) { + throw new IllegalStateException("only unsaved trie can be saved"); +} +status = Status.SAVED; +meta.writeVLong(index.getFilePointer()); +saveNodes(index); +meta.writeVLong(root.fp); +index.writeLong(0L); // additional 8 bytes for over-reading +meta.writeVLong(index.getFilePointer()); + } + + void saveNodes(IndexOutput index) throws IOException { +final long startFP = index.getFilePointer(); +Deque stack = new ArrayDeque<>(); +stack.p
Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]
tteofili commented on PR #14094: URL: https://github.com/apache/lucene/pull/14094#issuecomment-2740827581 additional experiments with different quantization levels and filtering: ## No-fitlering ### Baseline ``` recall latency(ms)nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.9854.620 20 100 50 64250 no 106.46 1878.64 3 600.08 585.938 585.938 HNSW 0.8993.657 20 100 50 64250 7 bits 67.74 2952.47 5 746.34 733.185 147.247 HNSW 0.5852.328 20 100 50 64250 4 bits 46.86 4268.03 3 675.33 659.943 74.005 HNSW 0.9839.212 50 100 50 64250 no 235.68 2121.56 8 1501.44 1464.844 1464.844 HNSW 0.9007.562 50 100 50 64250 7 bits 165.99 3012.30 9 1867.29 1832.962 368.118 HNSW 0.5804.934 50 100 50 64250 4 bits 130.65 3826.96 8 1689.29 1649.857 185.013 HNSW ``` ### Candidate ``` recall latency(ms)nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s num_segments index_size(MB) selectivity vec_disk(MB) vec_RAM(MB) indexType 0.9803.744 20 100 50 64250 no 10690106.82 1872.29 3 600.10 1.00 585.938 585.938 HNSW 0.8963.473 20 100 50 64250 7 bits 11878 68.83 2905.54 5 746.39 1.00 733.185 147.247 HNSW 0.5852.032 20 100 50 64250 4 bits 13279 51.32 3897.12 3 675.32 1.00 659.943 74.005 HNSW 0.9828.549 50 100 50 64250 no 23079248.29 2013.81 8 1501.32 1.00 1464.844 1464.844 HNSW 0.8986.733 50 100 50 64250 7 bits 23629167.17 2991.02 9 1867.31 1.00 1832.962 368.118 HNSW 0.5813.776 50 100 50 64250 4 bits 21179152.43 3280.24 5 1690.38 1.00 1649.857 185.013 HNSW ``` ## Filtering ### Baseline ``` recall latency(ms)nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s num_segments index_size(MB) selectivity vec_disk(MB) vec_RAM(MB) indexType 1.0000.642 20 100 50 64250 no 1965109.81 1821.26 3 600.16 0.01 585.938 585.938 HNSW 0.9644.947 20 100 50 64250 no 9504110.91 1803.33 3 600.11 0.10 585.938 585.938 HNSW 0.9838.417 20 100 50 64250 no 22193103.13 1939.28 3 600.09 0.50 585.938 585.938 HNSW 0.9180.762 20 100 50 64250 7 bits 1981 64.33 3108.82 5 746.33 0.01 733.185 147.247 HNSW 0.8924.310 20 100 50 64250 7 bits 10302 66.23 3019.87 5 746.34 0.10 733.185 147.247 HNSW 0.8986.900 20 100 50 64250 7 bits 23394 69.09 2894.82 4 746.51 0.50 733.185 147.247 HNSW 0.6601.137 20 100 50 64250 4 bits 1695 50.01 3999.44 3 675.40 0.01 659.943 74.005 HNSW 0.6192.852 20 100 50 64250 4 bits 11021 49.88 4010.03 3 675.31 0.10 659.943 74.005 HNSW 0.5924.429 20 100 50 64250 4 bits 27121 48.72 4104.75 3 675.30 0.50 659.943 74.005 HNSW 1.0002.371 50 100 50 64250 no 5017244.18 2047.64 8 1501.36 0.01 1464.844 1464.844 HNSW 0.968 11.976 50
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2006876856 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java: ## @@ -0,0 +1,552 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class TrieBuilder { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + static final int LEAF_NODE_HAS_TERMS = 1 << 5; + static final int LEAF_NODE_HAS_FLOOR = 1 << 6; + static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1; + static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0; + + /** + * The output describing the term block the prefix point to. + * + * @param fp describes the on-disk terms block which a trie node points to. + * @param hasTerms A boolean which will be false if this on-disk block consists entirely of + * pointers to child blocks. + * @param floorData A {@link BytesRef} which will be non-null when a large block of terms sharing + * a single trie prefix is split into multiple on-disk blocks. + */ + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +BUILDING, +SAVED, +DESTROYED + } + + private static class Node { + +// The utf8 digit that leads to this Node, 0 for root node +private final int label; +// The children listed in order by their utf8 label +private final LinkedList children; +// The output of this node. +private Output output; + +// Vars used during saving: + +// The file pointer point to where the node saved. -1 means the node has not been saved. +private long fp = -1; +// The iterator whose next() point to the first child has not been saved. +private Iterator childrenIterator; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.BUILDING; + final Node root = new Node(0, null, new LinkedList<>()); + + static TrieBuilder bytesRefToTrie(BytesRef k, Output v) { +return new TrieBuilder(k, v); + } + + private TrieBuilder(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + /** + * Absorb all (K, V) pairs from the given trie into this one. The given trie builder should not + * have key that already exists in this one, otherwise a {@link IllegalArgumentException } will be + * thrown and this trie will get destroyed. + * + * Note: the given trie will be destroyed after absorbing. + */ + void absorb(TrieBuilder trieBuilder) { +if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) { + throw new IllegalStateException("tries should be unsaved"); +} +// Use a simple stack to avoid recursion. +Deque stack = new ArrayDeque<>(); +stack.add(() -> absorb(this.root, trieBuilder.root, stack)); +while (!stack.isEmpty()) { + stack.pop().run(); +} +trieBuilder.status = Status.DESTROYED; + } + + private void absorb(Node n, Node add, Deque stack) { +assert n.label == add.label; +if (add.output != null) { + if (n.output != null) { +
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2006876856 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java: ## @@ -0,0 +1,552 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class TrieBuilder { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + static final int LEAF_NODE_HAS_TERMS = 1 << 5; + static final int LEAF_NODE_HAS_FLOOR = 1 << 6; + static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1; + static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0; + + /** + * The output describing the term block the prefix point to. + * + * @param fp describes the on-disk terms block which a trie node points to. + * @param hasTerms A boolean which will be false if this on-disk block consists entirely of + * pointers to child blocks. + * @param floorData A {@link BytesRef} which will be non-null when a large block of terms sharing + * a single trie prefix is split into multiple on-disk blocks. + */ + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +BUILDING, +SAVED, +DESTROYED + } + + private static class Node { + +// The utf8 digit that leads to this Node, 0 for root node +private final int label; +// The children listed in order by their utf8 label +private final LinkedList children; +// The output of this node. +private Output output; + +// Vars used during saving: + +// The file pointer point to where the node saved. -1 means the node has not been saved. +private long fp = -1; +// The iterator whose next() point to the first child has not been saved. +private Iterator childrenIterator; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.BUILDING; + final Node root = new Node(0, null, new LinkedList<>()); + + static TrieBuilder bytesRefToTrie(BytesRef k, Output v) { +return new TrieBuilder(k, v); + } + + private TrieBuilder(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + /** + * Absorb all (K, V) pairs from the given trie into this one. The given trie builder should not + * have key that already exists in this one, otherwise a {@link IllegalArgumentException } will be + * thrown and this trie will get destroyed. + * + * Note: the given trie will be destroyed after absorbing. + */ + void absorb(TrieBuilder trieBuilder) { +if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) { + throw new IllegalStateException("tries should be unsaved"); +} +// Use a simple stack to avoid recursion. +Deque stack = new ArrayDeque<>(); +stack.add(() -> absorb(this.root, trieBuilder.root, stack)); +while (!stack.isEmpty()) { + stack.pop().run(); +} +trieBuilder.status = Status.DESTROYED; + } + + private void absorb(Node n, Node add, Deque stack) { +assert n.label == add.label; +if (add.output != null) { + if (n.output != null) { +
Re: [PR] Optimize ParallelLeafReader to improve term vector fetching efficienc [lucene]
DivyanshIITB commented on PR #14373: URL: https://github.com/apache/lucene/pull/14373#issuecomment-2741117231 Just a gentle reminder @vigyasharma -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2005865922 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java: ## @@ -0,0 +1,552 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class TrieBuilder { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + static final int LEAF_NODE_HAS_TERMS = 1 << 5; + static final int LEAF_NODE_HAS_FLOOR = 1 << 6; + static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1; + static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0; + + /** + * The output describing the term block the prefix point to. + * + * @param fp describes the on-disk terms block which a trie node points to. + * @param hasTerms A boolean which will be false if this on-disk block consists entirely of + * pointers to child blocks. + * @param floorData A {@link BytesRef} which will be non-null when a large block of terms sharing + * a single trie prefix is split into multiple on-disk blocks. + */ + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +BUILDING, +SAVED, +DESTROYED + } + + private static class Node { + +// The utf8 digit that leads to this Node, 0 for root node +private final int label; +// The children listed in order by their utf8 label +private final LinkedList children; +// The output of this node. +private Output output; + +// Vars used during saving: + +// The file pointer point to where the node saved. -1 means the node has not been saved. +private long fp = -1; +// The iterator whose next() point to the first child has not been saved. +private Iterator childrenIterator; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.BUILDING; + final Node root = new Node(0, null, new LinkedList<>()); + + static TrieBuilder bytesRefToTrie(BytesRef k, Output v) { +return new TrieBuilder(k, v); + } + + private TrieBuilder(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + /** + * Absorb all (K, V) pairs from the given trie into this one. The given trie builder should not + * have key that already exists in this one, otherwise a {@link IllegalArgumentException } will be + * thrown and this trie will get destroyed. + * + * Note: the given trie will be destroyed after absorbing. + */ + void absorb(TrieBuilder trieBuilder) { +if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) { + throw new IllegalStateException("tries should be unsaved"); +} +// Use a simple stack to avoid recursion. +Deque stack = new ArrayDeque<>(); +stack.add(() -> absorb(this.root, trieBuilder.root, stack)); +while (!stack.isEmpty()) { + stack.pop().run(); +} +trieBuilder.status = Status.DESTROYED; + } + + private void absorb(Node n, Node add, Deque stack) { +assert n.label == add.label; +if (add.output != null) { + if (n.output != null) { +
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2005852050 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java: ## @@ -0,0 +1,552 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class TrieBuilder { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + static final int LEAF_NODE_HAS_TERMS = 1 << 5; + static final int LEAF_NODE_HAS_FLOOR = 1 << 6; + static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1; + static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0; + + /** + * The output describing the term block the prefix point to. + * + * @param fp describes the on-disk terms block which a trie node points to. + * @param hasTerms A boolean which will be false if this on-disk block consists entirely of + * pointers to child blocks. + * @param floorData A {@link BytesRef} which will be non-null when a large block of terms sharing + * a single trie prefix is split into multiple on-disk blocks. + */ + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +BUILDING, +SAVED, +DESTROYED + } + + private static class Node { Review Comment: Double yes on read / write side! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] build support: java 24 [lucene]
rmuir commented on issue #14379: URL: https://github.com/apache/lucene/issues/14379#issuecomment-2740938412 I tried allowing 24 and gradle only failed in the usual way (incompatible classfile): we have to wait for them to issue a gradle release that "supports 24" so they can parse the classfiles of the jdk :( -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] build support: java 24 [lucene]
dweiss commented on issue #14379: URL: https://github.com/apache/lucene/issues/14379#issuecomment-2740939418 https://github.com/gradle/gradle/issues/32290 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] build support: java 24 [lucene]
dweiss commented on issue #14379: URL: https://github.com/apache/lucene/issues/14379#issuecomment-2740930456 > build currently requires 23 exactly This is terrible. I'll take a look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] build support: java 24 [lucene]
dweiss commented on issue #14379: URL: https://github.com/apache/lucene/issues/14379#issuecomment-2740943620 They do tons of weird stuff these days that require bytecode manipulation and touching everything upon loading. I don't think there is a way around other than wait for that issue to be solved. I'll keep an eye on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org