Re: [I] A little optimization about LZ4 [lucene]
kkewwei commented on issue #14347: URL: https://github.com/apache/lucene/issues/14347#issuecomment-2743128817 `I` am wondering if the default for BEST_SPEED should be using preset dict as that compromises speed for compression` It appears that certain scenarios might not be entirely appropriate for LZ4WithPresetDict. In principle, the read I/O may does indeed increase by double, particularly for small documents. For instance, if the size of small document is 1kb, while the dictionary is 4 KB, the read I/O will increase fourfold. I'm attempting to implement dictionary reuse by current buffer. it's lossless. It functions effectively when reading some documents from one chunk(8*10*1024). However, if the read document from different chunk, we have to rebuild the dictionary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ParallelLeafReader to improve term vector fetching efficienc [lucene]
vigyasharma commented on code in PR #14373: URL: https://github.com/apache/lucene/pull/14373#discussion_r2008971160 ## lucene/core/src/java/org/apache/lucene/index/ParallelLeafReader.java: ## @@ -348,15 +348,24 @@ public void prefetch(int docID) throws IOException { @Override public Fields get(int docID) throws IOException { ParallelFields fields = null; -for (Map.Entry ent : tvFieldToReader.entrySet()) { - String fieldName = ent.getKey(); - TermVectors termVectors = readerToTermVectors.get(ent.getValue()); - Terms vector = termVectors.get(docID, fieldName); - if (vector != null) { -if (fields == null) { - fields = new ParallelFields(); -} -fields.addField(fieldName, vector); + +// Step 2: Fetch all term vectors once per reader +for (Map.Entry entry : readerToTermVectors.entrySet()) { + TermVectors termVectors = entry.getValue(); + Fields docFields = termVectors.get(docID); // Fetch all fields at once + + if (docFields != null) { + if (fields == null) { + fields = new ParallelFields(); + } + + // Step 3: Aggregate only required fields + for (String fieldName : docFields) { + Terms vector = docFields.terms(fieldName); + if (vector != null) { Review Comment: > `docFields.terms(fieldName)` can be null if the field exists in `docFields` but does not have term vectors stored. Okay. I thought `termVectors.get(docID)` only returns Fields that have terms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Clean up junk from gradle's user home (~/.gradle/.tmp). #14385 [lucene]
uschindler commented on code in PR #14387: URL: https://github.com/apache/lucene/pull/14387#discussion_r2008716613 ## gradle/hacks/wipe-temp.gradle: ## @@ -14,15 +14,35 @@ * See the License for the specific language governing permissions and * limitations under the License. */ +import java.nio.file.Files +import java.time.temporal.ChronoUnit +import java.time.Instant + // See LUCENE-9471. We redirect temporary location for gradle // so that it doesn't pollute user's tmp. Wipe it during a clean though. configure(rootProject) { gradle.buildFinished { +// we clean up the java.io.tmpdir we've redirected gradle to use (LUCENE-9471). +// these are still used and populated with junk. rootProject.delete fileTree(".gradle/tmp").matching { Review Comment: maybe add a comment above that this is in project's directory. This caused confusion in the original discussion! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Implement #docIDRunEnd() on PostingsEnum. [lucene]
gf2121 commented on code in PR #14390: URL: https://github.com/apache/lucene/pull/14390#discussion_r2008706856 ## lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java: ## @@ -1059,6 +1059,29 @@ private void bufferIntoBitSet(int start, int end, FixedBitSet bitSet, int offset } } +@Override +public int docIDRunEnd() throws IOException { + // Note: this assumes that BLOCK_SIZE == 128, this bit of the code would need to be changed if Review Comment: Nit: Can we assert this assumption here? As i tried to change `BLOCK_SIZE` before, i personally prefer an explicit AssertionError rather than comments :) ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -2458,6 +2469,31 @@ private static void checkTermsIntersect(Terms terms, Automaton automaton, BytesR } } + private static void checkDocIDRuns(DocIdSetIterator iterator) throws IOException { +int prevDoc = -1; +int runEnd = 0; +for (int doc = iterator.nextDoc(); Review Comment: Do we also need to test `docIDRunEnd` works properly against operations other than `nextDoc`? E.g. `advance`, `intoBitset`, `advanceShallow`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] build: generate CaseFolding.java from "gradle regenerate" [lucene]
rmuir commented on code in PR #14384: URL: https://github.com/apache/lucene/pull/14384#discussion_r2008648506 ## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ## @@ -759,23 +759,14 @@ private Automaton toAutomaton( * @return the original codepoint and the set of alternates */ private int[] toCaseInsensitiveChar(int codepoint) { -int[] altCodepoints = CaseFolding.lookupAlternates(codepoint); -if (altCodepoints != null) { - int[] concat = new int[altCodepoints.length + 1]; - System.arraycopy(altCodepoints, 0, concat, 0, altCodepoints.length); - concat[altCodepoints.length] = codepoint; - return concat; -} else { - int altCase = - Character.isLowerCase(codepoint) - ? Character.toUpperCase(codepoint) - : Character.toLowerCase(codepoint); - if (altCase != codepoint) { -return new int[] {altCase, codepoint}; - } else { -return new int[] {codepoint}; - } -} +List list = new ArrayList<>(); +CaseFolding.expand( +codepoint, +(int variant) -> { Review Comment: I thought about it: am not sure due to boxing. can try again. it is 2025 they still have not fixed the boxes :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ParallelLeafReader to improve term vector fetching efficienc [lucene]
DivyanshIITB commented on PR #14373: URL: https://github.com/apache/lucene/pull/14373#issuecomment-2745182601 > Changes look good to me. Can you run `./gradlew tidy` to fix formatting issues, and add a changes entry before we merge this? I successfully ran `./gradlew tidy` and the built was successful. Also added the changes entry in `CHANGES.txt` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Implement #docIDRunEnd() on PostingsEnum. [lucene]
jpountz commented on code in PR #14390: URL: https://github.com/apache/lucene/pull/14390#discussion_r2008757738 ## lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java: ## @@ -1059,6 +1059,29 @@ private void bufferIntoBitSet(int start, int end, FixedBitSet bitSet, int offset } } +@Override +public int docIDRunEnd() throws IOException { + // Note: this assumes that BLOCK_SIZE == 128, this bit of the code would need to be changed if Review Comment: I wanted to do this but this gives a compile error that fails the build, I'm unsure how to do it otherwise ``` 1. ERROR in lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java (at line 1066) assert BLOCK_SIZE == 2 * Long.SIZE; ^^^ Comparing identical expressions ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Implement #docIDRunEnd() on PostingsEnum. [lucene]
jpountz commented on code in PR #14390: URL: https://github.com/apache/lucene/pull/14390#discussion_r2008757161 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -2458,6 +2469,31 @@ private static void checkTermsIntersect(Terms terms, Automaton automaton, BytesR } } + private static void checkDocIDRuns(DocIdSetIterator iterator) throws IOException { +int prevDoc = -1; +int runEnd = 0; +for (int doc = iterator.nextDoc(); Review Comment: I wondered about it. We already have tests that test nextDoc vs. advance or nextDoc vs. intoBitSet. And "sane" impls of docIDRun should be free of side effects. So I thought that it would be enough this way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Implement #docIDRunEnd() on PostingsEnum. [lucene]
gf2121 commented on code in PR #14390: URL: https://github.com/apache/lucene/pull/14390#discussion_r2008767598 ## lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java: ## @@ -1059,6 +1059,29 @@ private void bufferIntoBitSet(int start, int end, FixedBitSet bitSet, int offset } } +@Override +public int docIDRunEnd() throws IOException { + // Note: this assumes that BLOCK_SIZE == 128, this bit of the code would need to be changed if Review Comment: This trick seems work: ``` boolean level0IsDense = encoding == DeltaEncoding.UNARY && docBitSet.getBits()[0] == -1L && docBitSet.getBits()[1] == -1L; if (level0IsDense) { assert Long.bitCount(docBitSet.getBits()[0]) + Long.bitCount(docBitSet.getBits()[1]) == BLOCK_SIZE : "We are assuming BLOCK_SIZE == 128 here."; ... } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Implement #docIDRunEnd() on PostingsEnum. [lucene]
gf2121 commented on code in PR #14390: URL: https://github.com/apache/lucene/pull/14390#discussion_r2008772414 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -2458,6 +2469,31 @@ private static void checkTermsIntersect(Terms terms, Automaton automaton, BytesR } } + private static void checkDocIDRuns(DocIdSetIterator iterator) throws IOException { +int prevDoc = -1; +int runEnd = 0; +for (int doc = iterator.nextDoc(); Review Comment: Thank you for explanation, that sounds good enough today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ParallelLeafReader to improve term vector fetching efficienc [lucene]
DivyanshIITB commented on code in PR #14373: URL: https://github.com/apache/lucene/pull/14373#discussion_r2008721150 ## lucene/core/src/java/org/apache/lucene/index/ParallelLeafReader.java: ## @@ -348,15 +348,24 @@ public void prefetch(int docID) throws IOException { @Override public Fields get(int docID) throws IOException { ParallelFields fields = null; -for (Map.Entry ent : tvFieldToReader.entrySet()) { - String fieldName = ent.getKey(); - TermVectors termVectors = readerToTermVectors.get(ent.getValue()); - Terms vector = termVectors.get(docID, fieldName); - if (vector != null) { -if (fields == null) { - fields = new ParallelFields(); -} -fields.addField(fieldName, vector); + +// Step 2: Fetch all term vectors once per reader +for (Map.Entry entry : readerToTermVectors.entrySet()) { + TermVectors termVectors = entry.getValue(); + Fields docFields = termVectors.get(docID); // Fetch all fields at once + + if (docFields != null) { + if (fields == null) { + fields = new ParallelFields(); + } + + // Step 3: Aggregate only required fields + for (String fieldName : docFields) { + Terms vector = docFields.terms(fieldName); + if (vector != null) { Review Comment: `docFields.terms(fieldName)` can be null if the field exists in `docFields` but does not have term vectors stored. This check ensures we only add fields that actually have term vectors -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2006876856 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java: ## @@ -0,0 +1,552 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class TrieBuilder { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + static final int LEAF_NODE_HAS_TERMS = 1 << 5; + static final int LEAF_NODE_HAS_FLOOR = 1 << 6; + static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1; + static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0; + + /** + * The output describing the term block the prefix point to. + * + * @param fp describes the on-disk terms block which a trie node points to. + * @param hasTerms A boolean which will be false if this on-disk block consists entirely of + * pointers to child blocks. + * @param floorData A {@link BytesRef} which will be non-null when a large block of terms sharing + * a single trie prefix is split into multiple on-disk blocks. + */ + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +BUILDING, +SAVED, +DESTROYED + } + + private static class Node { + +// The utf8 digit that leads to this Node, 0 for root node +private final int label; +// The children listed in order by their utf8 label +private final LinkedList children; +// The output of this node. +private Output output; + +// Vars used during saving: + +// The file pointer point to where the node saved. -1 means the node has not been saved. +private long fp = -1; +// The iterator whose next() point to the first child has not been saved. +private Iterator childrenIterator; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.BUILDING; + final Node root = new Node(0, null, new LinkedList<>()); + + static TrieBuilder bytesRefToTrie(BytesRef k, Output v) { +return new TrieBuilder(k, v); + } + + private TrieBuilder(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + /** + * Absorb all (K, V) pairs from the given trie into this one. The given trie builder should not + * have key that already exists in this one, otherwise a {@link IllegalArgumentException } will be + * thrown and this trie will get destroyed. + * + * Note: the given trie will be destroyed after absorbing. + */ + void absorb(TrieBuilder trieBuilder) { +if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) { + throw new IllegalStateException("tries should be unsaved"); +} +// Use a simple stack to avoid recursion. +Deque stack = new ArrayDeque<>(); +stack.add(() -> absorb(this.root, trieBuilder.root, stack)); +while (!stack.isEmpty()) { + stack.pop().run(); +} +trieBuilder.status = Status.DESTROYED; + } + + private void absorb(Node n, Node add, Deque stack) { +assert n.label == add.label; +if (add.output != null) { + if (n.output != null) { +
Re: [PR] Optimize ParallelLeafReader to improve term vector fetching efficienc [lucene]
vigyasharma commented on code in PR #14373: URL: https://github.com/apache/lucene/pull/14373#discussion_r2008971507 ## lucene/CHANGES.txt: ## @@ -35,6 +35,10 @@ Optimizations - * GITHUB#14011: Reduce allocation rate in HNSW concurrent merge. (Viliam Durina) * GITHUB#14022: Optimize DFS marking of connected components in HNSW by reducing stack depth, improving performance and reducing allocations. (Viswanath Kuchibhotla) +* GITHUB#14373: Optimized `ParallelLeafReader` to improve term vector fetching efficiency. +- Fetches all term vectors once per reader instead of per field. +- Reduces complexity from **O(n²) to O(n)**. +- Enhances performance for documents with many fields. (Divyansh Agrawal) Review Comment: We generally keep a single bullet per changes entry. Details are already available in the pull request that the entry points too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Cover all DataType [lucene]
LuXugang commented on PR #14091: URL: https://github.com/apache/lucene/pull/14091#issuecomment-2745976269 Sorry for misssing backreport this issue, I would fix it soon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org