Re: [I] TestKnnGraph.testMultiThreadedSearch random test failure [lucene]
benwtrent commented on issue #14327: URL: https://github.com/apache/lucene/issues/14327#issuecomment-2730201462 Git bisect blames: a6a96cde1c65fddb65363f0090a0202fd6db329c Which, if the scores are the same between docs, makes sense to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]
jpountz commented on code in PR #14365: URL: https://github.com/apache/lucene/pull/14365#discussion_r1999318407 ## lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java: ## @@ -251,6 +252,30 @@ public void visit(int docID, byte[] packedValue) { } } +@Override +public void visit(IntsRef ref) { + final int[] docs = ref.ints; + for (int i = ref.offset, to = ref.offset + ref.length; i < to; i++) { +int docID = docs[i]; +if (docID > maxDocVisited) { + adder.add(docID); +} + } +} + +@Override +public void visit(DocIdSetIterator iterator) throws IOException { + assert iterator.docID() == -1; + int docID = iterator.nextDoc(); + if (docID <= maxDocVisited) { +docID = iterator.advance(maxDocVisited + 1); + } + while (docID != DocIdSetIterator.NO_MORE_DOCS) { +adder.add(docID); +docID = iterator.nextDoc(); + } Review Comment: Could it just be a simple loop such as: ``` for (int doc = iterator.advance(maxDocVisited + 1); doc != NO_MORE_DOCS; doc = iterator.nextDoc()) { adder.add(doc); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
msfroh commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2730390727 Instead of a boolean flag, what if we define an interface that specifies the folding rules? It could have two methods: one that folds input characters to a canonical representation (before sorting) and one that expands from the canonical representation to the characters that should be matched. We could ship ASCII and Turkish implementations to start, say. If someone has a Romanian corpus that has a mix of characters with and without diacritics, they might strip diacritics on input and expand them for matching. (That would effectively combine lowercase and ASCII folding.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2730397885 I think my ask is misunderstood, it is just to follow the Unicode standard. There are two mappings for simple case folding: * Default * Alternate (Turkish/azeri) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2730410784 If you want to do fancy romanian accent removal, use an analyzer and normalize your data. That's what a search engine is all about. But if we want to provide some limited runtime expansion (which I'm still very unsure about), let's just stick with the standards and not invent something else. No need for interfaces, abstractions or lambdas. `boolean` is the correct solution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] add Automaton.toMermaid() for emitting mermaid state charts [lucene]
rmuir commented on PR #14360: URL: https://github.com/apache/lucene/pull/14360#issuecomment-2730119438 i'll keep the PR up here. Actually as a first step, I'd rather improve existing toDot() and regex toString(). It would help the logic here, too. There's no need to escape codepoints as `\u0010`, we could do `U+10` instead. And it is very conservative about what is "printable". If I want to make them look pretty, i edit the .dot file and fix some of these things. We can definitely try to help remove that step. For mermaid, it does allow accessible descriptions via "hover" with the mouse. It would be nice to be "aggressive" about displaying characters but at the same time provide the ability to see the unicode range if you want. I'll see what the possibilities are with dot here too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Decode doc ids in BKD leaves with auto-vectorized loops [lucene]
gf2121 commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2729902081 I raised an PR for annotation. https://github.com/mikemccand/luceneutil/pull/354. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]
gf2121 commented on PR #14361: URL: https://github.com/apache/lucene/pull/14361#issuecomment-2729280384 OK i get expected results that multiple of 16 faster than multiple of 8 when i force `-XX:UseAVX=3`, it can be seen AVX3 is slower on this chip, that may be why java disabled it by default. ``` Benchmark Mode Cnt Score Error Units Decode21Benchmark.decode21Scalar thrpt5 28.375 ? 0.064 ops/ms Decode21Benchmark.decode21Scalar:asm thrpt NaN --- Decode21Benchmark.decode21Vector thrpt5 41.844 ? 0.182 ops/ms Decode21Benchmark.decode21Vector:asm thrpt NaN --- Decode21Benchmark.decode21VectorFloorToMultipleOf16 thrpt5 64.471 ? 0.218 ops/ms Decode21Benchmark.decode21VectorFloorToMultipleOf16:asm thrpt NaN --- Decode21Benchmark.decode21VectorFloorToMultipleOf8 thrpt5 39.665 ? 0.120 ops/ms Decode21Benchmark.decode21VectorFloorToMultipleOf8:asm thrpt NaN --- ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Decode doc ids in BKD leaves with auto-vectorized loops [lucene]
gf2121 commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2729806648 Nightly benchmark confirmed the speed up https://benchmarks.mikemccandless.com/2025.03.16.18.04.58.html. Thanks again for profile guide and helping figure out simpler and faster codes! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Multi-threaded vector search over multiple segments can lead to inconsistent results [lucene]
tteofili commented on issue #14180: URL: https://github.com/apache/lucene/issues/14180#issuecomment-2730194375 > Could you please let me know which future version of Elasticsearch will resolve the vector search consistency problem? we are investigating on a proper solution to this issue that addresses the determinism while retaining good efficiency (e.g., see some early experiments in this [draft PR](https://github.com/apache/lucene/pull/14191)) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Completion FSTs to be loaded off-heap by default [lucene]
javanna commented on code in PR #14364: URL: https://github.com/apache/lucene/pull/14364#discussion_r1999422163 ## lucene/suggest/src/java/org/apache/lucene/search/suggest/document/CompletionPostingsFormat.java: ## @@ -122,11 +122,6 @@ public enum FSTLoadMode { private final FSTLoadMode fstLoadMode; - /** Used only by core Lucene at read-time via Service Provider instantiation */ - public CompletionPostingsFormat(String name) { Review Comment: This was unused and redundant. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup merging of HNSW graphs [lucene]
mayya-sharipova commented on PR #14331: URL: https://github.com/apache/lucene/pull/14331#issuecomment-2730506974 @msokolov Thanks for the comment. I've experimented setting: beamCandidates0 to `M * 3` increasing it from the previous `M*2` when building merged graphs. Graphs look better, but there are still significant speedups Evaluation is done with Luceneutil on these datasets: 1. **quora-E5-small**; 522931 docs; 384 dims; 7 bits quantized; cosine metric - baseline: index time: **112.41s**, force merge: **113.81s** - candidate: index time: **81.55s**, force merge: **70.87s** 2. **cohere-wikipedia-v2**; 1M docs; 768 dims; 7 bits quantized; cosine metric - baseline: index time: **158.1s**, force merge: **425.20s** - candidate: index time: **122.95s**, force merge: **239.28s** 3. **gist**; 960 dims, 1M docs; 7 bits quantized; euclidean metric - baseline: index time: **141.82s**, force merge: **536.07s** - candidate: index time: **119.26s**, force merge: **279.05s** 4. **cohere-wikipedia-v3**; 1M docs; 1024 dims; 7 bits quantized; dot_product metric - baseline: index time: **211.86s**, force merge: **654.97s** - candidate: index time: **168.22s,** force merge: **414.12s**     -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Integrating GPU based Vector Search using cuVS [lucene]
kaivalnp commented on PR #14131: URL: https://github.com/apache/lucene/pull/14131#issuecomment-2730545360 Exciting change! Since this PR adds a new codec for vector search, I wanted to point to #14178 along similar lines -- adding a new Faiss-based KNN format to index and query vectors Faiss (https://github.com/facebookresearch/faiss) is _"a library for efficient similarity search and clustering of dense vectors"_. It supports various features like vector transforms (eg PCA), indexing algorithms (eg IVF, HNSW, etc), quantization techniques (eg PQ), search strategies (eg 2-step refinement), different hardware (including [GPUs](https://github.com/facebookresearch/faiss/wiki/Faiss-on-the-GPU) -- also has support for cuVS) -- and adding this codec would allow users to make use of (most of) these features! Internally, the format calls the [C API](https://github.com/facebookresearch/faiss/blob/main/c_api/INSTALL.md) of Faiss using Panama (https://openjdk.org/projects/panama) FFI. The codec is present in the sandbox module, and does _not_ add Faiss as a dependency of Lucene -- only relies on the shared library (along with all dependencies) to be present at runtime (on `$LD_LIBRARY_PATH` or `-Djava.library.path`) Would appreciate feedback on the PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
msfroh commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2730578138 Okay, got it! That's the piece that I was misunderstanding. I didn't realize that Turkish/Azeri is the **only** other valid folding. I kept thinking of it as just an example where the naïve folding wouldn't work. I'll go ahead and update this PR with that in mind. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Completion FSTs to be loaded off-heap by default [lucene]
javanna commented on code in PR #14364: URL: https://github.com/apache/lucene/pull/14364#discussion_r1999421774 ## lucene/suggest/src/java/org/apache/lucene/search/suggest/document/Completion101PostingsFormat.java: ## @@ -25,17 +25,9 @@ * @lucene.experimental */ public class Completion101PostingsFormat extends CompletionPostingsFormat { - /** Creates a {@link Completion101PostingsFormat} that will load the completion FST on-heap. */ + /** Creates a {@link Completion101PostingsFormat} that will load the completion FST off-heap. */ public Completion101PostingsFormat() { -this(FSTLoadMode.ON_HEAP); - } - - /** - * Creates a {@link Completion101PostingsFormat} that will use the provided fstLoadMode - * to determine if the completion FST should be loaded on or off heap. - */ - public Completion101PostingsFormat(FSTLoadMode fstLoadMode) { Review Comment: I find that these constructors taking the fst load mode are trappy, in that they make readers think that calling them allows to override the load mode. In practice, the load mode is only ever used at read time, and that codepath goes through SPI loading, so via the default constructor. I think removing these is the best path forward to avoid confusion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] A little optimization about LZ4 [lucene]
jainankitk commented on issue #14347: URL: https://github.com/apache/lucene/issues/14347#issuecomment-2730860749 I am not sure if it is good idea to have this as user parameter. But, I am wondering if the default for `BEST_SPEED` should be using preset dict as that compromises speed for compression. Users can always use `BEST_COMPRESSION` if they are willing to compromise on speed in favor of compression. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Completion FSTs to be loaded off-heap by default [lucene]
javanna opened a new pull request, #14364: URL: https://github.com/apache/lucene/pull/14364 All the existing completion postings format load their FSTs on-heap. It is possible to customize that behaviour by mainintaing a custom postings format that override the fst load mode. TestSuggestField attempted to test all modes, but in practice, it can only test the default load mode because the read path relies on the default constructor called via SPI, that does not allow providing the fst load mode. This commit changes the default value to OFF_HEAP, and addresses the testing gap in TestSuggestField, so that all load modes are tested. This unveiled a bug when loading FST off heap and the index input is a MemorySegmentIndexInput which is fixed by #14271. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support load per-iteration replacement of NamedSPI [lucene]
javanna commented on PR #14275: URL: https://github.com/apache/lucene/pull/14275#issuecomment-2729969364 I opened #14364 to make the suggested change to the completion postings format, let me know what you think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Address completion fields testing gap and truly allow loading FST off heap [lucene]
javanna commented on PR #14270: URL: https://github.com/apache/lucene/pull/14270#issuecomment-2729966315 I opened #14364 to still address the testing gap, but also change the default load mode to off heap. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Address completion fields testing gap and truly allow loading FST off heap [lucene]
javanna closed pull request #14270: Address completion fields testing gap and truly allow loading FST off heap URL: https://github.com/apache/lucene/pull/14270 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]
dsmiley commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2729965328 An aside: `org.apache.lucene.search.DisjunctionScorer.TwoPhase#matches` looks kind of sad, in that each matches() call is going to build a priority queue of "unverified matches" (DisiWrapper holding TwoPhaseIterator). It seems strange to populate one on visiting each doc instead of maintaining a fixed pre-sorted array of them, since we know which clauses have TPIs. The DisiWrapper could have a TPI index (by match cost) into an array of TwoPhaseIterators. The selected unverifiedMatches per matches() call might be noted via a bitmask/bitset that is cheap to set & clear & iterate set bits. Or could just use an array of DisiWrapper that is cleared & filled. No matchCost comparisons & heap manipulation. Not sure if I'm over-optimizing here. The use-case bringing me here is only one TPI, and it's approximation is all docs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Decode doc ids in BKD leaves with auto-vectorized loops [lucene]
jpountz commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2729975580 Fantastic speedup. Nice to see tasks like `TermDayOfYearSort` also take advantage from this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]
jpountz commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2729988278 The current approach is probably not the fastest indeed. We should add a task to nightly benchmarks if we want to optimize this. Something like a disjunction of phrase queries (possibly both for top-n collection and counting)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Opening of vector files with ReadAdvice.RANDOM_PRELOAD [lucene]
jainankitk commented on issue #14348: URL: https://github.com/apache/lucene/issues/14348#issuecomment-2730902349 > Lucene currently uses ReadAdvice.RANDOM when opening these files. I think it would be better to use RANDOM_PRELOAD. As per the documentation for RANDOM_PRELOAD: _This should only be used on very small files that can be expected to fit in RAM with very high confidence._ Hence, I am wondering if it can have adverse impact for cases having small page cache. While https://github.com/apache/lucene/pull/13264 adds way to configure ReadAdvice, that applies to all file formats I guess. It would be great, if users can tune the ReadAdvice for specific formats, depending on the server page cache and data size. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]
dsmiley commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2730909459 BTW I don't have plans to explore this further. Anyone should feel free to take over. Or abandon if nobody cares -- I admit it's very unusual to even have a top level disjunction, let alone one with an expensive clause, as implied by the TPI. I found this while writing a simple test, so simple that it wasn't real-world representative. The real-world representative has the disjunction deeper than the top level, and that which doesn't show this performance problem as the bulk scorer is then not used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]
mccullocht commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r1999677796 ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,72 @@ +package org.apache.lucene.index; + +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; +import java.io.IOException; + +/** + * A multi-tenant merge scheduler that shares a global thread pool across multiple IndexWriters. + */ +public class MultiTenantMergeScheduler extends MergeScheduler { + +// Shared global thread pool with lazy initialization +private static class LazyHolder { +static final ExecutorService MERGE_THREAD_POOL = + Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() / 2); +} + +private static ExecutorService getMergeThreadPool() { +return LazyHolder.MERGE_THREAD_POOL; +} + +// Use getMergeThreadPool() instead of direct access + +@Override +public void merge(MergeScheduler.MergeSource mergeSource, MergeTrigger trigger) throws IOException { +while (mergeSource.hasPendingMerges()) { // Use hasPendingMerges() instead of relying on null check +MergePolicy.OneMerge merge = mergeSource.getNextMerge(); +if (merge == null) { +break; // Explicitly exit if no merge is available +} + +// Submit merge task to the shared thread pool +MERGE_THREAD_POOL.submit(() -> { +try { +mergeSource.merge(merge); +} catch (IOException e) { +throw new RuntimeException("Merge operation failed", e); +} +}); + +// Cleanup completed merges +activeMerges.removeIf(Future::isDone); +} +} + +@Override +public void close() throws IOException { Review Comment: IIRC `IndexWriter.close()` will call `MergeScheduler.close()` which means that you would wait for all merges across all indexes using this scheduler to complete. The `IndexWriter` that is currently calling `close()` will stop scheduling new merges, but the other writers may not, which means this may not terminate. To fix this you may need to map active merges back to the `IndexWriter` or `Directory` they belong to. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Implement #docIDRunEnd() on `DisjunctionDISIApproximation`. [lucene]
jpountz merged PR #14363: URL: https://github.com/apache/lucene/pull/14363 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Completion FSTs to be loaded off-heap by default [lucene]
jpountz commented on code in PR #14364: URL: https://github.com/apache/lucene/pull/14364#discussion_r1999714466 ## lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestSuggestField.java: ## @@ -951,7 +951,16 @@ static IndexWriterConfig iwcWithSuggestField(Analyzer analyzer, final Set
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2731329809 it is confusing. because unicode case folding algorithm is supposed to work for everyone. But here's the problem: for most of the world: * lowercase i has a dot, uppercase I has no dot. for turkish/azeri world: * lowercase i corresponds to İ with a dot and lowercase ı without dot corresponds to I that's why they need their own separate mappings. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add Issue Tracker Link under 'Editing Content on the Lucene™ Sites' [lucene-site]
DivyanshIITB opened a new pull request, #78: URL: https://github.com/apache/lucene-site/pull/78 This PR adds a direct link to the [Lucene Issue Tracker](https://issues.apache.org/jira/projects/LUCENE) under the "Editing Content on the Lucene™ sites" section in site-instructions.md. Changes Made: - Updated site-instructions.md to include a reference to the issue tracker. - Placed the link right after the mention of the GitHub and Gitbox repositories for better visibility. Reason for the Change: - Helps contributors quickly find where to report website-related issues. - Improves documentation clarity and accessibility. Fixes #72 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Implement bulk adding methods for dynamic pruning. [lucene]
gf2121 opened a new pull request, #14365: URL: https://github.com/apache/lucene/pull/14365 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Opening of vector files with ReadAdvice.RANDOM_PRELOAD [lucene]
jpountz commented on issue #14348: URL: https://github.com/apache/lucene/issues/14348#issuecomment-2730966937 For what it's worth, it's possible to override the read advice of vectors with something like that: ```java Path path = ...; Directory dir = new FilterDirectory(FSDirectory.open(path)) { @Override public IndexInput openInput(String name, IOContext context) throws IOException { if (name.endsWith(".vex") || name.endsWith(".vec") || name.endsWith(".veb")) { context = context.withReadAdvice(ReadAdvice.RANDOM_PRELOAD); } return super.openInput(name, context); } }; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]
DivyanshIITB commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r1998645110 ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,44 @@ +package org.apache.lucene.index; + +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; +import java.io.IOException; + +/** + * A multi-tenant merge scheduler that shares a global thread pool across multiple IndexWriters. + */ +public class MultiTenantMergeScheduler extends MergeScheduler { + +// Shared global thread pool +private static final ExecutorService MERGE_THREAD_POOL = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() / 2); Review Comment: Thanks for the suggestion! I've updated the implementation to use lazy static initialization by wrapping the thread pool in a nested static class. This ensures the thread pool is only created when first accessed, avoiding unnecessary resource allocation. Let me know if you have any further feedback! Now, instead of directly initializing MERGE_THREAD_POOL, I'm using a LazyHolder class with a getMergeThreadPool() method to ensure thread-safe lazy initialization. This should prevent the thread pool from being created when the policy is unused." -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]
DivyanshIITB commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r1998646386 ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,44 @@ +package org.apache.lucene.index; + +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; +import java.io.IOException; + +/** + * A multi-tenant merge scheduler that shares a global thread pool across multiple IndexWriters. + */ +public class MultiTenantMergeScheduler extends MergeScheduler { + +// Shared global thread pool +private static final ExecutorService MERGE_THREAD_POOL = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() / 2); + +@Override +public void merge(MergeScheduler.MergeSource mergeSource, MergeTrigger trigger) throws IOException { +while (true) { +MergePolicy.OneMerge merge = mergeSource.getNextMerge(); +if (merge == null) { +break; +} +// Submit merge task to the shared thread pool +Future future = MERGE_THREAD_POOL.submit(() -> { +try { +mergeSource.merge(merge); +} catch (IOException e) { +throw new RuntimeException(e); +} +}); + +try { +future.get(); // Ensure the task completes +} catch (Exception e) { +throw new IOException("Merge operation failed", e); +} Review Comment: Thanks for the feedback! You're absolutely right—blocking on future.get(); was making the merges sequential. I've removed it so that merges can now proceed asynchronously in the background. Let me know if this aligns with the intended behavior! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]
DivyanshIITB commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r1998645593 ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,44 @@ +package org.apache.lucene.index; + +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; +import java.io.IOException; + +/** + * A multi-tenant merge scheduler that shares a global thread pool across multiple IndexWriters. + */ +public class MultiTenantMergeScheduler extends MergeScheduler { + +// Shared global thread pool +private static final ExecutorService MERGE_THREAD_POOL = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() / 2); + +@Override +public void merge(MergeScheduler.MergeSource mergeSource, MergeTrigger trigger) throws IOException { +while (true) { +MergePolicy.OneMerge merge = mergeSource.getNextMerge(); +if (merge == null) { Review Comment: Thanks for pointing this out! I’ve updated the loop condition to use hasPendingMerges() instead of relying on null. This ensures better adherence to the documented API behavior. Let me know if this aligns with what you were suggesting! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]
DivyanshIITB commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r1998646829 ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,44 @@ +package org.apache.lucene.index; + +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; +import java.io.IOException; + +/** + * A multi-tenant merge scheduler that shares a global thread pool across multiple IndexWriters. + */ +public class MultiTenantMergeScheduler extends MergeScheduler { + +// Shared global thread pool +private static final ExecutorService MERGE_THREAD_POOL = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() / 2); + +@Override +public void merge(MergeScheduler.MergeSource mergeSource, MergeTrigger trigger) throws IOException { +while (true) { +MergePolicy.OneMerge merge = mergeSource.getNextMerge(); +if (merge == null) { +break; +} +// Submit merge task to the shared thread pool +Future future = MERGE_THREAD_POOL.submit(() -> { +try { +mergeSource.merge(merge); +} catch (IOException e) { +throw new RuntimeException(e); +} +}); + +try { +future.get(); // Ensure the task completes +} catch (Exception e) { +throw new IOException("Merge operation failed", e); +} +} +} + +@Override +public void close() { Review Comment: Thanks for pointing this out! I’ve updated close() to properly wait for all running merges, similar to sync() in ConcurrentMergeScheduler. Now, merge() tracks active tasks, and I've added cleanup for completed futures. Also, I introduced shutdownThreadPool() to allow safe global shutdown. Let me know if this works! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]
DivyanshIITB commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r1998648834 ## lucene/core/src/test/org/apache/lucene/index/TestMultiTenantMergeScheduler.java: ## @@ -0,0 +1,30 @@ +package org.apache.lucene.index; + +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.RAMDirectory; +import org.apache.lucene.analysis.standard.StandardAnalyzer; +import org.apache.lucene.document.Document; +import org.apache.lucene.document.TextField; +import org.apache.lucene.util.LuceneTestCase; + +public class TestMultiTenantMergeScheduler extends LuceneTestCase { + +public void testMultiTenantMergeScheduler() throws Exception { Review Comment: Thanks for the feedback! I’ve expanded the test suite to include multiple writers consuming the scheduler, different merge scheduling times, and staggered start/close behavior. Also, I’ve referred to TestConcurrentMergeScheduler for structuring test cases to verify concurrency. Let me know if you have additional suggestions! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]
DivyanshIITB commented on PR #14335: URL: https://github.com/apache/lucene/pull/14335#issuecomment-2729358357 I have a request to you. Kindly ignore the following two deleted files in the "Files Changed" section : "KeepOnlyLastCommitDeletionPolicy.java" "ConcurrentMergeScheduler.java" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]
gf2121 commented on PR #14361: URL: https://github.com/apache/lucene/pull/14361#issuecomment-2729094239 Results on `wikimediumall`: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value IntSet 94.25 (1.4%) 93.50 (1.1%) -0.8% ( -3% -1%) 0.069 CountFilteredIntNRQ 43.06 (2.1%) 43.27 (2.2%)0.5% ( -3% -4%) 0.529 FilteredIntNRQ 79.65 (3.1%) 80.63 (3.4%)1.2% ( -5% -8%) 0.285 IntNRQ 81.27 (3.6%) 82.37 (3.8%)1.3% ( -5% -9%) 0.311 ``` https://bytedance.larkoffice.com/sheets/J1W9s9NM5hVgOYtvecAccMrJnXc"; data-importRangeRawData-range="'Sheet1'!A1:D16"> | baselline | candidate | diff -- | -- | -- | -- _32.kdd | 58829701 | 58713421 | -0.20% _65.kdd | 63532709 | 63171629 | -0.57% _98.kdd | 64475136 | 63986556 | -0.76% _cb.kdd | 64765401 | 64210351 | -0.86% _fe.kdd | 63945666 | 63346246 | -0.94% _fp.kdd | 6619883 | 6094047 | -7.94% _g0.kdd | 6422847 | 5915855 | -7.89% _gb.kdd | 6483860 | 5981968 | -7.74% _gm.kdd | 6494743 | 6005943 | -7.53% _gx.kdd | 6253202 | 5811974 | -7.06% _gy.kdd | 522267 | 522267 | 0.00% _gz.kdd | 521444 | 521444 | 0.00% _h0.kdd | 522729 | 522729 | 0.00% _h1.kdd | 504805 | 504805 | 0.00% _h2.kdd | 463669 | 463669 | 0.00% -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]
jpountz commented on PR #14361: URL: https://github.com/apache/lucene/pull/14361#issuecomment-2729147376 Should we floor to a multiple of 16 instead of 8 so that we have a perfect second loop with AVX-512 as well? (By the way, which of your machine produced the above benchmark results?) Otherwise, the change makes sense to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for two-phase iterators to DenseConjunctionBulkScorer. [lucene]
jpountz commented on code in PR #14359: URL: https://github.com/apache/lucene/pull/14359#discussion_r1998546217 ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -238,9 +296,77 @@ private void scoreWindowUsingBitSet( windowMatches.clear(); } + private void scoreWindowUsingLeapFrog( Review Comment: The contract is that `TwoPhaseIterator#matches()` verifies if the current doc ID of the approximation matches. So we cannot load matches from the approximation into a bit set and then call `matches()` as it wouldn't check the correct doc ID. We could change the API, but I'm not sure if that would buy anything as our most typical two-phase iterator, PhraseQuery, is similar: `PostingsEnum#nextPosition()` reads the next position of the current doc ID. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for two-phase iterators to DenseConjunctionBulkScorer. [lucene]
jpountz commented on code in PR #14359: URL: https://github.com/apache/lucene/pull/14359#discussion_r1998546819 ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -238,9 +296,77 @@ private void scoreWindowUsingBitSet( windowMatches.clear(); } + private void scoreWindowUsingLeapFrog( + LeafCollector collector, + Bits acceptDocs, + List approximations, + List twoPhases, + int min, + int max) + throws IOException { +if (approximations.size() == 1) { + // Since a TwoPhaseIterator comes with its approximation, the number of approximations is gte + // the number of TwoPhaseIterators + assert twoPhases.size() <= 1; + TwoPhaseIterator twoPhase = twoPhases.get(0); + DocIdSetIterator approximation = approximations.isEmpty() ? null : approximations.get(0); Review Comment: Hmm, you're right. This means I'm missing a test case. I'll look into it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for two-phase iterators to DenseConjunctionBulkScorer. [lucene]
gf2121 commented on code in PR #14359: URL: https://github.com/apache/lucene/pull/14359#discussion_r1998596143 ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -238,9 +296,77 @@ private void scoreWindowUsingBitSet( windowMatches.clear(); } + private void scoreWindowUsingLeapFrog( Review Comment: > The contract is that TwoPhaseIterator#matches() verifies if the current doc ID of the approximation matches. Oh yes, you are right. Thanks for explanation! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]
gf2121 commented on PR #14361: URL: https://github.com/apache/lucene/pull/14361#issuecomment-2729250628 Thanks for feedback, > Should we floor to a multiple of 16 instead of 8 so that we have a perfect second loop with AVX-512 as well? That is what i thought initially. But my AVX-512 machine (hopefully it is) somehow only deals 256 bit once for its `vpand` and `vpslld` instructions so flooring to multiple of 8 works on it. Flooring to multiple of 16 makes benchmark a bit slower on my machine as more remaining longs need to be read but i'm fine to floor it to multiple of 16 there are machines requiring that. ``` Benchmark Mode Cnt Score Error Units Decode21Benchmark.decode21Scalar thrpt5 28.114 ? 0.013 ops/ms Decode21Benchmark.decode21Scalar:asm thrpt NaN --- Decode21Benchmark.decode21Vector thrpt5 49.160 ? 1.661 ops/ms Decode21Benchmark.decode21Vector:asm thrpt NaN --- Decode21Benchmark.decode21VectorFloorToMultipleOf16 thrpt5 74.828 ? 0.463 ops/ms Decode21Benchmark.decode21VectorFloorToMultipleOf16:asm thrpt NaN --- Decode21Benchmark.decode21VectorFloorToMultipleOf8 thrpt5 81.078 ? 0.397 ops/ms Decode21Benchmark.decode21VectorFloorToMultipleOf8:asm thrpt NaN --- ``` cpu flags: ``` fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni arch_capabilities ``` > By the way, which of your machine produced the above benchmark results? The luceneutil results get on the intel chip (Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz (AVX 512)). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]
gf2121 opened a new pull request, #14361: URL: https://github.com/apache/lucene/pull/14361 This PR tries another way to implement the idea of https://github.com/apache/lucene/pull/13521, taking advantage of auto-vectorized loop to decode ints like we did in for bpv24 in https://github.com/apache/lucene/pull/14203. One thing need to be pointed out is that the remainder loop does not get vectorized (again!) since `512 / 3 = 170` is not a multiple of 8, then you see the`floorToMultipleOf8` trick . **Mac M2** ``` BenchmarkMode CntScore Error Units Decode21Benchmark.decode21Scalarthrpt5 92.405 ± 0.521 ops/ms Decode21Benchmark.decode21Vectorthrpt5 108.325 ± 1.517 ops/ms Decode21Benchmark.decode21VectorFloorToMultipleOf8 thrpt5 141.691 ± 3.948 ops/ms ``` **Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz (AVX 512)** ``` BenchmarkMode Cnt Score Error Units Decode21Benchmark.decode21Scalarthrpt5 29.134 ? 0.087 ops/ms Decode21Benchmark.decode21Scalar:asmthrpt NaN --- Decode21Benchmark.decode21Vectorthrpt5 45.180 ? 0.479 ops/ms Decode21Benchmark.decode21Vector:asmthrpt NaN --- Decode21Benchmark.decode21VectorFloorToMultipleOf8 thrpt5 76.330 ? 2.858 ops/ms Decode21Benchmark.decode21VectorFloorToMultipleOf8:asm thrpt NaN --- ``` cc @expani who raised this neat idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for two-phase iterators to DenseConjunctionBulkScorer. [lucene]
gf2121 commented on code in PR #14359: URL: https://github.com/apache/lucene/pull/14359#discussion_r1998258104 ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -238,9 +296,77 @@ private void scoreWindowUsingBitSet( windowMatches.clear(); } + private void scoreWindowUsingLeapFrog( Review Comment: Out of curiosity: why we are choosing such a per-doc conjunction algorithm instead of scoring approximations into a bitset and check `matches` for existing bits? Actually i also thought about something like `IndexedDISI#intoBitset` to speed up range queries on `Match.IF_DOC_HAS_VALUE` :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestKnnGraph.testMultiThreadedSearch random test failure [lucene]
gf2121 commented on issue #14327: URL: https://github.com/apache/lucene/issues/14327#issuecomment-2729377397 Seeing similar failure as well: ``` > java.lang.AssertionError: [doc=0 score=0.990099 shardIndex=-1, doc=3 score=0.49751243 shardIndex=-1, doc=5 score=0.21691975 shardIndex=-1, doc=6 score=0.19960079 shardIndex=-1, doc=8 score=0.17825313 shardIndex=-1] expected:<5> but was:<8> > at __randomizedtesting.SeedInfo.seed([7E836993445B[632](https://github.com/apache/lucene/actions/runs/13899513369/job/38887681283?pr=14361#step:5:633)6]:0) > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.apache.lucene.index.TestKnnGraph.assertResults(TestKnnGraph.java:356) > at org.apache.lucene.index.TestKnnGraph.lambda$testMultiThreadedSearch$0(TestKnnGraph.java:308) > at java.base/java.lang.Thread.run(Thread.java:1575) > > java.lang.AssertionError: [doc=0 score=0.990099 shardIndex=-1, doc=3 score=0.49751243 shardIndex=-1, doc=5 score=0.21691975 shardIndex=-1, doc=6 score=0.19960079 shardIndex=-1, doc=8 score=0.17825313 shardIndex=-1] expected:<5> but was:<8> > at __randomizedtesting.SeedInfo.seed([7E836993445B6326]:0) > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.apache.lucene.index.TestKnnGraph.assertResults(TestKnnGraph.java:356) > at org.apache.lucene.index.TestKnnGraph.lambda$testMultiThreadedSearch$0(TestKnnGraph.java:308) > at java.base/java.lang.Thread.run(Thread.java:1575) > > java.lang.AssertionError: [doc=0 score=0.990099 shardIndex=-1, doc=3 score=0.49751243 shardIndex=-1, doc=5 score=0.21691975 shardIndex=-1, doc=6 score=0.19960079 shardIndex=-1, doc=8 score=0.17825313 shardIndex=-1] expected:<5> but was:<8> > at __randomizedtesting.SeedInfo.seed([7E836993445B6326]:0) > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:[647](https://github.com/apache/lucene/actions/runs/13899513369/job/38887681283?pr=14361#step:5:648)) > at org.apache.lucene.index.TestKnnGraph.assertResults(TestKnnGraph.java:356) > at org.apache.lucene.index.TestKnnGraph.lambda$testMultiThreadedSearch$0(TestKnnGraph.java:308) > at java.base/java.lang.Thread.run(Thread.java:1575) 2> NOTE: reproduce with: gradlew test --tests TestKnnGraph.testMultiThreadedSearch -Dtests.seed=7E836993445B6326 -Dtests.locale=kam -Dtests.timezone=America/Guyana -Dtests.asserts=true -Dtests.file.encoding=UTF-8 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] add Automaton.toMermaid() for emitting mermaid state charts [lucene]
dweiss commented on PR #14360: URL: https://github.com/apache/lucene/pull/14360#issuecomment-2728443809 I've looked at the docs of mermaid and toyed around a bit. I agree that infinite loops are so ugly that one's eyes start to bleed. Maybe we should stick to graphviz. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]
vigyasharma commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r1998041333 ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,44 @@ +package org.apache.lucene.index; + +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; +import java.io.IOException; + +/** + * A multi-tenant merge scheduler that shares a global thread pool across multiple IndexWriters. + */ +public class MultiTenantMergeScheduler extends MergeScheduler { + +// Shared global thread pool +private static final ExecutorService MERGE_THREAD_POOL = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() / 2); + +@Override +public void merge(MergeScheduler.MergeSource mergeSource, MergeTrigger trigger) throws IOException { +while (true) { +MergePolicy.OneMerge merge = mergeSource.getNextMerge(); +if (merge == null) { Review Comment: Can we use `hasPendingMerges()` instead? I know existing schedulers assume `null` means there are no more merges, but it's not the documented API behavior. ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,44 @@ +package org.apache.lucene.index; + +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; +import java.io.IOException; + +/** + * A multi-tenant merge scheduler that shares a global thread pool across multiple IndexWriters. + */ +public class MultiTenantMergeScheduler extends MergeScheduler { + +// Shared global thread pool +private static final ExecutorService MERGE_THREAD_POOL = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() / 2); Review Comment: This will create a threadpool even when this policy is not used. Let's use lazy static initialization. ## lucene/core/src/test/org/apache/lucene/index/TestMultiTenantMergeScheduler.java: ## @@ -0,0 +1,30 @@ +package org.apache.lucene.index; + +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.RAMDirectory; +import org.apache.lucene.analysis.standard.StandardAnalyzer; +import org.apache.lucene.document.Document; +import org.apache.lucene.document.TextField; +import org.apache.lucene.util.LuceneTestCase; + +public class TestMultiTenantMergeScheduler extends LuceneTestCase { + +public void testMultiTenantMergeScheduler() throws Exception { Review Comment: This test is not sufficient. We need to test (and also implement) a lot more scenarios, like multiple writers consuming this scheduler, scheduling different merges and starting/closing at different times. You can refer to the tests in `TestConcurrentMergeScheduler` as a starting point. ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,44 @@ +package org.apache.lucene.index; + +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; +import java.io.IOException; + +/** + * A multi-tenant merge scheduler that shares a global thread pool across multiple IndexWriters. + */ +public class MultiTenantMergeScheduler extends MergeScheduler { + +// Shared global thread pool +private static final ExecutorService MERGE_THREAD_POOL = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() / 2); + +@Override +public void merge(MergeScheduler.MergeSource mergeSource, MergeTrigger trigger) throws IOException { +while (true) { +MergePolicy.OneMerge merge = mergeSource.getNextMerge(); +if (merge == null) { +break; +} +// Submit merge task to the shared thread pool +Future future = MERGE_THREAD_POOL.submit(() -> { +try { +mergeSource.merge(merge); +} catch (IOException e) { +throw new RuntimeException(e); +} +}); + +try { +future.get(); // Ensure the task completes +} catch (Exception e) { +throw new IOException("Merge operation failed", e); +} Review Comment: I don't think this achieves concurrent background merging like we'd want. I think this code will submit a merge request and block on its future for completion, effectively making all merges sequential. ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,44 @@ +package org.apache.lucene.index; + +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; +import java.io.IOException; + +/** + * A multi-tenant merge scheduler that shares a global thread pool across multiple IndexWriter
Re: [PR] Add support for two-phase iterators to DenseConjunctionBulkScorer. [lucene]
gf2121 commented on code in PR #14359: URL: https://github.com/apache/lucene/pull/14359#discussion_r1998219212 ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -238,9 +296,77 @@ private void scoreWindowUsingBitSet( windowMatches.clear(); } + private void scoreWindowUsingLeapFrog( + LeafCollector collector, + Bits acceptDocs, + List approximations, + List twoPhases, + int min, + int max) + throws IOException { +if (approximations.size() == 1) { + // Since a TwoPhaseIterator comes with its approximation, the number of approximations is gte + // the number of TwoPhaseIterators + assert twoPhases.size() <= 1; + TwoPhaseIterator twoPhase = twoPhases.get(0); + DocIdSetIterator approximation = approximations.isEmpty() ? null : approximations.get(0); Review Comment: ```suggestion DocIdSetIterator approximation = approximations.get(0); TwoPhaseIterator twoPhase = twoPhases.isEmpty() ? null : twoPhases.get(0); ``` ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -238,9 +296,77 @@ private void scoreWindowUsingBitSet( windowMatches.clear(); } + private void scoreWindowUsingLeapFrog( Review Comment: Out of curiosity: why we are choosing such a per-doc conjunction algorithm instead of scoring approximations into a bitset and check `matches` for existing bits? Actually i also thought about something like `IndexedDISI#intoBitset` to speed up range queries on sparse doc values :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Support modifying segmentInfos.counter in IndexWriter [lucene]
guojialiang92 opened a new issue, #14362: URL: https://github.com/apache/lucene/issues/14362 ### Description Can we support modifying `segmentInfos.counter` in `IndexWriter`? This can be used to skip some segment names when writing. In the scenario of enabling `segment replication` in an Elasticsearch cluster, this control can be implemented during primary shard peer recovery and replica shard promotion to avoid segment counter conflicts. Please help evaluate, if possible, I will submit a Pull Request. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Specialise DirectMonotonicReader when it only contains one block [lucene]
iverase closed pull request #14358: Specialise DirectMonotonicReader when it only contains one block URL: https://github.com/apache/lucene/pull/14358 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org