Re: [PR] Add support for two-phase iterators to DenseConjunctionBulkScorer. [lucene]
jpountz commented on PR #14359: URL: https://github.com/apache/lucene/pull/14359#issuecomment-2727649234 This doesn't slow down existing tasks significantly, including `CountFilteredPhrase` which now runs with `DenseConjunctionBulkScorer` vs. a `DefaultBulkScorer` on top of a `ConjunctionScorer` before. ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value TermMonthSort 3433.89 (3.5%) 3372.95 (2.8%) -1.8% ( -7% -4%) 0.078 CountFilteredPhrase 25.76 (2.6%) 25.34 (2.6%) -1.6% ( -6% -3%) 0.048 Wildcard 76.02 (3.6%) 75.31 (3.9%) -0.9% ( -8% -6%) 0.435 FilteredPrefix3 128.10 (3.3%) 126.96 (2.4%) -0.9% ( -6% -4%) 0.327 Prefix3 133.76 (3.2%) 132.58 (2.6%) -0.9% ( -6% -5%) 0.339 OrHighHigh 51.94 (3.4%) 51.49 (3.9%) -0.9% ( -7% -6%) 0.457 Or3Terms 162.21 (5.8%) 160.85 (5.6%) -0.8% ( -11% - 11%) 0.642 TermTitleSort 141.36 (1.7%) 140.17 (2.3%) -0.8% ( -4% -3%) 0.201 FilteredIntNRQ 297.20 (2.2%) 295.03 (3.0%) -0.7% ( -5% -4%) 0.380 Fuzzy2 74.81 (2.9%) 74.30 (2.8%) -0.7% ( -6% -5%) 0.438 Fuzzy1 79.34 (3.1%) 78.87 (2.9%) -0.6% ( -6% -5%) 0.541 And3Terms 169.32 (5.3%) 168.32 (4.8%) -0.6% ( -10% - 10%) 0.714 IntNRQ 304.31 (2.2%) 302.66 (2.6%) -0.5% ( -5% -4%) 0.481 Or2Terms2StopWords 155.77 (5.5%) 154.94 (5.5%) -0.5% ( -10% - 10%) 0.758 TermDayOfYearSort 696.95 (1.9%) 693.32 (2.2%) -0.5% ( -4% -3%) 0.426 CountAndHighMed 289.64 (1.8%) 288.19 (1.7%) -0.5% ( -3% -3%) 0.363 OrStopWords 31.68 (9.6%) 31.52 (9.2%) -0.5% ( -17% - 20%) 0.869 AndStopWords 30.55 (8.3%) 30.41 (7.6%) -0.5% ( -15% - 16%) 0.857 OrHighMed 196.20 (2.8%) 195.33 (3.3%) -0.4% ( -6% -5%) 0.643 And2Terms2StopWords 158.92 (5.3%) 158.23 (4.8%) -0.4% ( -10% - 10%) 0.787 FilteredAndStopWords 55.81 (2.8%) 55.58 (2.2%) -0.4% ( -5% -4%) 0.606 AndHighMed 131.29 (1.4%) 130.78 (1.2%) -0.4% ( -2% -2%) 0.349 CountOrHighMed 357.52 (1.7%) 356.16 (1.9%) -0.4% ( -3% -3%) 0.498 FilteredAnd2Terms2StopWords 200.10 (2.1%) 199.46 (1.5%) -0.3% ( -3% -3%) 0.585 CountTerm 8915.10 (4.2%) 8887.69 (4.5%) -0.3% ( -8% -8%) 0.822 CombinedOrHighMed 72.13 (2.5%) 71.95 (2.3%) -0.2% ( -4% -4%) 0.749 CombinedOrHighHigh 18.78 (2.3%) 18.74 (2.1%) -0.2% ( -4% -4%) 0.731 DismaxOrHighMed 170.26 (2.1%) 169.93 (1.9%) -0.2% ( -4% -3%) 0.760 FilteredOrHighMed 154.92 (1.4%) 154.63 (1.3%) -0.2% ( -2% -2%) 0.649 FilteredOrHighHigh 67.99 (1.9%) 67.86 (1.5%) -0.2% ( -3% -3%) 0.729 FilteredAndHighMed 130.42 (3.9%) 130.24 (3.4%) -0.1% ( -7% -7%) 0.902 FilteredTerm 159.19 (2.3%) 158.98 (1.9%) -0.1% ( -4% -4%) 0.843 FilteredOr2Terms2StopWords 147.78 (1.6%) 147.61 (1.4%) -0.1% ( -3% -2%) 0.808 FilteredAnd3Terms 188.72 (3.1%) 188.51 (2.5%) -0.1% ( -5% -5%) 0.898 CountFilteredOrHighMed 116.44 (0.8%) 116.33 (0.9%) -0.1% ( -1% -1%) 0.747 DismaxOrHighHigh 116.55 (2.9%) 116.47 (2.3%) -0.1% ( -5% -5%) 0.934 AndHighOrMedMed 46.65 (1.7%) 46.62 (1.7%) -0.1% ( -3% -3%) 0.899 CountFilteredOrHighHigh 105.35 (1.1%) 105.29 (1.1%) -0.1% ( -2% -2%) 0.865 FilteredOr3Terms 162.57 (1.6%) 162.48 (1.7%) -0.1% ( -3% -3%) 0.911 PKLookup 281.03 (1.8%)
[PR] Add support for two-phase iterators to DenseConjunctionBulkScorer. [lucene]
jpountz opened a new pull request, #14359: URL: https://github.com/apache/lucene/pull/14359 The main motivation is to efficiently evaluate range queries on fields that have a doc-value index enabled. These range queries produce two-phase iterators that should match large contiguous range of doc IDs. Using `DenseConjunctionBulkScorer` helps skip these clauses from the conjunction on doc ID ranges that they fully match. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Specialise DirectMonotonicReader when it only contains one block [lucene]
iverase commented on code in PR #14358: URL: https://github.com/apache/lucene/pull/14358#discussion_r1997607493 ## lucene/core/src/java/org/apache/lucene/util/packed/DirectMonotonicReader.java: ## @@ -90,102 +140,142 @@ public static DirectMonotonicReader getInstance(Meta meta, RandomAccessInput dat /** Retrieves an instance from the specified slice. */ public static DirectMonotonicReader getInstance( - Meta meta, RandomAccessInput data, boolean merging) throws IOException { -final LongValues[] readers = new LongValues[meta.numBlocks]; -for (int i = 0; i < meta.numBlocks; ++i) { - if (meta.bpvs[i] == 0) { -readers[i] = LongValues.ZEROES; - } else if (merging - && i < meta.numBlocks - 1 // we only know the number of values for the last block - && meta.blockShift >= DirectReader.MERGE_BUFFER_SHIFT) { -readers[i] = -DirectReader.getMergeInstance( -data, meta.bpvs[i], meta.offsets[i], 1L << meta.blockShift); + Meta meta, RandomAccessInput data, boolean merging) { +return meta.getInstance(data, merging); + } + + private static final class SingleBlockMeta extends Meta { +private final long min; +private final float avg; +private final byte bpv; +private final long offset; + +private SingleBlockMeta(long min, float avg, byte bpv, long offset) { + this.min = min; + this.avg = avg; + this.bpv = bpv; + this.offset = offset; Review Comment: This offset seems to always be zero, is that correct? Then we can get rid of it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Specialise DirectMonotonicReader when it only contains one block [lucene]
iverase opened a new pull request, #14358: URL: https://github.com/apache/lucene/pull/14358 While looking into some heap dumps, I notice in the DirectMonotonicReader.Meta objects hold by segments that the case of single value block is actually common. I wondered if we could specialize that case so we can create a faster, light weigh single block version. This commit makes DirectMonotonicReader a sealed abstractions with two possible implementations, one for multi-blocks and the other for single blocks. The multi-block implementation is very similar to the current implementation while the single block is more lightweight as it does not need to be holding arrays. The only side effect is that DirectMonotonicReader is bimorphic now, but that should be ok. I ran luceneutil and didnt' see much change: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value OrHighNotMed 168.35 (4.3%) 163.80 (4.9%) -2.7% ( -11% -6%) 0.064 HighTermTitleSort 36.96 (2.4%) 36.04 (4.2%) -2.5% ( -8% -4%) 0.023 PKLookup 105.78 (6.1%) 103.28 (8.4%) -2.4% ( -15% - 12%) 0.306 HighTerm 228.90 (4.3%) 223.90 (5.3%) -2.2% ( -11% -7%) 0.154 OrHighLow 330.01 (3.9%) 324.79 (3.5%) -1.6% ( -8% -6%) 0.181 LowSpanNear 32.10 (6.5%) 31.61 (6.9%) -1.5% ( -13% - 12%) 0.469 Wildcard 103.26 (5.3%) 101.68 (7.1%) -1.5% ( -13% - 11%) 0.443 HighSpanNear4.70 (4.7%)4.63 (4.6%) -1.5% ( -10% -8%) 0.309 OrNotHighLow 451.15 (3.5%) 444.47 (4.0%) -1.5% ( -8% -6%) 0.215 HighSloppyPhrase4.71 (4.8%)4.64 (5.7%) -1.4% ( -11% -9%) 0.415 Fuzzy2 36.81 (4.8%) 36.31 (5.9%) -1.4% ( -11% -9%) 0.423 Fuzzy1 40.34 (2.9%) 39.81 (4.1%) -1.3% ( -8% -5%) 0.233 OrHighHigh 75.21 (5.2%) 74.22 (5.1%) -1.3% ( -11% -9%) 0.418 AndHighHigh 59.33 (5.0%) 58.57 (5.1%) -1.3% ( -10% -9%) 0.426 AndHighMed 131.70 (3.2%) 130.05 (4.1%) -1.2% ( -8% -6%) 0.286 range 2006.77 (7.0%) 1983.01 (8.5%) -1.2% ( -15% - 15%) 0.630 MedTerm 333.14 (4.2%) 329.23 (5.2%) -1.2% ( -10% -8%) 0.429 LowIntervalsOrdered6.16 (5.8%)6.09 (6.5%) -1.1% ( -12% - 11%) 0.564 OrHighNotHigh 206.93 (3.4%) 204.61 (4.4%) -1.1% ( -8% -6%) 0.366 MedSpanNear 15.36 (2.9%) 15.20 (3.7%) -1.0% ( -7% -5%) 0.323 OrHighMed 135.41 (3.2%) 134.25 (2.8%) -0.9% ( -6% -5%) 0.373 LowTerm 528.68 (4.1%) 524.30 (4.8%) -0.8% ( -9% -8%) 0.556 OrHighNotLow 238.88 (4.2%) 237.11 (4.7%) -0.7% ( -9% -8%) 0.599 MedSloppyPhrase 38.80 (2.7%) 38.55 (3.0%) -0.6% ( -6% -5%) 0.483 MedIntervalsOrdered 43.96 (9.3%) 43.68 (9.2%) -0.6% ( -17% - 19%) 0.831 OrNotHighHigh 301.25 (2.5%) 299.46 (3.2%) -0.6% ( -6% -5%) 0.520 HighIntervalsOrdered7.35 (6.1%)7.31 (6.8%) -0.5% ( -12% - 13%) 0.792 BrowseDayOfYearTaxoFacets2.03 (13.2%)2.02 (14.7%) -0.4% ( -25% - 31%) 0.920 OrNotHighMed 229.85 (2.8%) 228.86 (4.2%) -0.4% ( -7% -6%) 0.701 BrowseDateTaxoFacets2.01 (12.5%)2.00 (15.0%) -0.4% ( -24% - 30%) 0.922 MedPhrase 77.28 (2.9%) 77.10 (3.1%) -0.2% ( -6% -5%) 0.799 BrowseRandomLabelTaxoFacets1.49 (12.2%)1.49 (16.4%) -0.2% ( -25% - 32%) 0.960 Respell 22.59 (3.9%) 22.55 (4.1%) -0.1% ( -7% -8%) 0.908 LowSloppyPhrase 76.40 (2.7%) 76.35 (3.2%) -0.1% ( -5% -6%) 0.938 AndHighLow 478.69 (5.3%) 479.27 (7.1%)0.1% ( -11% - 13%) 0.951
Re: [PR] add Automaton.toMermaid() for emitting mermaid state charts [lucene]
rmuir commented on PR #14360: URL: https://github.com/apache/lucene/pull/14360#issuecomment-2727698038 Here's the same Automaton, but via `toDot()` tossed into https://dreampuf.github.io/GraphvizOnline with all defaults. I guess I'm still a fan of that output style, I feel it is more readable. But maybe we can improve the styling here to make this more useful and save you that step.  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] add Automaton.toMermaid() for emitting mermaid state charts [lucene]
rmuir commented on PR #14360: URL: https://github.com/apache/lucene/pull/14360#issuecomment-2727699099 Mermaid definitely doesn't handle infinite automata very well at all: ```mermaid stateDiagram direction LR classDef accept border-width:5px;stroke-width:5px,stroke:#00 class 0 accept 0 --> 0:K ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Decode doc ids in BKD leaves with auto-vectorized loops [lucene]
gf2121 merged PR #14203: URL: https://github.com/apache/lucene/pull/14203 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Specialise DirectMonotonicReader when it only contains one block [lucene]
jpountz commented on PR #14358: URL: https://github.com/apache/lucene/pull/14358#issuecomment-2727467298 I'm wary about adding all these micro-optimizations to reduce the per-segment per-field overhead. They hurt readability and may easily get lost over time when codecs get replaced with new codecs. Presumably, the motivation is to reduce heap usage when working with lots of shards / segments / fields? Could this be solved differently so that heap usage due to open indexes remains low by design rather than as a result of many micro-optimizations? E.g. by pooling index readers using a least-recently used policy, using a single Lucene field for multiple logical fields (like Elasticsearch's `flattened` does), merging segments more aggressively (e.g. by bumping the floor/min segment size), or a combination of all these things? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]
jpountz commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727461724 +1 let's use `DisjunctionSumScorerwhich` (which already supports two-phase iteration) when one of the clauses exposes a non-null two-phase iterator? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]
dsmiley commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727528269 If one or more DISI has a high cost (irrespective of TPIs), thus matching many docs, I could see avoiding BS1 as well. An aside, if we are going to refer to these as BS1 vs BS2, they should have names more clearly reflecting this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]
jpountz commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727629320 In case you missed it, `BooleanScorer` had optimizations recently that make it hard to beat by `DisjunctionScorer` when clauses are `PostingsEnum`s: - `DocIdSetIterator#intoBitSet` helps impls optimize loading matches into bit sets, in the case of `PostingsEnum` through (auto) vectorization (annotation [`HR`](https://benchmarks.mikemccandless.com/CountOrHighHigh.html)) - Dense blocks of postings are stored as bit sets, further speeding up OR-ing postings into a bit set ([annotations `HW` and `HX`](https://benchmarks.mikemccandless.com/CountOrHighHigh.html)) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]
jpountz commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727625240 > If one or more DISI has a high cost (irrespective of TPIs), thus matching many docs, I could see avoiding BS1 as well. I imagine that your idea is that if most of the cost comes from a single DISI then the heap doesn't need to be reorderer every time so the heap reordering overhead may be less than the bitset overhead. `BooleanScorer` has an optimization for a similar situation where if a single clause matches on a window of 4,096 docs then it will skip the bit set as an intermediate representation of matches and feed this clause directly into the collector. > An aside, if we are going to refer to these as BS1 vs BS2, they should have names more clearly reflecting this. Agreed to stop using `BS1`/`BS2`. It's how Lucene's disjunction scorers were historically referred to and these names came naturally after reading your description message that abbreviated `BooleanScorer` to `BS`, but it doesn't properly reflect how Lucene scores disjunctions nowadays. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Multi-HNSW graphs per segment? [lucene]
navneet1v commented on issue #14341: URL: https://github.com/apache/lucene/issues/14341#issuecomment-2727640413 > What you primarily want in the referenced GH issue is the ability to filter on more metadata during traversal vs doing a pre filter on the candidate documents themselves. As Adrien pointed out, this is better solved with more efficient filtering approaches (I wonder if the recent work on ACORN would help?) > > This issue is more towards the niche use cases where multiple graphs for a given segment would work. Multi threaded queries and disjoint cluster representation seem to be the immediate choices. So what I have been thinking here is if at a segment level we add a capability of having multiple graphs and make the process of generating those graphs generic will solve the problem I mentioned in github issue. See overall, how you split your docs into multiple clusters can be a choice for the user who is configuring the codec. If user wants to use k-Means algo to do clustering then it can use it, but if there are lets say other parameters in the document then that can be used for clustering/building different HNSW graphs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Opening of vector files with ReadAdvice.RANDOM_PRELOAD [lucene]
navneet1v commented on issue #14348: URL: https://github.com/apache/lucene/issues/14348#issuecomment-2727641512 @viliam-durina if you have benchmarks that shows the performance is better it will be good to raise the PR. Once PR is there maintainers can also do more tests to see if it is really the case. ccing few maintainers @mikemccand , @benwtrent , @msokolov -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]
dsmiley commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727499162 Thanks for your confirmation of the problem. The collect-per-clause is surprising to me; like what would benefit from that algorithm? Wouldn't that _only_ be in fact _needed_ if scores are _needed_? Since they need to be aggregated across the clauses on a matching document. If so, this isn't really about TPIs, it's about the need for scores. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] add Automaton.toMermaid() for emitting mermaid state charts [lucene]
rmuir opened a new pull request, #14360: URL: https://github.com/apache/lucene/pull/14360 Mermaid is state chart supported within fenced codeblocks by github. For some reason it doesn't support dotty but instead the latest js tool. I'm sure in 2 months it will be a different tool. Beautification of any sort can happen at any time, this just makes it work correctly. Mermaid state diagram really isn't the best for this, just look at how start/end state works there. So we just define a "class" that makes accept states blue. I'd rather have double-circle but I'm not a CSS guy. Closes #14351 ```mermaid stateDiagram direction LR classDef accept border-width:5px;stroke-width:5px,stroke:#00 0 0 --> 13:\\U-k 0 --> 1:l 0 --> 13:m-t 0 --> 19:u 0 --> 13:v-\\U0010 1 1 --> 14:\\U-b 1 --> 20:c 1 --> 14:d-t 1 --> 2:u 1 --> 14:v-\\U0010 2 2 --> 15:\\U-b 2 --> 3:c 2 --> 15:d 2 --> 21:e 2 --> 15:f-\\U0010 3 3 --> 16:\\U-d 3 --> 4:e 3 --> 16:f-m 3 --> 22:n 3 --> 16:o-\\U0010 4 4 --> 17:\\U-d 4 --> 23:e 4 --> 17:f-m 4 --> 5:n 4 --> 17:o-\\U0010 5 class 5 accept 5 --> 18:\\U-d 5 --> 6:e 5 --> 18:f-\\U0010 6 class 6 accept 6 --> 12:\\U-\\U0010 7 7 --> 8:u 8 8 --> 9:c 9 9 --> 10:e 10 10 --> 11:n 11 11 --> 12:e 12 class 12 accept 13 13 --> 7:l 13 --> 8:u 14 14 --> 9:c 14 --> 8:u 15 15 --> 9:c 15 --> 10:e 16 16 --> 10:e 16 --> 11:n 17 17 --> 12:e 17 --> 11:n 18 class 18 accept 18 --> 12:e 19 19 --> 9:c 19 --> 7:l 19 --> 8:u 20 20 --> 9:c 20 --> 10:e 20 --> 8:u 21 21 --> 9:c 21 --> 10:e 21 --> 11:n 22 22 --> 24:e 22 --> 11:n 23 class 23 accept 23 --> 12:e 23 --> 11:n 24 class 24 accept 24 --> 11:n ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]
jpountz commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727502419 BS2 uses a heap to merge multiple `DocIdSetIterator`s. Unfortunately, reordering this heap on every call to `nextDoc()` or `advance(int)` is not completely free and BS1's approach of loading all clauses into a shared bit set and then iterating the bits of this bit set proves to be faster. Presumably, when two-phase iterators are involved, the cost of reording the heap wouldn't be a big deal compared to the cost of calling `TwoPhaseIterator#matches()`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Specialise DirectMonotonicReader when it only contains one block [lucene]
iverase commented on PR #14358: URL: https://github.com/apache/lucene/pull/14358#issuecomment-2728327145 I understand what you say, I will close this then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce Lucene90DocValuesProducer memory footprint [lucene]
iverase commented on PR #14340: URL: https://github.com/apache/lucene/pull/14340#issuecomment-2728329058 See here https://github.com/apache/lucene/pull/14358#issuecomment-2727467298, I will close this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]
dsmiley commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2728182572 I could imagine improving BooleanScorer so that the TPI clauses are separated and converted to a filter around the collector to try to match docs *not* collected (i.e. test for docs inbetween those collected). All this logic could be separated to another class such that BooleanScorer is nearly the same or even identical if the caller were to compose it. DisjunctionDISIApproximation would be instrumental. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Multi-threaded vector search over multiple segments can lead to inconsistent results [lucene]
Zona-hu commented on issue #14180: URL: https://github.com/apache/lucene/issues/14180#issuecomment-2728315199 > ### 描述 > 相关:[#14167](https://github.com/apache/lucene/pull/14167) > > 但是,除了多叶收集(例如信息共享)之外,对多个段进行多线程搜索也可以在低值下获得一致的结果`k`。 > > 有可能获得更一致的结果,并且可能通过简单地收集更多邻居(`k`在查询中、`fanout`在 lucene 工具中,或者`efSearch`如果你愿意的话..)来消除大多数不一致性。然而,虽然 HNSW 搜索是近似的,但我们应该努力保持一致性。 > > ### 版本和环境详细信息 > _没有回应_ Hello, I filed this issue: https://github.com/apache/lucene/issues/14180 Could you please let me know which future version of Elasticsearch will resolve the vector search consistency problem? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce Lucene90DocValuesProducer memory footprint [lucene]
iverase closed pull request #14340: Reduce Lucene90DocValuesProducer memory footprint URL: https://github.com/apache/lucene/pull/14340 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize commit retention policy to maintain only the last 5 commits [lucene]
vigyasharma commented on PR #14325: URL: https://github.com/apache/lucene/pull/14325#issuecomment-2728289419 +1 to Adrien's comment, IndexDeletionPolicy can quite easily be implemented and configured by users in IndexWriterConfig. It if often configured outside of Lucene too, like the [CombinedDeletionPolicy](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/engine/CombinedDeletionPolicy.java) in OpenSearch and Elasticsearch. @DivyanshIITB Incidentally, I've recently been working on a segment replication optimization (#14219 ) that could benefit from keeping the last N commits in the index. Ah, I see you already linked this PR to that issue. I would suggest you to create a separate IndexDeletionPolicy for this. You could also allow users to configure the `N` commits they want to retain instead of fixing it to 5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Vector API to decode BKD docIds [lucene]
jpountz commented on code in PR #14203: URL: https://github.com/apache/lucene/pull/14203#discussion_r1997535617 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PointsWriter.java: ## @@ -105,15 +107,22 @@ public Lucene90PointsWriter( } } + public Lucene90PointsWriter( + SegmentWriteState writeState, int maxPointsInLeafNode, double maxMBSortInHeap) + throws IOException { +this(writeState, maxPointsInLeafNode, maxMBSortInHeap, Lucene90PointsFormat.VERSION_CURRENT); + } Review Comment: let's make all constructors that take a version pkg-private? ## lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java: ## @@ -139,6 +141,28 @@ public BKDWriter( BKDConfig config, double maxMBSortInHeap, long totalPointCount) { +this( +maxDoc, +tempDir, +tempFileNamePrefix, +config, +maxMBSortInHeap, +totalPointCount, +BKDWriter.VERSION_CURRENT); + } + + public BKDWriter( Review Comment: Can you add javadocs that this ctor should be only used for testing with older versions? ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PointsFormat.java: ## @@ -59,18 +61,39 @@ public final class Lucene90PointsFormat extends PointsFormat { public static final String META_EXTENSION = "kdm"; static final int VERSION_START = 0; - static final int VERSION_CURRENT = VERSION_START; + static final int VERSION_BKD_VECTORIZED_BPV24 = 1; + static final int VERSION_CURRENT = VERSION_BKD_VECTORIZED_BPV24; + + private static final Map VERSION_TO_BKD_VERSION = + Map.of( + VERSION_START, BKDWriter.VERSION_META_FILE, + VERSION_BKD_VECTORIZED_BPV24, BKDWriter.VERSION_VECTORIZED_DOCID); + + private final int version; /** Sole constructor */ - public Lucene90PointsFormat() {} + public Lucene90PointsFormat() { +this(VERSION_CURRENT); + } + + public Lucene90PointsFormat(int version) { Review Comment: Could it be pkg-private? I think we only need it for testing? ## lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java: ## @@ -114,31 +117,28 @@ void writeDocIds(int[] docIds, int start, int count, DataOutput out) throws IOEx } else { if (max <= 0xFF) { out.writeByte(BPV_24); -// write them the same way we are reading them. -int i; -for (i = 0; i < count - 7; i += 8) { - int doc1 = docIds[start + i]; - int doc2 = docIds[start + i + 1]; - int doc3 = docIds[start + i + 2]; - int doc4 = docIds[start + i + 3]; - int doc5 = docIds[start + i + 4]; - int doc6 = docIds[start + i + 5]; - int doc7 = docIds[start + i + 6]; - int doc8 = docIds[start + i + 7]; - long l1 = (doc1 & 0xffL) << 40 | (doc2 & 0xffL) << 16 | ((doc3 >>> 8) & 0xL); - long l2 = - (doc3 & 0xffL) << 56 - | (doc4 & 0xffL) << 32 - | (doc5 & 0xffL) << 8 - | ((doc6 >> 16) & 0xffL); - long l3 = (doc6 & 0xL) << 48 | (doc7 & 0xffL) << 24 | (doc8 & 0xffL); - out.writeLong(l1); - out.writeLong(l2); - out.writeLong(l3); -} -for (; i < count; ++i) { - out.writeShort((short) (docIds[start + i] >>> 8)); - out.writeByte((byte) docIds[start + i]); +if (version < BKDWriter.VERSION_VECTORIZED_DOCID) { + writeScalarInts24(docIds, start, count, out); +} else { + // encode the docs in the format that can be vectorized decoded. + final int quarter = count >> 2; + final int numInts = quarter * 3; + for (int i = 0; i < numInts; i++) { +scratch[i] = docIds[i + start] << 8; + } + for (int i = 0; i < quarter; i++) { +final int longIdx = i + numInts + start; +scratch[i] |= docIds[longIdx] >>> 16; +scratch[i + quarter] |= (docIds[longIdx] >>> 8) & 0xFF; +scratch[i + quarter * 2] |= docIds[longIdx] & 0xFF; Review Comment: nit: maybe write bytes in little-endian order for consistency, so `scratch[i] |= docIds[longIdx] & 0xFF; scratch[i + quarter] |= (docIds[longIdx] >>> 8) & 0xFF; scratch[i + quarter * 2] |= docIds[longIdx] >>> 16;`, etc. ? ## lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java: ## @@ -248,21 +281,68 @@ private void readBitSet(IndexInput in, int count, int[] docIDs) throws IOExcepti assert pos == count : "pos: " + pos + ", count: " + count; } - private static void readDelta16(IndexInput in, int count, int[] docIDs) throws IOException { + private static void readDelta16(IndexInput in, int count, int[] docIds) throws IOException { final int min = in.readVInt(); -final int halfLen = count >>> 1; -in.readInts(docIDs, 0, h
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r1997503104 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java: ## @@ -0,0 +1,486 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class Trie { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +UNSAVED, +SAVED, +DESTROYED + } + + private static class Node { +private final int label; +private final LinkedList children; +private Output output; +private long fp = -1; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.UNSAVED; + final Node root = new Node(0, null, new LinkedList<>()); + + Trie(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + void putAll(Trie trie) { +if (status != Status.UNSAVED || trie.status != Status.UNSAVED) { + throw new IllegalStateException("tries should be unsaved"); +} +trie.status = Status.DESTROYED; +putAll(this.root, trie.root); + } + + private static void putAll(Node n, Node add) { +assert n.label == add.label; +if (add.output != null) { + n.output = add.output; Review Comment: I rethink about this and i agree we should have this check, added :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Decode doc ids in BKD leaves with auto-vectorized loops [lucene]
gf2121 commented on code in PR #14203: URL: https://github.com/apache/lucene/pull/14203#discussion_r1997571999 ## lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java: ## @@ -248,21 +281,68 @@ private void readBitSet(IndexInput in, int count, int[] docIDs) throws IOExcepti assert pos == count : "pos: " + pos + ", count: " + count; } - private static void readDelta16(IndexInput in, int count, int[] docIDs) throws IOException { + private static void readDelta16(IndexInput in, int count, int[] docIds) throws IOException { final int min = in.readVInt(); -final int halfLen = count >>> 1; -in.readInts(docIDs, 0, halfLen); -for (int i = 0; i < halfLen; ++i) { - int l = docIDs[i]; +final int half = count >> 1; +in.readInts(docIds, 0, half); +if (count == BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE) { + // Same format, but enabling the JVM to specialize the decoding logic for the default number + // of points per node proved to help on benchmarks + decode16(docIds, BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE / 2, min); + assert half * 2 == BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE + : "we are assuming DEFAULT_MAX_POINTS_IN_LEAF_NODE is a multiple of 2 here."; +} else { + decode16(docIds, half, min); + for (int i = half << 1; i < count; i++) { +docIds[i] = Short.toUnsignedInt(in.readShort()) + min; + } +} + } + + private static void decode16(int[] docIDs, int half, int min) { +for (int i = 0; i < half; ++i) { + final int l = docIDs[i]; docIDs[i] = (l >>> 16) + min; - docIDs[halfLen + i] = (l & 0x) + min; + docIDs[i + half] = (l & 0x) + min; } -if ((count & 1) == 1) { - docIDs[count - 1] = Short.toUnsignedInt(in.readShort()) + min; + } + + private void readInts24(IndexInput in, int count, int[] docIDs) throws IOException { +int quarter = count >> 2; +int numInts = quarter * 3; +in.readInts(scratch, 0, numInts); +if (count == BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE) { + // Same format, but enabling the JVM to specialize the decoding logic for the default number + // of points per node proved to help on benchmarks + decode24( + docIDs, + scratch, + BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE / 4, + BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE / 4 * 3); + assert quarter * 4 == BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE + : " we are assuming DEFAULT_MAX_POINTS_IN_LEAF_NODE is a multiple of 4 here."; +} else { + decode24(docIDs, scratch, quarter, numInts); + // Now read the remaining 0, 1, 2 or 3 values + for (int i = quarter << 2; i < count; ++i) { +docIDs[i] = (in.readShort() & 0x) | (in.readByte() & 0xFF) << 16; + } Review Comment: I want to keep decode24 small so i put it under the `if else` block to save the assertion. luceneutil and jmh proved it has similar performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org