Re: [PR] Add support for two-phase iterators to DenseConjunctionBulkScorer. [lucene]

2025-03-16 Thread via GitHub


jpountz commented on PR #14359:
URL: https://github.com/apache/lucene/pull/14359#issuecomment-2727649234

   This doesn't slow down existing tasks significantly, including 
`CountFilteredPhrase` which now runs with `DenseConjunctionBulkScorer` vs. a 
`DefaultBulkScorer` on top of a `ConjunctionScorer` before.
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
  TermMonthSort 3433.89  (3.5%) 3372.95  
(2.8%)   -1.8% (  -7% -4%) 0.078
CountFilteredPhrase   25.76  (2.6%)   25.34  
(2.6%)   -1.6% (  -6% -3%) 0.048
   Wildcard   76.02  (3.6%)   75.31  
(3.9%)   -0.9% (  -8% -6%) 0.435
FilteredPrefix3  128.10  (3.3%)  126.96  
(2.4%)   -0.9% (  -6% -4%) 0.327
Prefix3  133.76  (3.2%)  132.58  
(2.6%)   -0.9% (  -6% -5%) 0.339
 OrHighHigh   51.94  (3.4%)   51.49  
(3.9%)   -0.9% (  -7% -6%) 0.457
   Or3Terms  162.21  (5.8%)  160.85  
(5.6%)   -0.8% ( -11% -   11%) 0.642
  TermTitleSort  141.36  (1.7%)  140.17  
(2.3%)   -0.8% (  -4% -3%) 0.201
 FilteredIntNRQ  297.20  (2.2%)  295.03  
(3.0%)   -0.7% (  -5% -4%) 0.380
 Fuzzy2   74.81  (2.9%)   74.30  
(2.8%)   -0.7% (  -6% -5%) 0.438
 Fuzzy1   79.34  (3.1%)   78.87  
(2.9%)   -0.6% (  -6% -5%) 0.541
  And3Terms  169.32  (5.3%)  168.32  
(4.8%)   -0.6% ( -10% -   10%) 0.714
 IntNRQ  304.31  (2.2%)  302.66  
(2.6%)   -0.5% (  -5% -4%) 0.481
 Or2Terms2StopWords  155.77  (5.5%)  154.94  
(5.5%)   -0.5% ( -10% -   10%) 0.758
  TermDayOfYearSort  696.95  (1.9%)  693.32  
(2.2%)   -0.5% (  -4% -3%) 0.426
CountAndHighMed  289.64  (1.8%)  288.19  
(1.7%)   -0.5% (  -3% -3%) 0.363
OrStopWords   31.68  (9.6%)   31.52  
(9.2%)   -0.5% ( -17% -   20%) 0.869
   AndStopWords   30.55  (8.3%)   30.41  
(7.6%)   -0.5% ( -15% -   16%) 0.857
  OrHighMed  196.20  (2.8%)  195.33  
(3.3%)   -0.4% (  -6% -5%) 0.643
And2Terms2StopWords  158.92  (5.3%)  158.23  
(4.8%)   -0.4% ( -10% -   10%) 0.787
   FilteredAndStopWords   55.81  (2.8%)   55.58  
(2.2%)   -0.4% (  -5% -4%) 0.606
 AndHighMed  131.29  (1.4%)  130.78  
(1.2%)   -0.4% (  -2% -2%) 0.349
 CountOrHighMed  357.52  (1.7%)  356.16  
(1.9%)   -0.4% (  -3% -3%) 0.498
FilteredAnd2Terms2StopWords  200.10  (2.1%)  199.46  
(1.5%)   -0.3% (  -3% -3%) 0.585
  CountTerm 8915.10  (4.2%) 8887.69  
(4.5%)   -0.3% (  -8% -8%) 0.822
  CombinedOrHighMed   72.13  (2.5%)   71.95  
(2.3%)   -0.2% (  -4% -4%) 0.749
 CombinedOrHighHigh   18.78  (2.3%)   18.74  
(2.1%)   -0.2% (  -4% -4%) 0.731
DismaxOrHighMed  170.26  (2.1%)  169.93  
(1.9%)   -0.2% (  -4% -3%) 0.760
  FilteredOrHighMed  154.92  (1.4%)  154.63  
(1.3%)   -0.2% (  -2% -2%) 0.649
 FilteredOrHighHigh   67.99  (1.9%)   67.86  
(1.5%)   -0.2% (  -3% -3%) 0.729
 FilteredAndHighMed  130.42  (3.9%)  130.24  
(3.4%)   -0.1% (  -7% -7%) 0.902
   FilteredTerm  159.19  (2.3%)  158.98  
(1.9%)   -0.1% (  -4% -4%) 0.843
 FilteredOr2Terms2StopWords  147.78  (1.6%)  147.61  
(1.4%)   -0.1% (  -3% -2%) 0.808
  FilteredAnd3Terms  188.72  (3.1%)  188.51  
(2.5%)   -0.1% (  -5% -5%) 0.898
 CountFilteredOrHighMed  116.44  (0.8%)  116.33  
(0.9%)   -0.1% (  -1% -1%) 0.747
   DismaxOrHighHigh  116.55  (2.9%)  116.47  
(2.3%)   -0.1% (  -5% -5%) 0.934
AndHighOrMedMed   46.65  (1.7%)   46.62  
(1.7%)   -0.1% (  -3% -3%) 0.899
CountFilteredOrHighHigh  105.35  (1.1%)  105.29  
(1.1%)   -0.1% (  -2% -2%) 0.865
   FilteredOr3Terms  162.57  (1.6%)  162.48  
(1.7%)   -0.1% (  -3% -3%) 0.911
   PKLookup  281.03  (1.8%)   

[PR] Add support for two-phase iterators to DenseConjunctionBulkScorer. [lucene]

2025-03-16 Thread via GitHub


jpountz opened a new pull request, #14359:
URL: https://github.com/apache/lucene/pull/14359

   The main motivation is to efficiently evaluate range queries on fields that 
have a doc-value index enabled. These range queries produce two-phase iterators 
that should match large contiguous range of doc IDs. Using 
`DenseConjunctionBulkScorer` helps skip these clauses from the conjunction on 
doc ID ranges that they fully match.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Specialise DirectMonotonicReader when it only contains one block [lucene]

2025-03-16 Thread via GitHub


iverase commented on code in PR #14358:
URL: https://github.com/apache/lucene/pull/14358#discussion_r1997607493


##
lucene/core/src/java/org/apache/lucene/util/packed/DirectMonotonicReader.java:
##
@@ -90,102 +140,142 @@ public static DirectMonotonicReader getInstance(Meta 
meta, RandomAccessInput dat
 
   /** Retrieves an instance from the specified slice. */
   public static DirectMonotonicReader getInstance(
-  Meta meta, RandomAccessInput data, boolean merging) throws IOException {
-final LongValues[] readers = new LongValues[meta.numBlocks];
-for (int i = 0; i < meta.numBlocks; ++i) {
-  if (meta.bpvs[i] == 0) {
-readers[i] = LongValues.ZEROES;
-  } else if (merging
-  && i < meta.numBlocks - 1 // we only know the number of values for 
the last block
-  && meta.blockShift >= DirectReader.MERGE_BUFFER_SHIFT) {
-readers[i] =
-DirectReader.getMergeInstance(
-data, meta.bpvs[i], meta.offsets[i], 1L << meta.blockShift);
+  Meta meta, RandomAccessInput data, boolean merging) {
+return meta.getInstance(data, merging);
+  }
+
+  private static final class SingleBlockMeta extends Meta {
+private final long min;
+private final float avg;
+private final byte bpv;
+private final long offset;
+
+private SingleBlockMeta(long min, float avg, byte bpv, long offset) {
+  this.min = min;
+  this.avg = avg;
+  this.bpv = bpv;
+  this.offset = offset;

Review Comment:
   This offset seems to always be zero, is that correct? Then we can get rid of 
it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Specialise DirectMonotonicReader when it only contains one block [lucene]

2025-03-16 Thread via GitHub


iverase opened a new pull request, #14358:
URL: https://github.com/apache/lucene/pull/14358

   While looking into some heap dumps, I notice in the 
DirectMonotonicReader.Meta objects hold by segments that the case of single 
value block is actually common. I wondered if we could specialize that case so 
we can create a faster, light weigh single block version.
   
   This commit makes DirectMonotonicReader a sealed abstractions with two 
possible implementations, one for multi-blocks and the other for single blocks. 
The multi-block implementation is very similar to the current implementation 
while the single block is more lightweight as it does not need to be holding 
arrays. The only side effect is that DirectMonotonicReader is bimorphic now, 
but that should be ok.
   
   I ran luceneutil and didnt' see much change:  
   
   ```
  TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
   OrHighNotMed  168.35  (4.3%)  163.80  
(4.9%)   -2.7% ( -11% -6%) 0.064
  HighTermTitleSort   36.96  (2.4%)   36.04  
(4.2%)   -2.5% (  -8% -4%) 0.023
   PKLookup  105.78  (6.1%)  103.28  
(8.4%)   -2.4% ( -15% -   12%) 0.306
   HighTerm  228.90  (4.3%)  223.90  
(5.3%)   -2.2% ( -11% -7%) 0.154
  OrHighLow  330.01  (3.9%)  324.79  
(3.5%)   -1.6% (  -8% -6%) 0.181
LowSpanNear   32.10  (6.5%)   31.61  
(6.9%)   -1.5% ( -13% -   12%) 0.469
   Wildcard  103.26  (5.3%)  101.68  
(7.1%)   -1.5% ( -13% -   11%) 0.443
   HighSpanNear4.70  (4.7%)4.63  
(4.6%)   -1.5% ( -10% -8%) 0.309
   OrNotHighLow  451.15  (3.5%)  444.47  
(4.0%)   -1.5% (  -8% -6%) 0.215
   HighSloppyPhrase4.71  (4.8%)4.64  
(5.7%)   -1.4% ( -11% -9%) 0.415
 Fuzzy2   36.81  (4.8%)   36.31  
(5.9%)   -1.4% ( -11% -9%) 0.423
 Fuzzy1   40.34  (2.9%)   39.81  
(4.1%)   -1.3% (  -8% -5%) 0.233
 OrHighHigh   75.21  (5.2%)   74.22  
(5.1%)   -1.3% ( -11% -9%) 0.418
AndHighHigh   59.33  (5.0%)   58.57  
(5.1%)   -1.3% ( -10% -9%) 0.426
 AndHighMed  131.70  (3.2%)  130.05  
(4.1%)   -1.2% (  -8% -6%) 0.286
  range 2006.77  (7.0%) 1983.01  
(8.5%)   -1.2% ( -15% -   15%) 0.630
MedTerm  333.14  (4.2%)  329.23  
(5.2%)   -1.2% ( -10% -8%) 0.429
LowIntervalsOrdered6.16  (5.8%)6.09  
(6.5%)   -1.1% ( -12% -   11%) 0.564
  OrHighNotHigh  206.93  (3.4%)  204.61  
(4.4%)   -1.1% (  -8% -6%) 0.366
MedSpanNear   15.36  (2.9%)   15.20  
(3.7%)   -1.0% (  -7% -5%) 0.323
  OrHighMed  135.41  (3.2%)  134.25  
(2.8%)   -0.9% (  -6% -5%) 0.373
LowTerm  528.68  (4.1%)  524.30  
(4.8%)   -0.8% (  -9% -8%) 0.556
   OrHighNotLow  238.88  (4.2%)  237.11  
(4.7%)   -0.7% (  -9% -8%) 0.599
MedSloppyPhrase   38.80  (2.7%)   38.55  
(3.0%)   -0.6% (  -6% -5%) 0.483
MedIntervalsOrdered   43.96  (9.3%)   43.68  
(9.2%)   -0.6% ( -17% -   19%) 0.831
  OrNotHighHigh  301.25  (2.5%)  299.46  
(3.2%)   -0.6% (  -6% -5%) 0.520
   HighIntervalsOrdered7.35  (6.1%)7.31  
(6.8%)   -0.5% ( -12% -   13%) 0.792
  BrowseDayOfYearTaxoFacets2.03 (13.2%)2.02 
(14.7%)   -0.4% ( -25% -   31%) 0.920
   OrNotHighMed  229.85  (2.8%)  228.86  
(4.2%)   -0.4% (  -7% -6%) 0.701
   BrowseDateTaxoFacets2.01 (12.5%)2.00 
(15.0%)   -0.4% ( -24% -   30%) 0.922
  MedPhrase   77.28  (2.9%)   77.10  
(3.1%)   -0.2% (  -6% -5%) 0.799
BrowseRandomLabelTaxoFacets1.49 (12.2%)1.49 
(16.4%)   -0.2% ( -25% -   32%) 0.960
Respell   22.59  (3.9%)   22.55  
(4.1%)   -0.1% (  -7% -8%) 0.908
LowSloppyPhrase   76.40  (2.7%)   76.35  
(3.2%)   -0.1% (  -5% -6%) 0.938
 AndHighLow  478.69  (5.3%)  479.27  
(7.1%)0.1% ( -11% -   13%) 0.951

Re: [PR] add Automaton.toMermaid() for emitting mermaid state charts [lucene]

2025-03-16 Thread via GitHub


rmuir commented on PR #14360:
URL: https://github.com/apache/lucene/pull/14360#issuecomment-2727698038

   Here's the same Automaton, but via `toDot()` tossed into 
https://dreampuf.github.io/GraphvizOnline with all defaults. I guess I'm still 
a fan of that output style, I feel it is more readable.
   
   But maybe we can improve the styling here to make this more useful and save 
you that step.
   
   ![graphviz 
(4)](https://github.com/user-attachments/assets/2adf0a3a-99f4-4bee-b2f5-281e00ea1aa4)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] add Automaton.toMermaid() for emitting mermaid state charts [lucene]

2025-03-16 Thread via GitHub


rmuir commented on PR #14360:
URL: https://github.com/apache/lucene/pull/14360#issuecomment-2727699099

   Mermaid definitely doesn't handle infinite automata very well at all:
   ```mermaid
   stateDiagram
 direction LR
 classDef accept border-width:5px;stroke-width:5px,stroke:#00
 class 0 accept
 0 --> 0:K
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Decode doc ids in BKD leaves with auto-vectorized loops [lucene]

2025-03-16 Thread via GitHub


gf2121 merged PR #14203:
URL: https://github.com/apache/lucene/pull/14203


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Specialise DirectMonotonicReader when it only contains one block [lucene]

2025-03-16 Thread via GitHub


jpountz commented on PR #14358:
URL: https://github.com/apache/lucene/pull/14358#issuecomment-2727467298

   I'm wary about adding all these micro-optimizations to reduce the 
per-segment per-field overhead. They hurt readability and may easily get lost 
over time when codecs get replaced with new codecs. Presumably, the motivation 
is to reduce heap usage when working with lots of shards / segments / fields? 
Could this be solved differently so that heap usage due to open indexes remains 
low by design rather than as a result of many micro-optimizations? E.g. by 
pooling index readers using a least-recently used policy, using a single Lucene 
field for multiple logical fields (like Elasticsearch's `flattened` does), 
merging segments more aggressively (e.g. by bumping the floor/min segment 
size), or a combination of all these things?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub


jpountz commented on PR #14357:
URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727461724

   +1 let's use `DisjunctionSumScorerwhich` (which already supports two-phase 
iteration) when one of the clauses exposes a non-null two-phase iterator?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub


dsmiley commented on PR #14357:
URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727528269

   If one or more DISI has a high cost (irrespective of TPIs), thus matching 
many docs, I could see avoiding BS1 as well.
   
   An aside, if we are going to refer to these as BS1 vs BS2, they should have 
names more clearly reflecting this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub


jpountz commented on PR #14357:
URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727629320

   In case you missed it, `BooleanScorer` had optimizations recently that make 
it hard to beat by `DisjunctionScorer` when clauses are `PostingsEnum`s:
- `DocIdSetIterator#intoBitSet` helps impls optimize loading matches into 
bit sets, in the case of `PostingsEnum` through (auto) vectorization 
(annotation [`HR`](https://benchmarks.mikemccandless.com/CountOrHighHigh.html))
- Dense blocks of postings are stored as bit sets, further speeding up 
OR-ing postings into a bit set ([annotations `HW` and 
`HX`](https://benchmarks.mikemccandless.com/CountOrHighHigh.html))


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub


jpountz commented on PR #14357:
URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727625240

   > If one or more DISI has a high cost (irrespective of TPIs), thus matching 
many docs, I could see avoiding BS1 as well.
   
   I imagine that your idea is that if most of the cost comes from a single 
DISI then the heap doesn't need to be reorderer every time so the heap 
reordering overhead may be less than the bitset overhead. `BooleanScorer` has 
an optimization for a similar situation where if a single clause matches on a 
window of 4,096 docs then it will skip the bit set as an intermediate 
representation of matches and feed this clause directly into the collector.
   
   > An aside, if we are going to refer to these as BS1 vs BS2, they should 
have names more clearly reflecting this.
   
   Agreed to stop using `BS1`/`BS2`. It's how Lucene's disjunction scorers were 
historically referred to and these names came naturally after reading your 
description message that abbreviated `BooleanScorer` to `BS`, but it doesn't 
properly reflect how Lucene scores disjunctions nowadays.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Multi-HNSW graphs per segment? [lucene]

2025-03-16 Thread via GitHub


navneet1v commented on issue #14341:
URL: https://github.com/apache/lucene/issues/14341#issuecomment-2727640413

   > What you primarily want in the referenced GH issue is the ability to 
filter on more metadata during traversal vs doing a pre filter on the candidate 
documents themselves. As Adrien pointed out, this is better solved with more 
efficient filtering approaches (I wonder if the recent work on ACORN would 
help?)
   > 
   > This issue is more towards the niche use cases where multiple graphs for a 
given segment would work. Multi threaded queries and disjoint cluster 
representation seem to be the immediate choices.
   
   So what I have been thinking here is if at a segment level we add a 
capability of having multiple graphs and make the process of generating those 
graphs generic will solve the problem I mentioned in github issue. See overall, 
how you split your docs into multiple clusters can be a choice for the user who 
is configuring the codec. If user wants to use k-Means algo to do clustering 
then it can use it, but if there are lets say other parameters in the document 
then that can be used for clustering/building different HNSW graphs. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Opening of vector files with ReadAdvice.RANDOM_PRELOAD [lucene]

2025-03-16 Thread via GitHub


navneet1v commented on issue #14348:
URL: https://github.com/apache/lucene/issues/14348#issuecomment-2727641512

   @viliam-durina if you have benchmarks that shows the performance is better 
it will be good to raise the PR. Once PR is there maintainers can also do more 
tests to see if it is really the case.
   
   ccing few maintainers @mikemccand , @benwtrent , @msokolov 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub


dsmiley commented on PR #14357:
URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727499162

   Thanks for your confirmation of the problem.  The collect-per-clause is 
surprising to me; like what would benefit from that algorithm?  Wouldn't that 
_only_ be in fact _needed_ if scores are _needed_?  Since they need to be 
aggregated across the clauses on a matching document.  If so, this isn't really 
about TPIs, it's about the need for scores.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] add Automaton.toMermaid() for emitting mermaid state charts [lucene]

2025-03-16 Thread via GitHub


rmuir opened a new pull request, #14360:
URL: https://github.com/apache/lucene/pull/14360

   Mermaid is state chart supported within fenced codeblocks by github. For 
some reason it doesn't support dotty but instead the latest js tool. I'm sure 
in 2 months it will be a different tool.
   
   Beautification of any sort can happen at any time, this just makes it work 
correctly. Mermaid state diagram really isn't the best for this, just look at 
how start/end state works there. So we just define a "class" that makes accept 
states blue. I'd rather have double-circle but I'm not a CSS guy.
   
   Closes #14351
   
   ```mermaid
   stateDiagram
 direction LR
 classDef accept border-width:5px;stroke-width:5px,stroke:#00
 0
 0 --> 13:\\U-k
 0 --> 1:l
 0 --> 13:m-t
 0 --> 19:u
 0 --> 13:v-\\U0010
 1
 1 --> 14:\\U-b
 1 --> 20:c
 1 --> 14:d-t
 1 --> 2:u
 1 --> 14:v-\\U0010
 2
 2 --> 15:\\U-b
 2 --> 3:c
 2 --> 15:d
 2 --> 21:e
 2 --> 15:f-\\U0010
 3
 3 --> 16:\\U-d
 3 --> 4:e
 3 --> 16:f-m
 3 --> 22:n
 3 --> 16:o-\\U0010
 4
 4 --> 17:\\U-d
 4 --> 23:e
 4 --> 17:f-m
 4 --> 5:n
 4 --> 17:o-\\U0010
 5
 class 5 accept
 5 --> 18:\\U-d
 5 --> 6:e
 5 --> 18:f-\\U0010
 6
 class 6 accept
 6 --> 12:\\U-\\U0010
 7
 7 --> 8:u
 8
 8 --> 9:c
 9
 9 --> 10:e
 10
 10 --> 11:n
 11
 11 --> 12:e
 12
 class 12 accept
 13
 13 --> 7:l
 13 --> 8:u
 14
 14 --> 9:c
 14 --> 8:u
 15
 15 --> 9:c
 15 --> 10:e
 16
 16 --> 10:e
 16 --> 11:n
 17
 17 --> 12:e
 17 --> 11:n
 18
 class 18 accept
 18 --> 12:e
 19
 19 --> 9:c
 19 --> 7:l
 19 --> 8:u
 20
 20 --> 9:c
 20 --> 10:e
 20 --> 8:u
 21
 21 --> 9:c
 21 --> 10:e
 21 --> 11:n
 22
 22 --> 24:e
 22 --> 11:n
 23
 class 23 accept
 23 --> 12:e
 23 --> 11:n
 24
 class 24 accept
 24 --> 11:n
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub


jpountz commented on PR #14357:
URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727502419

   BS2 uses a heap to merge multiple `DocIdSetIterator`s. Unfortunately, 
reordering this heap on every call to `nextDoc()` or `advance(int)` is not 
completely free and BS1's approach of loading all clauses into a shared bit set 
and then iterating the bits of this bit set proves to be faster.
   
   Presumably, when two-phase iterators are involved, the cost of reording the 
heap wouldn't be a big deal compared to the cost of calling 
`TwoPhaseIterator#matches()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Specialise DirectMonotonicReader when it only contains one block [lucene]

2025-03-16 Thread via GitHub


iverase commented on PR #14358:
URL: https://github.com/apache/lucene/pull/14358#issuecomment-2728327145

   I understand what you say, I will close this then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce Lucene90DocValuesProducer memory footprint [lucene]

2025-03-16 Thread via GitHub


iverase commented on PR #14340:
URL: https://github.com/apache/lucene/pull/14340#issuecomment-2728329058

   See here 
https://github.com/apache/lucene/pull/14358#issuecomment-2727467298, I will 
close this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub


dsmiley commented on PR #14357:
URL: https://github.com/apache/lucene/pull/14357#issuecomment-2728182572

   I could imagine improving BooleanScorer so that the TPI clauses are 
separated and converted to a filter around the collector to try to match docs 
*not* collected (i.e. test for docs inbetween those collected).  All this logic 
could be separated to another class such that BooleanScorer is nearly the same 
or even identical if the caller were to compose it.  
DisjunctionDISIApproximation would be instrumental.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Multi-threaded vector search over multiple segments can lead to inconsistent results [lucene]

2025-03-16 Thread via GitHub


Zona-hu commented on issue #14180:
URL: https://github.com/apache/lucene/issues/14180#issuecomment-2728315199

   > ### 描述
   > 相关:[#14167](https://github.com/apache/lucene/pull/14167)
   > 
   > 但是,除了多叶收集(例如信息共享)之外,对多个段进行多线程搜索也可以在低值下获得一致的结果`k`。
   > 
   > 有可能获得更一致的结果,并且可能通过简单地收集更多邻居(`k`在查询中、`fanout`在 lucene 
工具中,或者`efSearch`如果你愿意的话..)来消除大多数不一致性。然而,虽然 HNSW 搜索是近似的,但我们应该努力保持一致性。
   > 
   > ### 版本和环境详细信息
   > _没有回应_
   
   Hello, I filed this issue:
   https://github.com/apache/lucene/issues/14180
   Could you please let me know which future version of Elasticsearch will 
resolve the vector search consistency problem?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce Lucene90DocValuesProducer memory footprint [lucene]

2025-03-16 Thread via GitHub


iverase closed pull request #14340: Reduce Lucene90DocValuesProducer memory 
footprint
URL: https://github.com/apache/lucene/pull/14340


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize commit retention policy to maintain only the last 5 commits [lucene]

2025-03-16 Thread via GitHub


vigyasharma commented on PR #14325:
URL: https://github.com/apache/lucene/pull/14325#issuecomment-2728289419

   +1 to Adrien's comment, IndexDeletionPolicy can quite easily be implemented 
and configured by users in IndexWriterConfig. It if often configured outside of 
Lucene too, like the 
[CombinedDeletionPolicy](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/engine/CombinedDeletionPolicy.java)
 in OpenSearch and Elasticsearch. 
   
   @DivyanshIITB  Incidentally, I've recently been working on a segment 
replication optimization (#14219 ) that could benefit from keeping the last N 
commits in the index. Ah, I see you already linked this PR to that issue. I 
would suggest you to create a separate IndexDeletionPolicy for this. You could 
also  allow users to configure the `N` commits they want to retain instead of 
fixing it to 5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Vector API to decode BKD docIds [lucene]

2025-03-16 Thread via GitHub


jpountz commented on code in PR #14203:
URL: https://github.com/apache/lucene/pull/14203#discussion_r1997535617


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PointsWriter.java:
##
@@ -105,15 +107,22 @@ public Lucene90PointsWriter(
 }
   }
 
+  public Lucene90PointsWriter(
+  SegmentWriteState writeState, int maxPointsInLeafNode, double 
maxMBSortInHeap)
+  throws IOException {
+this(writeState, maxPointsInLeafNode, maxMBSortInHeap, 
Lucene90PointsFormat.VERSION_CURRENT);
+  }

Review Comment:
   let's make all constructors that take a version pkg-private?



##
lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java:
##
@@ -139,6 +141,28 @@ public BKDWriter(
   BKDConfig config,
   double maxMBSortInHeap,
   long totalPointCount) {
+this(
+maxDoc,
+tempDir,
+tempFileNamePrefix,
+config,
+maxMBSortInHeap,
+totalPointCount,
+BKDWriter.VERSION_CURRENT);
+  }
+
+  public BKDWriter(

Review Comment:
   Can you add javadocs that this ctor should be only used for testing with 
older versions?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PointsFormat.java:
##
@@ -59,18 +61,39 @@ public final class Lucene90PointsFormat extends 
PointsFormat {
   public static final String META_EXTENSION = "kdm";
 
   static final int VERSION_START = 0;
-  static final int VERSION_CURRENT = VERSION_START;
+  static final int VERSION_BKD_VECTORIZED_BPV24 = 1;
+  static final int VERSION_CURRENT = VERSION_BKD_VECTORIZED_BPV24;
+
+  private static final Map VERSION_TO_BKD_VERSION =
+  Map.of(
+  VERSION_START, BKDWriter.VERSION_META_FILE,
+  VERSION_BKD_VECTORIZED_BPV24, BKDWriter.VERSION_VECTORIZED_DOCID);
+
+  private final int version;
 
   /** Sole constructor */
-  public Lucene90PointsFormat() {}
+  public Lucene90PointsFormat() {
+this(VERSION_CURRENT);
+  }
+
+  public Lucene90PointsFormat(int version) {

Review Comment:
   Could it be pkg-private? I think we only need it for testing?



##
lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java:
##
@@ -114,31 +117,28 @@ void writeDocIds(int[] docIds, int start, int count, 
DataOutput out) throws IOEx
 } else {
   if (max <= 0xFF) {
 out.writeByte(BPV_24);
-// write them the same way we are reading them.
-int i;
-for (i = 0; i < count - 7; i += 8) {
-  int doc1 = docIds[start + i];
-  int doc2 = docIds[start + i + 1];
-  int doc3 = docIds[start + i + 2];
-  int doc4 = docIds[start + i + 3];
-  int doc5 = docIds[start + i + 4];
-  int doc6 = docIds[start + i + 5];
-  int doc7 = docIds[start + i + 6];
-  int doc8 = docIds[start + i + 7];
-  long l1 = (doc1 & 0xffL) << 40 | (doc2 & 0xffL) << 16 | 
((doc3 >>> 8) & 0xL);
-  long l2 =
-  (doc3 & 0xffL) << 56
-  | (doc4 & 0xffL) << 32
-  | (doc5 & 0xffL) << 8
-  | ((doc6 >> 16) & 0xffL);
-  long l3 = (doc6 & 0xL) << 48 | (doc7 & 0xffL) << 24 | (doc8 
& 0xffL);
-  out.writeLong(l1);
-  out.writeLong(l2);
-  out.writeLong(l3);
-}
-for (; i < count; ++i) {
-  out.writeShort((short) (docIds[start + i] >>> 8));
-  out.writeByte((byte) docIds[start + i]);
+if (version < BKDWriter.VERSION_VECTORIZED_DOCID) {
+  writeScalarInts24(docIds, start, count, out);
+} else {
+  // encode the docs in the format that can be vectorized decoded.
+  final int quarter = count >> 2;
+  final int numInts = quarter * 3;
+  for (int i = 0; i < numInts; i++) {
+scratch[i] = docIds[i + start] << 8;
+  }
+  for (int i = 0; i < quarter; i++) {
+final int longIdx = i + numInts + start;
+scratch[i] |= docIds[longIdx] >>> 16;
+scratch[i + quarter] |= (docIds[longIdx] >>> 8) & 0xFF;
+scratch[i + quarter * 2] |= docIds[longIdx] & 0xFF;

Review Comment:
   nit: maybe write bytes in little-endian order for consistency, so 
`scratch[i] |= docIds[longIdx] & 0xFF; scratch[i + quarter] |= (docIds[longIdx] 
>>> 8) & 0xFF; scratch[i + quarter * 2] |= docIds[longIdx] >>> 16;`, etc. ?



##
lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java:
##
@@ -248,21 +281,68 @@ private void readBitSet(IndexInput in, int count, int[] 
docIDs) throws IOExcepti
 assert pos == count : "pos: " + pos + ", count: " + count;
   }
 
-  private static void readDelta16(IndexInput in, int count, int[] docIDs) 
throws IOException {
+  private static void readDelta16(IndexInput in, int count, int[] docIds) 
throws IOException {
 final int min = in.readVInt();
-final int halfLen = count >>> 1;
-in.readInts(docIDs, 0, h

Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-16 Thread via GitHub


gf2121 commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r1997503104


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java:
##
@@ -0,0 +1,486 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class Trie {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILDREN_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILDREN_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+UNSAVED,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+private final int label;
+private final LinkedList children;
+private Output output;
+private long fp = -1;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.UNSAVED;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  Trie(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  void putAll(Trie trie) {
+if (status != Status.UNSAVED || trie.status != Status.UNSAVED) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+trie.status = Status.DESTROYED;
+putAll(this.root, trie.root);
+  }
+
+  private static void putAll(Node n, Node add) {
+assert n.label == add.label;
+if (add.output != null) {
+  n.output = add.output;

Review Comment:
   I rethink about this and i agree we should have this check, added :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Decode doc ids in BKD leaves with auto-vectorized loops [lucene]

2025-03-16 Thread via GitHub


gf2121 commented on code in PR #14203:
URL: https://github.com/apache/lucene/pull/14203#discussion_r1997571999


##
lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java:
##
@@ -248,21 +281,68 @@ private void readBitSet(IndexInput in, int count, int[] 
docIDs) throws IOExcepti
 assert pos == count : "pos: " + pos + ", count: " + count;
   }
 
-  private static void readDelta16(IndexInput in, int count, int[] docIDs) 
throws IOException {
+  private static void readDelta16(IndexInput in, int count, int[] docIds) 
throws IOException {
 final int min = in.readVInt();
-final int halfLen = count >>> 1;
-in.readInts(docIDs, 0, halfLen);
-for (int i = 0; i < halfLen; ++i) {
-  int l = docIDs[i];
+final int half = count >> 1;
+in.readInts(docIds, 0, half);
+if (count == BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE) {
+  // Same format, but enabling the JVM to specialize the decoding logic 
for the default number
+  // of points per node proved to help on benchmarks
+  decode16(docIds, BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE / 2, min);
+  assert half * 2 == BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE
+  : "we are assuming DEFAULT_MAX_POINTS_IN_LEAF_NODE is a multiple of 
2 here.";
+} else {
+  decode16(docIds, half, min);
+  for (int i = half << 1; i < count; i++) {
+docIds[i] = Short.toUnsignedInt(in.readShort()) + min;
+  }
+}
+  }
+
+  private static void decode16(int[] docIDs, int half, int min) {
+for (int i = 0; i < half; ++i) {
+  final int l = docIDs[i];
   docIDs[i] = (l >>> 16) + min;
-  docIDs[halfLen + i] = (l & 0x) + min;
+  docIDs[i + half] = (l & 0x) + min;
 }
-if ((count & 1) == 1) {
-  docIDs[count - 1] = Short.toUnsignedInt(in.readShort()) + min;
+  }
+
+  private void readInts24(IndexInput in, int count, int[] docIDs) throws 
IOException {
+int quarter = count >> 2;
+int numInts = quarter * 3;
+in.readInts(scratch, 0, numInts);
+if (count == BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE) {
+  // Same format, but enabling the JVM to specialize the decoding logic 
for the default number
+  // of points per node proved to help on benchmarks
+  decode24(
+  docIDs,
+  scratch,
+  BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE / 4,
+  BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE / 4 * 3);
+  assert quarter * 4 == BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE
+  : " we are assuming DEFAULT_MAX_POINTS_IN_LEAF_NODE is a multiple of 
4 here.";
+} else {
+  decode24(docIDs, scratch, quarter, numInts);
+  // Now read the remaining 0, 1, 2 or 3 values
+  for (int i = quarter << 2; i < count; ++i) {
+docIDs[i] = (in.readShort() & 0x) | (in.readByte() & 0xFF) << 16;
+  }

Review Comment:
   I want to keep decode24 small so i put it under the `if else` block to save 
the assertion. luceneutil and jmh proved it has similar performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org