Re: [PR] nocommit: demonstrate how a minor change in IndexSearcher can have an inexplicable performance impact [lucene]

2024-08-15 Thread via GitHub


epotyom commented on PR #13657:
URL: https://github.com/apache/lucene/pull/13657#issuecomment-2290926100

   Interesting... Thanks for isolating the change that causes the regression 
Greg!
   
   > impacted tasks are possibly taking the `leafSlices.length == 0` condition 
branch and avoiding the need to create the collectors list at all?
   
   I don't think it is that - if I understand correctly `leafSlices.length == 
0` only if there are no segments at all - which shouldn't be the case for 
benchmark tests we are running. Just in case I added a commit on top of this PR 
to never create the ArrayList for collectors if the length is zero: 
https://github.com/apache/lucene/commit/e30b76d77926d5f68d70f8b1d3b7129a73b76942
 , and it also shows regression:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
   HighTerm  525.44  (3.5%)  441.68  
(3.7%)  -15.9% ( -22% -   -9%) 0.000
   HighSloppyPhrase   33.23  (2.7%)   28.13  
(1.4%)  -15.3% ( -18% -  -11%) 0.000
MedTerm  479.40  (4.6%)  410.71  
(4.4%)  -14.3% ( -22% -   -5%) 0.000
LowTerm  433.49  (3.1%)  387.45  
(3.1%)  -10.6% ( -16% -   -4%) 0.000
  BrowseMonthTaxoFacets1.84  (9.0%)1.70  
(8.8%)   -7.9% ( -23% -   10%) 0.160
BrowseRandomLabelTaxoFacets1.37 (12.4%)1.26  
(0.5%)   -7.6% ( -18% -6%) 0.174
   HighSpanNear   31.21  (1.7%)   28.97  
(2.8%)   -7.2% ( -11% -   -2%) 0.000
   ...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize decoding blocks of postings using the vector API. [lucene]

2024-08-15 Thread via GitHub


jpountz commented on PR #13636:
URL: https://github.com/apache/lucene/pull/13636#issuecomment-2291136659

   A few queries got a small-ish speedup, e.g.
- 
[AndMedOrHighHigh](https://home.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html)
 +2.5%
- 
[AndHighHigh](https://home.apache.org/~mikemccand/lucenebench/AndHighHigh.html) 
+1.5%
- 
[CountAndHighMed](https://home.apache.org/~mikemccand/lucenebench/CountAndHighMed.html)
 +1.5%
- 
[AndHighMed](https://home.apache.org/~mikemccand/lucenebench/AndHighMed.html) 
+1%


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Speed up prefix sums when decoding doc IDs. [lucene]

2024-08-15 Thread via GitHub


jpountz opened a new pull request, #13658:
URL: https://github.com/apache/lucene/pull/13658

   This updates file formats to compute prefix sums by summing up 8 deltas per 
long at the same time if the number of bits per value is 4 or less, and 4 
deltas per long at the same time if the number of bits per value is between 5 
included and 11 included. Otherwise, we keep summing up 2 deltas per long like 
today.
   
   The `PostingDecodingUtil` was slightly modified due to the fact that more 
numbers of bits per value now need to apply different shifts to the input data. 
E.g. now that we store integers that require 5 bits per value as 16-bit 
integers under the hood rather than 8, we extract the first values by shifting 
by 16-5=11, 16-2*5=6 and 16-3*5=1 and then decode tail values from the 
remaining bit per 16-bit integer.
   
   Micro benchmarks suggest a noticeable speedup for prefix sums and bits per 
value 2, 3, 4 when we now sum up 8 values at once instead of 2, and a minor 
speedup otherwise. For reference, wikibigall is frequently using 2, 3 or 4 bits 
per value on blocks of doc deltas of stop words ("the", "a", "1", etc.)
   
   Before (without enabling the vector module):
   
   ```
   Benchmark  (bpv)   Mode  Cnt   Score   
Error   Units
   PostingIndexInputBenchmark.decode  2  thrpt   15  55,166 ± 
0,480  ops/us
   PostingIndexInputBenchmark.decode  3  thrpt   15  51,257 ± 
0,086  ops/us
   PostingIndexInputBenchmark.decode  4  thrpt   15  55,769 ± 
0,271  ops/us
   PostingIndexInputBenchmark.decode  5  thrpt   15  50,789 ± 
0,062  ops/us
   PostingIndexInputBenchmark.decode  6  thrpt   15  49,804 ± 
0,187  ops/us
   PostingIndexInputBenchmark.decode  7  thrpt   15  47,552 ± 
0,522  ops/us
   PostingIndexInputBenchmark.decode  8  thrpt   15  61,442 ± 
0,330  ops/us
   PostingIndexInputBenchmark.decode  9  thrpt   15  40,030 ± 
0,084  ops/us
   PostingIndexInputBenchmark.decode 10  thrpt   15  41,480 ± 
0,458  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum  2  thrpt   15  20,808 ± 
0,063  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum  3  thrpt   15  20,181 ± 
0,184  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum  4  thrpt   15  20,931 ± 
0,141  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum  5  thrpt   15  20,195 ± 
0,118  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum  6  thrpt   15  20,304 ± 
0,185  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum  7  thrpt   15  19,334 ± 
0,204  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum  8  thrpt   15  21,579 ± 
0,071  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum  9  thrpt   15  17,726 ± 
0,253  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum 10  thrpt   15  18,021 ± 
0,267  ops/us
   ```
   
   After (without enabling the vector module):
   
   ```
   Benchmark  (bpv)   Mode  Cnt   Score   
Error   Units
   PostingIndexInputBenchmark.decode  2  thrpt   15  57,723 ± 
0,501  ops/us
   PostingIndexInputBenchmark.decode  3  thrpt   15  52,526 ± 
0,255  ops/us
   PostingIndexInputBenchmark.decode  4  thrpt   15  57,395 ± 
0,424  ops/us
   PostingIndexInputBenchmark.decode  5  thrpt   15  50,513 ± 
0,076  ops/us
   PostingIndexInputBenchmark.decode  6  thrpt   15  47,176 ± 
0,146  ops/us
   PostingIndexInputBenchmark.decode  7  thrpt   15  44,838 ± 
0,138  ops/us
   PostingIndexInputBenchmark.decode  8  thrpt   15  61,604 ± 
0,262  ops/us
   PostingIndexInputBenchmark.decode  9  thrpt   15  37,737 ± 
0,057  ops/us
   PostingIndexInputBenchmark.decode 10  thrpt   15  37,079 ± 
0,364  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum  2  thrpt   15  30,823 ± 
0,052  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum  3  thrpt   15  28,148 ± 
0,155  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum  4  thrpt   15  30,545 ± 
0,059  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum  5  thrpt   15  22,332 ± 
0,100  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum  6  thrpt   15  22,240 ± 
0,029  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum  7  thrpt   15  21,809 ± 
0,038  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum  8  thrpt   15  23,279 ± 
0,376  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum  9  thrpt   15  21,624 ± 
0,073  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum 10  thrpt   15  21,348 ± 
0,033  ops/us
   ```
   
   After (with enabling the vector module - Apple M3 - preferredBitSize=128):
   ```
   Benchmark  (bpv)   Mode  Cnt   Score   
Error  

Re: [PR] Slightly speed up decoding blocks of postings/freqs/positions. [lucene]

2024-08-15 Thread via GitHub


jpountz commented on PR #13631:
URL: https://github.com/apache/lucene/pull/13631#issuecomment-2291233215

   @gsmiller Rebasing this PR proved a bit challenging after #13636 got merged, 
so I ended up creating a new one that only speeds up prefix sums without 
disabling slower numbers of bits per value: #13658. I plan on looking into 
disabling the slower bits per value in a different PR if it still proves 
appealing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] TestHnswFloatVectorGraph.testRandomReadWriteAndMerge fails with java.lang.IndexOutOfBoundsException [lucene]

2024-08-15 Thread via GitHub


ChrisHegarty opened a new issue, #13659:
URL: https://github.com/apache/lucene/issues/13659

   ### Description
   
   ```
   TestHnswFloatVectorGraph > testRandomReadWriteAndMerge FAILED
   java.lang.IndexOutOfBoundsException: Index 2147483647 out of bounds for 
length 18
   at 
__randomizedtesting.SeedInfo.seed([11F871BAF8E02ADC:343906638A98C7FC]:0)
   at 
java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:100)
   at 
java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106)
   at 
java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302)
   at java.base/java.util.Objects.checkIndex(Objects.java:385)
   at java.base/java.util.ArrayList.get(ArrayList.java:427)
   at 
org.apache.lucene.util.hnsw.RandomAccessVectorValues$1.vectorValue(RandomAccessVectorValues.java:135)
   at 
org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatScoringSupplier$1.score(DefaultFlatVectorScorer.java:142)
   at 
org.apache.lucene.util.hnsw.HnswGraphSearcher.searchLevel(HnswGraphSearcher.java:206)
   at 
org.apache.lucene.util.hnsw.HnswGraphBuilder.connectComponents(HnswGraphBuilder.java:462)
   at 
org.apache.lucene.util.hnsw.HnswGraphBuilder.connectComponents(HnswGraphBuilder.java:423)
   at 
org.apache.lucene.util.hnsw.HnswGraphBuilder.finish(HnswGraphBuilder.java:416)
   at 
org.apache.lucene.util.hnsw.HnswGraphBuilder.getCompletedGraph(HnswGraphBuilder.java:182)
   at 
org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsWriter$FieldWriter.getGraph(Lucene99HnswVectorsWriter.java:618)
   at 
org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsWriter.writeField(Lucene99HnswVectorsWriter.java:187)
   at 
org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsWriter.flush(Lucene99HnswVectorsWriter.java:149)
   at 
org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsWriter.flush(PerFieldKnnVectorsFormat.java:115)
   at 
org.apache.lucene.index.VectorValuesConsumer.flush(VectorValuesConsumer.java:76)
   at 
org.apache.lucene.index.IndexingChain.flush(IndexingChain.java:305)
   at 
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:456)
   at 
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:502)
   at 
org.apache.lucene.index.DocumentsWriter.maybeFlush(DocumentsWriter.java:456)
   at 
org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:649)
   at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3674)
   at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:4122)
   at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4084)
   at 
org.apache.lucene.util.hnsw.HnswGraphTestCase.testRandomReadWriteAndMerge(HnswGraphTestCase.java:189)
   at 
org.apache.lucene.util.hnsw.TestHnswFloatVectorGraph.testRandomReadWriteAndMerge(TestHnswFloatVectorGraph.java:40)
   ...
   ```
   
   ### Gradle command to reproduce
   
   ./gradlew :lucene:core:test --tests 
"org.apache.lucene.util.hnsw.TestHnswFloatVectorGraph.testRandomReadWriteAndMerge"
 -Ptests.jvms=6 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC 
-XX:ActiveProcessorCount=1" -Ptests.seed=11F871BAF8E02ADC -Ptests.locale=hy-AM 
-Ptests.timezone=America/Vancouver -Ptests.gui=false 
-Ptests.file.encoding=UTF-8 -Ptests.vectorsize=128 
-Ptests.forceintegervectors=true


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestHnswFloatVectorGraph.testReadWrite fails on branch 9x [lucene]

2024-08-15 Thread via GitHub


msokolov closed issue #13653: TestHnswFloatVectorGraph.testReadWrite fails on 
branch 9x
URL: https://github.com/apache/lucene/issues/13653


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestHnswFloatVectorGraph.testReadWrite fails on branch 9x [lucene]

2024-08-15 Thread via GitHub


msokolov commented on issue #13653:
URL: https://github.com/apache/lucene/issues/13653#issuecomment-2291316793

   OK I'll mark this fixed now since we seem to have gone ~24h without any more 
failures from automated builds that I can see


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Expose FlatVectorsFormat [lucene]

2024-08-15 Thread via GitHub


msokolov commented on PR #13469:
URL: https://github.com/apache/lucene/pull/13469#issuecomment-2291338923

   Hi @navneet1v, I think you may be right although TBH I find the way Lucene 
handles formats with SPI to be somewhat confusing.  We're using this in our own 
service with a custom Codec that seems to not require the  META-INF/SPI lookup, 
bnut maybe there is some shortcoming to that approach


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] nocommit: demonstrate how a minor change in IndexSearcher can have an inexplicable performance impact [lucene]

2024-08-15 Thread via GitHub


gsmiller commented on PR #13657:
URL: https://github.com/apache/lucene/pull/13657#issuecomment-2291353795

   I got an even tighter isolation on the regressing change but still can't 
understand why it could be happening. It seems to have to do with 
whether-or-not the collectors list gets created before or after calling 
`createWeight`. So this order _does not_ cause any regression:
   ```
   final Weight weight = createWeight(query, firstCollector.scoreMode(), 1);
   final List collectors = new ArrayList<>(leafSlices.length);
   ```
   
   While this does:
   ```
   final List collectors = new ArrayList<>(leafSlices.length);
   final Weight weight = createWeight(query, firstCollector.scoreMode(), 1);
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Revert changes to IndexSearcher brought in by GH#13568 [lucene]

2024-08-15 Thread via GitHub


gsmiller commented on PR #13656:
URL: https://github.com/apache/lucene/pull/13656#issuecomment-2291395658

   Investigation continues on how #13568 could possibly have caused a 
regression to various nightly benchmark tasks, but I've confirmed that this 
patch fixed the issues (see 
[Term](https://home.apache.org/~mikemccand/lucenebench/Term.html), 
[SloppyPhrase](https://home.apache.org/~mikemccand/lucenebench/SloppyPhrase.html),
 [SpanNear](https://home.apache.org/~mikemccand/lucenebench/SpanNear.html), 
[CombinedHighMed](https://home.apache.org/~mikemccand/lucenebench/CombinedHighMed.html)).
   
   I'll back port this patch to 9x to make sure it doesn't get forgotten and 
we'll keep working on a better solution as we understand what's going on better 
(see conversation in #13657).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Try applying bipartite graph reordering to KNN graph node ids [lucene]

2024-08-15 Thread via GitHub


msokolov commented on issue #13565:
URL: https://github.com/apache/lucene/issues/13565#issuecomment-2291406018

   OK I'm beginning to see that it would be easier to start with a tool that 
rewrites the index as you did with the `BPIndexReorderer` -- integrating this 
thing as a new index format or as a new kind of IndexSort introduces all sorts 
of other complications that are not needed if we want to just get an idea of 
what the impact of such a change might be.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Revert changes to IndexSearcher brought in by GH#13568 [lucene]

2024-08-15 Thread via GitHub


mikemccand commented on PR #13656:
URL: https://github.com/apache/lucene/pull/13656#issuecomment-2291583847

   Thanks @gsmiller -- looks like this did indeed fix the nightly benchy e.g. 
[CombinedTerm](https://home.apache.org/~mikemccand/lucenebench/CombinedTerm.html),
 [TermQuery](https://home.apache.org/~mikemccand/lucenebench/Term.html), 
[SloppyPhrase](https://home.apache.org/~mikemccand/lucenebench/SloppyPhrase.html).
  But I'm still baffled why this hurt performance so much...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Revert changes to IndexSearcher brought in by GH#13568 [lucene]

2024-08-15 Thread via GitHub


gsmiller commented on PR #13656:
URL: https://github.com/apache/lucene/pull/13656#issuecomment-2291752889

   Totally @mikemccand . Have you followed what @epotyom and I found so far in 
#13657? It's wild. Just moving the initialization of the array list before or 
after the call to createWeight is what's making the difference. All I can think 
is some sort of jvm optimization is being disabled because of the change since 
I can't think of any other relationship here. We're going to do a little more 
debugging but I'll probably send out an email to the dev@ list tomorrow to see 
if anyone with more jvm expertise might be able to shed some light on what's 
going on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestHnswFloatVectorGraph.testRandomReadWriteAndMerge fails with java.lang.IndexOutOfBoundsException [lucene]

2024-08-15 Thread via GitHub


benwtrent commented on issue #13659:
URL: https://github.com/apache/lucene/issues/13659#issuecomment-2291778869

   Yeah, gitbisect indicate `217828736c4 - gh-12627: HnswGraphBuilder connects 
disconnected HNSW graph components (#13566) (7 days ago) `
   
   I am not 100% sure whats up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestHnswFloatVectorGraph.testRandomReadWriteAndMerge fails with java.lang.IndexOutOfBoundsException [lucene]

2024-08-15 Thread via GitHub


benwtrent commented on issue #13659:
URL: https://github.com/apache/lucene/issues/13659#issuecomment-2291853791

   From what I can tell:
   
- `components` will return  a component where the entry_point is 
`NO_MORE_DOCS`
- This occurs because the `notFullyConnected.nextSetBit(0);` returns 
`NO_MORE_DOCS`
- When checking for connectedness, we look at `size` not entry point
```
 // connect other components to the largest one
 Component c0 = 
components.stream().max(Comparator.comparingInt(Component::size)).get();
   ```
   
   I am not sure what the behavior should be here, but it seems to me that a 
component whose entry point is `NO_MORE_DOCS` shouldn't get added to the 
components list.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Expose FlatVectorsFormat [lucene]

2024-08-15 Thread via GitHub


navneet1v commented on PR #13469:
URL: https://github.com/apache/lucene/pull/13469#issuecomment-2291858889

   Hi @msokolov thanks for sharing your thoughts.
   
   > I find the way Lucene handles formats with SPI to be somewhat confusing.
   
   +1. but I always feel my experience with SPI is low hence I might feel that. 
   
   
   > We're using this in our own service with a custom Codec that seems to not 
require the META-INF/SPI lookup, bnut maybe there is some shortcoming to that 
approach
   
   Would like to know how you are doing it if there is some reference. Because 
I am also using a custom codec. The one way I figured out to use this is I 
create my own KNNVectorsFormat and then used the FlatWriters and Readers 
directly. Let me know if there is any better way. But this is best I can think 
of.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestHnswFloatVectorGraph.testRandomReadWriteAndMerge fails with java.lang.IndexOutOfBoundsException [lucene]

2024-08-15 Thread via GitHub


benwtrent commented on issue #13659:
URL: https://github.com/apache/lucene/issues/13659#issuecomment-2291899075

   @msokolov ^ what say you? It seems that a `NO_MORE_DOCS` indicates that the 
`notFullyConnected` just didn't ever get selected. Then later, when iterating 
the components 
   
   ```
   RandomVectorScorer scorer = scorerSupplier.scorer(c.start());
   ```
   
   And `start()` is `NO_MORE_DOCS`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Only attempt to connect components when entry point is valid [lucene]

2024-08-15 Thread via GitHub


benwtrent opened a new pull request, #13660:
URL: https://github.com/apache/lucene/pull/13660

   We should only connect a component if its entry point is valid.
   
   closes: https://github.com/apache/lucene/issues/13659


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up prefix sums when decoding doc IDs. [lucene]

2024-08-15 Thread via GitHub


gsmiller commented on code in PR #13658:
URL: https://github.com/apache/lucene/pull/13658#discussion_r1718827005


##
lucene/core/src/java/org/apache/lucene/codecs/lucene912/ForDeltaUtil.java:
##
@@ -62,22 +256,374 @@ void encodeDeltas(long[] longs, DataOutput out) throws 
IOException {
   assert or != 0;
   final int bitsPerValue = PackedInts.bitsRequired(or);
   out.writeByte((byte) bitsPerValue);
-  forUtil.encode(longs, bitsPerValue, out);
+  encodeDeltas(longs, bitsPerValue, out);
+}
+  }
+
+  private void encodeDeltas(long[] longs, int bitsPerValue, DataOutput out) 
throws IOException {
+final int nextPrimitive;
+final int numLongs;
+if (bitsPerValue <= 4) {
+  nextPrimitive = 8;
+  numLongs = BLOCK_SIZE / 8;
+  collapse8(longs);
+} else if (bitsPerValue <= 11) {
+  nextPrimitive = 16;
+  numLongs = BLOCK_SIZE / 4;
+  collapse16(longs);
+} else {
+  nextPrimitive = 32;
+  numLongs = BLOCK_SIZE / 2;
+  collapse32(longs);
+}
+
+final int numLongsPerShift = bitsPerValue * 2;
+int idx = 0;
+int shift = nextPrimitive - bitsPerValue;
+for (int i = 0; i < numLongsPerShift; ++i) {
+  tmp[i] = longs[idx++] << shift;
+}
+for (shift = shift - bitsPerValue; shift >= 0; shift -= bitsPerValue) {
+  for (int i = 0; i < numLongsPerShift; ++i) {
+tmp[i] |= longs[idx++] << shift;
+  }
+}
+
+final int remainingBitsPerLong = shift + bitsPerValue;
+final long maskRemainingBitsPerLong;
+if (nextPrimitive == 8) {
+  maskRemainingBitsPerLong = MASKS8[remainingBitsPerLong];
+} else if (nextPrimitive == 16) {
+  maskRemainingBitsPerLong = MASKS16[remainingBitsPerLong];
+} else {
+  maskRemainingBitsPerLong = MASKS32[remainingBitsPerLong];
+}
+
+int tmpIdx = 0;
+int remainingBitsPerValue = bitsPerValue;
+while (idx < numLongs) {
+  if (remainingBitsPerValue >= remainingBitsPerLong) {
+remainingBitsPerValue -= remainingBitsPerLong;
+tmp[tmpIdx++] |= (longs[idx] >>> remainingBitsPerValue) & 
maskRemainingBitsPerLong;
+if (remainingBitsPerValue == 0) {
+  idx++;
+  remainingBitsPerValue = bitsPerValue;
+}
+  } else {
+final long mask1, mask2;
+if (nextPrimitive == 8) {
+  mask1 = MASKS8[remainingBitsPerValue];
+  mask2 = MASKS8[remainingBitsPerLong - remainingBitsPerValue];
+} else if (nextPrimitive == 16) {
+  mask1 = MASKS16[remainingBitsPerValue];
+  mask2 = MASKS16[remainingBitsPerLong - remainingBitsPerValue];
+} else {
+  mask1 = MASKS32[remainingBitsPerValue];
+  mask2 = MASKS32[remainingBitsPerLong - remainingBitsPerValue];
+}
+tmp[tmpIdx] |= (longs[idx++] & mask1) << (remainingBitsPerLong - 
remainingBitsPerValue);
+remainingBitsPerValue = bitsPerValue - remainingBitsPerLong + 
remainingBitsPerValue;
+tmp[tmpIdx++] |= (longs[idx] >>> remainingBitsPerValue) & mask2;
+  }
+}
+
+for (int i = 0; i < numLongsPerShift; ++i) {
+  out.writeLong(tmp[i]);
+}
+  }
+
+  /** Number of bytes required to encode 128 integers of {@code bitsPerValue} 
bits per value. */
+  int numBytes(int bitsPerValue) {
+return bitsPerValue << (BLOCK_SIZE_LOG2 - 3);
+  }
+
+  private static void decodeSlow(

Review Comment:
   And one more method that I think is duplicated from `ForUtil`?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene912/ForDeltaUtil.java:
##
@@ -62,22 +256,374 @@ void encodeDeltas(long[] longs, DataOutput out) throws 
IOException {
   assert or != 0;
   final int bitsPerValue = PackedInts.bitsRequired(or);
   out.writeByte((byte) bitsPerValue);
-  forUtil.encode(longs, bitsPerValue, out);
+  encodeDeltas(longs, bitsPerValue, out);
+}
+  }
+
+  private void encodeDeltas(long[] longs, int bitsPerValue, DataOutput out) 
throws IOException {

Review Comment:
   I think this is identical to `ForDelta#encode` right? Can we leverage that 
instead of duplicating it? I guess the issue is that `ForUtil` relies on its 
`tmp` buffer so the method isn't static, but we could split out a helper static 
method that takes the buffer as input that could be common I think.



##
lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultPostingDecodingUtil.java:
##
@@ -29,13 +29,14 @@ public DefaultPostingDecodingUtil(IndexInput in) {
 
   @Override
   public void splitLongs(

Review Comment:
   What if we add the "slow" implementation as a static method in 
`PostingDecodingUtil` (as sketched out below) and then delegate to it here and 
in the "slow path" in the vectorized implementation? That we can avoid the code 
duplication.
   
   ```
 protected static void splitLongsSlow(IndexInput in, int count, long[] b, 
int bShift, int dec, long bMask, long[] c, int cIndex, long cMask) throws 
IOEx

Re: [I] TestHnswFloatVectorGraph.testRandomReadWriteAndMerge fails with java.lang.IndexOutOfBoundsException [lucene]

2024-08-15 Thread via GitHub


msokolov commented on issue #13659:
URL: https://github.com/apache/lucene/issues/13659#issuecomment-2292058048

   Oh, the gift that keeps on giving! Your solution seems reasonable. I mean it 
would be nice if we didn't generate these degenerate Component in the first 
place?  But this will work


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Only attempt to connect components when entry point is valid [lucene]

2024-08-15 Thread via GitHub


msokolov commented on code in PR #13660:
URL: https://github.com/apache/lucene/pull/13660#discussion_r1718901258


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -456,6 +456,9 @@ private boolean connectComponents(int level) throws 
IOException {
 if (c != c0) {
   beam.clear();
   eps[0] = c0.start();
+  if (c.start() == NO_MORE_DOCS) {

Review Comment:
   nit: we could move this to the beginning of the loop



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Only attempt to connect components when entry point is valid [lucene]

2024-08-15 Thread via GitHub


msokolov commented on code in PR #13660:
URL: https://github.com/apache/lucene/pull/13660#discussion_r1718901258


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -456,6 +456,9 @@ private boolean connectComponents(int level) throws 
IOException {
 if (c != c0) {
   beam.clear();
   eps[0] = c0.start();
+  if (c.start() == NO_MORE_DOCS) {

Review Comment:
   nit: we could move this to the beginning of the block



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Upgrade spotless to 6.9.1, google java format to 1.23.0. [lucene]

2024-08-15 Thread via GitHub


dweiss opened a new pull request, #13661:
URL: https://github.com/apache/lucene/pull/13661

   This is a trivial bump of plugin dependencies. One commit updates versions, 
another is a tidy run on the newer spotless/ google java format (mostly 
indentation within comments of switch statements).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up prefix sums when decoding doc IDs. [lucene]

2024-08-15 Thread via GitHub


jpountz commented on code in PR #13658:
URL: https://github.com/apache/lucene/pull/13658#discussion_r1718950733


##
lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultPostingDecodingUtil.java:
##
@@ -29,13 +29,14 @@ public DefaultPostingDecodingUtil(IndexInput in) {
 
   @Override
   public void splitLongs(

Review Comment:
   > can we determine cIndex as `count * ((bShift - 1) / dec + 1)`?
   
   We could do something like that, though it's often more convenient to start 
at `0` when the `c` array is `tmp` rather than `longs`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up prefix sums when decoding doc IDs. [lucene]

2024-08-15 Thread via GitHub


jpountz commented on code in PR #13658:
URL: https://github.com/apache/lucene/pull/13658#discussion_r1718955964


##
lucene/core/src/java/org/apache/lucene/codecs/lucene912/PostingIndexInput.java:
##
@@ -50,6 +53,6 @@ public void decode(int bitsPerValue, long[] longs) throws 
IOException {
* and store results into {@code longs}.
*/
   public void decodeAndPrefixSum(int bitsPerValue, long base, long[] longs) 
throws IOException {

Review Comment:
   I forgot to add a comment about that: `PostingsDecodingUtil` instances can 
only be retrieved from specific call sites, see 
`VectorizationProvider#ensureCaller`. So I needed to introduce some abstraction 
that could be used from the benchmark-jmh module. Before this PR, this didn't 
look too odd because this abstraction was also used in 
`Lucene912PostingsReader`, but indeed now it's only used by the benchmark-jmh 
module. I'll add a comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up prefix sums when decoding doc IDs. [lucene]

2024-08-15 Thread via GitHub


jpountz commented on PR #13658:
URL: https://github.com/apache/lucene/pull/13658#issuecomment-2292214610

   Thanks for taking a look @gsmiller ! I believe that I addressed all your 
(good) comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize decoding blocks of postings using the vector API. (#13636) [lucene]

2024-08-15 Thread via GitHub


jpountz commented on PR #13652:
URL: https://github.com/apache/lucene/pull/13652#issuecomment-2292242821

   OK, I did it for src/java20 too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up prefix sums when decoding doc IDs. [lucene]

2024-08-15 Thread via GitHub


gsmiller commented on code in PR #13658:
URL: https://github.com/apache/lucene/pull/13658#discussion_r1719059369


##
lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultPostingDecodingUtil.java:
##
@@ -29,13 +29,14 @@ public DefaultPostingDecodingUtil(IndexInput in) {
 
   @Override
   public void splitLongs(

Review Comment:
   > We could do something like that, though it's often more convenient to 
start at 0 when the c array is tmp rather than longs.
   
   Ha oops. That was a note to myself to think about it some more. Didn't mean 
for that to make it into my feedback :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up prefix sums when decoding doc IDs. [lucene]

2024-08-15 Thread via GitHub


gsmiller commented on code in PR #13658:
URL: https://github.com/apache/lucene/pull/13658#discussion_r1719062831


##
lucene/core/src/java/org/apache/lucene/codecs/lucene912/PostingIndexInput.java:
##
@@ -50,6 +53,6 @@ public void decode(int bitsPerValue, long[] longs) throws 
IOException {
* and store results into {@code longs}.
*/
   public void decodeAndPrefixSum(int bitsPerValue, long base, long[] longs) 
throws IOException {

Review Comment:
   That makes sense, thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce memory usage of SkipListWriter [lucene]

2024-08-15 Thread via GitHub


github-actions[bot] commented on PR #13576:
URL: https://github.com/apache/lucene/pull/13576#issuecomment-2292498818

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org