Re: [PR] nocommit: demonstrate how a minor change in IndexSearcher can have an inexplicable performance impact [lucene]
epotyom commented on PR #13657: URL: https://github.com/apache/lucene/pull/13657#issuecomment-2290926100 Interesting... Thanks for isolating the change that causes the regression Greg! > impacted tasks are possibly taking the `leafSlices.length == 0` condition branch and avoiding the need to create the collectors list at all? I don't think it is that - if I understand correctly `leafSlices.length == 0` only if there are no segments at all - which shouldn't be the case for benchmark tests we are running. Just in case I added a commit on top of this PR to never create the ArrayList for collectors if the length is zero: https://github.com/apache/lucene/commit/e30b76d77926d5f68d70f8b1d3b7129a73b76942 , and it also shows regression: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value HighTerm 525.44 (3.5%) 441.68 (3.7%) -15.9% ( -22% - -9%) 0.000 HighSloppyPhrase 33.23 (2.7%) 28.13 (1.4%) -15.3% ( -18% - -11%) 0.000 MedTerm 479.40 (4.6%) 410.71 (4.4%) -14.3% ( -22% - -5%) 0.000 LowTerm 433.49 (3.1%) 387.45 (3.1%) -10.6% ( -16% - -4%) 0.000 BrowseMonthTaxoFacets1.84 (9.0%)1.70 (8.8%) -7.9% ( -23% - 10%) 0.160 BrowseRandomLabelTaxoFacets1.37 (12.4%)1.26 (0.5%) -7.6% ( -18% -6%) 0.174 HighSpanNear 31.21 (1.7%) 28.97 (2.8%) -7.2% ( -11% - -2%) 0.000 ... ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize decoding blocks of postings using the vector API. [lucene]
jpountz commented on PR #13636: URL: https://github.com/apache/lucene/pull/13636#issuecomment-2291136659 A few queries got a small-ish speedup, e.g. - [AndMedOrHighHigh](https://home.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html) +2.5% - [AndHighHigh](https://home.apache.org/~mikemccand/lucenebench/AndHighHigh.html) +1.5% - [CountAndHighMed](https://home.apache.org/~mikemccand/lucenebench/CountAndHighMed.html) +1.5% - [AndHighMed](https://home.apache.org/~mikemccand/lucenebench/AndHighMed.html) +1% -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Speed up prefix sums when decoding doc IDs. [lucene]
jpountz opened a new pull request, #13658: URL: https://github.com/apache/lucene/pull/13658 This updates file formats to compute prefix sums by summing up 8 deltas per long at the same time if the number of bits per value is 4 or less, and 4 deltas per long at the same time if the number of bits per value is between 5 included and 11 included. Otherwise, we keep summing up 2 deltas per long like today. The `PostingDecodingUtil` was slightly modified due to the fact that more numbers of bits per value now need to apply different shifts to the input data. E.g. now that we store integers that require 5 bits per value as 16-bit integers under the hood rather than 8, we extract the first values by shifting by 16-5=11, 16-2*5=6 and 16-3*5=1 and then decode tail values from the remaining bit per 16-bit integer. Micro benchmarks suggest a noticeable speedup for prefix sums and bits per value 2, 3, 4 when we now sum up 8 values at once instead of 2, and a minor speedup otherwise. For reference, wikibigall is frequently using 2, 3 or 4 bits per value on blocks of doc deltas of stop words ("the", "a", "1", etc.) Before (without enabling the vector module): ``` Benchmark (bpv) Mode Cnt Score Error Units PostingIndexInputBenchmark.decode 2 thrpt 15 55,166 ± 0,480 ops/us PostingIndexInputBenchmark.decode 3 thrpt 15 51,257 ± 0,086 ops/us PostingIndexInputBenchmark.decode 4 thrpt 15 55,769 ± 0,271 ops/us PostingIndexInputBenchmark.decode 5 thrpt 15 50,789 ± 0,062 ops/us PostingIndexInputBenchmark.decode 6 thrpt 15 49,804 ± 0,187 ops/us PostingIndexInputBenchmark.decode 7 thrpt 15 47,552 ± 0,522 ops/us PostingIndexInputBenchmark.decode 8 thrpt 15 61,442 ± 0,330 ops/us PostingIndexInputBenchmark.decode 9 thrpt 15 40,030 ± 0,084 ops/us PostingIndexInputBenchmark.decode 10 thrpt 15 41,480 ± 0,458 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 2 thrpt 15 20,808 ± 0,063 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 3 thrpt 15 20,181 ± 0,184 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 4 thrpt 15 20,931 ± 0,141 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 5 thrpt 15 20,195 ± 0,118 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 6 thrpt 15 20,304 ± 0,185 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 7 thrpt 15 19,334 ± 0,204 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 8 thrpt 15 21,579 ± 0,071 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 9 thrpt 15 17,726 ± 0,253 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 10 thrpt 15 18,021 ± 0,267 ops/us ``` After (without enabling the vector module): ``` Benchmark (bpv) Mode Cnt Score Error Units PostingIndexInputBenchmark.decode 2 thrpt 15 57,723 ± 0,501 ops/us PostingIndexInputBenchmark.decode 3 thrpt 15 52,526 ± 0,255 ops/us PostingIndexInputBenchmark.decode 4 thrpt 15 57,395 ± 0,424 ops/us PostingIndexInputBenchmark.decode 5 thrpt 15 50,513 ± 0,076 ops/us PostingIndexInputBenchmark.decode 6 thrpt 15 47,176 ± 0,146 ops/us PostingIndexInputBenchmark.decode 7 thrpt 15 44,838 ± 0,138 ops/us PostingIndexInputBenchmark.decode 8 thrpt 15 61,604 ± 0,262 ops/us PostingIndexInputBenchmark.decode 9 thrpt 15 37,737 ± 0,057 ops/us PostingIndexInputBenchmark.decode 10 thrpt 15 37,079 ± 0,364 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 2 thrpt 15 30,823 ± 0,052 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 3 thrpt 15 28,148 ± 0,155 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 4 thrpt 15 30,545 ± 0,059 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 5 thrpt 15 22,332 ± 0,100 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 6 thrpt 15 22,240 ± 0,029 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 7 thrpt 15 21,809 ± 0,038 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 8 thrpt 15 23,279 ± 0,376 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 9 thrpt 15 21,624 ± 0,073 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 10 thrpt 15 21,348 ± 0,033 ops/us ``` After (with enabling the vector module - Apple M3 - preferredBitSize=128): ``` Benchmark (bpv) Mode Cnt Score Error
Re: [PR] Slightly speed up decoding blocks of postings/freqs/positions. [lucene]
jpountz commented on PR #13631: URL: https://github.com/apache/lucene/pull/13631#issuecomment-2291233215 @gsmiller Rebasing this PR proved a bit challenging after #13636 got merged, so I ended up creating a new one that only speeds up prefix sums without disabling slower numbers of bits per value: #13658. I plan on looking into disabling the slower bits per value in a different PR if it still proves appealing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] TestHnswFloatVectorGraph.testRandomReadWriteAndMerge fails with java.lang.IndexOutOfBoundsException [lucene]
ChrisHegarty opened a new issue, #13659: URL: https://github.com/apache/lucene/issues/13659 ### Description ``` TestHnswFloatVectorGraph > testRandomReadWriteAndMerge FAILED java.lang.IndexOutOfBoundsException: Index 2147483647 out of bounds for length 18 at __randomizedtesting.SeedInfo.seed([11F871BAF8E02ADC:343906638A98C7FC]:0) at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:100) at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106) at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302) at java.base/java.util.Objects.checkIndex(Objects.java:385) at java.base/java.util.ArrayList.get(ArrayList.java:427) at org.apache.lucene.util.hnsw.RandomAccessVectorValues$1.vectorValue(RandomAccessVectorValues.java:135) at org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatScoringSupplier$1.score(DefaultFlatVectorScorer.java:142) at org.apache.lucene.util.hnsw.HnswGraphSearcher.searchLevel(HnswGraphSearcher.java:206) at org.apache.lucene.util.hnsw.HnswGraphBuilder.connectComponents(HnswGraphBuilder.java:462) at org.apache.lucene.util.hnsw.HnswGraphBuilder.connectComponents(HnswGraphBuilder.java:423) at org.apache.lucene.util.hnsw.HnswGraphBuilder.finish(HnswGraphBuilder.java:416) at org.apache.lucene.util.hnsw.HnswGraphBuilder.getCompletedGraph(HnswGraphBuilder.java:182) at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsWriter$FieldWriter.getGraph(Lucene99HnswVectorsWriter.java:618) at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsWriter.writeField(Lucene99HnswVectorsWriter.java:187) at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsWriter.flush(Lucene99HnswVectorsWriter.java:149) at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsWriter.flush(PerFieldKnnVectorsFormat.java:115) at org.apache.lucene.index.VectorValuesConsumer.flush(VectorValuesConsumer.java:76) at org.apache.lucene.index.IndexingChain.flush(IndexingChain.java:305) at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:456) at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:502) at org.apache.lucene.index.DocumentsWriter.maybeFlush(DocumentsWriter.java:456) at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:649) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3674) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:4122) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4084) at org.apache.lucene.util.hnsw.HnswGraphTestCase.testRandomReadWriteAndMerge(HnswGraphTestCase.java:189) at org.apache.lucene.util.hnsw.TestHnswFloatVectorGraph.testRandomReadWriteAndMerge(TestHnswFloatVectorGraph.java:40) ... ``` ### Gradle command to reproduce ./gradlew :lucene:core:test --tests "org.apache.lucene.util.hnsw.TestHnswFloatVectorGraph.testRandomReadWriteAndMerge" -Ptests.jvms=6 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1" -Ptests.seed=11F871BAF8E02ADC -Ptests.locale=hy-AM -Ptests.timezone=America/Vancouver -Ptests.gui=false -Ptests.file.encoding=UTF-8 -Ptests.vectorsize=128 -Ptests.forceintegervectors=true -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestHnswFloatVectorGraph.testReadWrite fails on branch 9x [lucene]
msokolov closed issue #13653: TestHnswFloatVectorGraph.testReadWrite fails on branch 9x URL: https://github.com/apache/lucene/issues/13653 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestHnswFloatVectorGraph.testReadWrite fails on branch 9x [lucene]
msokolov commented on issue #13653: URL: https://github.com/apache/lucene/issues/13653#issuecomment-2291316793 OK I'll mark this fixed now since we seem to have gone ~24h without any more failures from automated builds that I can see -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Expose FlatVectorsFormat [lucene]
msokolov commented on PR #13469: URL: https://github.com/apache/lucene/pull/13469#issuecomment-2291338923 Hi @navneet1v, I think you may be right although TBH I find the way Lucene handles formats with SPI to be somewhat confusing. We're using this in our own service with a custom Codec that seems to not require the META-INF/SPI lookup, bnut maybe there is some shortcoming to that approach -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] nocommit: demonstrate how a minor change in IndexSearcher can have an inexplicable performance impact [lucene]
gsmiller commented on PR #13657: URL: https://github.com/apache/lucene/pull/13657#issuecomment-2291353795 I got an even tighter isolation on the regressing change but still can't understand why it could be happening. It seems to have to do with whether-or-not the collectors list gets created before or after calling `createWeight`. So this order _does not_ cause any regression: ``` final Weight weight = createWeight(query, firstCollector.scoreMode(), 1); final List collectors = new ArrayList<>(leafSlices.length); ``` While this does: ``` final List collectors = new ArrayList<>(leafSlices.length); final Weight weight = createWeight(query, firstCollector.scoreMode(), 1); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Revert changes to IndexSearcher brought in by GH#13568 [lucene]
gsmiller commented on PR #13656: URL: https://github.com/apache/lucene/pull/13656#issuecomment-2291395658 Investigation continues on how #13568 could possibly have caused a regression to various nightly benchmark tasks, but I've confirmed that this patch fixed the issues (see [Term](https://home.apache.org/~mikemccand/lucenebench/Term.html), [SloppyPhrase](https://home.apache.org/~mikemccand/lucenebench/SloppyPhrase.html), [SpanNear](https://home.apache.org/~mikemccand/lucenebench/SpanNear.html), [CombinedHighMed](https://home.apache.org/~mikemccand/lucenebench/CombinedHighMed.html)). I'll back port this patch to 9x to make sure it doesn't get forgotten and we'll keep working on a better solution as we understand what's going on better (see conversation in #13657). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Try applying bipartite graph reordering to KNN graph node ids [lucene]
msokolov commented on issue #13565: URL: https://github.com/apache/lucene/issues/13565#issuecomment-2291406018 OK I'm beginning to see that it would be easier to start with a tool that rewrites the index as you did with the `BPIndexReorderer` -- integrating this thing as a new index format or as a new kind of IndexSort introduces all sorts of other complications that are not needed if we want to just get an idea of what the impact of such a change might be. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Revert changes to IndexSearcher brought in by GH#13568 [lucene]
mikemccand commented on PR #13656: URL: https://github.com/apache/lucene/pull/13656#issuecomment-2291583847 Thanks @gsmiller -- looks like this did indeed fix the nightly benchy e.g. [CombinedTerm](https://home.apache.org/~mikemccand/lucenebench/CombinedTerm.html), [TermQuery](https://home.apache.org/~mikemccand/lucenebench/Term.html), [SloppyPhrase](https://home.apache.org/~mikemccand/lucenebench/SloppyPhrase.html). But I'm still baffled why this hurt performance so much... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Revert changes to IndexSearcher brought in by GH#13568 [lucene]
gsmiller commented on PR #13656: URL: https://github.com/apache/lucene/pull/13656#issuecomment-2291752889 Totally @mikemccand . Have you followed what @epotyom and I found so far in #13657? It's wild. Just moving the initialization of the array list before or after the call to createWeight is what's making the difference. All I can think is some sort of jvm optimization is being disabled because of the change since I can't think of any other relationship here. We're going to do a little more debugging but I'll probably send out an email to the dev@ list tomorrow to see if anyone with more jvm expertise might be able to shed some light on what's going on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestHnswFloatVectorGraph.testRandomReadWriteAndMerge fails with java.lang.IndexOutOfBoundsException [lucene]
benwtrent commented on issue #13659: URL: https://github.com/apache/lucene/issues/13659#issuecomment-2291778869 Yeah, gitbisect indicate `217828736c4 - gh-12627: HnswGraphBuilder connects disconnected HNSW graph components (#13566) (7 days ago) ` I am not 100% sure whats up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestHnswFloatVectorGraph.testRandomReadWriteAndMerge fails with java.lang.IndexOutOfBoundsException [lucene]
benwtrent commented on issue #13659: URL: https://github.com/apache/lucene/issues/13659#issuecomment-2291853791 From what I can tell: - `components` will return a component where the entry_point is `NO_MORE_DOCS` - This occurs because the `notFullyConnected.nextSetBit(0);` returns `NO_MORE_DOCS` - When checking for connectedness, we look at `size` not entry point ``` // connect other components to the largest one Component c0 = components.stream().max(Comparator.comparingInt(Component::size)).get(); ``` I am not sure what the behavior should be here, but it seems to me that a component whose entry point is `NO_MORE_DOCS` shouldn't get added to the components list. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Expose FlatVectorsFormat [lucene]
navneet1v commented on PR #13469: URL: https://github.com/apache/lucene/pull/13469#issuecomment-2291858889 Hi @msokolov thanks for sharing your thoughts. > I find the way Lucene handles formats with SPI to be somewhat confusing. +1. but I always feel my experience with SPI is low hence I might feel that. > We're using this in our own service with a custom Codec that seems to not require the META-INF/SPI lookup, bnut maybe there is some shortcoming to that approach Would like to know how you are doing it if there is some reference. Because I am also using a custom codec. The one way I figured out to use this is I create my own KNNVectorsFormat and then used the FlatWriters and Readers directly. Let me know if there is any better way. But this is best I can think of. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestHnswFloatVectorGraph.testRandomReadWriteAndMerge fails with java.lang.IndexOutOfBoundsException [lucene]
benwtrent commented on issue #13659: URL: https://github.com/apache/lucene/issues/13659#issuecomment-2291899075 @msokolov ^ what say you? It seems that a `NO_MORE_DOCS` indicates that the `notFullyConnected` just didn't ever get selected. Then later, when iterating the components ``` RandomVectorScorer scorer = scorerSupplier.scorer(c.start()); ``` And `start()` is `NO_MORE_DOCS`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Only attempt to connect components when entry point is valid [lucene]
benwtrent opened a new pull request, #13660: URL: https://github.com/apache/lucene/pull/13660 We should only connect a component if its entry point is valid. closes: https://github.com/apache/lucene/issues/13659 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up prefix sums when decoding doc IDs. [lucene]
gsmiller commented on code in PR #13658: URL: https://github.com/apache/lucene/pull/13658#discussion_r1718827005 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/ForDeltaUtil.java: ## @@ -62,22 +256,374 @@ void encodeDeltas(long[] longs, DataOutput out) throws IOException { assert or != 0; final int bitsPerValue = PackedInts.bitsRequired(or); out.writeByte((byte) bitsPerValue); - forUtil.encode(longs, bitsPerValue, out); + encodeDeltas(longs, bitsPerValue, out); +} + } + + private void encodeDeltas(long[] longs, int bitsPerValue, DataOutput out) throws IOException { +final int nextPrimitive; +final int numLongs; +if (bitsPerValue <= 4) { + nextPrimitive = 8; + numLongs = BLOCK_SIZE / 8; + collapse8(longs); +} else if (bitsPerValue <= 11) { + nextPrimitive = 16; + numLongs = BLOCK_SIZE / 4; + collapse16(longs); +} else { + nextPrimitive = 32; + numLongs = BLOCK_SIZE / 2; + collapse32(longs); +} + +final int numLongsPerShift = bitsPerValue * 2; +int idx = 0; +int shift = nextPrimitive - bitsPerValue; +for (int i = 0; i < numLongsPerShift; ++i) { + tmp[i] = longs[idx++] << shift; +} +for (shift = shift - bitsPerValue; shift >= 0; shift -= bitsPerValue) { + for (int i = 0; i < numLongsPerShift; ++i) { +tmp[i] |= longs[idx++] << shift; + } +} + +final int remainingBitsPerLong = shift + bitsPerValue; +final long maskRemainingBitsPerLong; +if (nextPrimitive == 8) { + maskRemainingBitsPerLong = MASKS8[remainingBitsPerLong]; +} else if (nextPrimitive == 16) { + maskRemainingBitsPerLong = MASKS16[remainingBitsPerLong]; +} else { + maskRemainingBitsPerLong = MASKS32[remainingBitsPerLong]; +} + +int tmpIdx = 0; +int remainingBitsPerValue = bitsPerValue; +while (idx < numLongs) { + if (remainingBitsPerValue >= remainingBitsPerLong) { +remainingBitsPerValue -= remainingBitsPerLong; +tmp[tmpIdx++] |= (longs[idx] >>> remainingBitsPerValue) & maskRemainingBitsPerLong; +if (remainingBitsPerValue == 0) { + idx++; + remainingBitsPerValue = bitsPerValue; +} + } else { +final long mask1, mask2; +if (nextPrimitive == 8) { + mask1 = MASKS8[remainingBitsPerValue]; + mask2 = MASKS8[remainingBitsPerLong - remainingBitsPerValue]; +} else if (nextPrimitive == 16) { + mask1 = MASKS16[remainingBitsPerValue]; + mask2 = MASKS16[remainingBitsPerLong - remainingBitsPerValue]; +} else { + mask1 = MASKS32[remainingBitsPerValue]; + mask2 = MASKS32[remainingBitsPerLong - remainingBitsPerValue]; +} +tmp[tmpIdx] |= (longs[idx++] & mask1) << (remainingBitsPerLong - remainingBitsPerValue); +remainingBitsPerValue = bitsPerValue - remainingBitsPerLong + remainingBitsPerValue; +tmp[tmpIdx++] |= (longs[idx] >>> remainingBitsPerValue) & mask2; + } +} + +for (int i = 0; i < numLongsPerShift; ++i) { + out.writeLong(tmp[i]); +} + } + + /** Number of bytes required to encode 128 integers of {@code bitsPerValue} bits per value. */ + int numBytes(int bitsPerValue) { +return bitsPerValue << (BLOCK_SIZE_LOG2 - 3); + } + + private static void decodeSlow( Review Comment: And one more method that I think is duplicated from `ForUtil`? ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/ForDeltaUtil.java: ## @@ -62,22 +256,374 @@ void encodeDeltas(long[] longs, DataOutput out) throws IOException { assert or != 0; final int bitsPerValue = PackedInts.bitsRequired(or); out.writeByte((byte) bitsPerValue); - forUtil.encode(longs, bitsPerValue, out); + encodeDeltas(longs, bitsPerValue, out); +} + } + + private void encodeDeltas(long[] longs, int bitsPerValue, DataOutput out) throws IOException { Review Comment: I think this is identical to `ForDelta#encode` right? Can we leverage that instead of duplicating it? I guess the issue is that `ForUtil` relies on its `tmp` buffer so the method isn't static, but we could split out a helper static method that takes the buffer as input that could be common I think. ## lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultPostingDecodingUtil.java: ## @@ -29,13 +29,14 @@ public DefaultPostingDecodingUtil(IndexInput in) { @Override public void splitLongs( Review Comment: What if we add the "slow" implementation as a static method in `PostingDecodingUtil` (as sketched out below) and then delegate to it here and in the "slow path" in the vectorized implementation? That we can avoid the code duplication. ``` protected static void splitLongsSlow(IndexInput in, int count, long[] b, int bShift, int dec, long bMask, long[] c, int cIndex, long cMask) throws IOEx
Re: [I] TestHnswFloatVectorGraph.testRandomReadWriteAndMerge fails with java.lang.IndexOutOfBoundsException [lucene]
msokolov commented on issue #13659: URL: https://github.com/apache/lucene/issues/13659#issuecomment-2292058048 Oh, the gift that keeps on giving! Your solution seems reasonable. I mean it would be nice if we didn't generate these degenerate Component in the first place? But this will work -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Only attempt to connect components when entry point is valid [lucene]
msokolov commented on code in PR #13660: URL: https://github.com/apache/lucene/pull/13660#discussion_r1718901258 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -456,6 +456,9 @@ private boolean connectComponents(int level) throws IOException { if (c != c0) { beam.clear(); eps[0] = c0.start(); + if (c.start() == NO_MORE_DOCS) { Review Comment: nit: we could move this to the beginning of the loop -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Only attempt to connect components when entry point is valid [lucene]
msokolov commented on code in PR #13660: URL: https://github.com/apache/lucene/pull/13660#discussion_r1718901258 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -456,6 +456,9 @@ private boolean connectComponents(int level) throws IOException { if (c != c0) { beam.clear(); eps[0] = c0.start(); + if (c.start() == NO_MORE_DOCS) { Review Comment: nit: we could move this to the beginning of the block -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Upgrade spotless to 6.9.1, google java format to 1.23.0. [lucene]
dweiss opened a new pull request, #13661: URL: https://github.com/apache/lucene/pull/13661 This is a trivial bump of plugin dependencies. One commit updates versions, another is a tidy run on the newer spotless/ google java format (mostly indentation within comments of switch statements). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up prefix sums when decoding doc IDs. [lucene]
jpountz commented on code in PR #13658: URL: https://github.com/apache/lucene/pull/13658#discussion_r1718950733 ## lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultPostingDecodingUtil.java: ## @@ -29,13 +29,14 @@ public DefaultPostingDecodingUtil(IndexInput in) { @Override public void splitLongs( Review Comment: > can we determine cIndex as `count * ((bShift - 1) / dec + 1)`? We could do something like that, though it's often more convenient to start at `0` when the `c` array is `tmp` rather than `longs`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up prefix sums when decoding doc IDs. [lucene]
jpountz commented on code in PR #13658: URL: https://github.com/apache/lucene/pull/13658#discussion_r1718955964 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/PostingIndexInput.java: ## @@ -50,6 +53,6 @@ public void decode(int bitsPerValue, long[] longs) throws IOException { * and store results into {@code longs}. */ public void decodeAndPrefixSum(int bitsPerValue, long base, long[] longs) throws IOException { Review Comment: I forgot to add a comment about that: `PostingsDecodingUtil` instances can only be retrieved from specific call sites, see `VectorizationProvider#ensureCaller`. So I needed to introduce some abstraction that could be used from the benchmark-jmh module. Before this PR, this didn't look too odd because this abstraction was also used in `Lucene912PostingsReader`, but indeed now it's only used by the benchmark-jmh module. I'll add a comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up prefix sums when decoding doc IDs. [lucene]
jpountz commented on PR #13658: URL: https://github.com/apache/lucene/pull/13658#issuecomment-2292214610 Thanks for taking a look @gsmiller ! I believe that I addressed all your (good) comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize decoding blocks of postings using the vector API. (#13636) [lucene]
jpountz commented on PR #13652: URL: https://github.com/apache/lucene/pull/13652#issuecomment-2292242821 OK, I did it for src/java20 too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up prefix sums when decoding doc IDs. [lucene]
gsmiller commented on code in PR #13658: URL: https://github.com/apache/lucene/pull/13658#discussion_r1719059369 ## lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultPostingDecodingUtil.java: ## @@ -29,13 +29,14 @@ public DefaultPostingDecodingUtil(IndexInput in) { @Override public void splitLongs( Review Comment: > We could do something like that, though it's often more convenient to start at 0 when the c array is tmp rather than longs. Ha oops. That was a note to myself to think about it some more. Didn't mean for that to make it into my feedback :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up prefix sums when decoding doc IDs. [lucene]
gsmiller commented on code in PR #13658: URL: https://github.com/apache/lucene/pull/13658#discussion_r1719062831 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/PostingIndexInput.java: ## @@ -50,6 +53,6 @@ public void decode(int bitsPerValue, long[] longs) throws IOException { * and store results into {@code longs}. */ public void decodeAndPrefixSum(int bitsPerValue, long base, long[] longs) throws IOException { Review Comment: That makes sense, thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce memory usage of SkipListWriter [lucene]
github-actions[bot] commented on PR #13576: URL: https://github.com/apache/lucene/pull/13576#issuecomment-2292498818 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org