Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]
msokolov commented on PR #13872: URL: https://github.com/apache/lucene/pull/13872#issuecomment-2430042116 With the most recent commit I saw these luceneutil/knnPerfTest.py results: ## 1. baseline ``` recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s force merge s num segments index size (MB) 0.816 0.294 15010 6 32 50 no 341.37 110.92 1 1534.03 0.811 0.308 15010 6 32 50 7 bits 346.68 93.22 1 1906.16 0.786 0.288 15010 6 32 50 4 bits 346.28 89.15 1 1906.10 ``` ## this change with defaults (no command line flags) ``` recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s force merge s num segments index size (MB) 0.817 0.304 15010 6 32 50 no 344.11 111.70 1 1533.94 0.812 0.231 15010 6 32 50 7 bits 354.29 89.76 1 1906.16 0.785 0.239 15010 6 32 50 4 bits 352.3789.01 1 1906.12 ``` ## This change with vector api enabled: ``` recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s force merge s num segments index size (MB) 0.817 0.247 15010 6 32 50 no 0.00 0.17 1 1533.94 0.812 0.282 15010 6 32 50 7 bits 0.00 0.17 1 1906.16 0.785 0.207 15010 6 32 50 4 bits 0.00 0.17 1 1906.12 ``` ## This change with vector api and enable-native-access ``` recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s force merge s num segments index size (MB) 0.817 0.246 15010 6 32 50 no 0.00 0.17 1 1533.94 0.812 0.290 15010 6 32 50 7 bits 0.00 0.17 1 1906.16 0.785 0.206 15010 6 32 50 4 bits 0.00 0.18 1 1906.12 ``` So I think there is some slowdown in the quantized indexing. I think we need to find a solution for the over-allocations due to having moved this logic from ScorerSupplier to Scorer. The best idea I have is to make Scorers mutable and supply them with new target vectors as needed. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]
msokolov commented on code in PR #13872: URL: https://github.com/apache/lucene/pull/13872#discussion_r1811235616 ## lucene/core/src/java/org/apache/lucene/codecs/hnsw/DefaultFlatVectorScorer.java: ## @@ -88,34 +88,28 @@ public String toString() { /** RandomVectorScorerSupplier for bytes vector */ private static final class ByteScoringSupplier implements RandomVectorScorerSupplier { -private final ByteVectorValues vectors; -private final ByteVectorValues vectors1; -private final ByteVectorValues vectors2; +private final ByteVectorValues vectorValues; private final VectorSimilarityFunction similarityFunction; private ByteScoringSupplier( -ByteVectorValues vectors, VectorSimilarityFunction similarityFunction) throws IOException { - this.vectors = vectors; - vectors1 = vectors.copy(); - vectors2 = vectors.copy(); +ByteVectorValues vectorValues, VectorSimilarityFunction similarityFunction) +throws IOException { + this.vectorValues = vectorValues; this.similarityFunction = similarityFunction; } @Override -public RandomVectorScorer scorer(int ord) { - return new RandomVectorScorer.AbstractRandomVectorScorer(vectors) { +public RandomVectorScorer scorer(int ord) throws IOException { + ByteVectorValues.Bytes vectors1 = vectorValues.vectors(); + ByteVectorValues.Bytes vectors2 = vectorValues.vectors(); + return new RandomVectorScorer.AbstractRandomVectorScorer(vectorValues) { Review Comment: yeah this seems like a bad consequence. Maybe we could switch from a supplier/scorer to a mutable scorer that can be "set" to a new vector as needed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Ensure stability of clause order for DisjunctionMaxQuery toString [lucene]
ljak opened a new pull request, #13944: URL: https://github.com/apache/lucene/pull/13944 Since https://github.com/apache/lucene/pull/110, the disjuncts elements of DisjunctionMaxQueries don't have an order anymore, which is impacting the `toString` method. In isolation, that does not matter. But, in Solr, when the debug component is needed for a distributed query, every shard can return a different toString representation of the same query... and the different toString keys of the debug response will have an array value, containing those different representations (instead of having one value for one same representation). Example with the `parsedquery_toString` key (of a json response within Solr): `parsedquery_toString":["((docIdentifiers:\"Okarandeep Osingh\" docIdentifiers:Otest) | (docTitle:\"Okarandeep Osingh\" docTitle:Otest) | (docBody:\"Okarandeep Osingh\" docBody:Otest))","((docBody:\"Okarandeep Osingh\" docBody:Otest) | (docTitle:\"Okarandeep Osingh\" docTitle:Otest) | (docIdentifiers:\"Okarandeep Osingh\" docIdentifiers:Otest))"]` When PR110 was merged, Solr adapted its unit tests this way: https://github.com/apache/solr/pull/117 but, later on within Lucene, the toString method of DisjuctionIntervalsSource was adapted in prevision of a potential similar future change: https://github.com/apache/lucene/pull/193. I adapted the toString method of DisjunctionMaxQueries similarly to this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Remove TopScoreDocCollector's dependency on HitsThresholdChecker. [lucene]
jpountz opened a new pull request, #13943: URL: https://github.com/apache/lucene/pull/13943 `TopScoreDocCollectorManager` has a dependency on `HitsThresholdChecker`, which is essentially a shared counter that is incremented until it reaches the total hits threshold, when the scorer can start dynamically pruning hits. A consequence of this removal is that dynamic pruning may start later, as soon as: - either the current slice collected `totalHitsThreshold` hits, - or another slice collected `totalHitsThreshold` hits and the current slice collected enough hits (up to 1,024) to check the shared `MaxScoreAccumulator`. So in short, it exchanges a bit more work globally in favor of a bit less contention. A longer-term goal of mine is to stop specializing our `CollectorManager`s based on whether they are going to be used concurrently or not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove TopScoreDocCollector's dependency on HitsThresholdChecker. [lucene]
jpountz commented on PR #13943: URL: https://github.com/apache/lucene/pull/13943#issuecomment-2429765576 wikibigall with a `searchConcurrency` of 8 suggests that the slowdown is tiny: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value AndHighLow 849.71 (3.3%) 826.18 (2.2%) -2.8% ( -7% -2%) 0.002 HighTermDayOfYearSort 253.62 (3.2%) 246.72 (3.0%) -2.7% ( -8% -3%) 0.005 TermDTSort 202.82 (3.4%) 198.21 (4.4%) -2.3% ( -9% -5%) 0.069 HighTermTitleBDVSort 49.87 (5.1%) 49.01 (5.5%) -1.7% ( -11% -9%) 0.306 OrHighRare 245.78 (8.9%) 242.01 (9.2%) -1.5% ( -17% - 18%) 0.591 And2Terms2StopWords 184.35 (5.3%) 181.84 (4.5%) -1.4% ( -10% -8%) 0.379 AndHighHigh 102.63 (6.9%) 101.25 (5.9%) -1.3% ( -13% - 12%) 0.507 CountTerm 8520.99 (3.9%) 8417.05 (5.1%) -1.2% ( -9% -8%) 0.396 Wildcard 112.90 (5.6%) 111.56 (4.8%) -1.2% ( -10% -9%) 0.471 Fuzzy1 79.74 (1.7%) 79.07 (1.7%) -0.8% ( -4% -2%) 0.114 OrHighLow 724.26 (2.3%) 719.70 (2.2%) -0.6% ( -5% -3%) 0.377 Or2Terms2StopWords 192.42 (5.4%) 191.24 (4.0%) -0.6% ( -9% -9%) 0.680 And3Terms 176.74 (5.3%) 175.76 (4.1%) -0.6% ( -9% -9%) 0.712 CountOrHighHigh 88.76 (4.9%) 88.34 (4.4%) -0.5% ( -9% -9%) 0.744 HighTermMonthSort 1066.76 (1.6%) 1062.79 (1.9%) -0.4% ( -3% -3%) 0.506 Fuzzy2 75.06 (1.4%) 74.85 (1.8%) -0.3% ( -3% -3%) 0.597 CountOrHighMed 137.67 (5.3%) 137.45 (4.4%) -0.2% ( -9% - 10%) 0.920 OrHighNotHigh 196.77 (3.3%) 196.66 (3.5%) -0.1% ( -6% -6%) 0.959 HighTermTitleSort 70.08 (6.6%) 70.09 (5.7%)0.0% ( -11% - 13%) 0.994 OrHighHigh 94.07 (4.8%) 94.12 (5.1%)0.0% ( -9% - 10%) 0.975 AndHighMed 182.18 (4.1%) 182.43 (3.4%)0.1% ( -7% -8%) 0.909 OrNotHighMed 255.11 (2.8%) 255.50 (3.3%)0.2% ( -5% -6%) 0.874 OrHighMed 242.11 (2.4%) 242.65 (2.5%)0.2% ( -4% -5%) 0.772 OrNotHighHigh 235.61 (2.3%) 236.26 (3.4%)0.3% ( -5% -6%) 0.766 HighTerm 361.55 (2.6%) 362.84 (2.7%)0.4% ( -4% -5%) 0.669 MedTerm 453.24 (2.8%) 455.07 (2.5%)0.4% ( -4% -5%) 0.628 OrHighNotMed 317.00 (3.0%) 318.40 (3.8%)0.4% ( -6% -7%) 0.680 PKLookup 277.48 (2.2%) 278.76 (2.7%)0.5% ( -4% -5%) 0.558 OrMany 46.17 (2.3%) 46.41 (2.9%)0.5% ( -4% -5%) 0.520 Prefix3 68.55 (4.0%) 69.01 (4.7%)0.7% ( -7% -9%) 0.627 OrHighNotLow 336.53 (3.2%) 339.73 (3.9%)1.0% ( -5% -8%) 0.395 AndStopWords 64.81 (5.3%) 65.46 (5.3%)1.0% ( -9% - 12%) 0.543 LowTerm 640.08 (3.1%) 647.88 (2.6%)1.2% ( -4% -7%) 0.176 CountAndHighHigh 74.37 (5.2%) 75.37 (5.6%)1.3% ( -8% - 12%) 0.426 CountAndHighMed 161.25 (5.1%) 163.59 (5.7%)1.4% ( -8% - 12%) 0.394 OrNotHighLow 865.87 (3.3%) 880.11 (2.7%)1.6% ( -4% -7%) 0.081 Or3Terms 175.34 (4.3%) 178.28 (4.9%)1.7% ( -7% - 11%) 0.252 OrStopWords 69.26 (6.5%) 70.71 (6.4%)2.1% ( -10% - 15%) 0.303 IntNRQ 166.81 (5.5%) 170.67 (10.5%)2.3% ( -12% - 19%) 0.381 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsub
Re: [PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]
msokolov commented on PR #13910: URL: https://github.com/apache/lucene/pull/13910#issuecomment-2429836870 Yes, maybe we should -- I think it would be a one-liner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]
msokolov commented on PR #13910: URL: https://github.com/apache/lucene/pull/13910#issuecomment-2429841476 There is another upgrade path -- if you started with 9.0 and then "upgraded" your index by rewriting it (eg with IndexUpdater tool) via merge to 9.1-9.7 you could subsequently read the index with later versions. But this seemed kind of complex to explain for a case that probably doesn't exist. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]
msokolov commented on code in PR #13872: URL: https://github.com/apache/lucene/pull/13872#discussion_r1811216599 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/OffHeapQuantizedByteVectorValues.java: ## @@ -127,31 +121,42 @@ public int size() { } @Override - public byte[] vectorValue(int targetOrd) throws IOException { -if (lastOrd == targetOrd) { - return binaryValue; -} -slice.seek((long) targetOrd * byteSize); -slice.readBytes(byteBuffer.array(), byteBuffer.arrayOffset(), numBytes); -slice.readFloats(scoreCorrectionConstant, 0, 1); -decompressBytes(binaryValue, numBytes); -lastOrd = targetOrd; -return binaryValue; - } + public QuantizedBytes vectors() throws IOException { +return new QuantizedBytes() { + ByteBuffer byteBuffer = ByteBuffer.allocate(dimension); + byte[] binaryValue = byteBuffer.array(); + IndexInput input = slice.clone(); + float[] scoreCorrectionConstant = new float[1]; Review Comment: personally I don't care about making these final - the compiler already ensures that they are or it wouldn't let you use them in a closure like this. As for private, I don't think you can make local variables private, but maybe I am missing something. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]
benwtrent commented on PR #13910: URL: https://github.com/apache/lucene/pull/13910#issuecomment-2429748958 @msokolov could we do a simpler patch for 9.12.1? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we avoid allocating a byte[] upfront for binary doc values [lucene]
iverase closed issue #13929: Should we avoid allocating a byte[] upfront for binary doc values URL: https://github.com/apache/lucene/issues/13929 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we avoid allocating a byte[] upfront for binary doc values [lucene]
iverase commented on issue #13929: URL: https://github.com/apache/lucene/issues/13929#issuecomment-2429888740 I really wish our binary doc values didn't imply that you need to have everything on heap in order to read them, it feels wrong. But anyway, I understand I won't happen easily. Closing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]
msokolov commented on code in PR #13872: URL: https://github.com/apache/lucene/pull/13872#discussion_r1811229378 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/Lucene99MemorySegmentByteVectorScorerSupplier.java: ## @@ -112,20 +96,20 @@ static final class CosineSupplier extends Lucene99MemorySegmentByteVectorScorerS @Override public RandomVectorScorer scorer(int ord) { checkOrdinal(ord); + MemorySegmentAccessInput slice = input.clone(); + byte[] scratch1 = new byte[vectorByteSize]; + byte[] scratch2 = new byte[vectorByteSize]; Review Comment: Yeah, this just seemed cleaner than trying to make that conditional, and my assumption is these scorers are not created that often? Once per search? Although I guess when indexing that could be a lot (once per doc). The challenge here is that `getSegment()` is a member of the Supplier while the Scorers are the ones that should be supplying the scratch data, so we can't easily create scratch lazily. I guess we could create some new abstraction in here to handle that but it seems kind of messy. Is there some way to know "up front" whether a memorysegment is going to be produced? If we knew that we could allocate scratch space or not based on that knowledge. I have to say I'm a little lost in this java21 MemorySegment code -- maybe @ChrisHegarty will weigh in and explain what the conditions are that lead to segmentSliceOrNull returning null? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Removing the deprecated parameters, -fast, -slow, -crossCheckTermVectors from CheckIndex. [lucene]
slow-J opened a new pull request, #13942: URL: https://github.com/apache/lucene/pull/13942 Removing the deprecated parameters, -fast, -slow, -crossCheckTermVectors from CheckIndex. Their usage is replaced with `-level` with respective values of `1`, `3`, `3`. Follow-up on the deprecation done in #11023. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Have value and count in LabelAndValue only for TaxonomyFacets [lucene]
stefanvodita closed pull request #13740: Have value and count in LabelAndValue only for TaxonomyFacets URL: https://github.com/apache/lucene/pull/13740 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make CheckIndex doChecksumsOnly / -fast as default [LUCENE-9984] [lucene]
slow-J commented on issue #11023: URL: https://github.com/apache/lucene/issues/11023#issuecomment-2428849956 I'll clean up the deprecated CheckIndex params in Lucene 11. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make BooleanScorer work on top of Scorers rather than BulkScorers. [lucene]
jpountz commented on PR #13931: URL: https://github.com/apache/lucene/pull/13931#issuecomment-2429122034 There is a good speedup on nightly benchmarks too: https://benchmarks.mikemccandless.com/CountOrHighHigh.html. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup OrderIntervalsSource some more [lucene]
jpountz commented on PR #13937: URL: https://github.com/apache/lucene/pull/13937#issuecomment-2429119642 There is indeed a small speedup to intervals with a low p-value. https://benchmarks.mikemccandless.com/IntervalsOrdered.html I pushed an annotation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce the compiled size of the collect() method on `TopScoreDocCollector`. [lucene]
jpountz merged PR #13939: URL: https://github.com/apache/lucene/pull/13939 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce a heuristic to amortize the per-window overhead in MaxScoreBulkScorer. [lucene]
jpountz merged PR #13941: URL: https://github.com/apache/lucene/pull/13941 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]
msokolov commented on PR #13910: URL: https://github.com/apache/lucene/pull/13910#issuecomment-2429428988 ok something like this: Dear Lucene user community, We recently uncovered a backwards compatibility bug that affects indexes created with version 9.0 containing KNN vector fields. Versions 9.8 - 9.12 are unable to search vectors in such indexes correctly and will return incorrect results without raising any error. We think it's likely very few if any of you are using 9.0 indexes, but if you are, possible mitigation steps are: * Upgrade to 10.0 or later, or * Do not upgrade past 9.7, or * If you must use an affected Lucene version (9.8-9.12) and you have 9.0-written indexes including KNN vector fields, you must recreate those indexes from source with your current Lucene version. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Introduce a heuristic to amortize the per-window overhead in MaxScoreBulkScorer. [lucene]
jpountz opened a new pull request, #13941: URL: https://github.com/apache/lucene/pull/13941 It is sometimes possible for `MaxScoreBulkScorer` to compute windows that don't contain many candidate matches, resulting in more time spent evaluating maximum scores per window than evaluating candidate matches on this window. This PR introduces a heuristic that tries to require at least 32 candidate matches per clause per window to amortize the per-window overhead. This results in a speedup for the `OrMany` task. ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value OrHighLow 830.99 (2.8%) 821.55 (2.0%) -1.1% ( -5% -3%) 0.236 CountAndHighMed 149.53 (3.2%) 148.06 (1.8%) -1.0% ( -5% -4%) 0.335 CountAndHighHigh 49.23 (3.3%) 48.85 (2.1%) -0.8% ( -6% -4%) 0.483 OrHighRare 277.29 (5.9%) 275.20 (5.1%) -0.8% ( -11% - 10%) 0.728 LowTerm 1006.28 (2.7%) 999.28 (2.7%) -0.7% ( -5% -4%) 0.512 OrHighNotMed 461.91 (2.0%) 459.09 (3.1%) -0.6% ( -5% -4%) 0.556 AndHighMed 205.48 (2.0%) 204.44 (2.2%) -0.5% ( -4% -3%) 0.547 HighTermTitleBDVSort 20.30 (4.4%) 20.22 (4.0%) -0.4% ( -8% -8%) 0.798 OrHighNotLow 483.66 (2.2%) 481.97 (4.3%) -0.3% ( -6% -6%) 0.794 OrNotHighHigh 283.34 (2.3%) 282.47 (2.0%) -0.3% ( -4% -4%) 0.714 OrNotHighLow 1058.78 (3.5%) 1055.94 (2.6%) -0.3% ( -6% -6%) 0.826 AndHighHigh 78.53 (1.8%) 78.33 (1.9%) -0.3% ( -3% -3%) 0.721 OrHighHigh 77.35 (1.6%) 77.23 (1.6%) -0.2% ( -3% -3%) 0.812 OrNotHighMed 314.20 (2.9%) 313.96 (2.7%) -0.1% ( -5% -5%) 0.944 And2Terms2StopWords 155.15 (2.9%) 155.07 (1.8%) -0.0% ( -4% -4%) 0.961 OrHighNotHigh 285.50 (2.5%) 285.63 (1.8%)0.0% ( -4% -4%) 0.958 CountOrHighMed 104.73 (1.6%) 104.95 (1.6%)0.2% ( -2% -3%) 0.744 And3Terms 167.95 (3.2%) 168.63 (2.6%)0.4% ( -5% -6%) 0.729 IntNRQ 90.83 (4.7%) 91.26 (14.9%)0.5% ( -18% - 21%) 0.913 OrHighMed 200.80 (2.1%) 201.78 (1.7%)0.5% ( -3% -4%) 0.511 HighTermTitleSort 149.37 (2.5%) 150.20 (2.0%)0.6% ( -3% -5%) 0.528 CountOrHighHigh 49.93 (1.4%) 50.24 (1.5%)0.6% ( -2% -3%) 0.270 AndHighLow 1079.98 (2.6%) 1086.73 (3.6%)0.6% ( -5% -7%) 0.613 Or2Terms2StopWords 158.09 (4.1%) 159.09 (2.4%)0.6% ( -5% -7%) 0.630 HighTerm 515.68 (2.2%) 519.07 (2.6%)0.7% ( -4% -5%) 0.490 HighTermMonthSort 3222.57 (3.4%) 3244.84 (2.9%)0.7% ( -5% -7%) 0.576 MedTerm 582.99 (2.5%) 587.15 (2.5%)0.7% ( -4% -5%) 0.468 Wildcard 82.76 (4.3%) 83.45 (3.8%)0.8% ( -6% -9%) 0.599 AndStopWords 30.49 (4.7%) 30.77 (2.4%)0.9% ( -5% -8%) 0.537 HighTermDayOfYearSort 813.54 (3.4%) 821.97 (2.1%)1.0% ( -4% -6%) 0.355 PKLookup 272.42 (2.7%) 275.38 (2.5%)1.1% ( -4% -6%) 0.288 Or3Terms 166.90 (4.3%) 168.77 (2.7%)1.1% ( -5% -8%) 0.424 OrStopWords 33.64 (6.5%) 34.29 (3.2%)1.9% ( -7% - 12%) 0.335 TermDTSort 344.04 (6.6%) 351.30 (5.3%)2.1% ( -9% - 15%) 0.371 Prefix3 123.31 (3.5%) 126.03 (6.6%)2.2% ( -7% - 12%) 0.286 CountTerm 8267.89 (4.4%) 8628.08 (4.7%)4.4% ( -4% - 14%) 0.014 OrMany 13.25 (3.7%) 18.87 (3.7%) 42.4% ( 33% - 51%) 0.000 ``` ### Description -- This is an automated message