Re: [PR] Replace Map with IntObjectHashMap for KnnVectorsReader [lucene]
jpountz merged PR #13763: URL: https://github.com/apache/lucene/pull/13763 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Simplify codec setup in vector-related tests. [lucene]
jpountz opened a new pull request, #13970: URL: https://github.com/apache/lucene/pull/13970 Many of vector-related tests set up a codec manually by extending the current codec. This makes bumping the current codec a bit painful as all these files need to be touched. This commit migrates to `TestUtil#alwaysKnnVectorsFormat`, similarly to what we do for postings and doc values. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] minor javadoc correction on VectorUtilSupport#findNextGEQ [lucene]
gsmiller commented on PR #13969: URL: https://github.com/apache/lucene/pull/13969#issuecomment-2450206811 @jpountz thanks for the clarification! Let's not make the javadoc change then since it sounds like a reasonable reason to keep the requirement that values are sorted beginning at index `0` and not `from`. (We could always change it later if it seemed like there was a useful reason to not require values `[0, from]` to be sorted). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move postings back to int[] to take advantage of having more lanes per vector. [lucene]
gsmiller commented on code in PR #13968: URL: https://github.com/apache/lucene/pull/13968#discussion_r1824790729 ## lucene/core/src/java/org/apache/lucene/internal/vectorization/PostingDecodingUtil.java: ## @@ -55,4 +55,30 @@ public void splitLongs( c[cIndex + i] &= cMask; } } + + /** + * Core methods for decoding blocks of docs / freqs / positions / offsets. + * + * + * Read {@code count} ints. + * For all {@code i} >= 0 so that {@code bShift - i * dec} > 0, apply shift {@code + * bShift - i * dec} and store the result in {@code b} at offset {@code count * i}. + * Apply mask {@code cMask} and store the result in {@code c} starting at offset {@code + * cIndex}. + * + */ + public void splitInts( Review Comment: Should we drop `#splitLongs`? (Also, should we add `@lucene.internal` to this class so we're free to drop public methods?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] minor javadoc correction on VectorUtilSupport#findNextGEQ [lucene]
gsmiller commented on PR #13969: URL: https://github.com/apache/lucene/pull/13969#issuecomment-2450301205 @jpountz after thinking a little more, I wonder if an `assert` would make sense to guard against unsorted data between index `0` and `from`? Probably quite unlikely, but would also be nice if that use-case tripped an assert now instead of silently working and then failing later because it didn't adhere to the contract outlined in the javadoc? We could do something like `assert IntStream.range(0, length - 1).noneMatch(i -> buffer[i] > buffer[i + 1]);`. It's trivial but I'm happy to add this if you think it would be reasonable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] minor javadoc correction on VectorUtilSupport#findNextGEQ [lucene]
jpountz commented on PR #13969: URL: https://github.com/apache/lucene/pull/13969#issuecomment-2450344763 This sounds good to me, maybe extract it to a function to avoid increasing the method size too much? Now that you made me look harder at this code, I'm also considering renaming `length` to `to` since `length` usually is a number of entries after `from` while it's an absolute end offset here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move postings back to int[] to take advantage of having more lanes per vector. [lucene]
jpountz commented on code in PR #13968: URL: https://github.com/apache/lucene/pull/13968#discussion_r1824819989 ## lucene/core/src/java/org/apache/lucene/internal/vectorization/PostingDecodingUtil.java: ## @@ -55,4 +55,30 @@ public void splitLongs( c[cIndex + i] &= cMask; } } + + /** + * Core methods for decoding blocks of docs / freqs / positions / offsets. + * + * + * Read {@code count} ints. + * For all {@code i} >= 0 so that {@code bShift - i * dec} > 0, apply shift {@code + * bShift - i * dec} and store the result in {@code b} at offset {@code count * i}. + * Apply mask {@code cMask} and store the result in {@code c} starting at offset {@code + * cIndex}. + * + */ + public void splitInts( Review Comment: FWIW this class may only be used from a very small set of explicitly named classes, see `org.apache.lucene.internal.vectorization.VectorizationProvider#VALID_CALLERS`, so there is no risk that users use this API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move postings back to int[] to take advantage of having more lanes per vector. [lucene]
jpountz commented on code in PR #13968: URL: https://github.com/apache/lucene/pull/13968#discussion_r1824816062 ## lucene/core/src/java/org/apache/lucene/internal/vectorization/PostingDecodingUtil.java: ## @@ -55,4 +55,30 @@ public void splitLongs( c[cIndex + i] &= cMask; } } + + /** + * Core methods for decoding blocks of docs / freqs / positions / offsets. + * + * + * Read {@code count} ints. + * For all {@code i} >= 0 so that {@code bShift - i * dec} > 0, apply shift {@code + * bShift - i * dec} and store the result in {@code b} at offset {@code count * i}. + * Apply mask {@code cMask} and store the result in {@code c} starting at offset {@code + * cIndex}. + * + */ + public void splitInts( Review Comment: Thanks for catching, I had meant to do it but missed some bits obviously. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] minor javadoc correction on VectorUtilSupport#findNextGEQ [lucene]
jpountz commented on PR #13969: URL: https://github.com/apache/lucene/pull/13969#issuecomment-2450030846 I wrote the javadocs this way on purpose, so that it would still work to create an IntVector/LongVector that starts before `from` and count the number of values that are less than the target. E.g. something like this: ```java if (length >= LONG_SPECIES.length() && length - from < LONG_SPECIES.length()) { // less than LONG_SPECIES.length() doc IDs LongVector vector = LongVector.fromArray(LONG_SPECIES, values, length - LONG_SPECIES.length()); VectorMask mask = vector.compare(VectorOperators.LT, target); return length - LONG_SPECIES.length() + mask.trueCount(); } else { // other cases } ``` The current implementation doesn't take advantage of it, so I don't mind removing it, we could add it back later on if we want to take advantatge of it since it's an internal API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up advancing within a block, take 2. [lucene]
jpountz commented on PR #13958: URL: https://github.com/apache/lucene/pull/13958#issuecomment-2450044413 If you check out data at https://github.com/apache/lucene/pull/13692#issuecomment-2324658146, `AndHighHigh` and `AndHighMed` tend to advance a bit further than `CountAndHighHigh` and `CountAndHighMed`, so that might be the issue. I am tempted to not touch anything yet and see how nightlies react to https://github.com/apache/lucene/pull/13968, which should allow to check 2x more values at once. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] minor javadoc correction on VectorUtilSupport#findNextGEQ [lucene]
gsmiller closed pull request #13969: minor javadoc correction on VectorUtilSupport#findNextGEQ URL: https://github.com/apache/lucene/pull/13969 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]
mikemccand commented on PR #13910: URL: https://github.com/apache/lucene/pull/13910#issuecomment-2450133029 > Could you add a CHANGES entry in 9.12 for your bug fix for 9.12.1? Ahh yes sorry I will do that today! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move postings back to int[] to take advantage of having more lanes per vector. [lucene]
jpountz commented on PR #13968: URL: https://github.com/apache/lucene/pull/13968#issuecomment-2450268830 I plan on merging tomorrow, so that we have two data points with longs on nightly benchmarks before seeing how it performs with ints. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] minor javadoc correction on VectorUtilSupport#findNextGEQ [lucene]
gsmiller commented on PR #13969: URL: https://github.com/apache/lucene/pull/13969#issuecomment-2450411129 > I'm also considering renaming length to to since length usually is a number of entries after from while it's an absolute end offset here +1. I noticed this as well when writing the assertion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Move vector search from IndexInput to RandomAccessInput [lucene]
msokolov commented on issue #13938: URL: https://github.com/apache/lucene/issues/13938#issuecomment-2449719834 I think this will be helpful since currently we cannot share these readers across threads -- they retain the state information about the current position. Not sure how much benefit that will be since they must still typically maintain some local temporary storage to retain the value that is read -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up advancing within a block, take 2. [lucene]
jpountz commented on PR #13958: URL: https://github.com/apache/lucene/pull/13958#issuecomment-2449813210 Nightly benchmarks just picked up the change with a mix of speedups and slowdowns: https://benchmarks.mikemccandless.com/2024.10.30.18.12.23.html. Here are the main ones I'm seeing: Speedups: - CountAndHighHigh: +5% - CountAndHighMed: +2.5% Slowdowns: - Phrase -3.5% - AndHighOrMedMed: -3% - OrHighRare: -3% - AndHighHigh: -3% - AndHighMed: -2.5% I'm a bit surprised/disappointed at the `AndHighHigh`/`AndHighMed` slowdown since this change is supposed to help conjunctions, and the counting queries proved it helps. I'll look into it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] minor javadoc correction on VectorUtilSupport#findNextGEQ [lucene]
gsmiller opened a new pull request, #13969: URL: https://github.com/apache/lucene/pull/13969 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Account for 0 graph size when initializing HNSW graph [lucene]
mayya-sharipova commented on PR #13964: URL: https://github.com/apache/lucene/pull/13964#issuecomment-2450001895 @john-wagster Thanks for the review. I tried to write tests, but it needs a lot of setup and mocks, and I thought it does't worth. But I plan to write integration kind of test that will cover the changed part as a part of #13447 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Account for 0 graph size when initializing HNSW graph [lucene]
mayya-sharipova merged PR #13964: URL: https://github.com/apache/lucene/pull/13964 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] minor javadoc correction on VectorUtilSupport#findNextGEQ [lucene]
gsmiller commented on PR #13969: URL: https://github.com/apache/lucene/pull/13969#issuecomment-244971 @jpountz was looking through #13958 retroactively to understand the change and I _think_ I spotted a small javadoc error. Can you take a peek? Even though this is super trivial, I wanted to check with you prior to merging to make sure I'm not missing something. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Rename NodeHash to FSTSuffixNodeCache [lucene]
dungba88 commented on PR #13259: URL: https://github.com/apache/lucene/pull/13259#issuecomment-2451381448 Hi Lucene community, would someone kindly take a look at this PR? This is only minor renaming and Javadoc improvement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.mismatch in FSTCompiler#add. [lucene]
github-actions[bot] commented on PR #13924: URL: https://github.com/apache/lucene/pull/13924#issuecomment-2451063927 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove TODO in FSTCompiler#freezeTail. [lucene]
github-actions[bot] commented on PR #13923: URL: https://github.com/apache/lucene/pull/13923#issuecomment-2451063955 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Move vector search from IndexInput to RandomAccessInput [lucene]
dungba88 commented on issue #13938: URL: https://github.com/apache/lucene/issues/13938#issuecomment-2449356939 Hi, I'm learning Lucene KNN and this seems to be a workable PR for beginner. Just curious about the motivation behind this change. Is it only for cleaner code, or are we also suppose to make any latency improvement on the absolute readFloats method compare to the current seek() + readFloats()? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move postings back to int[] to take advantage of having more lanes per vector. [lucene]
jpountz commented on PR #13968: URL: https://github.com/apache/lucene/pull/13968#issuecomment-2449457800 Here is a `luceneutil` run against `wikibigall`: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value CountOrHighHigh 76.20 (1.0%) 74.54 (0.9%) -2.2% ( -4% -0%) 0.000 CountOrHighMed 143.11 (1.2%) 141.35 (1.0%) -1.2% ( -3% -0%) 0.001 CountAndHighHigh 57.54 (1.1%) 57.30 (0.9%) -0.4% ( -2% -1%) 0.189 TermDTSort 376.04 (6.4%) 374.67 (5.7%) -0.4% ( -11% - 12%) 0.853 AndHighHigh 91.53 (1.4%) 91.32 (1.9%) -0.2% ( -3% -3%) 0.669 HighTermDayOfYearSort 901.22 (3.8%) 899.77 (3.4%) -0.2% ( -7% -7%) 0.890 AndHighLow 1205.47 (1.7%) 1203.76 (2.0%) -0.1% ( -3% -3%) 0.810 OrHighMed 206.20 (2.5%) 206.34 (2.7%)0.1% ( -5% -5%) 0.935 OrNotHighLow 1148.24 (2.1%) 1149.12 (2.0%)0.1% ( -3% -4%) 0.908 OrHighLow 756.64 (1.7%) 757.29 (1.6%)0.1% ( -3% -3%) 0.872 MedTerm 742.62 (2.1%) 743.99 (2.2%)0.2% ( -4% -4%) 0.793 AndHighMed 184.06 (1.4%) 184.45 (1.3%)0.2% ( -2% -2%) 0.622 PKLookup 271.47 (2.2%) 272.19 (2.5%)0.3% ( -4% -5%) 0.728 OrHighRare 280.73 (4.2%) 281.50 (5.4%)0.3% ( -8% - 10%) 0.861 OrHighNotHigh 262.74 (2.8%) 263.51 (2.6%)0.3% ( -4% -5%) 0.736 AndStopWords 32.53 (4.5%) 32.64 (4.2%)0.3% ( -8% -9%) 0.806 HighTerm 443.84 (2.8%) 445.95 (2.1%)0.5% ( -4% -5%) 0.551 OrHighNotMed 485.11 (3.0%) 487.66 (3.4%)0.5% ( -5% -7%) 0.612 LowTerm 1138.20 (2.7%) 1144.70 (2.8%)0.6% ( -4% -6%) 0.519 Fuzzy2 75.60 (2.1%) 76.05 (2.5%)0.6% ( -3% -5%) 0.429 Wildcard 116.11 (3.2%) 116.86 (4.1%)0.6% ( -6% -8%) 0.597 OrHighHigh 93.59 (3.6%) 94.24 (3.4%)0.7% ( -6% -7%) 0.538 OrNotHighHigh 261.04 (2.8%) 262.93 (2.4%)0.7% ( -4% -6%) 0.396 Fuzzy1 80.27 (2.6%) 80.86 (2.6%)0.7% ( -4% -6%) 0.381 And2Terms2StopWords 161.95 (2.6%) 163.18 (2.5%)0.8% ( -4% -5%) 0.354 HighTermTitleBDVSort 15.67 (6.6%) 15.80 (5.8%)0.8% ( -10% - 14%) 0.677 OrHighNotLow 447.52 (4.0%) 451.75 (4.1%)0.9% ( -6% -9%) 0.471 And3Terms 178.15 (3.2%) 179.88 (2.8%)1.0% ( -4% -7%) 0.319 Or2Terms2StopWords 164.13 (3.7%) 166.04 (3.4%)1.2% ( -5% -8%) 0.312 OrStopWords 36.12 (6.7%) 36.55 (6.1%)1.2% ( -10% - 14%) 0.564 Or3Terms 178.00 (3.7%) 180.14 (3.5%)1.2% ( -5% -8%) 0.309 Prefix3 70.94 (4.1%) 71.81 (8.1%)1.2% ( -10% - 13%) 0.554 IntNRQ 179.05 (5.1%) 181.32 (5.4%)1.3% ( -8% - 12%) 0.459 HighTermMonthSort 3413.39 (2.2%) 3459.32 (3.0%)1.3% ( -3% -6%) 0.111 OrNotHighMed 384.09 (3.2%) 389.69 (2.5%)1.5% ( -4% -7%) 0.112 OrMany 19.16 (3.5%) 19.44 (3.6%)1.5% ( -5% -8%) 0.203 CountTerm 9388.28 (3.3%) 9587.31 (4.2%)2.1% ( -5% -9%) 0.082 HighTermTitleSort 135.48 (1.9%) 139.76 (3.3%)3.2% ( -1% -8%) 0.000 CountAndHighMed 160.02 (1.3%) 168.58 (1.3%)5.4% ( 2% -7%) 0.000 ``` The `CountAndHighMed` and `HighTermTitleSort` speedups are consistently reproducible. I believe that the former is due to being able to compare 8 lanes at once instead of 4, and the latter is due to
[PR] Move postings back to int[]. [lucene]
jpountz opened a new pull request, #13968: URL: https://github.com/apache/lucene/pull/13968 In Lucene 8.4, we updated postings to work on long[] arrays internally. This allowed us to workaround the lack of explicit vectorization (auto-vectorization doesn't detect all the scenarios that we would like to handle) support in the JVM by summing up two integers in one operation for instance. With explicit vectorization now available, it looks like we can get more benefits from the ability to compare multiple intetgers in one operations than from summing up two integers in one operation. Moving back to ints helps compare 2x more integers at once vs. longs. The diff is large because of the codec dance: `Lucene912PostingsFormat` and `Lucene100Codec` moved to `lucene/backward-codecs` and a new `Lucene101PostingsFormat` is a copy of the previous `Lucene912PostingsFormat` with a move from long[] arrays to int[] arrays, and changes to the on-disk format for blocks of packed integers. Note that `DataInput#readGroupVInt` and `VectorUtilSupport#findNextGEQ` have been cleaned up to only support `int[]` and no longer `long[]`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Move vector search from IndexInput to RandomAccessInput [lucene]
dungba88 commented on issue #13938: URL: https://github.com/apache/lucene/issues/13938#issuecomment-2451355944 > I think this will be helpful since currently we cannot share these readers across threads -- they retain the state information about the current position. Not sure how much benefit that will be since they must still typically maintain some local temporary storage to retain the value that is read Gotcha, the current usage of seek + readFloats requires the Reader to keep the seek position. When we change to the RandomAccessInput, we expect the operation to have no side-effect to the Reader and thus they will be sharable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move postings back to int[] to take advantage of having more lanes per vector. [lucene]
jpountz merged PR #13968: URL: https://github.com/apache/lucene/pull/13968 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org