Re: [PR] Replace Map with IntObjectHashMap for KnnVectorsReader [lucene]

2024-10-31 Thread via GitHub


jpountz merged PR #13763:
URL: https://github.com/apache/lucene/pull/13763


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Simplify codec setup in vector-related tests. [lucene]

2024-10-31 Thread via GitHub


jpountz opened a new pull request, #13970:
URL: https://github.com/apache/lucene/pull/13970

   Many of vector-related tests set up a codec manually by extending the 
current codec. This makes bumping the current codec a bit painful as all these 
files need to be touched. This commit migrates to 
`TestUtil#alwaysKnnVectorsFormat`, similarly to what we do for postings and doc 
values.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] minor javadoc correction on VectorUtilSupport#findNextGEQ [lucene]

2024-10-31 Thread via GitHub


gsmiller commented on PR #13969:
URL: https://github.com/apache/lucene/pull/13969#issuecomment-2450206811

   @jpountz thanks for the clarification! Let's not make the javadoc change 
then since it sounds like a reasonable reason to keep the requirement that 
values are sorted beginning at index `0` and not `from`.  (We could always 
change it later if it seemed like there was a useful reason to not require 
values `[0, from]` to be sorted).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move postings back to int[] to take advantage of having more lanes per vector. [lucene]

2024-10-31 Thread via GitHub


gsmiller commented on code in PR #13968:
URL: https://github.com/apache/lucene/pull/13968#discussion_r1824790729


##
lucene/core/src/java/org/apache/lucene/internal/vectorization/PostingDecodingUtil.java:
##
@@ -55,4 +55,30 @@ public void splitLongs(
   c[cIndex + i] &= cMask;
 }
   }
+
+  /**
+   * Core methods for decoding blocks of docs / freqs / positions / offsets.
+   *
+   * 
+   *   Read {@code count} ints.
+   *   For all {@code i} >= 0 so that {@code bShift - i * dec} > 0, 
apply shift {@code
+   *   bShift - i * dec} and store the result in {@code b} at offset 
{@code count * i}.
+   *   Apply mask {@code cMask} and store the result in {@code c} starting 
at offset {@code
+   *   cIndex}.
+   * 
+   */
+  public void splitInts(

Review Comment:
   Should we drop `#splitLongs`? (Also, should we add `@lucene.internal` to 
this class so we're free to drop public methods?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] minor javadoc correction on VectorUtilSupport#findNextGEQ [lucene]

2024-10-31 Thread via GitHub


gsmiller commented on PR #13969:
URL: https://github.com/apache/lucene/pull/13969#issuecomment-2450301205

   @jpountz after thinking a little more, I wonder if an `assert` would make 
sense to guard against unsorted data between index `0` and `from`? Probably 
quite unlikely, but would also be nice if that use-case tripped an assert now 
instead of silently working and then failing later because it didn't adhere to 
the contract outlined in the javadoc? We could do something like `assert 
IntStream.range(0, length - 1).noneMatch(i -> buffer[i] > buffer[i + 1]);`. 
It's trivial but I'm happy to add this if you think it would be reasonable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] minor javadoc correction on VectorUtilSupport#findNextGEQ [lucene]

2024-10-31 Thread via GitHub


jpountz commented on PR #13969:
URL: https://github.com/apache/lucene/pull/13969#issuecomment-2450344763

   This sounds good to me, maybe extract it to a function to avoid increasing 
the method size too much? Now that you made me look harder at this code, I'm 
also considering renaming `length` to `to` since `length` usually is a number 
of entries after `from` while it's an absolute end offset here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move postings back to int[] to take advantage of having more lanes per vector. [lucene]

2024-10-31 Thread via GitHub


jpountz commented on code in PR #13968:
URL: https://github.com/apache/lucene/pull/13968#discussion_r1824819989


##
lucene/core/src/java/org/apache/lucene/internal/vectorization/PostingDecodingUtil.java:
##
@@ -55,4 +55,30 @@ public void splitLongs(
   c[cIndex + i] &= cMask;
 }
   }
+
+  /**
+   * Core methods for decoding blocks of docs / freqs / positions / offsets.
+   *
+   * 
+   *   Read {@code count} ints.
+   *   For all {@code i} >= 0 so that {@code bShift - i * dec} > 0, 
apply shift {@code
+   *   bShift - i * dec} and store the result in {@code b} at offset 
{@code count * i}.
+   *   Apply mask {@code cMask} and store the result in {@code c} starting 
at offset {@code
+   *   cIndex}.
+   * 
+   */
+  public void splitInts(

Review Comment:
   FWIW this class may only be used from a very small set of explicitly named 
classes, see 
`org.apache.lucene.internal.vectorization.VectorizationProvider#VALID_CALLERS`, 
so there is no risk that users use this API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move postings back to int[] to take advantage of having more lanes per vector. [lucene]

2024-10-31 Thread via GitHub


jpountz commented on code in PR #13968:
URL: https://github.com/apache/lucene/pull/13968#discussion_r1824816062


##
lucene/core/src/java/org/apache/lucene/internal/vectorization/PostingDecodingUtil.java:
##
@@ -55,4 +55,30 @@ public void splitLongs(
   c[cIndex + i] &= cMask;
 }
   }
+
+  /**
+   * Core methods for decoding blocks of docs / freqs / positions / offsets.
+   *
+   * 
+   *   Read {@code count} ints.
+   *   For all {@code i} >= 0 so that {@code bShift - i * dec} > 0, 
apply shift {@code
+   *   bShift - i * dec} and store the result in {@code b} at offset 
{@code count * i}.
+   *   Apply mask {@code cMask} and store the result in {@code c} starting 
at offset {@code
+   *   cIndex}.
+   * 
+   */
+  public void splitInts(

Review Comment:
   Thanks for catching, I had meant to do it but missed some bits obviously.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] minor javadoc correction on VectorUtilSupport#findNextGEQ [lucene]

2024-10-31 Thread via GitHub


jpountz commented on PR #13969:
URL: https://github.com/apache/lucene/pull/13969#issuecomment-2450030846

   I wrote the javadocs this way on purpose, so that it would still work to 
create an IntVector/LongVector that starts before `from` and count the number 
of values that are less than the target. E.g. something like this:
   
   ```java
   if (length >= LONG_SPECIES.length() && length - from < 
LONG_SPECIES.length()) {
 // less than LONG_SPECIES.length() doc IDs
 LongVector vector = LongVector.fromArray(LONG_SPECIES, values, length - 
LONG_SPECIES.length());
 VectorMask mask = vector.compare(VectorOperators.LT, target);
 return length - LONG_SPECIES.length() + mask.trueCount();
   } else {
 // other cases
   }
   ```
   
   The current implementation doesn't take advantage of it, so I don't mind 
removing it, we could add it back later on if we want to take advantatge of it 
since it's an internal API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up advancing within a block, take 2. [lucene]

2024-10-31 Thread via GitHub


jpountz commented on PR #13958:
URL: https://github.com/apache/lucene/pull/13958#issuecomment-2450044413

   If you check out data at 
https://github.com/apache/lucene/pull/13692#issuecomment-2324658146, 
`AndHighHigh` and `AndHighMed` tend to advance a bit further than 
`CountAndHighHigh` and `CountAndHighMed`, so that might be the issue. I am 
tempted to not touch anything yet and see how nightlies react to 
https://github.com/apache/lucene/pull/13968, which should allow to check 2x 
more values at once.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] minor javadoc correction on VectorUtilSupport#findNextGEQ [lucene]

2024-10-31 Thread via GitHub


gsmiller closed pull request #13969: minor javadoc correction on 
VectorUtilSupport#findNextGEQ
URL: https://github.com/apache/lucene/pull/13969


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]

2024-10-31 Thread via GitHub


mikemccand commented on PR #13910:
URL: https://github.com/apache/lucene/pull/13910#issuecomment-2450133029

   > Could you add a CHANGES entry in 9.12 for your bug fix for 9.12.1?
   
   Ahh yes sorry I will do that today!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move postings back to int[] to take advantage of having more lanes per vector. [lucene]

2024-10-31 Thread via GitHub


jpountz commented on PR #13968:
URL: https://github.com/apache/lucene/pull/13968#issuecomment-2450268830

   I plan on merging tomorrow, so that we have two data points with longs on 
nightly benchmarks before seeing how it performs with ints.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] minor javadoc correction on VectorUtilSupport#findNextGEQ [lucene]

2024-10-31 Thread via GitHub


gsmiller commented on PR #13969:
URL: https://github.com/apache/lucene/pull/13969#issuecomment-2450411129

   > I'm also considering renaming length to to since length usually is a 
number of entries after from while it's an absolute end offset here
   
   +1. I noticed this as well when writing the assertion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Move vector search from IndexInput to RandomAccessInput [lucene]

2024-10-31 Thread via GitHub


msokolov commented on issue #13938:
URL: https://github.com/apache/lucene/issues/13938#issuecomment-2449719834

   I think this will be helpful since currently we cannot share these readers 
across threads -- they retain the state information about the current position. 
 Not sure how much benefit that will be since they must still typically 
maintain some local temporary storage to retain the value that is read


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up advancing within a block, take 2. [lucene]

2024-10-31 Thread via GitHub


jpountz commented on PR #13958:
URL: https://github.com/apache/lucene/pull/13958#issuecomment-2449813210

   Nightly benchmarks just picked up the change with a mix of speedups and 
slowdowns: https://benchmarks.mikemccandless.com/2024.10.30.18.12.23.html. Here 
are the main ones I'm seeing:
   
   Speedups:
- CountAndHighHigh: +5%
- CountAndHighMed: +2.5%
   
   Slowdowns:
- Phrase -3.5%
- AndHighOrMedMed: -3%
- OrHighRare: -3%
- AndHighHigh: -3%
- AndHighMed: -2.5%
   
   I'm a bit surprised/disappointed at the `AndHighHigh`/`AndHighMed` slowdown 
since this change is supposed to help conjunctions, and the counting queries 
proved it helps. I'll look into it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] minor javadoc correction on VectorUtilSupport#findNextGEQ [lucene]

2024-10-31 Thread via GitHub


gsmiller opened a new pull request, #13969:
URL: https://github.com/apache/lucene/pull/13969

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Account for 0 graph size when initializing HNSW graph [lucene]

2024-10-31 Thread via GitHub


mayya-sharipova commented on PR #13964:
URL: https://github.com/apache/lucene/pull/13964#issuecomment-2450001895

   @john-wagster Thanks for the review. I tried to write tests, but it needs a 
lot of setup and mocks, and I thought it does't worth.
   
   But I plan to write integration kind of test that will cover the changed 
part as a part of #13447
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Account for 0 graph size when initializing HNSW graph [lucene]

2024-10-31 Thread via GitHub


mayya-sharipova merged PR #13964:
URL: https://github.com/apache/lucene/pull/13964


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] minor javadoc correction on VectorUtilSupport#findNextGEQ [lucene]

2024-10-31 Thread via GitHub


gsmiller commented on PR #13969:
URL: https://github.com/apache/lucene/pull/13969#issuecomment-244971

   @jpountz was looking through #13958 retroactively to understand the change 
and I _think_ I spotted a small javadoc error. Can you take a peek? Even though 
this is super trivial, I wanted to check with you prior to merging to make sure 
I'm not missing something. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Rename NodeHash to FSTSuffixNodeCache [lucene]

2024-10-31 Thread via GitHub


dungba88 commented on PR #13259:
URL: https://github.com/apache/lucene/pull/13259#issuecomment-2451381448

   Hi Lucene community, would someone kindly take a look at this PR? This is 
only minor renaming and Javadoc improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.mismatch in FSTCompiler#add. [lucene]

2024-10-31 Thread via GitHub


github-actions[bot] commented on PR #13924:
URL: https://github.com/apache/lucene/pull/13924#issuecomment-2451063927

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove TODO in FSTCompiler#freezeTail. [lucene]

2024-10-31 Thread via GitHub


github-actions[bot] commented on PR #13923:
URL: https://github.com/apache/lucene/pull/13923#issuecomment-2451063955

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Move vector search from IndexInput to RandomAccessInput [lucene]

2024-10-31 Thread via GitHub


dungba88 commented on issue #13938:
URL: https://github.com/apache/lucene/issues/13938#issuecomment-2449356939

   Hi, I'm learning Lucene KNN and this seems to be a workable PR for beginner. 
Just curious about the motivation behind this change. Is it only for cleaner 
code, or are we also suppose to make any latency improvement on the absolute 
readFloats method compare to the current seek() + readFloats()?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move postings back to int[] to take advantage of having more lanes per vector. [lucene]

2024-10-31 Thread via GitHub


jpountz commented on PR #13968:
URL: https://github.com/apache/lucene/pull/13968#issuecomment-2449457800

   Here is a `luceneutil` run against `wikibigall`:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
CountOrHighHigh   76.20  (1.0%)   74.54  
(0.9%)   -2.2% (  -4% -0%) 0.000
 CountOrHighMed  143.11  (1.2%)  141.35  
(1.0%)   -1.2% (  -3% -0%) 0.001
   CountAndHighHigh   57.54  (1.1%)   57.30  
(0.9%)   -0.4% (  -2% -1%) 0.189
 TermDTSort  376.04  (6.4%)  374.67  
(5.7%)   -0.4% ( -11% -   12%) 0.853
AndHighHigh   91.53  (1.4%)   91.32  
(1.9%)   -0.2% (  -3% -3%) 0.669
  HighTermDayOfYearSort  901.22  (3.8%)  899.77  
(3.4%)   -0.2% (  -7% -7%) 0.890
 AndHighLow 1205.47  (1.7%) 1203.76  
(2.0%)   -0.1% (  -3% -3%) 0.810
  OrHighMed  206.20  (2.5%)  206.34  
(2.7%)0.1% (  -5% -5%) 0.935
   OrNotHighLow 1148.24  (2.1%) 1149.12  
(2.0%)0.1% (  -3% -4%) 0.908
  OrHighLow  756.64  (1.7%)  757.29  
(1.6%)0.1% (  -3% -3%) 0.872
MedTerm  742.62  (2.1%)  743.99  
(2.2%)0.2% (  -4% -4%) 0.793
 AndHighMed  184.06  (1.4%)  184.45  
(1.3%)0.2% (  -2% -2%) 0.622
   PKLookup  271.47  (2.2%)  272.19  
(2.5%)0.3% (  -4% -5%) 0.728
 OrHighRare  280.73  (4.2%)  281.50  
(5.4%)0.3% (  -8% -   10%) 0.861
  OrHighNotHigh  262.74  (2.8%)  263.51  
(2.6%)0.3% (  -4% -5%) 0.736
   AndStopWords   32.53  (4.5%)   32.64  
(4.2%)0.3% (  -8% -9%) 0.806
   HighTerm  443.84  (2.8%)  445.95  
(2.1%)0.5% (  -4% -5%) 0.551
   OrHighNotMed  485.11  (3.0%)  487.66  
(3.4%)0.5% (  -5% -7%) 0.612
LowTerm 1138.20  (2.7%) 1144.70  
(2.8%)0.6% (  -4% -6%) 0.519
 Fuzzy2   75.60  (2.1%)   76.05  
(2.5%)0.6% (  -3% -5%) 0.429
   Wildcard  116.11  (3.2%)  116.86  
(4.1%)0.6% (  -6% -8%) 0.597
 OrHighHigh   93.59  (3.6%)   94.24  
(3.4%)0.7% (  -6% -7%) 0.538
  OrNotHighHigh  261.04  (2.8%)  262.93  
(2.4%)0.7% (  -4% -6%) 0.396
 Fuzzy1   80.27  (2.6%)   80.86  
(2.6%)0.7% (  -4% -6%) 0.381
And2Terms2StopWords  161.95  (2.6%)  163.18  
(2.5%)0.8% (  -4% -5%) 0.354
   HighTermTitleBDVSort   15.67  (6.6%)   15.80  
(5.8%)0.8% ( -10% -   14%) 0.677
   OrHighNotLow  447.52  (4.0%)  451.75  
(4.1%)0.9% (  -6% -9%) 0.471
  And3Terms  178.15  (3.2%)  179.88  
(2.8%)1.0% (  -4% -7%) 0.319
 Or2Terms2StopWords  164.13  (3.7%)  166.04  
(3.4%)1.2% (  -5% -8%) 0.312
OrStopWords   36.12  (6.7%)   36.55  
(6.1%)1.2% ( -10% -   14%) 0.564
   Or3Terms  178.00  (3.7%)  180.14  
(3.5%)1.2% (  -5% -8%) 0.309
Prefix3   70.94  (4.1%)   71.81  
(8.1%)1.2% ( -10% -   13%) 0.554
 IntNRQ  179.05  (5.1%)  181.32  
(5.4%)1.3% (  -8% -   12%) 0.459
  HighTermMonthSort 3413.39  (2.2%) 3459.32  
(3.0%)1.3% (  -3% -6%) 0.111
   OrNotHighMed  384.09  (3.2%)  389.69  
(2.5%)1.5% (  -4% -7%) 0.112
 OrMany   19.16  (3.5%)   19.44  
(3.6%)1.5% (  -5% -8%) 0.203
  CountTerm 9388.28  (3.3%) 9587.31  
(4.2%)2.1% (  -5% -9%) 0.082
  HighTermTitleSort  135.48  (1.9%)  139.76  
(3.3%)3.2% (  -1% -8%) 0.000
CountAndHighMed  160.02  (1.3%)  168.58  
(1.3%)5.4% (   2% -7%) 0.000
   ```
   
   The `CountAndHighMed` and `HighTermTitleSort` speedups are consistently 
reproducible. I believe that the former is due to being able to compare 8 lanes 
at once instead of 4, and the latter is due to 

[PR] Move postings back to int[]. [lucene]

2024-10-31 Thread via GitHub


jpountz opened a new pull request, #13968:
URL: https://github.com/apache/lucene/pull/13968

   In Lucene 8.4, we updated postings to work on long[] arrays internally. This 
allowed us to workaround the lack of explicit vectorization (auto-vectorization 
doesn't detect all the scenarios that we would like to handle) support in the 
JVM by summing up two integers in one operation for instance.
   
   With explicit vectorization now available, it looks like we can get more 
benefits from the ability to compare multiple intetgers in one operations than 
from summing up two integers in one operation. Moving back to ints helps 
compare 2x more integers at once vs. longs.
   
   The diff is large because of the codec dance: `Lucene912PostingsFormat` and 
`Lucene100Codec` moved to `lucene/backward-codecs` and a new 
`Lucene101PostingsFormat` is a copy of the previous `Lucene912PostingsFormat` 
with a move from long[] arrays to int[] arrays, and changes to the on-disk 
format for blocks of packed integers.
   
   Note that `DataInput#readGroupVInt` and `VectorUtilSupport#findNextGEQ` have 
been cleaned up to only support `int[]` and no longer `long[]`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Move vector search from IndexInput to RandomAccessInput [lucene]

2024-10-31 Thread via GitHub


dungba88 commented on issue #13938:
URL: https://github.com/apache/lucene/issues/13938#issuecomment-2451355944

   > I think this will be helpful since currently we cannot share these readers 
across threads -- they retain the state information about the current position. 
Not sure how much benefit that will be since they must still typically maintain 
some local temporary storage to retain the value that is read
   
   Gotcha, the current usage of seek + readFloats requires the Reader to keep 
the seek position. When we change to the RandomAccessInput, we expect the 
operation to have no side-effect to the Reader and thus they will be sharable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move postings back to int[] to take advantage of having more lanes per vector. [lucene]

2024-10-31 Thread via GitHub


jpountz merged PR #13968:
URL: https://github.com/apache/lucene/pull/13968


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org