date:20241025

Re: [PR] Ensure stability of clause order for DisjunctionMaxQuery toString [lucene]

2024-10-25 Thread via GitHub



jpountz commented on PR #13944:
URL: https://github.com/apache/lucene/pull/13944#issuecomment-2435611867

   Yes, exactly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Performance difference between files getting opened with IOContext.RANDOM vs IOContext.READ during merges [lucene]

2024-10-25 Thread via GitHub



shatejas commented on issue #13920:
URL: https://github.com/apache/lucene/issues/13920#issuecomment-2435944343

   >  @shatejas I think all the required details are present, so are you going 
to raise a PR for this?
   
   Yeah I am working on it, I have the changes and I am trying to figure out a 
good way to benchmark lucene


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Check ahead of time if the `count` can be obtained [lucene]

2024-10-25 Thread via GitHub



LuXugang closed issue #13890: Check ahead of time if the `count` can be obtained
URL: https://github.com/apache/lucene/issues/13890


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Check ahead if we can get the count [lucene]

2024-10-25 Thread via GitHub



LuXugang merged PR #13899:
URL: https://github.com/apache/lucene/pull/13899


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-10-25 Thread via GitHub



yugushihuang commented on PR #13572:
URL: https://github.com/apache/lucene/pull/13572#issuecomment-2435780436

   We have measured performance using 
[knnPerfTest.py](https://github.com/mikemccand/luceneutil/blob/main/src/python/knnPerfTest.py)
 in lucene util with this PR 
[commit](https://github.com/goankur/lucene/commit/85d78116f87b679078a80cf606cd4bc7219ee793)
 as candidate branch. 
   ### cmd
   ```
   '/usr/lib/jvm/java-21-amazon-corretto/bin/java', '-cp', [...], 
'--add-modules', 'jdk.incubator.vector', 
'-Djava.library.path=/home/[user_name]/lucene_candidate/lucene/native/build/libs/dotProduct/shared',
 'knn.KnnGraphTester', '-quantize', '-ndoc', '150', '-maxConn', '32', 
'-beamWidthIndex', '50', '-fanout', '6', '-quantizeBits', '7', 
'-numMergeWorker', '12', '-numMergeThread', '4', '-encoding', 'float32', 
'-topK', '10', '-dim', '768', '-docs', 'enwiki-20120502-lines-1k-mpnet.vec', 
'-reindex', '-search-and-stats', 'enwiki-20120502-mpnet.vec', '-forceMerge', 
'-quiet'
   ```
   ### Lucene_Baseline
   ```
   Graph level=3 size=46, connectedness=1.00
   Graph level=2 size=1405, connectedness=1.00
   Graph level=1 size=46174, connectedness=1.00
   Graph level=0 size=150, connectedness=1.00
   
   Results:
   recall  latency (ms) nDoc  topK  fanout  maxConn  beamWidth  quantized  
index s  force merge s  num segments  index size (MB)
0.332 0.333  15010   6   32 50 7 bits   
432.69 271.51 1  5558.90
   ```
   ### Lucene_Candidate
   ```
   Graph level=3 size=46, connectedness=1.00
   Graph level=2 size=1410, connectedness=1.00
   Graph level=1 size=46205, connectedness=1.00
   Graph level=0 size=150, connectedness=1.00
   
   Results:
   recall  latency (ms) nDoc  topK  fanout  maxConn  beamWidth  quantized  
index s  force merge s  num segments  index size (MB)
0.337 0.260  15010   6   32 50 7 bits   
441.25 293.41 1  5558.91
   ```
   
   The latency has dropped from 0.333ms to 0.26ms.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Check ahead if we can get the count [lucene]

2024-10-25 Thread via GitHub



jpountz commented on code in PR #13899:
URL: https://github.com/apache/lucene/pull/13899#discussion_r1815247300


##
lucene/core/src/java/org/apache/lucene/search/IndexSortSortedNumericDocValuesRangeQuery.java:
##
@@ -186,10 +186,44 @@ public boolean isCacheable(LeafReaderContext ctx) {
   @Override
   public int count(LeafReaderContext context) throws IOException {
 if (context.reader().hasDeletions() == false) {
-  IteratorAndCount itAndCount = getDocIdSetIteratorOrNull(context);
+  if (lowerValue > upperValue) {
+return 0;
+  }

Review Comment:
   This could be moved before the check of whether the segment has deletes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Make some BooleanQuery methods public and a new `#add(Collection)` method for BQ builder [lucene]

2024-10-25 Thread via GitHub



mikemccand commented on code in PR #13950:
URL: https://github.com/apache/lucene/pull/13950#discussion_r1814888763


##
lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java:
##
@@ -87,6 +87,28 @@ public Builder add(BooleanClause clause) {
   return this;
 }
 
+/**
+ * Add a collection of BooleanClause's to this {@link Builder}. Note that 
the order in which

Review Comment:
   Remove the `'` -- just `BooleanClauses`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow reading binary doc values as a RandomAccessInput [lucene]

2024-10-25 Thread via GitHub



iverase commented on code in PR #13948:
URL: https://github.com/apache/lucene/pull/13948#discussion_r1816322441


##
lucene/core/src/java/org/apache/lucene/store/RandomAccessInputDataInput.java:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+
+/**
+ * DataInput backed by a {@link RandomAccessInput}. WARNING: This class 
omits all low-level
+ * checks.
+ *
+ * @lucene.experimental
+ */
+public final class RandomAccessInputDataInput extends DataInput {
+
+  private RandomAccessInput input;
+
+  private long pos;
+
+  public RandomAccessInputDataInput() {}
+
+  // NOTE: sets pos to 0, which is not right if you had
+  // called reset w/ non-zero offset!!
+  public void rewind() {
+pos = 0;
+  }
+
+  public long getPosition() {
+return pos;
+  }
+
+  public void setPosition(long pos) {
+this.pos = pos;
+  }
+
+  public void reset(RandomAccessInput input) {
+this.input = input;
+pos = 0;
+  }

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow reading binary doc values as a RandomAccessInput [lucene]

2024-10-25 Thread via GitHub



iverase commented on code in PR #13948:
URL: https://github.com/apache/lucene/pull/13948#discussion_r1816321990


##
lucene/core/src/java/org/apache/lucene/index/BinaryDocValues.java:
##
@@ -33,4 +34,15 @@ protected BinaryDocValues() {}
* @return binary value
*/
   public abstract BytesRef binaryValue() throws IOException;
+
+  /**
+   * Returns the binary value as a {@link RandomAccessInput} for the current 
document ID. The bytes
+   * start at position 0 up to {@link RandomAccessInput#length()}. It is 
illegal to call this method
+   * after {@link #advanceExact(int)} returned {@code false}.

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow reading binary doc values as a RandomAccessInput [lucene]

2024-10-25 Thread via GitHub



iverase commented on code in PR #13948:
URL: https://github.com/apache/lucene/pull/13948#discussion_r1816323891


##
lucene/core/src/java/org/apache/lucene/store/RandomAccessInputDataInput.java:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+
+/**
+ * DataInput backed by a {@link RandomAccessInput}. WARNING: This class 
omits all low-level
+ * checks.
+ *
+ * @lucene.experimental
+ */
+public final class RandomAccessInputDataInput extends DataInput {
+
+  private RandomAccessInput input;
+
+  private long pos;
+
+  public RandomAccessInputDataInput() {}
+
+  // NOTE: sets pos to 0, which is not right if you had
+  // called reset w/ non-zero offset!!
+  public void rewind() {
+pos = 0;
+  }
+
+  public long getPosition() {
+return pos;
+  }
+
+  public void setPosition(long pos) {
+this.pos = pos;
+  }
+
+  public void reset(RandomAccessInput input) {
+this.input = input;
+pos = 0;
+  }
+
+  public long length() {
+return input.length();
+  }
+
+  @Override
+  public void skipBytes(long count) {
+pos += count;
+  }
+
+  @Override
+  public short readShort() throws IOException {
+try {
+  return input.readShort(pos);
+} finally {
+  pos += Short.BYTES;

Review Comment:
   This class is a copy / paste from ByteArrayDataInput so it makes me wonder 
if that's something we need to change in that implementation too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Ensure stability of clause order for DisjunctionMaxQuery toString [lucene]

2024-10-25 Thread via GitHub



ljak commented on PR #13944:
URL: https://github.com/apache/lucene/pull/13944#issuecomment-2435609721

   Ha, I see. Could we say that the new `List orderedQueries` would have 
the same behavior that `Query[] disjuncts` before 
https://github.com/apache/lucene/pull/110/files ? If yes, I presume it would 
work. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-10-25 Thread via GitHub



msokolov commented on code in PR #13872:
URL: https://github.com/apache/lucene/pull/13872#discussion_r1816770842


##
lucene/core/src/java21/org/apache/lucene/internal/vectorization/Lucene99MemorySegmentByteVectorScorerSupplier.java:
##
@@ -112,20 +96,20 @@ static final class CosineSupplier extends 
Lucene99MemorySegmentByteVectorScorerS
 @Override
 public RandomVectorScorer scorer(int ord) {
   checkOrdinal(ord);
+  MemorySegmentAccessInput slice = input.clone();
+  byte[] scratch1 = new byte[vectorByteSize];
+  byte[] scratch2 = new byte[vectorByteSize];

Review Comment:
   I'm not sure I understand your idea, Chris, but if you want to have a go at 
it, by all means please do, and maybe I'll understand then :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-10-25 Thread via GitHub



benwtrent commented on PR #13872:
URL: https://github.com/apache/lucene/pull/13872#issuecomment-2437820707

   I think a "merging scorer" would be good. The only place the "scorer 
supplier" is used is during graph building. 
   
   My initial concern with a "mutable scorer" is that it would also make the 
single scorer mutable, which seems weird to me. But I am happily to revisit 
this, especially since its blocking a nice refactor.
   
   Given that all these random scorer stuff is internal APIs, we can do 
whatever is best with what we have.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-10-25 Thread via GitHub



ChrisHegarty commented on code in PR #13872:
URL: https://github.com/apache/lucene/pull/13872#discussion_r1816687000


##
lucene/core/src/java21/org/apache/lucene/internal/vectorization/Lucene99MemorySegmentByteVectorScorerSupplier.java:
##
@@ -112,20 +96,20 @@ static final class CosineSupplier extends 
Lucene99MemorySegmentByteVectorScorerS
 @Override
 public RandomVectorScorer scorer(int ord) {
   checkOrdinal(ord);
+  MemorySegmentAccessInput slice = input.clone();
+  byte[] scratch1 = new byte[vectorByteSize];
+  byte[] scratch2 = new byte[vectorByteSize];

Review Comment:
   I have another idea. maybe we just delegate the null cases to the other 
on-heap scorer. That might be simpler. We do something similar in the native 
scorer we have in Elasticsearch.   I can see how this looks in the branch, if u 
like?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-10-25 Thread via GitHub



msokolov commented on PR #13872:
URL: https://github.com/apache/lucene/pull/13872#issuecomment-2437836935

   Yes, OK I now see quite a bit of this is a "preexisting condition" and maybe 
not exacerbated by this change. We are still creating more scratch arrays than 
we did before though, I think, because previously we would `copy()` the 
VectorValues in a caller, and allocate a new scratch array there, whereas now 
since we have pushed down the "create new scratch array" into the Scorer 
creation, and this happens many more times than we would previously have copied 
the VectorValues, we are creating and destroying many more of these scratch 
arrays. Maybe this is acceptable and we can iterate in a futher cleanup? Let me 
try a few more benchmarking runs and be a little clearer about the impact on 
query and indexing times. I'd like to also report allocations, but not sure how 
to do that w/luceneutil


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow reading binary doc values as a RandomAccessInput [lucene]

2024-10-25 Thread via GitHub



jpountz commented on PR #13948:
URL: https://github.com/apache/lucene/pull/13948#issuecomment-2437732473

   In my experience, binary doc values are more often used to encode structured 
data, such as maps that help build scoring signals, geo shapes, etc. than 
actual binary content, so this change makes sense to me. I'm interested in 
having more opinions though.
   
   Would be nice to extend AssertingBinaryDocValues to make sure that all reads 
in the input are within bounds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-10-25 Thread via GitHub



msokolov commented on PR #13872:
URL: https://github.com/apache/lucene/pull/13872#issuecomment-2437752945

   > Can you clarify which allocation is the problematic one, and where it's 
done on the indexing path?
   
   See Ben's comments from ~2 weeks ago where he calls out the problem of 
overallocation. During indexing we call HnswGraphBuilder.diversityCheck() 
multiple times for each document (graph node) we insert, and in each of those 
calls we create scorers multiple times -- this is an n^2 algorithm (with n ~ 
number of neighbors). I'm proposing that instead of calling scorer() and 
creating a new scorer each time (which may in turn create a MemorySegment or a 
scratch array of some sort), that we instead have a mutable Scorer that can 
accept a new target vector.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Ensure stability of clause order for DisjunctionMaxQuery toString [lucene]

2024-10-25 Thread via GitHub



ljak commented on PR #13944:
URL: https://github.com/apache/lucene/pull/13944#issuecomment-2438170606

   Done. Thanks for reviewing!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Speed up advancing within a block, take 2. [lucene]

2024-10-25 Thread via GitHub



jpountz opened a new pull request, #13958:
URL: https://github.com/apache/lucene/pull/13958

   PR #13692 tried to speed up advancing by using branchless binary search, but 
while this yielded a speedup on my machine, this yielded a slowdown on nightly 
benchmarks.
   
   This PR tries a different approach using vectorization. Experimentation 
suggests that it slows down a bit queries when advancing often goes to the very 
next doc ID, such as term queries and `OrHighNotXXX` tasks. But it speeds up 
queries that advance to the next few doc IDs, such as `AndHighHigh`. I think 
that this is a good trade-off since it slows down some plenty fast queries in 
exchange for a speedup with some more expensive queries.
   
   Here is a `luceneutil` run on `wikibigall` with `-searchConcurrency 0`:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
  OrHighNotHigh  302.78  (2.4%)  283.75  
(2.9%)   -6.3% ( -11% -   -1%) 0.000
   OrHighNotMed  384.69  (3.0%)  363.33  
(2.8%)   -5.6% ( -10% -0%) 0.000
MedTerm  564.86  (2.2%)  537.04  
(3.5%)   -4.9% ( -10% -0%) 0.000
LowTerm 1014.02  (2.2%)  967.37  
(3.6%)   -4.6% ( -10% -1%) 0.000
   OrHighNotLow  446.38  (3.4%)  427.10  
(3.3%)   -4.3% ( -10% -2%) 0.000
   HighTerm  485.41  (1.9%)  464.49  
(3.2%)   -4.3% (  -9% -0%) 0.000
  OrNotHighHigh  229.78  (2.4%)  221.51  
(3.1%)   -3.6% (  -8% -1%) 0.000
   OrNotHighMed  396.63  (2.7%)  382.41  
(3.1%)   -3.6% (  -9% -2%) 0.000
Prefix3  145.65  (3.6%)  142.39  
(3.7%)   -2.2% (  -9% -5%) 0.051
 IntNRQ  158.04  (4.7%)  154.77  
(5.6%)   -2.1% ( -11% -8%) 0.205
  CountTerm 8320.96  (3.2%) 8198.56  
(4.7%)   -1.5% (  -9% -6%) 0.246
   PKLookup  273.35  (3.6%)  269.71  
(5.2%)   -1.3% (  -9% -7%) 0.345
   Wildcard   83.30  (3.4%)   82.28  
(3.1%)   -1.2% (  -7% -5%) 0.234
  HighTermMonthSort 3235.98  (3.1%) 3198.04  
(2.9%)   -1.2% (  -6% -4%) 0.215
  HighTermTitleSort  148.94  (2.5%)  148.38  
(2.6%)   -0.4% (  -5% -4%) 0.638
 CountOrHighMed  104.51  (2.0%)  104.22  
(1.7%)   -0.3% (  -3% -3%) 0.640
   HighTermTitleBDVSort   14.67  (5.3%)   14.64  
(5.9%)   -0.2% ( -10% -   11%) 0.899
   AndStopWords   30.68  (3.0%)   30.66  
(2.7%)   -0.1% (  -5% -5%) 0.941
CountOrHighHigh   50.17  (2.0%)   50.19  
(1.9%)0.0% (  -3% -3%) 0.947
 OrHighRare  273.82  (4.5%)  273.96  
(3.8%)0.0% (  -7% -8%) 0.971
 TermDTSort  353.37  (6.4%)  354.23  
(6.7%)0.2% ( -12% -   14%) 0.907
 Fuzzy1   77.85  (2.6%)   78.12  
(2.0%)0.3% (  -4% -4%) 0.633
 Fuzzy2   73.23  (2.5%)   73.50  
(1.9%)0.4% (  -3% -4%) 0.594
  HighTermDayOfYearSort  836.62  (3.1%)  841.07  
(4.0%)0.5% (  -6% -7%) 0.639
And2Terms2StopWords  154.49  (1.8%)  155.41  
(2.1%)0.6% (  -3% -4%) 0.340
  OrHighLow  771.90  (2.0%)  778.20  
(2.2%)0.8% (  -3% -5%) 0.217
  And3Terms  167.63  (2.3%)  169.23  
(2.2%)1.0% (  -3% -5%) 0.176
OrStopWords   33.99  (4.6%)   34.39  
(4.1%)1.2% (  -7% -   10%) 0.388
CountAndHighMed  148.01  (2.4%)  149.91  
(1.0%)1.3% (  -2% -4%) 0.025
 Or2Terms2StopWords  156.93  (2.8%)  159.21  
(3.0%)1.5% (  -4% -7%) 0.117
AndHighHigh   67.06  (1.3%)   68.07  
(1.6%)1.5% (  -1% -4%) 0.001
 OrMany   18.67  (2.9%)   18.96  
(2.9%)1.5% (  -4% -7%) 0.089
 AndHighMed  185.02  (1.6%)  189.06  
(1.3%)2.2% (   0% -5%) 0.000
 AndHighLow  948.34  (2.6%)  970.47  
(2.6%)2.3% (  -2% -7%) 0.004
 OrHighHigh   68.42  (1.4%)   70.08  
(1.3%)2.4% (   0% -5%) 0.000
   Or3Terms

Re: [I] Absolutely horrible Lucene performance with JDK 23 (Lucene 9.11.1 and 10.0.0) [lucene]

2024-10-25 Thread via GitHub



derreisende77 commented on issue #13959:
URL: https://github.com/apache/lucene/issues/13959#issuecomment-2438658215

   I made some tests with Ubuntu 24.10:
   JDK 23: 9.9 seconds
   JDK 22: 1.4 seconds


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Make some BooleanQuery methods public and a new `#add(Collection)` method for BQ builder [lucene]

2024-10-25 Thread via GitHub



jpountz commented on code in PR #13950:
URL: https://github.com/apache/lucene/pull/13950#discussion_r1815173658


##
lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java:
##
@@ -136,20 +158,20 @@ public List clauses() {
   }
 
   /** Return the collection of queries for the given {@link Occur}. */
-  Collection getClauses(Occur occur) {
+  public Collection getClauses(Occur occur) {
 return clauseSets.get(occur);
   }
 
   /**
* Whether this query is a pure disjunction, ie. it only has SHOULD clauses 
and it is enough for a
* single clause to match for this boolean query to match.
*/
-  boolean isPureDisjunction() {
+  public boolean isPureDisjunction() {
 return clauses.size() == getClauses(Occur.SHOULD).size() && 
minimumNumberShouldMatch <= 1;
   }
 
   /** Whether this query is a two clause disjunction with two term query 
clauses. */
-  boolean isTwoClausePureDisjunctionWithTerms() {
+  public boolean isTwoClausePureDisjunctionWithTerms() {

Review Comment:
   I can understand why someone would want to make `getClauses` public, but I 
wouldn't make the two above methods public, these are just implementation 
details of some rewrite rules?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-10-25 Thread via GitHub



jpountz commented on PR #13872:
URL: https://github.com/apache/lucene/pull/13872#issuecomment-2437740226

   Can you clarify which allocation is the problematic one, and where it's done 
on the indexing path?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-10-25 Thread via GitHub



ChrisHegarty commented on code in PR #13872:
URL: https://github.com/apache/lucene/pull/13872#discussion_r1816669062


##
lucene/core/src/java21/org/apache/lucene/internal/vectorization/Lucene99MemorySegmentByteVectorScorerSupplier.java:
##
@@ -112,20 +96,20 @@ static final class CosineSupplier extends 
Lucene99MemorySegmentByteVectorScorerS
 @Override
 public RandomVectorScorer scorer(int ord) {
   checkOrdinal(ord);
+  MemorySegmentAccessInput slice = input.clone();
+  byte[] scratch1 = new byte[vectorByteSize];
+  byte[] scratch2 = new byte[vectorByteSize];

Review Comment:
   We don't know during construction whether or not access to the vector data 
in backing segment will *always* be available. The main reason is that a vector 
may span across multiple memory segments. (one MSIndexInput can be made up of 
several memory segments)
   
   This change is not right. The scratch buffers are created per supplier, 
since we know with the threading model that that is safe. Creating scratch 
buffers per scorer will be too expensive.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Make DirectMonotonicReader.Meta more compact [lucene]

2024-10-25 Thread via GitHub



jpountz commented on PR #13864:
URL: https://github.com/apache/lucene/pull/13864#issuecomment-2437765755

   Sorry, I don't feel good about relying on `paddingBitsNeeded` on the read 
path. I suggest we close this PR, IMO the better fix would be to change the way 
we store terms dictionaries to rely less on `DirectMonotonicReader`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-10-25 Thread via GitHub



ChrisHegarty commented on PR #13872:
URL: https://github.com/apache/lucene/pull/13872#issuecomment-2437761782

   > that we instead have a mutable Scorer that can accept a new target vector.
   
   Yes, that is something that I've noodled on for a while now too - a scorer 
that accepts two ords, and returns the score. This will safe gigabytes garbage, 
which can be seen in the blunder output of the nightly luceneutil runs.  Tho, 
you do no have to do it all in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-10-25 Thread via GitHub



ChrisHegarty commented on code in PR #13872:
URL: https://github.com/apache/lucene/pull/13872#discussion_r1816669062


##
lucene/core/src/java21/org/apache/lucene/internal/vectorization/Lucene99MemorySegmentByteVectorScorerSupplier.java:
##
@@ -112,20 +96,20 @@ static final class CosineSupplier extends 
Lucene99MemorySegmentByteVectorScorerS
 @Override
 public RandomVectorScorer scorer(int ord) {
   checkOrdinal(ord);
+  MemorySegmentAccessInput slice = input.clone();
+  byte[] scratch1 = new byte[vectorByteSize];
+  byte[] scratch2 = new byte[vectorByteSize];

Review Comment:
   We don't know during construction whether or not access to the vector data 
in backing segment will *always* be available. The main reason is that a vector 
may span across multiple memory segments. (one MSIndexInput can be made up of 
several memory segments)
   
   This change is not right. The scratch buffers were created per supplier, 
since we know from the threading model that that is safe. Creating scratch 
buffers per scorer will be too expensive.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-10-25 Thread via GitHub



ChrisHegarty commented on code in PR #13872:
URL: https://github.com/apache/lucene/pull/13872#discussion_r1816669062


##
lucene/core/src/java21/org/apache/lucene/internal/vectorization/Lucene99MemorySegmentByteVectorScorerSupplier.java:
##
@@ -112,20 +96,20 @@ static final class CosineSupplier extends 
Lucene99MemorySegmentByteVectorScorerS
 @Override
 public RandomVectorScorer scorer(int ord) {
   checkOrdinal(ord);
+  MemorySegmentAccessInput slice = input.clone();
+  byte[] scratch1 = new byte[vectorByteSize];
+  byte[] scratch2 = new byte[vectorByteSize];

Review Comment:
   We don't know during construction whether or not access to the vector data 
in backing segment will *always* be available. The main reason is that a vector 
may span across multiple memory segments. (one MSIndexInput can be made up of 
several memory segments)
   
   This change is not right. The scratch buffers were created per supplier, 
since we know with the threading model that that is safe. Creating scratch 
buffers per scorer will be too expensive.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Make DirectMonotonicReader.Meta more compact [lucene]

2024-10-25 Thread via GitHub



original-brownbear closed pull request #13864: Make DirectMonotonicReader.Meta 
more compact
URL: https://github.com/apache/lucene/pull/13864


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Make DirectMonotonicReader.Meta more compact [lucene]

2024-10-25 Thread via GitHub



original-brownbear commented on PR #13864:
URL: https://github.com/apache/lucene/pull/13864#issuecomment-2437848161

   yea that's cool sorry forgot about this one, we for starters just store the 
offsets in a more compact form that'll help already. I'll open a PR once I find 
a little time :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-10-25 Thread via GitHub



msokolov commented on PR #13872:
URL: https://github.com/apache/lucene/pull/13872#issuecomment-2437853233

   Maybe we could add a `RandomVectorScorer.setTarget(int node)` method that 
would only be implemented by the Scorers returned from ScorerSuppliers?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Remove some useless code in TopScoreDocCollector. [lucene]

2024-10-25 Thread via GitHub



jpountz merged PR #13955:
URL: https://github.com/apache/lucene/pull/13955


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Make some BooleanQuery methods public and a new `#add(Collection)` method for BQ builder [lucene]

2024-10-25 Thread via GitHub



jpountz merged PR #13950:
URL: https://github.com/apache/lucene/pull/13950


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add MIGRATE entry about the fact that readVLong() may now read negative values, and up to 10 bytes. [lucene]

2024-10-25 Thread via GitHub



jpountz merged PR #13956:
URL: https://github.com/apache/lucene/pull/13956


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Ensure stability of clause order for DisjunctionMaxQuery toString [lucene]

2024-10-25 Thread via GitHub



jpountz commented on PR #13944:
URL: https://github.com/apache/lucene/pull/13944#issuecomment-2437549776

   Can you add an entry to `lucene/CHANGES.txt` under version 10.1.0? Then I'll 
merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Ensure doc order for TestCommonTermsQuery#testMinShouldMatch [lucene]

2024-10-25 Thread via GitHub



benwtrent merged PR #13953:
URL: https://github.com/apache/lucene/pull/13953


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] TestCommonTermsQuery.testMinShouldMatch test failure [lucene]

2024-10-25 Thread via GitHub



benwtrent closed issue #13946: TestCommonTermsQuery.testMinShouldMatch test 
failure
URL: https://github.com/apache/lucene/issues/13946


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] Absolutely horrible Lucene performance with JDK 23 (Lucene 9.11.1 and 10.0.0) [lucene]

2024-10-25 Thread via GitHub



derreisende77 opened a new issue, #13959:
URL: https://github.com/apache/lucene/issues/13959

   ### Description
   
   I am using Lucene in my app for several years happily with JDKs up to 22.
   My use case searches through film data and Lucene can return fairly huge 
result sets to my app - which as of now never was a problem. 
   I upgraded my app to JDK 23.0.1 on my MacBook Air macOS 15.01 16GB RAM.
   
   ```
   openjdk version "23.0.1" 2024-10-15
   OpenJDK Runtime Environment (build 23.0.1+13)
   OpenJDK 64-Bit Server VM (build 23.0.1+13, mixed mode, sharing)
   ```
   
   and started to notice **horrible** Lucene performance.
   With the following code snippet I do query my results:
   
   ```java
   var reader = DirectoryReader.open(list.getLuceneDirectory());
   final var searcher = new IndexSearcher(reader);
   final var docs = searcher.search(finalQuery, list.size());
   final var hit_length = docs.scoreDocs.length;
   
   var storedFields = searcher.storedFields();
   // the for loop takes ages with JDK23...
   for (final var hit : docs.scoreDocs) {
var docId = hit.doc;
//storedFields.prefetch(docId); //<-- doesn't change anything
var d = storedFields.document(docId, INTEREST_SET); //<-- this takes 
ages
//filmNrSet.add(Integer.parseInt(d.get(LuceneIndexKeys.ID)));
   }
   ```
   
   In the explored use case Lucene always returned the expected 558333 hits out 
of 802k documents.
   99% of the app runs take **5.7 seconds** to get the result. However when I 
am lucky *1% of the app runs* do get the same result back in **756 
milliseconds**.
   If Lucene is delivering fast, it will stay fast, if it is slow it will 
remain slow. I have no idea how this is triggered.
   
   I moved from `NRTCachingDirectory` to `MMapDirectory` but the performance 
remained bad. Tried some other stuff from internet - same result.
   
   I switched back to JDK 22.
   
   ```
   openjdk version "22.0.2" 2024-07-16
   OpenJDK Runtime Environment (build 22.0.2+11)
   OpenJDK 64-Bit Server VM (build 22.0.2+11, mixed mode, sharing)
   ```
   
   The same source which performed just horrible with JDK23 was **consistently 
fast** with JDK 22:
   
   ```
   Search took: 763.8 ms
   ```
   
   I am using the following flags for the JVM:
   ```
   -ea
   -XX:+UseShenandoahGC
   -XX:ShenandoahGCHeuristics=compact
   -XX:+UseStringDeduplication
   -XX:MaxRAMPercentage=50.0
   --enable-native-access=ALL-UNNAMED
   --add-modules jdk.incubator.vector
   ```
   
   I removed `--enable-native-access=ALL-UNNAMED` and `--add-modules 
jdk.incubator.vector` for testing purposes but the performance remained bad 
with JDK23.
   
   I made the same tests with JDK23 on my Windows 11 AMD Ryzen 4900H 16GB RAM 
laptop. There I get the results back in **13.63 seconds** with JDK 23. JDK 22 
does the same consistently in **2.1 seconds**.
   
   Switching between Lucene `9.11.1` and `10.0.0` made no difference, always 
shitty performance with JDK 23 both on macOS and windows. Consistent 
performance with JDK 22.
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Ensure stability of clause order for DisjunctionMaxQuery toString [lucene]

2024-10-25 Thread via GitHub



jpountz merged PR #13944:
URL: https://github.com/apache/lucene/pull/13944


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-10-25 Thread via GitHub



goankur commented on code in PR #13572:
URL: https://github.com/apache/lucene/pull/13572#discussion_r1817245059


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/OffHeapQuantizedByteVectorValues.java:
##
@@ -146,6 +146,7 @@ public float getScoreCorrectionConstant(int targetOrd) 
throws IOException {
 }
 slice.seek(((long) targetOrd * byteSize) + numBytes);
 slice.readFloats(scoreCorrectionConstant, 0, 1);
+lastOrd = targetOrd;

Review Comment:
   Got it!. I will remove this in the next revision. I was just trying to 
optimize for the case when `getScoreCorrectionConstant(int targetOrd)` gets 
invoked with the same `targetOrd` multiple times in succession.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-10-25 Thread via GitHub



benwtrent commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2438671673

   Hey @vigyasharma there is a lot of good work here. 
   
   I am going to shift my focus and see about how I can help here more fully. 
What are the next steps?
   
   I am guessing handling all the merging from main, I can take care of that 
sometime next week. 
   
   Just wondering where I can help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Absolutely horrible Lucene performance with JDK 23 (Lucene 9.11.1 and 10.0.0) [lucene]

2024-10-25 Thread via GitHub



benwtrent commented on issue #13959:
URL: https://github.com/apache/lucene/issues/13959#issuecomment-2438673337

   @derreisende77 do you have profiling of the two different runs? Maybe 
through async-profiler? It would be interesting to see where the time is being 
spent.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-10-25 Thread via GitHub



goankur commented on code in PR #13572:
URL: https://github.com/apache/lucene/pull/13572#discussion_r1817385236


##
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/VectorUtilBenchmark.java:
##
@@ -84,6 +91,76 @@ public void init() {
   floatsA[i] = random.nextFloat();
   floatsB[i] = random.nextFloat();
 }
+// Java 21+ specific initialization
+final int runtimeVersion = Runtime.version().feature();
+if (runtimeVersion >= 21) {
+  // Reflection based code to eliminate the use of Preview classes in JMH 
benchmarks
+  try {
+final Class vectorUtilSupportClass = 
VectorUtil.getVectorUtilSupportClass();
+final var className = 
"org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport";
+if (vectorUtilSupportClass.getName().equals(className) == false) {
+  nativeBytesA = null;
+  nativeBytesB = null;
+} else {
+  MethodHandles.Lookup lookup = MethodHandles.lookup();
+  final var MemorySegment = "java.lang.foreign.MemorySegment";
+  final var methodType =
+  MethodType.methodType(lookup.findClass(MemorySegment), 
byte[].class);
+  MethodHandle nativeMemorySegment =
+  lookup.findStatic(vectorUtilSupportClass, "nativeMemorySegment", 
methodType);
+  byte[] a = new byte[size];

Review Comment:
   Yes this is the setup code for the benchmark. We run setup once every 
`iteration` for a total of `15` iterations across `3` forks (5 iterations per 
fork)  for each `size` being tested. Each fork is preceded by 3 warm-up 
iterations.
   So before **each** iteration we generate random numbers in range [0-127] in 
two on-heap `byte[]`, allocate off-heap memory segments and populate them with 
contents from `byte[]`. These off-heap memory segments are provided to 
`VectorUtil.NATIVE_DOT_PRODUCT` method handle. 
   
   (Code snippet below for reference)
   
   ```
   @Param({"1", "128", "207", "256", "300", "512", "702", "1024"})
 int size;
   
   @Setup(Level.Iteration)
   public void init() {
   ...
   }
   ```
   
   > I wonder if we would see something different if we generated a large 
number of vectors and randomized which ones we compare on each run. Also would 
performance vary if the vectors are sequential in their buffer (ie vector 0 
starts at 0, vector 1 starts at size...)
   
   I guess the question you are hinting at is how does the performance vary 
when the two candidate vectors are further apart in memory (L1 cache / L2 cache 
/ L3 cache / Main-memory). Do the gains from native implementation become 
insignificant with increasing distance ?  Its an interesting question and I 
propose that we add benchmark method(s) to answer them in a follow up PR. Does 
that sound reasonable ?
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Absolutely horrible Lucene performance with JDK 23 (Lucene 9.11.1 and 10.0.0) [lucene]

2024-10-25 Thread via GitHub



derreisende77 commented on issue #13959:
URL: https://github.com/apache/lucene/issues/13959#issuecomment-2438907870

   @benwtrent I have JProfiler but I am not really experienced in using it - or 
profiling at all.
   I made two runs on macOS and made screenshots from the hotspot page.
   
   JDK23:
   
![jdk23](https://github.com/user-attachments/assets/d2753b47-1f68-49c8-9455-daf761956d96)
   
   JDK22:
   
![jdk22](https://github.com/user-attachments/assets/6292f639-9550-4e46-8e05-c13ac688f21f)
   
   The highlighted line in JDK22 image marks the function from where I posted 
the code snippet at the beginning. 
   The `StoredFields.document` ine above corresponds to the `var d = 
storedFields.document(docId, INTEREST_SET);` line.
   
   What I have seen during the profile run:
   - `ArrayUtil.growExact` takes a lot more time on JDK23 than on JDK 22.
   - `UnicodeUtil.UTF16toUTF8` calls were created during the index creation 
phase and take almost the same time on JDK 22 and 23.
   
   HTH


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Speed up advancing within a block, take 2. [lucene]

2024-10-25 Thread via GitHub



jpountz commented on PR #13958:
URL: https://github.com/apache/lucene/pull/13958#issuecomment-2438911637

   And I seem to be getting a better speedup by using `trueCount()` instead of 
`firstTrue()`:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
  CountTerm 8621.82  (5.6%) 8504.44  
(4.6%)   -1.4% ( -10% -9%) 0.401
   AndStopWords   31.14  (1.4%)   30.83  
(4.6%)   -1.0% (  -6% -5%) 0.363
Prefix3   96.42  (5.7%)   95.50  
(4.4%)   -1.0% ( -10% -9%) 0.557
   HighTermTitleBDVSort   15.80  (6.0%)   15.65  
(5.0%)   -0.9% ( -11% -   10%) 0.587
OrStopWords   34.67  (2.9%)   34.45  
(5.7%)   -0.6% (  -8% -8%) 0.657
   OrNotHighMed  385.71  (4.2%)  384.12  
(3.2%)   -0.4% (  -7% -7%) 0.725
 TermDTSort  346.51  (5.7%)  345.26  
(6.2%)   -0.4% ( -11% -   12%) 0.847
  HighTermTitleSort  153.13  (1.7%)  152.59  
(3.3%)   -0.4% (  -5% -4%) 0.670
 OrMany   19.06  (1.6%)   18.99  
(3.2%)   -0.3% (  -5% -4%) 0.671
  HighTermMonthSort 3126.69  (2.9%) 3117.99  
(3.7%)   -0.3% (  -6% -6%) 0.791
CountOrHighHigh   50.32  (1.6%)   50.26  
(2.1%)   -0.1% (  -3% -3%) 0.862
 CountOrHighMed  104.69  (1.7%)  104.70  
(2.0%)0.0% (  -3% -3%) 0.981
   PKLookup  270.86  (2.7%)  270.98  
(2.7%)0.0% (  -5% -5%) 0.960
 OrHighRare  281.93  (3.4%)  282.35  
(4.8%)0.1% (  -7% -8%) 0.911
   Wildcard   49.07  (3.7%)   49.15  
(4.2%)0.2% (  -7% -8%) 0.893
 Or2Terms2StopWords  160.10  (1.5%)  160.52  
(3.5%)0.3% (  -4% -5%) 0.756
And2Terms2StopWords  156.75  (1.5%)  157.35  
(2.8%)0.4% (  -3% -4%) 0.586
  OrHighLow  855.65  (2.4%)  859.93  
(2.7%)0.5% (  -4% -5%) 0.542
  HighTermDayOfYearSort  800.87  (2.8%)  805.06  
(2.9%)0.5% (  -5% -6%) 0.562
  And3Terms  169.90  (1.5%)  170.87  
(3.1%)0.6% (  -3% -5%) 0.455
 Fuzzy1   77.88  (3.3%)   78.52  
(2.9%)0.8% (  -5% -7%) 0.409
 Fuzzy2   73.27  (3.0%)   73.93  
(2.4%)0.9% (  -4% -6%) 0.295
   OrNotHighLow 1099.84  (3.7%) 1114.61  
(3.8%)1.3% (  -5% -9%) 0.260
   Or3Terms  169.45  (1.5%)  171.80  
(3.7%)1.4% (  -3% -6%) 0.118
CountAndHighMed  148.89  (2.5%)  151.58  
(3.0%)1.8% (  -3% -7%) 0.040
LowTerm 1033.62  (3.6%) 1052.61  
(2.8%)1.8% (  -4% -8%) 0.075
   OrHighNotMed  371.62  (3.1%)  378.74  
(3.5%)1.9% (  -4% -8%) 0.066
  OrHighNotHigh  296.15  (3.1%)  302.30  
(3.1%)2.1% (  -4% -8%) 0.036
AndHighHigh   70.55  (1.6%)   72.20  
(2.4%)2.3% (  -1% -6%) 0.000
 OrHighHigh   94.03  (1.6%)   96.25  
(2.0%)2.4% (  -1% -6%) 0.000
   OrHighNotLow  442.74  (3.0%)  454.42  
(3.6%)2.6% (  -3% -9%) 0.011
  OrHighMed  232.09  (2.5%)  238.43  
(2.5%)2.7% (  -2% -7%) 0.001
 IntNRQ  110.25 (15.4%)  113.35 
(17.9%)2.8% ( -26% -   42%) 0.594
MedTerm  601.09  (3.7%)  619.19  
(2.2%)3.0% (  -2% -9%) 0.002
 AndHighMed  221.49  (1.9%)  228.33  
(2.4%)3.1% (  -1% -7%) 0.000
   HighTerm  520.52  (3.4%)  537.37  
(2.6%)3.2% (  -2% -9%) 0.001
 AndHighLow 1047.38  (2.8%) 1082.62  
(2.7%)3.4% (  -2% -9%) 0.000
  OrNotHighHigh  276.13  (3.5%)  286.23  
(3.4%)3.7% (  -3% -   10%) 0.001
   CountAndHighHigh   49.28  (2.3%)   54.98  
(2.4%)   11.6% (   6% -   16%) 0.000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific commen

Re: [PR] Speed up advancing within a block, take 2. [lucene]

2024-10-25 Thread via GitHub



rmuir commented on PR #13958:
URL: https://github.com/apache/lucene/pull/13958#issuecomment-2438925486

   you are using VectorMask, only use this where implemented in HW (AVX-512 and 
ARM SVE).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Speed up advancing within a block, take 2. [lucene]

2024-10-25 Thread via GitHub



jpountz commented on PR #13958:
URL: https://github.com/apache/lucene/pull/13958#issuecomment-2438919587

   I ran this PR on my Mac laptop (M3), where this gives a massive slowdown, I 
imagine because some of the vector operations I'm using are emulated. I need to 
find what to check against in order to avoid this like we did for vectors with 
`PanamaVectorConstants.HAS_FAST_INTEGER_VECTORS`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Speed up advancing within a block, take 2. [lucene]

2024-10-25 Thread via GitHub



rmuir commented on PR #13958:
URL: https://github.com/apache/lucene/pull/13958#issuecomment-2438947715

   For these uses of vectormask you are ok with AVX2 (so just use existing 
FAST_INTEGER_VECTORS check):
   
   
https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1597-L1603
   
   So if you want to add this one without slowdowns: i would check: 
`FAST_INTEGER_VECTORS && amd64`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Early reset scratchBytes in Lucene90BlockTreeTermsWriter.compileIndex. [lucene]

2024-10-25 Thread via GitHub



vsop-479 commented on PR #13915:
URL: https://github.com/apache/lucene/pull/13915#issuecomment-2437267763

   I will close it, since it is insignificant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Early reset scratchBytes in Lucene90BlockTreeTermsWriter.compileIndex. [lucene]

2024-10-25 Thread via GitHub



vsop-479 closed pull request #13915: Early reset scratchBytes in 
Lucene90BlockTreeTermsWriter.compileIndex.
URL: https://github.com/apache/lucene/pull/13915


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Remove LeafSimScorer abstraction. [lucene]

2024-10-25 Thread via GitHub



jpountz opened a new pull request, #13957:
URL: https://github.com/apache/lucene/pull/13957

   `LeafSimScorer` is a specialization of a `SimScorer` for a given segment. It 
doesn't add much value, but benchmarks suggest that it adds measurable overhead 
to queries sorted by score.
   
   Here is a `luceneutil` run with `-searchConcurrency 0` on `wikibigall`:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
CountAndHighMed  148.80  (3.6%)  146.79  
(3.3%)   -1.4% (  -8% -5%) 0.219
Prefix3  210.12  (3.4%)  208.12  
(3.1%)   -1.0% (  -7% -5%) 0.355
   OrNotHighLow  930.49  (2.9%)  922.26  
(2.8%)   -0.9% (  -6% -4%) 0.326
 CountOrHighMed  104.34  (1.6%)  103.50  
(1.5%)   -0.8% (  -3% -2%) 0.099
   CountAndHighHigh   48.93  (3.6%)   48.55  
(3.4%)   -0.8% (  -7% -6%) 0.485
  HighTermMonthSort 3011.98  (2.9%) 2989.18  
(4.1%)   -0.8% (  -7% -6%) 0.498
 TermDTSort  342.40  (7.1%)  340.02  
(6.1%)   -0.7% ( -13% -   13%) 0.741
CountOrHighHigh   49.93  (1.6%)   49.76  
(1.2%)   -0.3% (  -3% -2%) 0.451
  HighTermTitleSort  111.58  (2.3%)  111.22  
(3.1%)   -0.3% (  -5% -5%) 0.710
  OrNotHighHigh  308.36  (3.1%)  307.70  
(3.3%)   -0.2% (  -6% -6%) 0.835
 Fuzzy2   71.17  (1.6%)   71.07  
(2.2%)   -0.1% (  -3% -3%) 0.824
  OrHighLow  726.98  (1.6%)  727.36  
(2.5%)0.1% (  -4% -4%) 0.939
  HighTermDayOfYearSort  764.56  (3.8%)  765.85  
(3.4%)0.2% (  -6% -7%) 0.882
   OrNotHighMed  350.64  (3.4%)  351.46  
(4.3%)0.2% (  -7% -8%) 0.848
 Fuzzy1   75.46  (1.9%)   75.80  
(1.8%)0.5% (  -3% -4%) 0.448
 IntNRQ  139.45 (13.7%)  140.08 
(14.5%)0.5% ( -24% -   33%) 0.918
   HighTermTitleBDVSort   15.35  (5.7%)   15.42  
(5.5%)0.5% ( -10% -   12%) 0.781
   PKLookup  265.51  (2.5%)  267.01  
(1.6%)0.6% (  -3% -4%) 0.389
 AndHighLow  989.77  (1.9%)  995.39  
(2.2%)0.6% (  -3% -4%) 0.387
  CountTerm 7984.92  (3.9%) 8051.09  
(5.0%)0.8% (  -7% -   10%) 0.557
  OrHighNotHigh  321.43  (2.7%)  324.15  
(3.1%)0.8% (  -4% -6%) 0.357
 OrMany   18.24  (2.4%)   18.45  
(2.1%)1.1% (  -3% -5%) 0.107
   Wildcard  117.97  (3.2%)  119.40  
(3.2%)1.2% (  -5% -7%) 0.230
 OrHighRare  269.54  (5.3%)  273.78  
(6.5%)1.6% (  -9% -   14%) 0.401
  OrHighMed  219.25  (2.5%)  222.89  
(2.7%)1.7% (  -3% -7%) 0.044
And2Terms2StopWords  151.65  (1.8%)  154.21  
(1.6%)1.7% (  -1% -5%) 0.002
 Or2Terms2StopWords  153.46  (3.1%)  156.15  
(2.8%)1.8% (  -4% -7%) 0.061
   Or3Terms  164.81  (2.4%)  168.57  
(2.9%)2.3% (  -2% -7%) 0.007
MedTerm  610.37  (3.5%)  625.30  
(3.7%)2.4% (  -4% -   10%) 0.032
   OrHighNotMed  417.48  (2.8%)  427.78  
(3.1%)2.5% (  -3% -8%) 0.008
LowTerm  981.78  (2.8%) 1008.35  
(3.8%)2.7% (  -3% -9%) 0.010
  And3Terms  165.41  (1.8%)  170.05  
(1.7%)2.8% (   0% -6%) 0.000
   AndStopWords   30.15  (3.0%)   31.07  
(3.8%)3.0% (  -3% -   10%) 0.005
   HighTerm  455.84  (3.4%)  469.91  
(4.0%)3.1% (  -4% -   10%) 0.009
 OrHighHigh   68.52  (1.7%)   70.69  
(3.7%)3.2% (  -2% -8%) 0.000
   OrHighNotLow  412.63  (2.8%)  427.86  
(3.5%)3.7% (  -2% -   10%) 0.000
OrStopWords   33.50  (3.8%)   34.75  
(5.1%)3.7% (  -4% -   13%) 0.009
 AndHighMed  165.41  (1.9%)  171.81  
(1.7%)3.9% (   0% -7%) 0.000
AndHighHigh   72.22  (1.7%)   76.11  
(1.4%)5.4% (   2% -8%) 0.000
   ```

Re: [PR] Disable exchanging minimum scores across slices for exhaustive evaluation. [lucene]

2024-10-25 Thread via GitHub



jpountz merged PR #13954:
URL: https://github.com/apache/lucene/pull/13954


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Support multi-tenant RAM buffers for IndexWriter [lucene]

2024-10-25 Thread via GitHub



jpountz commented on PR #13951:
URL: https://github.com/apache/lucene/pull/13951#issuecomment-2437616406

   > I couldn't think of a clean way to integrate the two... but I'll give it 
some more thought
   
   For what it's worth, these classes are package-private, so we can feel free 
to change their API.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]

2024-10-25 Thread via GitHub



HoustonPutman commented on code in PR #13914:
URL: https://github.com/apache/lucene/pull/13914#discussion_r1810967900


##
lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java:
##
@@ -202,66 +208,83 @@ public SegmentOutput(int hitsLength) {
* is used to compute the equi-weight per bin.
*/
   public static List computeDynamicNumericRanges(
-  long[] values, long[] weights, int len, long totalWeight, int topN) {
+  long[] values, long[] weights, int len, long totalValue, long 
totalWeight, int topN) {
 assert values.length == weights.length && len <= values.length && len >= 0;
 assert topN >= 0;
 List dynamicRangeResult = new ArrayList<>();
 if (len == 0 || topN == 0) {
   return dynamicRangeResult;
 }
 
-new InPlaceMergeSorter() {
-  @Override
-  protected int compare(int index1, int index2) {
-int cmp = Long.compare(values[index1], values[index2]);
-if (cmp == 0) {
-  // If the values are equal, sort based on the weights.
-  // Any weight order is correct as long as it's deterministic.
-  return Long.compare(weights[index1], weights[index2]);
-}
-return cmp;
-  }
+double rangeWeightTarget = (double) totalWeight / topN;
+double[] kWeights = new double[topN];
+for (int i = 0; i < topN; i++) {
+  kWeights[i] = (i == 0 ? 0 : kWeights[i - 1]) + rangeWeightTarget;

Review Comment:
   Wow yeah, both are better (though I like the first). This is the beauty of 
PR reviews haha. When you are 500 lines into a change, who knows what dumb 
things you will write...



##
lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java:
##
@@ -0,0 +1,407 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.SplittableRandom;
+
+/**
+ * Adaptive selection algorithm based on the introspective quick select 
algorithm. The quick select
+ * algorithm uses an interpolation variant of Tukey's ninther 
median-of-medians for pivot, and
+ * Bentley-McIlroy 3-way partitioning. For the introspective protection, it 
shuffles the sub-range
+ * if the max recursive depth is exceeded.
+ *
+ * This selection algorithm is fast on most data shapes, especially on 
nearly sorted data, or
+ * when k is close to the boundaries. It runs in linear time on average.
+ *
+ * @lucene.internal
+ */
+public abstract class WeightedSelector {
+
+  // This selector is used repeatedly by the radix selector for sub-ranges of 
less than
+  // 100 entries. This means this selector is also optimized to be fast on 
small ranges.
+  // It uses the variant of medians-of-medians and 3-way partitioning, and 
finishes the
+  // last tiny range (3 entries or less) with a very specialized sort.
+
+  private SplittableRandom random;
+
+  protected abstract long getWeight(int i);
+
+  protected abstract long getValue(int i);
+
+  public final WeightRangeInfo[] select(

Review Comment:
   Absolutely. Was going to go through and add docs, just wanted to make sure 
it was a good direction to go in first. Probably worth doing the benchmarking 
first 🥹



##
lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java:
##
@@ -0,0 +1,407 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+import java.util.Arrays;
+import

Re: [PR] Speed up advancing within a block, take 2. [lucene]

2024-10-25 Thread via GitHub



rmuir commented on PR #13958:
URL: https://github.com/apache/lucene/pull/13958#issuecomment-2438973598

   maybe its a bug that it doesnt work on your mac either. because elsewhere 
they have code that looks like it is supposed to be doing this stuff: 
https://github.com/openjdk/jdk/blob/f1a9a8d25b2e1f9b5dbe8719abb66ec4cd9057dc/src/hotspot/cpu/aarch64/aarch64_vector_ad.m4#L3782
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-10-25 Thread via GitHub



goankur commented on code in PR #13572:
URL: https://github.com/apache/lucene/pull/13572#discussion_r1817415010


##
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/VectorUtilBenchmark.java:
##
@@ -84,6 +91,76 @@ public void init() {
   floatsA[i] = random.nextFloat();
   floatsB[i] = random.nextFloat();
 }
+// Java 21+ specific initialization
+final int runtimeVersion = Runtime.version().feature();
+if (runtimeVersion >= 21) {
+  // Reflection based code to eliminate the use of Preview classes in JMH 
benchmarks
+  try {
+final Class vectorUtilSupportClass = 
VectorUtil.getVectorUtilSupportClass();
+final var className = 
"org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport";
+if (vectorUtilSupportClass.getName().equals(className) == false) {
+  nativeBytesA = null;
+  nativeBytesB = null;
+} else {
+  MethodHandles.Lookup lookup = MethodHandles.lookup();
+  final var MemorySegment = "java.lang.foreign.MemorySegment";
+  final var methodType =
+  MethodType.methodType(lookup.findClass(MemorySegment), 
byte[].class);
+  MethodHandle nativeMemorySegment =
+  lookup.findStatic(vectorUtilSupportClass, "nativeMemorySegment", 
methodType);
+  byte[] a = new byte[size];

Review Comment:
   Nonetheless I will simplify the setup code to make it a bit more readable in 
the next iteration.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Speed up advancing within a block, take 2. [lucene]

2024-10-25 Thread via GitHub



jpountz commented on PR #13958:
URL: https://github.com/apache/lucene/pull/13958#issuecomment-2438737799

   Specializing `ImpactsDISI#nextDoc()` helped get rid of the slowdown:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
   AndStopWords   31.34  (1.8%)   30.84  
(4.0%)   -1.6% (  -7% -4%) 0.105
  CountTerm 8573.12  (3.8%) 8449.05  
(4.7%)   -1.4% (  -9% -7%) 0.284
 CountOrHighMed  105.75  (2.1%)  104.50  
(1.4%)   -1.2% (  -4% -2%) 0.039
 TermDTSort  363.06  (6.4%)  358.98  
(6.6%)   -1.1% ( -13% -   12%) 0.585
CountOrHighHigh   50.62  (2.4%)   50.28  
(1.7%)   -0.7% (  -4% -3%) 0.305
 IntNRQ  453.67  (4.7%)  451.13  
(4.5%)   -0.6% (  -9% -9%) 0.700
 OrHighRare  283.32  (3.8%)  282.52  
(3.8%)   -0.3% (  -7% -7%) 0.813
 Fuzzy1   78.58  (2.1%)   78.42  
(3.0%)   -0.2% (  -5% -5%) 0.812
  HighTermDayOfYearSort  850.86  (4.4%)  849.52  
(3.0%)   -0.2% (  -7% -7%) 0.895
   HighTermTitleBDVSort   13.97  (6.3%)   13.96  
(5.5%)   -0.1% ( -11% -   12%) 0.974
And2Terms2StopWords  157.31  (1.3%)  157.27  
(2.2%)   -0.0% (  -3% -3%) 0.965
LowTerm  985.67  (3.0%)  986.01  
(1.8%)0.0% (  -4% -4%) 0.964
  HighTermMonthSort 3216.69  (2.2%) 3217.92  
(3.9%)0.0% (  -5% -6%) 0.969
 Fuzzy2   73.69  (2.0%)   73.74  
(2.4%)0.1% (  -4% -4%) 0.910
AndHighHigh   65.88  (2.1%)   66.18  
(2.0%)0.5% (  -3% -4%) 0.472
  And3Terms  169.85  (2.0%)  170.81  
(2.4%)0.6% (  -3% -5%) 0.424
 OrMany   19.10  (1.7%)   19.22  
(1.7%)0.6% (  -2% -4%) 0.237
 Or2Terms2StopWords  160.88  (1.4%)  161.91  
(2.0%)0.6% (  -2% -4%) 0.241
OrStopWords   34.90  (1.4%)   35.15  
(3.9%)0.7% (  -4% -6%) 0.450
  OrHighLow  799.18  (1.6%)  805.33  
(1.5%)0.8% (  -2% -3%) 0.117
CountAndHighMed  149.99  (3.1%)  151.23  
(1.1%)0.8% (  -3% -5%) 0.261
   Wildcard   88.47  (2.7%)   89.32  
(3.2%)1.0% (  -4% -7%) 0.309
   PKLookup  270.87  (3.8%)  273.47  
(1.7%)1.0% (  -4% -6%) 0.307
Prefix3   93.00  (8.2%)   94.14  
(6.3%)1.2% ( -12% -   17%) 0.599
MedTerm  690.05  (2.6%)  701.55  
(1.3%)1.7% (  -2% -5%) 0.010
   OrHighNotMed  359.57  (2.7%)  366.02  
(1.9%)1.8% (  -2% -6%) 0.014
   Or3Terms  170.81  (1.3%)  173.98  
(2.1%)1.9% (  -1% -5%) 0.001
   OrHighNotLow  432.25  (3.4%)  440.76  
(2.4%)2.0% (  -3% -8%) 0.035
  HighTermTitleSort  159.15  (4.8%)  162.44  
(2.9%)2.1% (  -5% -   10%) 0.096
 AndHighMed  225.25  (2.6%)  229.93  
(1.4%)2.1% (  -1% -6%) 0.002
   HighTerm  455.45  (2.4%)  465.69  
(2.1%)2.2% (  -2% -6%) 0.002
 OrHighHigh   78.87  (1.5%)   80.64  
(1.5%)2.3% (   0% -5%) 0.000
  OrHighNotHigh  218.32  (2.7%)  224.10  
(2.0%)2.6% (  -2% -7%) 0.000
   OrNotHighLow .11  (2.8%) 1144.28  
(2.5%)3.0% (  -2% -8%) 0.000
  OrHighMed  267.13  (1.8%)  275.57  
(1.3%)3.2% (   0% -6%) 0.000
   OrNotHighMed  303.24  (3.0%)  313.56  
(2.5%)3.4% (  -2% -9%) 0.000
  OrNotHighHigh  230.18  (2.8%)  238.62  
(2.2%)3.7% (  -1% -8%) 0.000
 AndHighLow  866.39  (2.7%)  903.54  
(2.4%)4.3% (   0% -9%) 0.000
   CountAndHighHigh   49.60  (3.1%)   53.54  
(0.9%)7.9% (   3% -   12%) 0.000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-m

Re: [PR] Speed up advancing within a block, take 2. [lucene]

2024-10-25 Thread via GitHub



rmuir commented on PR #13958:
URL: https://github.com/apache/lucene/pull/13958#issuecomment-2438944785

   
https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L280-L283


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Simplify leaf slice calculation [lucene]

2024-10-25 Thread via GitHub



github-actions[bot] commented on PR #13893:
URL: https://github.com/apache/lucene/pull/13893#issuecomment-2439076438

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Optimize slice calculation in IndexSearcher a little [lucene]

2024-10-25 Thread via GitHub



github-actions[bot] commented on PR #13860:
URL: https://github.com/apache/lucene/pull/13860#issuecomment-2439076472

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Reduce allocations in BKDReaderDocIDSetIterator [lucene]

2024-10-25 Thread via GitHub



github-actions[bot] commented on PR #13888:
URL: https://github.com/apache/lucene/pull/13888#issuecomment-2439076449

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

60 matches

Mail list logo