Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-10-22 Thread via GitHub


msokolov commented on PR #13872:
URL: https://github.com/apache/lucene/pull/13872#issuecomment-2430042116

   With the most recent commit I saw these luceneutil/knnPerfTest.py results:
   
   ## 1. baseline
   ```
   recall  latency (ms) nDoc  topK  fanout  maxConn  beamWidth  quantized  
index s  force merge s  num segments  index size (MB)
0.816 0.294  15010   6   32 50 no   
341.37 110.92 1  1534.03
0.811 0.308  15010   6   32 50 7 bits   
346.68  93.22 1  1906.16
0.786 0.288  15010   6   32 50 4 bits   
346.28  89.15 1  1906.10
   ```
   
   ## this change with defaults (no command line flags)
   ```
   recall  latency (ms) nDoc  topK  fanout  maxConn  beamWidth  quantized  
index s  force merge s  num segments  index size (MB)
0.817 0.304  15010   6   32 50 no   
  344.11  111.70 1  1533.94
0.812 0.231  15010   6   32 50 7 bits   
  354.29   89.76 1  1906.16
0.785 0.239  15010   6   32 50 4 bits   
  352.3789.01 1  1906.12
   ```
   
   ## This change with vector api enabled:
   ```
   recall  latency (ms) nDoc  topK  fanout  maxConn  beamWidth  quantized  
index s  force merge s  num segments  index size (MB)
0.817 0.247  15010   6   32 50 no   
  0.00   0.17 1  1533.94
0.812 0.282  15010   6   32 50 7 bits   
  0.00   0.17 1  1906.16
0.785 0.207  15010   6   32 50 4 bits   
  0.00   0.17 1  1906.12
   ```
   
   ## This change with vector api and enable-native-access
   ```
   recall  latency (ms) nDoc  topK  fanout  maxConn  beamWidth  quantized  
index s  force merge s  num segments  index size (MB)
0.817 0.246  15010   6   32 50 no   
  0.00   0.17 1  1533.94
0.812 0.290  15010   6   32 50 7 bits   
  0.00   0.17 1  1906.16
0.785 0.206  15010   6   32 50 4 bits   
  0.00   0.18 1  1906.12
   ```
   
   So I think there is some slowdown in the quantized indexing. I think we need 
to find a solution for the over-allocations due to having moved this logic from 
ScorerSupplier to Scorer. The best idea I have is to make Scorers mutable and 
supply them with new target vectors as needed. WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-10-22 Thread via GitHub


msokolov commented on code in PR #13872:
URL: https://github.com/apache/lucene/pull/13872#discussion_r1811235616


##
lucene/core/src/java/org/apache/lucene/codecs/hnsw/DefaultFlatVectorScorer.java:
##
@@ -88,34 +88,28 @@ public String toString() {
 
   /** RandomVectorScorerSupplier for bytes vector */
   private static final class ByteScoringSupplier implements 
RandomVectorScorerSupplier {
-private final ByteVectorValues vectors;
-private final ByteVectorValues vectors1;
-private final ByteVectorValues vectors2;
+private final ByteVectorValues vectorValues;
 private final VectorSimilarityFunction similarityFunction;
 
 private ByteScoringSupplier(
-ByteVectorValues vectors, VectorSimilarityFunction similarityFunction) 
throws IOException {
-  this.vectors = vectors;
-  vectors1 = vectors.copy();
-  vectors2 = vectors.copy();
+ByteVectorValues vectorValues, VectorSimilarityFunction 
similarityFunction)
+throws IOException {
+  this.vectorValues = vectorValues;
   this.similarityFunction = similarityFunction;
 }
 
 @Override
-public RandomVectorScorer scorer(int ord) {
-  return new RandomVectorScorer.AbstractRandomVectorScorer(vectors) {
+public RandomVectorScorer scorer(int ord) throws IOException {
+  ByteVectorValues.Bytes vectors1 = vectorValues.vectors();
+  ByteVectorValues.Bytes vectors2 = vectorValues.vectors();
+  return new RandomVectorScorer.AbstractRandomVectorScorer(vectorValues) {

Review Comment:
   yeah this seems like a bad consequence. Maybe we could switch from a 
supplier/scorer to a mutable scorer that can be "set" to a new vector as needed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Ensure stability of clause order for DisjunctionMaxQuery toString [lucene]

2024-10-22 Thread via GitHub


ljak opened a new pull request, #13944:
URL: https://github.com/apache/lucene/pull/13944

   Since https://github.com/apache/lucene/pull/110, the disjuncts elements of 
DisjunctionMaxQueries don't have an order anymore, which is impacting the 
`toString` method. In isolation, that does not matter. But, in Solr, when the 
debug component is needed for a distributed query, every shard can return a 
different toString representation of the same query... and the different 
toString keys of the debug response will have an array value, containing those 
different representations (instead of having one value for one same 
representation).
   
   Example with the `parsedquery_toString` key (of a json response within Solr):
   `parsedquery_toString":["((docIdentifiers:\"Okarandeep Osingh\" 
docIdentifiers:Otest) | (docTitle:\"Okarandeep Osingh\" docTitle:Otest) | 
(docBody:\"Okarandeep Osingh\" docBody:Otest))","((docBody:\"Okarandeep 
Osingh\" docBody:Otest) | (docTitle:\"Okarandeep Osingh\" docTitle:Otest) | 
(docIdentifiers:\"Okarandeep Osingh\" docIdentifiers:Otest))"]`
   
   When PR110 was merged, Solr adapted its unit tests this way: 
https://github.com/apache/solr/pull/117 but, later on within Lucene, the 
toString method of DisjuctionIntervalsSource was adapted in prevision of a 
potential similar future change: https://github.com/apache/lucene/pull/193. 
   
   I adapted the toString method of DisjunctionMaxQueries similarly to this PR.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Remove TopScoreDocCollector's dependency on HitsThresholdChecker. [lucene]

2024-10-22 Thread via GitHub


jpountz opened a new pull request, #13943:
URL: https://github.com/apache/lucene/pull/13943

   `TopScoreDocCollectorManager` has a dependency on `HitsThresholdChecker`, 
which is essentially a shared counter that is incremented until it reaches the 
total hits threshold, when the scorer can start dynamically pruning hits.
   
   A consequence of this removal is that dynamic pruning may start later, as 
soon as:
- either the current slice collected `totalHitsThreshold` hits,
- or another slice collected `totalHitsThreshold` hits and the current 
slice collected enough hits (up to 1,024) to check the shared 
`MaxScoreAccumulator`.
   
   So in short, it exchanges a bit more work globally in favor of a bit less 
contention. A longer-term goal of mine is to stop specializing our 
`CollectorManager`s based on whether they are going to be used concurrently or 
not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove TopScoreDocCollector's dependency on HitsThresholdChecker. [lucene]

2024-10-22 Thread via GitHub


jpountz commented on PR #13943:
URL: https://github.com/apache/lucene/pull/13943#issuecomment-2429765576

   wikibigall with a `searchConcurrency` of 8 suggests that the slowdown is 
tiny:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 AndHighLow  849.71  (3.3%)  826.18  
(2.2%)   -2.8% (  -7% -2%) 0.002
  HighTermDayOfYearSort  253.62  (3.2%)  246.72  
(3.0%)   -2.7% (  -8% -3%) 0.005
 TermDTSort  202.82  (3.4%)  198.21  
(4.4%)   -2.3% (  -9% -5%) 0.069
   HighTermTitleBDVSort   49.87  (5.1%)   49.01  
(5.5%)   -1.7% ( -11% -9%) 0.306
 OrHighRare  245.78  (8.9%)  242.01  
(9.2%)   -1.5% ( -17% -   18%) 0.591
And2Terms2StopWords  184.35  (5.3%)  181.84  
(4.5%)   -1.4% ( -10% -8%) 0.379
AndHighHigh  102.63  (6.9%)  101.25  
(5.9%)   -1.3% ( -13% -   12%) 0.507
  CountTerm 8520.99  (3.9%) 8417.05  
(5.1%)   -1.2% (  -9% -8%) 0.396
   Wildcard  112.90  (5.6%)  111.56  
(4.8%)   -1.2% ( -10% -9%) 0.471
 Fuzzy1   79.74  (1.7%)   79.07  
(1.7%)   -0.8% (  -4% -2%) 0.114
  OrHighLow  724.26  (2.3%)  719.70  
(2.2%)   -0.6% (  -5% -3%) 0.377
 Or2Terms2StopWords  192.42  (5.4%)  191.24  
(4.0%)   -0.6% (  -9% -9%) 0.680
  And3Terms  176.74  (5.3%)  175.76  
(4.1%)   -0.6% (  -9% -9%) 0.712
CountOrHighHigh   88.76  (4.9%)   88.34  
(4.4%)   -0.5% (  -9% -9%) 0.744
  HighTermMonthSort 1066.76  (1.6%) 1062.79  
(1.9%)   -0.4% (  -3% -3%) 0.506
 Fuzzy2   75.06  (1.4%)   74.85  
(1.8%)   -0.3% (  -3% -3%) 0.597
 CountOrHighMed  137.67  (5.3%)  137.45  
(4.4%)   -0.2% (  -9% -   10%) 0.920
  OrHighNotHigh  196.77  (3.3%)  196.66  
(3.5%)   -0.1% (  -6% -6%) 0.959
  HighTermTitleSort   70.08  (6.6%)   70.09  
(5.7%)0.0% ( -11% -   13%) 0.994
 OrHighHigh   94.07  (4.8%)   94.12  
(5.1%)0.0% (  -9% -   10%) 0.975
 AndHighMed  182.18  (4.1%)  182.43  
(3.4%)0.1% (  -7% -8%) 0.909
   OrNotHighMed  255.11  (2.8%)  255.50  
(3.3%)0.2% (  -5% -6%) 0.874
  OrHighMed  242.11  (2.4%)  242.65  
(2.5%)0.2% (  -4% -5%) 0.772
  OrNotHighHigh  235.61  (2.3%)  236.26  
(3.4%)0.3% (  -5% -6%) 0.766
   HighTerm  361.55  (2.6%)  362.84  
(2.7%)0.4% (  -4% -5%) 0.669
MedTerm  453.24  (2.8%)  455.07  
(2.5%)0.4% (  -4% -5%) 0.628
   OrHighNotMed  317.00  (3.0%)  318.40  
(3.8%)0.4% (  -6% -7%) 0.680
   PKLookup  277.48  (2.2%)  278.76  
(2.7%)0.5% (  -4% -5%) 0.558
 OrMany   46.17  (2.3%)   46.41  
(2.9%)0.5% (  -4% -5%) 0.520
Prefix3   68.55  (4.0%)   69.01  
(4.7%)0.7% (  -7% -9%) 0.627
   OrHighNotLow  336.53  (3.2%)  339.73  
(3.9%)1.0% (  -5% -8%) 0.395
   AndStopWords   64.81  (5.3%)   65.46  
(5.3%)1.0% (  -9% -   12%) 0.543
LowTerm  640.08  (3.1%)  647.88  
(2.6%)1.2% (  -4% -7%) 0.176
   CountAndHighHigh   74.37  (5.2%)   75.37  
(5.6%)1.3% (  -8% -   12%) 0.426
CountAndHighMed  161.25  (5.1%)  163.59  
(5.7%)1.4% (  -8% -   12%) 0.394
   OrNotHighLow  865.87  (3.3%)  880.11  
(2.7%)1.6% (  -4% -7%) 0.081
   Or3Terms  175.34  (4.3%)  178.28  
(4.9%)1.7% (  -7% -   11%) 0.252
OrStopWords   69.26  (6.5%)   70.71  
(6.4%)2.1% ( -10% -   15%) 0.303
 IntNRQ  166.81  (5.5%)  170.67 
(10.5%)2.3% ( -12% -   19%) 0.381
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsub

Re: [PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]

2024-10-22 Thread via GitHub


msokolov commented on PR #13910:
URL: https://github.com/apache/lucene/pull/13910#issuecomment-2429836870

   Yes, maybe we should -- I think it would be a one-liner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]

2024-10-22 Thread via GitHub


msokolov commented on PR #13910:
URL: https://github.com/apache/lucene/pull/13910#issuecomment-2429841476

   There is another upgrade path -- if you started with 9.0 and then "upgraded" 
your index by rewriting it (eg with IndexUpdater tool) via merge to 9.1-9.7 you 
could subsequently read the index with later versions. But this seemed kind of 
complex to explain for a case that probably doesn't exist.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-10-22 Thread via GitHub


msokolov commented on code in PR #13872:
URL: https://github.com/apache/lucene/pull/13872#discussion_r1811216599


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/OffHeapQuantizedByteVectorValues.java:
##
@@ -127,31 +121,42 @@ public int size() {
   }
 
   @Override
-  public byte[] vectorValue(int targetOrd) throws IOException {
-if (lastOrd == targetOrd) {
-  return binaryValue;
-}
-slice.seek((long) targetOrd * byteSize);
-slice.readBytes(byteBuffer.array(), byteBuffer.arrayOffset(), numBytes);
-slice.readFloats(scoreCorrectionConstant, 0, 1);
-decompressBytes(binaryValue, numBytes);
-lastOrd = targetOrd;
-return binaryValue;
-  }
+  public QuantizedBytes vectors() throws IOException {
+return new QuantizedBytes() {
+  ByteBuffer byteBuffer = ByteBuffer.allocate(dimension);
+  byte[] binaryValue = byteBuffer.array();
+  IndexInput input = slice.clone();
+  float[] scoreCorrectionConstant = new float[1];

Review Comment:
   personally I don't care about making these final - the compiler already 
ensures that they are or it wouldn't let you use them in a closure like this. 
As for private, I don't think you can make local variables private, but maybe I 
am missing something.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]

2024-10-22 Thread via GitHub


benwtrent commented on PR #13910:
URL: https://github.com/apache/lucene/pull/13910#issuecomment-2429748958

   @msokolov could we do a simpler patch for 9.12.1?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we avoid allocating a byte[] upfront for binary doc values [lucene]

2024-10-22 Thread via GitHub


iverase closed issue #13929: Should we avoid allocating a byte[] upfront for 
binary doc values
URL: https://github.com/apache/lucene/issues/13929


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we avoid allocating a byte[] upfront for binary doc values [lucene]

2024-10-22 Thread via GitHub


iverase commented on issue #13929:
URL: https://github.com/apache/lucene/issues/13929#issuecomment-2429888740

   I really wish our binary doc values didn't imply that you need to have 
everything on heap in order to read them, it feels wrong. 
   
   But anyway, I understand I won't happen easily. Closing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-10-22 Thread via GitHub


msokolov commented on code in PR #13872:
URL: https://github.com/apache/lucene/pull/13872#discussion_r1811229378


##
lucene/core/src/java21/org/apache/lucene/internal/vectorization/Lucene99MemorySegmentByteVectorScorerSupplier.java:
##
@@ -112,20 +96,20 @@ static final class CosineSupplier extends 
Lucene99MemorySegmentByteVectorScorerS
 @Override
 public RandomVectorScorer scorer(int ord) {
   checkOrdinal(ord);
+  MemorySegmentAccessInput slice = input.clone();
+  byte[] scratch1 = new byte[vectorByteSize];
+  byte[] scratch2 = new byte[vectorByteSize];

Review Comment:
   Yeah, this just seemed cleaner than trying to make that conditional, and my 
assumption is these scorers are not created that often? Once per search? 
Although I guess when indexing that could be a lot (once per doc). The 
challenge here is that `getSegment()` is a member of the Supplier while the 
Scorers are the ones that should be supplying the scratch data, so we can't 
easily create scratch lazily. I guess we could create some new abstraction in 
here to handle that but it seems kind of messy.
   
   Is there some way to know "up front" whether a memorysegment is going to be 
produced?  If we knew that we could allocate scratch space or not based on that 
knowledge. I have to say I'm a little lost in this java21 MemorySegment code -- 
maybe @ChrisHegarty will weigh in and explain what the conditions are that lead 
to segmentSliceOrNull returning null?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Removing the deprecated parameters, -fast, -slow, -crossCheckTermVectors from CheckIndex. [lucene]

2024-10-22 Thread via GitHub


slow-J opened a new pull request, #13942:
URL: https://github.com/apache/lucene/pull/13942

   Removing the deprecated parameters, -fast, -slow, -crossCheckTermVectors 
from CheckIndex.
   Their usage is replaced with `-level` with respective values of `1`, `3`, 
`3`.
   
   Follow-up on the deprecation done in #11023.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Have value and count in LabelAndValue only for TaxonomyFacets [lucene]

2024-10-22 Thread via GitHub


stefanvodita closed pull request #13740: Have value and count in LabelAndValue 
only for TaxonomyFacets
URL: https://github.com/apache/lucene/pull/13740


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make CheckIndex doChecksumsOnly / -fast as default [LUCENE-9984] [lucene]

2024-10-22 Thread via GitHub


slow-J commented on issue #11023:
URL: https://github.com/apache/lucene/issues/11023#issuecomment-2428849956

   I'll clean up the deprecated CheckIndex params in Lucene 11.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make BooleanScorer work on top of Scorers rather than BulkScorers. [lucene]

2024-10-22 Thread via GitHub


jpountz commented on PR #13931:
URL: https://github.com/apache/lucene/pull/13931#issuecomment-2429122034

   There is a good speedup on nightly benchmarks too: 
https://benchmarks.mikemccandless.com/CountOrHighHigh.html.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup OrderIntervalsSource some more [lucene]

2024-10-22 Thread via GitHub


jpountz commented on PR #13937:
URL: https://github.com/apache/lucene/pull/13937#issuecomment-2429119642

   There is indeed a small speedup to intervals with a low p-value. 
https://benchmarks.mikemccandless.com/IntervalsOrdered.html I pushed an 
annotation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce the compiled size of the collect() method on `TopScoreDocCollector`. [lucene]

2024-10-22 Thread via GitHub


jpountz merged PR #13939:
URL: https://github.com/apache/lucene/pull/13939


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce a heuristic to amortize the per-window overhead in MaxScoreBulkScorer. [lucene]

2024-10-22 Thread via GitHub


jpountz merged PR #13941:
URL: https://github.com/apache/lucene/pull/13941


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]

2024-10-22 Thread via GitHub


msokolov commented on PR #13910:
URL: https://github.com/apache/lucene/pull/13910#issuecomment-2429428988

   ok something like this:
   
   Dear Lucene user community,
   
   We recently uncovered a backwards compatibility bug that affects indexes 
created with version 9.0 containing KNN vector fields. Versions 9.8 - 9.12 are 
unable to search vectors in such indexes correctly and will return incorrect 
results without raising any error. We think it's likely very few if any of you 
are using 9.0 indexes, but if you are, possible mitigation steps are:
   
   * Upgrade to 10.0 or later, or
   * Do not upgrade past 9.7, or
   * If you must use an affected Lucene version (9.8-9.12) and you have 
9.0-written indexes including KNN vector fields, you must recreate those 
indexes from source with your current Lucene version.  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Introduce a heuristic to amortize the per-window overhead in MaxScoreBulkScorer. [lucene]

2024-10-22 Thread via GitHub


jpountz opened a new pull request, #13941:
URL: https://github.com/apache/lucene/pull/13941

   It is sometimes possible for `MaxScoreBulkScorer` to compute windows that 
don't contain many candidate matches, resulting in more time spent evaluating 
maximum scores per window than evaluating candidate matches on this window.
   
   This PR introduces a heuristic that tries to require at least 32 candidate 
matches per clause per window to amortize the per-window overhead. This results 
in a speedup for the `OrMany` task.
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
  OrHighLow  830.99  (2.8%)  821.55  
(2.0%)   -1.1% (  -5% -3%) 0.236
CountAndHighMed  149.53  (3.2%)  148.06  
(1.8%)   -1.0% (  -5% -4%) 0.335
   CountAndHighHigh   49.23  (3.3%)   48.85  
(2.1%)   -0.8% (  -6% -4%) 0.483
 OrHighRare  277.29  (5.9%)  275.20  
(5.1%)   -0.8% ( -11% -   10%) 0.728
LowTerm 1006.28  (2.7%)  999.28  
(2.7%)   -0.7% (  -5% -4%) 0.512
   OrHighNotMed  461.91  (2.0%)  459.09  
(3.1%)   -0.6% (  -5% -4%) 0.556
 AndHighMed  205.48  (2.0%)  204.44  
(2.2%)   -0.5% (  -4% -3%) 0.547
   HighTermTitleBDVSort   20.30  (4.4%)   20.22  
(4.0%)   -0.4% (  -8% -8%) 0.798
   OrHighNotLow  483.66  (2.2%)  481.97  
(4.3%)   -0.3% (  -6% -6%) 0.794
  OrNotHighHigh  283.34  (2.3%)  282.47  
(2.0%)   -0.3% (  -4% -4%) 0.714
   OrNotHighLow 1058.78  (3.5%) 1055.94  
(2.6%)   -0.3% (  -6% -6%) 0.826
AndHighHigh   78.53  (1.8%)   78.33  
(1.9%)   -0.3% (  -3% -3%) 0.721
 OrHighHigh   77.35  (1.6%)   77.23  
(1.6%)   -0.2% (  -3% -3%) 0.812
   OrNotHighMed  314.20  (2.9%)  313.96  
(2.7%)   -0.1% (  -5% -5%) 0.944
And2Terms2StopWords  155.15  (2.9%)  155.07  
(1.8%)   -0.0% (  -4% -4%) 0.961
  OrHighNotHigh  285.50  (2.5%)  285.63  
(1.8%)0.0% (  -4% -4%) 0.958
 CountOrHighMed  104.73  (1.6%)  104.95  
(1.6%)0.2% (  -2% -3%) 0.744
  And3Terms  167.95  (3.2%)  168.63  
(2.6%)0.4% (  -5% -6%) 0.729
 IntNRQ   90.83  (4.7%)   91.26 
(14.9%)0.5% ( -18% -   21%) 0.913
  OrHighMed  200.80  (2.1%)  201.78  
(1.7%)0.5% (  -3% -4%) 0.511
  HighTermTitleSort  149.37  (2.5%)  150.20  
(2.0%)0.6% (  -3% -5%) 0.528
CountOrHighHigh   49.93  (1.4%)   50.24  
(1.5%)0.6% (  -2% -3%) 0.270
 AndHighLow 1079.98  (2.6%) 1086.73  
(3.6%)0.6% (  -5% -7%) 0.613
 Or2Terms2StopWords  158.09  (4.1%)  159.09  
(2.4%)0.6% (  -5% -7%) 0.630
   HighTerm  515.68  (2.2%)  519.07  
(2.6%)0.7% (  -4% -5%) 0.490
  HighTermMonthSort 3222.57  (3.4%) 3244.84  
(2.9%)0.7% (  -5% -7%) 0.576
MedTerm  582.99  (2.5%)  587.15  
(2.5%)0.7% (  -4% -5%) 0.468
   Wildcard   82.76  (4.3%)   83.45  
(3.8%)0.8% (  -6% -9%) 0.599
   AndStopWords   30.49  (4.7%)   30.77  
(2.4%)0.9% (  -5% -8%) 0.537
  HighTermDayOfYearSort  813.54  (3.4%)  821.97  
(2.1%)1.0% (  -4% -6%) 0.355
   PKLookup  272.42  (2.7%)  275.38  
(2.5%)1.1% (  -4% -6%) 0.288
   Or3Terms  166.90  (4.3%)  168.77  
(2.7%)1.1% (  -5% -8%) 0.424
OrStopWords   33.64  (6.5%)   34.29  
(3.2%)1.9% (  -7% -   12%) 0.335
 TermDTSort  344.04  (6.6%)  351.30  
(5.3%)2.1% (  -9% -   15%) 0.371
Prefix3  123.31  (3.5%)  126.03  
(6.6%)2.2% (  -7% -   12%) 0.286
  CountTerm 8267.89  (4.4%) 8628.08  
(4.7%)4.4% (  -4% -   14%) 0.014
 OrMany   13.25  (3.7%)   18.87  
(3.7%)   42.4% (  33% -   51%) 0.000
   ```
   
   ### Description
   
   
   


-- 
This is an automated message