Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]

2025-03-28 Thread via GitHub


jainankitk commented on code in PR #14397:
URL: https://github.com/apache/lucene/pull/14397#discussion_r2019666086


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java:
##
@@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException {
   bytes.offset = bytes.length = 0;
   for (int decompressed = 0; decompressed < totalLength; ) {
 final int toDecompress = Math.min(totalLength - decompressed, 
chunkSize);
+decompressor.reset();
 decompressor.decompress(fieldsStream, toDecompress, 0, 
toDecompress, spare);

Review Comment:
   I am wondering if reset should be the default behavior. We can pass another 
flag to indicate reuse if possible.



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/LZ4WithPresetDictCompressionMode.java:
##
@@ -144,10 +148,85 @@ public void decompress(DataInput in, int originalLength, 
int offset, int length,
   assert bytes.isValid();
 }
 
+@Override
+public void decompress(
+IndexInput in, int originalLength, int offset, int length, BytesRef 
bytes)
+throws IOException {
+  assert offset + length <= originalLength;
+
+  if (length == 0) {
+bytes.length = 0;
+return;
+  }
+
+  final int dictLength = in.readVInt();
+  final int blockLength = in.readVInt();
+
+  final int numBlocks = readCompressedLengths(in, originalLength, 
dictLength, blockLength);
+
+  buffer = ArrayUtil.grow(buffer, dictLength + blockLength);
+  long startPointer = in.getFilePointer();
+  bytes.length = 0;
+  if (cachedDictFilPointer == startPointer) {
+assert cachedDictLength == dictLength && dictEndFilePointer > 0;
+in.seek(dictEndFilePointer);

Review Comment:
   I am wondering if we can reuse the common code across this and other 
decompress method?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]

2025-03-28 Thread via GitHub


vsop-479 commented on PR #14371:
URL: https://github.com/apache/lucene/pull/14371#issuecomment-2763043900

   @gf2121 
   I also implemented `advanceExact` with vector, there is still a slowdown.  I 
will try to measure it on other laptop (with more vector lanes).
   
   
   
   Benchmark   Mode  CntScoreError  
 Units
   AdvanceSparseDISIBenchmark.advanceExactthrpt   15  727.403 ± 33.060  
ops/ms
   AdvanceSparseDISIBenchmark.advanceExactVector  thrpt   15  520.427 ±  0.868  
ops/ms
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix HistogramCollector to not create zero-count buckets. [lucene]

2025-03-28 Thread via GitHub


jainankitk commented on code in PR #14421:
URL: https://github.com/apache/lucene/pull/14421#discussion_r2019647007


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/facet/plain/histograms/TestHistogramCollectorManager.java:
##
@@ -137,6 +140,23 @@ private void doTestSkipIndex(IndexWriterConfig cfg) throws 
IOException {
 IllegalStateException.class,
 () -> searcher.search(new MatchAllDocsQuery(), new 
HistogramCollectorManager("f", 4, 1)));
 
+// Create a query so that bucket "1" (values from 4 to 8), which is in the 
middle of the range,
+// doesn't match any docs. HistogramCollector should not add an entry with 
a count of 0 in this
+// case.
+Query query =
+new BooleanQuery.Builder()
+.add(NumericDocValuesField.newSlowRangeQuery("f", Long.MIN_VALUE, 
2), Occur.SHOULD)
+.add(NumericDocValuesField.newSlowRangeQuery("f", 10, 
Long.MAX_VALUE), Occur.SHOULD)
+.build();
+actualCounts = searcher.search(query, new HistogramCollectorManager("f", 
4));

Review Comment:
   Initially, I was wondering if `checkMaxBuckets(collectorCounts.size(), 
maxBuckets)` might cause some of the tests earlier expecting exception to fail. 
But, all the tests with `counts[i] == 0`, use `1024` as `maxBuckets` which is 
comfortably over `collectorCounts.size()`
   
   Maybe, we can specify 3 as `maxBuckets` for even stronger condition:
   
   ```
   searcher.search(query, new HistogramCollectorManager("f", 4, 3));
   ```



##
lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/HistogramCollector.java:
##
@@ -279,7 +279,9 @@ public void collect(DocIdStream stream) throws IOException {
 public void finish() throws IOException {
   // Put counts that we computed in the int[] back into the hash map.
   for (int i = 0; i < counts.length; ++i) {
-collectorCounts.addTo(leafMinBucket + i, counts[i]);
+if (counts[i] != 0) {
+  collectorCounts.addTo(leafMinBucket + i, counts[i]);
+}
   }
   checkMaxBuckets(collectorCounts.size(), maxBuckets);

Review Comment:
   While unrelated to this change, I am wondering if we should check eagerly to 
prevent unnecessary iterations:
   
   ```
   if (counts[i] != 0) {
 collectorCounts.addTo(leafMinBucket + i, counts[i]);
 checkMaxBuckets(collectorCounts.size(), maxBuckets);
   }
   ```
   
   We are doing similar validation in other places:
   
   ```
   if (bucket != prevBucket) {
 counts.addTo(bucket, 1);
 checkMaxBuckets(counts.size(), maxBuckets);
 prevBucket = bucket;
   }
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]

2025-03-28 Thread via GitHub


kkewwei commented on code in PR #14397:
URL: https://github.com/apache/lucene/pull/14397#discussion_r2019696179


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/LZ4WithPresetDictCompressionMode.java:
##
@@ -144,10 +148,85 @@ public void decompress(DataInput in, int originalLength, 
int offset, int length,
   assert bytes.isValid();
 }
 
+@Override
+public void decompress(
+IndexInput in, int originalLength, int offset, int length, BytesRef 
bytes)
+throws IOException {
+  assert offset + length <= originalLength;
+
+  if (length == 0) {
+bytes.length = 0;
+return;
+  }
+
+  final int dictLength = in.readVInt();
+  final int blockLength = in.readVInt();
+
+  final int numBlocks = readCompressedLengths(in, originalLength, 
dictLength, blockLength);
+
+  buffer = ArrayUtil.grow(buffer, dictLength + blockLength);
+  long startPointer = in.getFilePointer();
+  bytes.length = 0;
+  if (cachedDictFilPointer == startPointer) {
+assert cachedDictLength == dictLength && dictEndFilePointer > 0;
+in.seek(dictEndFilePointer);

Review Comment:
   Yes, you are right.
   
   Initially, I intended to utilize IndexInput for seek. But the 
compressedLengths already holds the compressed length of the dictionary. I can 
directly use it.



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/LZ4WithPresetDictCompressionMode.java:
##
@@ -144,10 +148,85 @@ public void decompress(DataInput in, int originalLength, 
int offset, int length,
   assert bytes.isValid();
 }
 
+@Override
+public void decompress(
+IndexInput in, int originalLength, int offset, int length, BytesRef 
bytes)
+throws IOException {
+  assert offset + length <= originalLength;
+
+  if (length == 0) {
+bytes.length = 0;
+return;
+  }
+
+  final int dictLength = in.readVInt();
+  final int blockLength = in.readVInt();
+
+  final int numBlocks = readCompressedLengths(in, originalLength, 
dictLength, blockLength);
+
+  buffer = ArrayUtil.grow(buffer, dictLength + blockLength);
+  long startPointer = in.getFilePointer();
+  bytes.length = 0;
+  if (cachedDictFilPointer == startPointer) {
+assert cachedDictLength == dictLength && dictEndFilePointer > 0;
+in.seek(dictEndFilePointer);

Review Comment:
   Yes, you are right.
   
   Initially, I intended to utilize IndexInput for seek. But the 
compressedLengths already holds the compressed length of the dictionary, I can 
directly use it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] For hnsw merger, do not pop from empty heap [lucene]

2025-03-28 Thread via GitHub


benwtrent merged PR #14420:
URL: https://github.com/apache/lucene/pull/14420


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow skip cache factor to be updated dynamically [lucene]

2025-03-28 Thread via GitHub


sgup432 commented on code in PR #14412:
URL: https://github.com/apache/lucene/pull/14412#discussion_r2019729847


##
lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java:
##
@@ -142,6 +144,14 @@ public LRUQueryCache(
 missCount = new LongAdder();
   }
 
+  public float getSkipCacheFactor() {
+return skipCacheFactor;
+  }
+
+  public void setSkipCacheFactor(float skipCacheFactor) {
+this.skipCacheFactor = skipCacheFactor;
+  }

Review Comment:
   Added



##
lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java:
##
@@ -125,7 +125,9 @@ public LRUQueryCache(
 this.maxSize = maxSize;
 this.maxRamBytesUsed = maxRamBytesUsed;
 this.leavesToCache = leavesToCache;
-if (skipCacheFactor >= 1 == false) { // NaN >= 1 evaluates false
+if (skipCacheFactor < 1) { //
+  // NaN >= 1
+  // evaluates false

Review Comment:
   Reverted



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Examine the affects of MADV_RANDOM when MGLRU is enabled in Linux kernel [lucene]

2025-03-28 Thread via GitHub


rmuir commented on issue #14408:
URL: https://github.com/apache/lucene/issues/14408#issuecomment-2755385165

   A better default would also be possible, I think, if individual codecs such 
as vectors, did not set this directly, but instead allowed the Directory class 
to be in control.
   
   User/Directory is only partially in control now, there is a 
`org.apache.lucene.store.defaultReadAdvice` system property, but then this 
codec just ignores that, which I don't think is good. So a user has to write 
custom Directory and mess around with IOContext to correct the issue, which 
isn't obvious.
   
   Let the defaults be as smart as they need. Maybe check 
`/sys/kernel/mm/lru_gen/enabled` as part of the decision-making! But IMO let 
the user have the final say, in an easy way.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] bump antlr 4.11.1 -> 4.13.2 [lucene]

2025-03-28 Thread via GitHub


rmuir commented on code in PR #14388:
URL: https://github.com/apache/lucene/pull/14388#discussion_r2008149095


##
lucene/expressions/src/generated/checksums/generateAntlr.json:
##
@@ -1,7 +1,8 @@
 {
 
"lucene/expressions/src/java/org/apache/lucene/expressions/js/Javascript.g4": 
"818e89aae0b6c7601051802013898c128fe7c1ba",
 
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptBaseVisitor.java":
 "6965abdb8b069aaceac1ce4f32ed965b194f3a25",
-
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptLexer.java":
 "b8d6b259ebbfce09a5379a1a2aa4c1ddd4e378eb",
-
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptParser.java":
 "7a3a7b9de17f4a8d41ef342312eae5c55e483e08",
-
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptVisitor.java":
 "ec24bb2b9004bc38ee808970870deed12351039e"
+
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptLexer.java":
 "6508dc5008e96a1ad28c967a3401407ba83f140b",
+
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptParser.java":
 "ba6d0c00af113f115fc7a1f165da7726afb2e8c5",
+
"lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptVisitor.java":
 "ec24bb2b9004bc38ee808970870deed12351039e",
+"property:antlr-version": "4.13.2"

Review Comment:
   Its fine, thanks for making it easier to bump these deps. Would love it if 
we could get to dependabot or renovatebot to send us PRs for this stuff. With 
the latter i've done similar with `postUpgradeCommand`: we'd have to run their 
docker container in a scheduled build or something, and not use their "service" 
though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] For hnsw merger, do not pop from empty heap [lucene]

2025-03-28 Thread via GitHub


tveasey commented on PR #14420:
URL: https://github.com/apache/lucene/pull/14420#issuecomment-2761706401

   I discussed this a bit with Mayya.
   
   So the underlying issue is how we account for the gain for degree 1 
vertices. We say that the gain contribution from any vertex is at least 2, i.e. 
$\max\left(2, \frac{|N_s(v)|}{4}\right)$. This forces us to insert degree 1 
vertices which was intentional. The problem is we can get at most +1 
contribution from them to $g_{tot}$. Mayya tried it and when we correct the 
gain to $\min\left(\max\left(2, \frac{|N_s(v)|}{4}\right), |N_s(v)|\right)$ we 
terminate before exhausting the heap, as expected.
   
   Arguably this is slightly more conceptually correct. However, it represents 
a change in behaviour and invalidates the testing Mayya's done. The current 
behaviour increases the relative size of $J$ for graphs with many degree one 
vertices. This seems conceptually ok. In the case the heap is exhausted we've 
just picked $J = V$ the set of all vertices and reverted to the original merge 
procedure which is clearly harmless. However, this is going to be an extreme 
edge case since we normally only have a small fraction of degree one vertices 
and we'll nearly always terminate on $g_{tot}

Re: [I] New merging hnsw failures with BP policy [lucene]

2025-03-28 Thread via GitHub


benwtrent closed issue #14407: New merging hnsw failures with BP policy
URL: https://github.com/apache/lucene/issues/14407


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] For hnsw merger, do not pop from empty heap [lucene]

2025-03-28 Thread via GitHub


mayya-sharipova commented on PR #14420:
URL: https://github.com/apache/lucene/pull/14420#issuecomment-2761615009

   @benwtrent Please go ahead with a fix. Confirmed with @tveasey, algorithm's 
author.
   
   But we also look further into it. In this particular test, graphs are 
unusual (small graphs with 65, 73 connections).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] For hnsw merger, do not pop from empty heap [lucene]

2025-03-28 Thread via GitHub


benwtrent commented on PR #14420:
URL: https://github.com/apache/lucene/pull/14420#issuecomment-2761394895

   > Does this case actually only happen when docs are reordered, or is it a 
general edge case that we happened to find on a test run when docs happened to 
be reordered?
   
   I think its a general case. BP exposed it.
   
   My concern, and waiting for @mayya-sharipova to clarify, is that do we want 
to return the set if our heap got exhausted, or do we consider it "incomplete" 
and return an empty set instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Handle NaN results in TestVectorUtilSupport.testBinaryVectors [lucene]

2025-03-28 Thread via GitHub


thecoop commented on code in PR #14419:
URL: https://github.com/apache/lucene/pull/14419#discussion_r2018614545


##
lucene/core/src/test/org/apache/lucene/internal/vectorization/TestVectorUtilSupport.java:
##
@@ -210,9 +210,13 @@ public void testMinMaxScalarQuantize() {
   }
 
   private void 
assertFloatReturningProviders(ToDoubleFunction func) {
-assertThat(
-func.applyAsDouble(PANAMA_PROVIDER.getVectorUtilSupport()),
-closeTo(func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport()), 
delta));
+double luceneValue = 
func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport());

Review Comment:
   I think `closeTo` gives better error messages (specifying the delta, as in 
the error message, rather than just giving the two values and letting you work 
it out). It's also a clearer assert specification in code.
   
   But yeah, hadn't spotted it doesn't handle NaN in the same way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make DenseConjunctionBulkScorer align scoring windows with #docIDRunEnd(). [lucene]

2025-03-28 Thread via GitHub


jpountz commented on code in PR #14400:
URL: https://github.com/apache/lucene/pull/14400#discussion_r2018633191


##
lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java:
##
@@ -171,37 +171,36 @@ private int scoreWindow(
   }
 }
 
-if (acceptDocs == null) {
-  int minDocIDRunEnd = max;
-  for (DisiWrapper w : iterators) {
-if (w.docID() > min) {
-  minDocIDRunEnd = min;
-  break;
-} else {
-  minDocIDRunEnd = Math.min(minDocIDRunEnd, w.docIDRunEnd());
-}
-  }
-
-  if (minDocIDRunEnd - min >= WINDOW_SIZE / 2) {
-// We have a large range of doc IDs that all match.
-rangeDocIdStream.from = min;
-rangeDocIdStream.to = minDocIDRunEnd;
-collector.collect(rangeDocIdStream);
-return minDocIDRunEnd;
-  }
-}
-
-int bitsetWindowMax = (int) Math.min(max, (long) min + WINDOW_SIZE);
-
+// Partition clauses of the conjunction into:
+//  - clauses that don't fully match the first half of the window and get 
evaluated via
+// #loadIntoBitSet or leaf-frog,
+//  - other clauses that are used to compute the greatest possible window 
size that they fully
+// match.
+// This logic helps align scoring windows with the natural #docIDRunEnd() 
boundaries of the
+// data, which helps evaluate fewer clauses per window - without allowing 
windows to become too
+// small thanks to the WINDOW_SIZE/2 threshold.
+int minDocIDRunEnd = max;
 for (DisiWrapper w : iterators) {
-  if (w.docID() > min || w.docIDRunEnd() < bitsetWindowMax) {
+  int docIdRunEnd = w.docIDRunEnd();
+  if (w.docID() > min || (docIdRunEnd - min) < WINDOW_SIZE / 2) {

Review Comment:
   I like it, I'll apply this suggestion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]

2025-03-28 Thread via GitHub


gsmiller commented on code in PR #14273:
URL: https://github.com/apache/lucene/pull/14273#discussion_r2018769975


##
lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java:
##
@@ -204,6 +205,40 @@ public int cardinality() {
 return Math.toIntExact(tot);
   }
 
+  /**
+   * Return the number of set bits between indexes {@code from} inclusive and 
{@code to} exclusive.
+   */
+  public int cardinality(int from, int to) {
+Objects.checkFromToIndex(from, to, length());
+
+int cardinality = 0;
+
+// First, align `from` with a word start, ie. a multiple of Long.SIZE (64)
+if ((from & 0x3F) != 0) {
+  long bits = this.bits[from >> 6] >>> from;
+  int numBitsTilNextWord = -from & 0x3F;
+  if (to - from < numBitsTilNextWord) {
+bits &= (1L << (to - from)) - 1L;
+return Long.bitCount(bits);
+  }
+  cardinality += Long.bitCount(bits);
+  from += numBitsTilNextWord;
+  assert (from & 0x3F) == 0;
+}
+
+for (int i = from >> 6, end = to >> 6; i < end; ++i) {
+  cardinality += Long.bitCount(bits[i]);
+}
+
+// Now handle bits between the last complete word and to
+if ((to & 0x3F) != 0) {
+  long bits = this.bits[to >> 6] << -to;

Review Comment:
   Thanks for the detailed explanation. This makes sense to me, it just left me 
scratching my head for a little bit initially to figure out the intention 
behind the different approaches :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Fix HistogramCollector to not create zero-count buckets. [lucene]

2025-03-28 Thread via GitHub


jpountz opened a new pull request, #14421:
URL: https://github.com/apache/lucene/pull/14421

   If a bucket in the middle of the range doesn't match docs, it would be 
returned with a count of zero. Better not return it at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] PointInSetQuery use reverse collection to improve performance [lucene]

2025-03-28 Thread via GitHub


github-actions[bot] commented on PR #14352:
URL: https://github.com/apache/lucene/pull/14352#issuecomment-2762926172

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Leverage sparse doc value indexes for range and value facet collection [lucene]

2025-03-28 Thread via GitHub


epotyom commented on issue #14406:
URL: https://github.com/apache/lucene/issues/14406#issuecomment-2761891472

   We might still be able to benefit from Skipper in sandbox facets module. In 
FacetCutter for long ranges 
https://github.com/apache/lucene/blob/b5d13e29f1431ba30ae7df43c89def87e1677db0/lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/cutters/ranges/LongRangeFacetCutter.java#L247
 we can reuse elementary interval ordinal for all documents within a block of 
docs that falls entirely within the interval. It won't be as fast as collecting 
a DocIdStream, but it might still make it faster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Handle NaN results in TestVectorUtilSupport.testBinaryVectors [lucene]

2025-03-28 Thread via GitHub


iverase opened a new pull request, #14419:
URL: https://github.com/apache/lucene/pull/14419

   I notice the following error in CI:
   
   ```
   ./gradlew test --tests TestVectorUtilSupport.testBinaryVectors 
-Dtests.seed=B12D50704230E803 -Dtests.locale=ce -Dtests.timezone=SystemV/AST4 
-Dtests.asserts=true -Dtests.file.encoding=UTF-8
   
 > java.lang.AssertionError: 
  > Expected: a numeric value within <1.0E-5> of 
  >  but:  differed by  more than delta <1.0E-5>
  > at 
__randomizedtesting.SeedInfo.seed([B12D50704230E803:FF2F6188C28072F0]:0)
  > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
  > at org.junit.Assert.assertThat(Assert.java:964)
  > at org.junit.Assert.assertThat(Assert.java:930)
  > at 
org.apache.lucene.internal.vectorization.TestVectorUtilSupport.assertFloatReturningProviders(TestVectorUtilSupport.java:213)
  > at 
org.apache.lucene.internal.vectorization.TestVectorUtilSupport.testBinaryVectors(TestVectorUtilSupport.java:71)
  > at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
   ```
   
   To avoid this, this commit makes `assertFloatReturningProviders` resilient 
to NaN results.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] removing constructor with deprecated attribute 'onlyLongestMatch [lucene]

2025-03-28 Thread via GitHub


renatoh commented on PR #14356:
URL: https://github.com/apache/lucene/pull/14356#issuecomment-2760436593

   @rmuir  Please have a look at it when you have the chance. thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]

2025-03-28 Thread via GitHub


vsop-479 commented on PR #14371:
URL: https://github.com/apache/lucene/pull/14371#issuecomment-2760720322

   @gf2121 
   I implemented `VectorMask` approach. There is still a slowdown. I think the 
reason is my laptop (Mac M2).
   
   
   Benchmark  Mode  CntScoreError   
Units
   AdvanceSparseDISIBenchmark.advancethrpt   15  654.472 ±  2.349  
ops/ms
   AdvanceSparseDISIBenchmark.advanceVector  thrpt   15  498.590 ± 66.751  
ops/ms
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Handle NaN results in TestVectorUtilSupport.testBinaryVectors [lucene]

2025-03-28 Thread via GitHub


iverase commented on code in PR #14419:
URL: https://github.com/apache/lucene/pull/14419#discussion_r2018525082


##
lucene/core/src/test/org/apache/lucene/internal/vectorization/TestVectorUtilSupport.java:
##
@@ -210,9 +210,13 @@ public void testMinMaxScalarQuantize() {
   }
 
   private void 
assertFloatReturningProviders(ToDoubleFunction func) {
-assertThat(
-func.applyAsDouble(PANAMA_PROVIDER.getVectorUtilSupport()),
-closeTo(func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport()), 
delta));
+double luceneValue = 
func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport());

Review Comment:
   I updated the code to use `assertEquals`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow skip cache factor to be updated dynamically [lucene]

2025-03-28 Thread via GitHub


sgup432 commented on code in PR #14412:
URL: https://github.com/apache/lucene/pull/14412#discussion_r2019087636


##
lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java:
##
@@ -99,7 +100,7 @@ public class LRUQueryCache implements QueryCache, 
Accountable {
   private final Map cache;
   private final ReentrantReadWriteLock.ReadLock readLock;
   private final ReentrantReadWriteLock.WriteLock writeLock;
-  private final float skipCacheFactor;
+  private final AtomicReference skipCacheFactor;

Review Comment:
   Yeah considering we are just doing set/get, volatile should be enough. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]

2025-03-28 Thread via GitHub


gf2121 commented on code in PR #14401:
URL: https://github.com/apache/lucene/pull/14401#discussion_r2018024992


##
lucene/core/src/java/org/apache/lucene/search/LeafCollector.java:
##
@@ -83,6 +84,21 @@ public interface LeafCollector {
*/
   void collect(int doc) throws IOException;
 
+  /**
+   * Collect a range of doc IDs, between {@code min} inclusive and {@code max} 
exclusive.
+   *
+   * Extending this method is typically useful to take advantage of 
pre-aggregated data exposed
+   * in a {@link DocValuesSkipper}.
+   *
+   * The default implementation calls {@link #collect(DocIdStream)} on a 
{@link DocIdStream} that

Review Comment:
   Maybe clarify if we have any guarantee on the given values like `max > min` ?



##
lucene/core/src/java/org/apache/lucene/search/RangeDocIdStream.java:
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+
+final class RangeDocIdStream extends DocIdStream {
+
+  private final int min, max;
+
+  RangeDocIdStream(int min, int max) {
+if (max < min) {

Review Comment:
   Do we intend to allow `min==max`, which is actually an empty range? We need 
to update this or the exception message anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Preparing existing profiler for adding concurrent profiling [lucene]

2025-03-28 Thread via GitHub


jainankitk commented on PR #14413:
URL: https://github.com/apache/lucene/pull/14413#issuecomment-2760401254

   One of the failing check is:
   
   ```
   --
   1. ERROR in 
/home/runner/work/lucene/lucene/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/QueryProfilerBreakdown.java
 (at line 63)
currentThreadId, ctx -> new QuerySliceProfilerBreakdown());
 ^^^
   The value of the lambda parameter ctx is not used
   --
   1 problem (1 error)
   
   > Task :lucene:sandbox:ecjLintMain FAILED
   ```
   
   I am wondering if there is a workaround for this? One option is to use 
`putIfAbsent` which doesn't require function as input, but then need to 
explicitly `get` before returning


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] For hnsw merger, do not pop from empty heap [lucene]

2025-03-28 Thread via GitHub


benwtrent opened a new pull request, #14420:
URL: https://github.com/apache/lucene/pull/14420

   It seems there are edge cases when the heap is empty. This prevents us from 
attempting to pop from an empty heap.
   
   This bug fix should go into 10.2 to make the 10.2.0 release.
   
   closes: https://github.com/apache/lucene/issues/14407


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] For hnsw merger, do not pop from empty heap [lucene]

2025-03-28 Thread via GitHub


benwtrent commented on PR #14420:
URL: https://github.com/apache/lucene/pull/14420#issuecomment-2761161246

   //cc @iverase as you are handling the 10.2 release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow skip cache factor to be updated dynamically [lucene]

2025-03-28 Thread via GitHub


jpountz commented on code in PR #14412:
URL: https://github.com/apache/lucene/pull/14412#discussion_r2019386671


##
lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java:
##
@@ -125,7 +125,9 @@ public LRUQueryCache(
 this.maxSize = maxSize;
 this.maxRamBytesUsed = maxRamBytesUsed;
 this.leavesToCache = leavesToCache;
-if (skipCacheFactor >= 1 == false) { // NaN >= 1 evaluates false
+if (skipCacheFactor < 1) { //
+  // NaN >= 1
+  // evaluates false

Review Comment:
   Please revert so that we keep rejecting skipCacheFactor=Float.NaN?



##
lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java:
##
@@ -142,6 +144,14 @@ public LRUQueryCache(
 missCount = new LongAdder();
   }
 
+  public float getSkipCacheFactor() {
+return skipCacheFactor;
+  }
+
+  public void setSkipCacheFactor(float skipCacheFactor) {
+this.skipCacheFactor = skipCacheFactor;
+  }

Review Comment:
   Can you add javadocs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make DenseConjunctionBulkScorer align scoring windows with #docIDRunEnd(). [lucene]

2025-03-28 Thread via GitHub


jpountz merged PR #14400:
URL: https://github.com/apache/lucene/pull/14400


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]

2025-03-28 Thread via GitHub


gsmiller commented on PR #14401:
URL: https://github.com/apache/lucene/pull/14401#issuecomment-2761478694

   > I don't think so, or rather taking advantage of range collection shouldn't 
help more than what https://github.com/apache/lucene/pull/14273 does with 
RangeDocIdStream?
   
   My thinking here was that `HistogramCollector` should benefit from any 
scorers that can provide it with a `DocIdStream` for collection, and that this 
change lays the groundwork for more scorers to pass streams to collectors 
instead of individual docs (specifically thinking about some of the query 
use-cases you mention in the description). I probably should have been more 
clear :) (and maybe I'm still getting confused and this isn't true...)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]

2025-03-28 Thread via GitHub


jpountz commented on PR #14401:
URL: https://github.com/apache/lucene/pull/14401#issuecomment-2761651163

   Ah, that's right. We have a good number of queries that are already covered, 
in my opinion the next natural step is to look into making ranges collect 
ranges when any clause would collect a range.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow skip cache factor to be updated dynamically [lucene]

2025-03-28 Thread via GitHub


jpountz commented on code in PR #14412:
URL: https://github.com/apache/lucene/pull/14412#discussion_r2018865956


##
lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java:
##
@@ -122,12 +123,30 @@ public LRUQueryCache(
   long maxRamBytesUsed,
   Predicate leavesToCache,
   float skipCacheFactor) {
+this(maxSize, maxRamBytesUsed, leavesToCache, new 
AtomicReference<>(skipCacheFactor));
+  }
+
+  /**
+   * Additionally, allows the ability to pass skipCacheFactor in form of 
AtomicReference where the
+   * caller can dynamically update(in a thread safe way) its value by calling 
skipCacheFactor.set()
+   * on their end.
+   */
+  public LRUQueryCache(
+  int maxSize,
+  long maxRamBytesUsed,
+  Predicate leavesToCache,
+  AtomicReference skipCacheFactor) {

Review Comment:
   I don't think that this is a reason for exposing it via an AtomicReference 
instead of a getter/setter. TieredMergePolicy, to give an example, is the same: 
many of its setters are not exposed on MergePolicy.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] PointInSetQuery clips segments by lower and upper [lucene]

2025-03-28 Thread via GitHub


gsmiller commented on code in PR #14268:
URL: https://github.com/apache/lucene/pull/14268#discussion_r2018814824


##
lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java:
##
@@ -108,6 +110,8 @@ protected PointInSetQuery(String field, int numDims, int 
bytesPerDim, Stream pac
   }
   if (previous == null) {
 previous = new BytesRefBuilder();
+lowerPoint = new byte[bytesPerDim * numDims];
+System.arraycopy(current.bytes, current.offset, lowerPoint, 0, 
lowerPoint.length);

Review Comment:
   minor: I think it'd be slightly more readable if you used `current.length` 
here instead of ` lowerPoint.length` (and I might also throw an `assert 
lowerPoint.length == current.length` immediately before this line to make it 
clear they should be equal).



##
lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java:
##
@@ -153,6 +162,21 @@ public ScorerSupplier scorerSupplier(LeafReaderContext 
context) throws IOExcepti
   return null;
 }
 
+if (values.getDocCount() == 0) {
+  return null;
+} else if (lowerPoint != null && upperPoint != null) {

Review Comment:
   Should it be true that either 1) both of these are null, or 2) both are 
non-null? I think so right? If that's right, I would check `lowerPoint != null` 
then put an `assert upperPoint != null` in the condition branch.



##
lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java:
##
@@ -122,6 +126,11 @@ protected PointInSetQuery(String field, int numDims, int 
bytesPerDim, Stream pac
 }
 sortedPackedPoints = builder.finish();
 sortedPackedPointsHashCode = sortedPackedPoints.hashCode();
+if (previous != null) {
+  BytesRef max = previous.get();
+  upperPoint = new byte[bytesPerDim * numDims];
+  System.arraycopy(max.bytes, max.offset, upperPoint, 0, 
upperPoint.length);

Review Comment:
   minor: same comment here about using `previous.length` in place of 
`upperPoint.length`. I don't feel strongly about this so please feel free to 
disagree if you have a different perspective. 



##
lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java:
##
@@ -153,6 +162,21 @@ public ScorerSupplier scorerSupplier(LeafReaderContext 
context) throws IOExcepti
   return null;
 }
 
+if (values.getDocCount() == 0) {

Review Comment:
   I'm not 100% sure of this but I don't think it's possible to get back a 
non-null instance from `reader#getPointValues` that has a zero doc count. I 
believe you'll always get back a null instance if the points field has no docs 
in a segment. Can you confirm with a test and/or some debugging?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]

2025-03-28 Thread via GitHub


jpountz commented on PR #14401:
URL: https://github.com/apache/lucene/pull/14401#issuecomment-2761666414

   Any opinion on `collect(int min, int max)` vs. `collectRange(int min, int 
max)`? I leaned towards `collectRange` since we already have `collect(int doc)` 
and it wouldn't be obvious from the parameter types whether `collect(int, int)` 
is collecting a range or two random docs. No strong feeling either way though. 
`collect(DocIdStream)` is called "collect" rather than "collectDocIdStream" so 
I guess that `collect(int min, int max)` would be more consistent from this 
perspective.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Handle NaN results in TestVectorUtilSupport.testBinaryVectors [lucene]

2025-03-28 Thread via GitHub


thecoop commented on code in PR #14419:
URL: https://github.com/apache/lucene/pull/14419#discussion_r2018614545


##
lucene/core/src/test/org/apache/lucene/internal/vectorization/TestVectorUtilSupport.java:
##
@@ -210,9 +210,13 @@ public void testMinMaxScalarQuantize() {
   }
 
   private void 
assertFloatReturningProviders(ToDoubleFunction func) {
-assertThat(
-func.applyAsDouble(PANAMA_PROVIDER.getVectorUtilSupport()),
-closeTo(func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport()), 
delta));
+double luceneValue = 
func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport());

Review Comment:
   I think `closeTo` gives better error messages (specifying the delta, as in 
the error message, rather than just giving the two values and letting you work 
it out). It's also a clearer assert specification in code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]

2025-03-28 Thread via GitHub


jpountz commented on code in PR #14401:
URL: https://github.com/apache/lucene/pull/14401#discussion_r2018638902


##
lucene/core/src/java/org/apache/lucene/search/RangeDocIdStream.java:
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+
+final class RangeDocIdStream extends DocIdStream {
+
+  private final int min, max;
+
+  RangeDocIdStream(int min, int max) {
+if (max < min) {

Review Comment:
   This is a good question, I had overlooked it. While I don't think that empty 
ranges would cause issues for any impl, I think we should reject them. I'll 
update the PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]

2025-03-28 Thread via GitHub


gsmiller commented on PR #14401:
URL: https://github.com/apache/lucene/pull/14401#issuecomment-2761941694

   I prefer `collectRange` as well to make usage a little less error-prone. I 
don't have a strong opinion though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow skip cache factor to be updated dynamically [lucene]

2025-03-28 Thread via GitHub


sgup432 commented on code in PR #14412:
URL: https://github.com/apache/lucene/pull/14412#discussion_r2019040588


##
lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java:
##
@@ -122,12 +123,30 @@ public LRUQueryCache(
   long maxRamBytesUsed,
   Predicate leavesToCache,
   float skipCacheFactor) {
+this(maxSize, maxRamBytesUsed, leavesToCache, new 
AtomicReference<>(skipCacheFactor));
+  }
+
+  /**
+   * Additionally, allows the ability to pass skipCacheFactor in form of 
AtomicReference where the
+   * caller can dynamically update(in a thread safe way) its value by calling 
skipCacheFactor.set()
+   * on their end.
+   */
+  public LRUQueryCache(
+  int maxSize,
+  long maxRamBytesUsed,
+  Predicate leavesToCache,
+  AtomicReference skipCacheFactor) {

Review Comment:
   Ok then, I will change it a getter/setter then.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Handle NaN results in TestVectorUtilSupport.testBinaryVectors [lucene]

2025-03-28 Thread via GitHub


iverase merged PR #14419:
URL: https://github.com/apache/lucene/pull/14419


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Preparing existing profiler for adding concurrent profiling [lucene]

2025-03-28 Thread via GitHub


jpountz commented on PR #14413:
URL: https://github.com/apache/lucene/pull/14413#issuecomment-2761334037

   You just need to replace `ctx` with `_`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]

2025-03-28 Thread via GitHub


jpountz commented on code in PR #14273:
URL: https://github.com/apache/lucene/pull/14273#discussion_r2018661757


##
lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java:
##
@@ -204,6 +205,40 @@ public int cardinality() {
 return Math.toIntExact(tot);
   }
 
+  /**
+   * Return the number of set bits between indexes {@code from} inclusive and 
{@code to} exclusive.
+   */
+  public int cardinality(int from, int to) {
+Objects.checkFromToIndex(from, to, length());
+
+int cardinality = 0;
+
+// First, align `from` with a word start, ie. a multiple of Long.SIZE (64)
+if ((from & 0x3F) != 0) {
+  long bits = this.bits[from >> 6] >>> from;
+  int numBitsTilNextWord = -from & 0x3F;
+  if (to - from < numBitsTilNextWord) {
+bits &= (1L << (to - from)) - 1L;
+return Long.bitCount(bits);
+  }
+  cardinality += Long.bitCount(bits);
+  from += numBitsTilNextWord;
+  assert (from & 0x3F) == 0;
+}
+
+for (int i = from >> 6, end = to >> 6; i < end; ++i) {
+  cardinality += Long.bitCount(bits[i]);
+}
+
+// Now handle bits between the last complete word and to
+if ((to & 0x3F) != 0) {
+  long bits = this.bits[to >> 6] << -to;

Review Comment:
   You are right. We need the low "to % 64" bits of `bits[to >> 6]`. `x << -to` 
is nice because it requires fewer instructions but it also moves bits to a 
different index. This makes iterating over set bits slightly more cumbersome, 
but doesn't matter for counting bits. Hence why `forEach` applies a mask 
instead. I haven't checked actual performance, I suspect that it doesn't matter 
in practice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] For hnsw merger, do not pop from empty heap [lucene]

2025-03-28 Thread via GitHub


jpountz commented on PR #14420:
URL: https://github.com/apache/lucene/pull/14420#issuecomment-2761367741

   Does this case actually only happen when docs are reordered, or is it a 
general edge case that we happened to find on a test run when docs happened to 
be reordered?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow skip cache factor to be updated dynamically [lucene]

2025-03-28 Thread via GitHub


sgup432 commented on code in PR #14412:
URL: https://github.com/apache/lucene/pull/14412#discussion_r2019109207


##
lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java:
##
@@ -142,6 +161,10 @@ public LRUQueryCache(
 missCount = new LongAdder();
   }
 
+  AtomicReference getSkipCacheFactor() {
+return skipCacheFactor;
+  }

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow skip cache factor to be updated dynamically [lucene]

2025-03-28 Thread via GitHub


sgup432 commented on code in PR #14412:
URL: https://github.com/apache/lucene/pull/14412#discussion_r2019115721


##
lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java:
##
@@ -269,8 +292,8 @@ boolean requiresEviction() {
   }
 
   CacheAndCount get(Query key, IndexReader.CacheHelper cacheHelper) {
-assert key instanceof BoostQuery == false;
-assert key instanceof ConstantScoreQuery == false;
+assert !(key instanceof BoostQuery);
+assert !(key instanceof ConstantScoreQuery);

Review Comment:
   Changed it back.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Handle NaN results in TestVectorUtilSupport.testBinaryVectors [lucene]

2025-03-28 Thread via GitHub


gf2121 commented on code in PR #14419:
URL: https://github.com/apache/lucene/pull/14419#discussion_r2018496277


##
lucene/core/src/test/org/apache/lucene/internal/vectorization/TestVectorUtilSupport.java:
##
@@ -210,9 +210,13 @@ public void testMinMaxScalarQuantize() {
   }
 
   private void 
assertFloatReturningProviders(ToDoubleFunction func) {
-assertThat(
-func.applyAsDouble(PANAMA_PROVIDER.getVectorUtilSupport()),
-closeTo(func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport()), 
delta));
+double luceneValue = 
func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport());

Review Comment:
   I was a bit surprised that we did not have a good utility to deal double 
equals so I grep `1e-5` in the code base. It seems we used to use `public 
static void assertEquals(double expected, double actual, double delta)` for 
this kinda assertion. It deals `NaN` well.
   
Do we have a special reason to use `closeTo` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org