Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]
jainankitk commented on code in PR #14397: URL: https://github.com/apache/lucene/pull/14397#discussion_r2019666086 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java: ## @@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException { bytes.offset = bytes.length = 0; for (int decompressed = 0; decompressed < totalLength; ) { final int toDecompress = Math.min(totalLength - decompressed, chunkSize); +decompressor.reset(); decompressor.decompress(fieldsStream, toDecompress, 0, toDecompress, spare); Review Comment: I am wondering if reset should be the default behavior. We can pass another flag to indicate reuse if possible. ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/LZ4WithPresetDictCompressionMode.java: ## @@ -144,10 +148,85 @@ public void decompress(DataInput in, int originalLength, int offset, int length, assert bytes.isValid(); } +@Override +public void decompress( +IndexInput in, int originalLength, int offset, int length, BytesRef bytes) +throws IOException { + assert offset + length <= originalLength; + + if (length == 0) { +bytes.length = 0; +return; + } + + final int dictLength = in.readVInt(); + final int blockLength = in.readVInt(); + + final int numBlocks = readCompressedLengths(in, originalLength, dictLength, blockLength); + + buffer = ArrayUtil.grow(buffer, dictLength + blockLength); + long startPointer = in.getFilePointer(); + bytes.length = 0; + if (cachedDictFilPointer == startPointer) { +assert cachedDictLength == dictLength && dictEndFilePointer > 0; +in.seek(dictEndFilePointer); Review Comment: I am wondering if we can reuse the common code across this and other decompress method? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]
vsop-479 commented on PR #14371: URL: https://github.com/apache/lucene/pull/14371#issuecomment-2763043900 @gf2121 I also implemented `advanceExact` with vector, there is still a slowdown. I will try to measure it on other laptop (with more vector lanes). Benchmark Mode CntScoreError Units AdvanceSparseDISIBenchmark.advanceExactthrpt 15 727.403 ± 33.060 ops/ms AdvanceSparseDISIBenchmark.advanceExactVector thrpt 15 520.427 ± 0.868 ops/ms -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix HistogramCollector to not create zero-count buckets. [lucene]
jainankitk commented on code in PR #14421: URL: https://github.com/apache/lucene/pull/14421#discussion_r2019647007 ## lucene/sandbox/src/test/org/apache/lucene/sandbox/facet/plain/histograms/TestHistogramCollectorManager.java: ## @@ -137,6 +140,23 @@ private void doTestSkipIndex(IndexWriterConfig cfg) throws IOException { IllegalStateException.class, () -> searcher.search(new MatchAllDocsQuery(), new HistogramCollectorManager("f", 4, 1))); +// Create a query so that bucket "1" (values from 4 to 8), which is in the middle of the range, +// doesn't match any docs. HistogramCollector should not add an entry with a count of 0 in this +// case. +Query query = +new BooleanQuery.Builder() +.add(NumericDocValuesField.newSlowRangeQuery("f", Long.MIN_VALUE, 2), Occur.SHOULD) +.add(NumericDocValuesField.newSlowRangeQuery("f", 10, Long.MAX_VALUE), Occur.SHOULD) +.build(); +actualCounts = searcher.search(query, new HistogramCollectorManager("f", 4)); Review Comment: Initially, I was wondering if `checkMaxBuckets(collectorCounts.size(), maxBuckets)` might cause some of the tests earlier expecting exception to fail. But, all the tests with `counts[i] == 0`, use `1024` as `maxBuckets` which is comfortably over `collectorCounts.size()` Maybe, we can specify 3 as `maxBuckets` for even stronger condition: ``` searcher.search(query, new HistogramCollectorManager("f", 4, 3)); ``` ## lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/HistogramCollector.java: ## @@ -279,7 +279,9 @@ public void collect(DocIdStream stream) throws IOException { public void finish() throws IOException { // Put counts that we computed in the int[] back into the hash map. for (int i = 0; i < counts.length; ++i) { -collectorCounts.addTo(leafMinBucket + i, counts[i]); +if (counts[i] != 0) { + collectorCounts.addTo(leafMinBucket + i, counts[i]); +} } checkMaxBuckets(collectorCounts.size(), maxBuckets); Review Comment: While unrelated to this change, I am wondering if we should check eagerly to prevent unnecessary iterations: ``` if (counts[i] != 0) { collectorCounts.addTo(leafMinBucket + i, counts[i]); checkMaxBuckets(collectorCounts.size(), maxBuckets); } ``` We are doing similar validation in other places: ``` if (bucket != prevBucket) { counts.addTo(bucket, 1); checkMaxBuckets(counts.size(), maxBuckets); prevBucket = bucket; } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]
kkewwei commented on code in PR #14397: URL: https://github.com/apache/lucene/pull/14397#discussion_r2019696179 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/LZ4WithPresetDictCompressionMode.java: ## @@ -144,10 +148,85 @@ public void decompress(DataInput in, int originalLength, int offset, int length, assert bytes.isValid(); } +@Override +public void decompress( +IndexInput in, int originalLength, int offset, int length, BytesRef bytes) +throws IOException { + assert offset + length <= originalLength; + + if (length == 0) { +bytes.length = 0; +return; + } + + final int dictLength = in.readVInt(); + final int blockLength = in.readVInt(); + + final int numBlocks = readCompressedLengths(in, originalLength, dictLength, blockLength); + + buffer = ArrayUtil.grow(buffer, dictLength + blockLength); + long startPointer = in.getFilePointer(); + bytes.length = 0; + if (cachedDictFilPointer == startPointer) { +assert cachedDictLength == dictLength && dictEndFilePointer > 0; +in.seek(dictEndFilePointer); Review Comment: Yes, you are right. Initially, I intended to utilize IndexInput for seek. But the compressedLengths already holds the compressed length of the dictionary. I can directly use it. ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/LZ4WithPresetDictCompressionMode.java: ## @@ -144,10 +148,85 @@ public void decompress(DataInput in, int originalLength, int offset, int length, assert bytes.isValid(); } +@Override +public void decompress( +IndexInput in, int originalLength, int offset, int length, BytesRef bytes) +throws IOException { + assert offset + length <= originalLength; + + if (length == 0) { +bytes.length = 0; +return; + } + + final int dictLength = in.readVInt(); + final int blockLength = in.readVInt(); + + final int numBlocks = readCompressedLengths(in, originalLength, dictLength, blockLength); + + buffer = ArrayUtil.grow(buffer, dictLength + blockLength); + long startPointer = in.getFilePointer(); + bytes.length = 0; + if (cachedDictFilPointer == startPointer) { +assert cachedDictLength == dictLength && dictEndFilePointer > 0; +in.seek(dictEndFilePointer); Review Comment: Yes, you are right. Initially, I intended to utilize IndexInput for seek. But the compressedLengths already holds the compressed length of the dictionary, I can directly use it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] For hnsw merger, do not pop from empty heap [lucene]
benwtrent merged PR #14420: URL: https://github.com/apache/lucene/pull/14420 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow skip cache factor to be updated dynamically [lucene]
sgup432 commented on code in PR #14412: URL: https://github.com/apache/lucene/pull/14412#discussion_r2019729847 ## lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java: ## @@ -142,6 +144,14 @@ public LRUQueryCache( missCount = new LongAdder(); } + public float getSkipCacheFactor() { +return skipCacheFactor; + } + + public void setSkipCacheFactor(float skipCacheFactor) { +this.skipCacheFactor = skipCacheFactor; + } Review Comment: Added ## lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java: ## @@ -125,7 +125,9 @@ public LRUQueryCache( this.maxSize = maxSize; this.maxRamBytesUsed = maxRamBytesUsed; this.leavesToCache = leavesToCache; -if (skipCacheFactor >= 1 == false) { // NaN >= 1 evaluates false +if (skipCacheFactor < 1) { // + // NaN >= 1 + // evaluates false Review Comment: Reverted -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Examine the affects of MADV_RANDOM when MGLRU is enabled in Linux kernel [lucene]
rmuir commented on issue #14408: URL: https://github.com/apache/lucene/issues/14408#issuecomment-2755385165 A better default would also be possible, I think, if individual codecs such as vectors, did not set this directly, but instead allowed the Directory class to be in control. User/Directory is only partially in control now, there is a `org.apache.lucene.store.defaultReadAdvice` system property, but then this codec just ignores that, which I don't think is good. So a user has to write custom Directory and mess around with IOContext to correct the issue, which isn't obvious. Let the defaults be as smart as they need. Maybe check `/sys/kernel/mm/lru_gen/enabled` as part of the decision-making! But IMO let the user have the final say, in an easy way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] bump antlr 4.11.1 -> 4.13.2 [lucene]
rmuir commented on code in PR #14388: URL: https://github.com/apache/lucene/pull/14388#discussion_r2008149095 ## lucene/expressions/src/generated/checksums/generateAntlr.json: ## @@ -1,7 +1,8 @@ { "lucene/expressions/src/java/org/apache/lucene/expressions/js/Javascript.g4": "818e89aae0b6c7601051802013898c128fe7c1ba", "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptBaseVisitor.java": "6965abdb8b069aaceac1ce4f32ed965b194f3a25", - "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptLexer.java": "b8d6b259ebbfce09a5379a1a2aa4c1ddd4e378eb", - "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptParser.java": "7a3a7b9de17f4a8d41ef342312eae5c55e483e08", - "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptVisitor.java": "ec24bb2b9004bc38ee808970870deed12351039e" + "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptLexer.java": "6508dc5008e96a1ad28c967a3401407ba83f140b", + "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptParser.java": "ba6d0c00af113f115fc7a1f165da7726afb2e8c5", + "lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptVisitor.java": "ec24bb2b9004bc38ee808970870deed12351039e", +"property:antlr-version": "4.13.2" Review Comment: Its fine, thanks for making it easier to bump these deps. Would love it if we could get to dependabot or renovatebot to send us PRs for this stuff. With the latter i've done similar with `postUpgradeCommand`: we'd have to run their docker container in a scheduled build or something, and not use their "service" though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] For hnsw merger, do not pop from empty heap [lucene]
tveasey commented on PR #14420: URL: https://github.com/apache/lucene/pull/14420#issuecomment-2761706401 I discussed this a bit with Mayya. So the underlying issue is how we account for the gain for degree 1 vertices. We say that the gain contribution from any vertex is at least 2, i.e. $\max\left(2, \frac{|N_s(v)|}{4}\right)$. This forces us to insert degree 1 vertices which was intentional. The problem is we can get at most +1 contribution from them to $g_{tot}$. Mayya tried it and when we correct the gain to $\min\left(\max\left(2, \frac{|N_s(v)|}{4}\right), |N_s(v)|\right)$ we terminate before exhausting the heap, as expected. Arguably this is slightly more conceptually correct. However, it represents a change in behaviour and invalidates the testing Mayya's done. The current behaviour increases the relative size of $J$ for graphs with many degree one vertices. This seems conceptually ok. In the case the heap is exhausted we've just picked $J = V$ the set of all vertices and reverted to the original merge procedure which is clearly harmless. However, this is going to be an extreme edge case since we normally only have a small fraction of degree one vertices and we'll nearly always terminate on $g_{tot}
Re: [I] New merging hnsw failures with BP policy [lucene]
benwtrent closed issue #14407: New merging hnsw failures with BP policy URL: https://github.com/apache/lucene/issues/14407 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] For hnsw merger, do not pop from empty heap [lucene]
mayya-sharipova commented on PR #14420: URL: https://github.com/apache/lucene/pull/14420#issuecomment-2761615009 @benwtrent Please go ahead with a fix. Confirmed with @tveasey, algorithm's author. But we also look further into it. In this particular test, graphs are unusual (small graphs with 65, 73 connections). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] For hnsw merger, do not pop from empty heap [lucene]
benwtrent commented on PR #14420: URL: https://github.com/apache/lucene/pull/14420#issuecomment-2761394895 > Does this case actually only happen when docs are reordered, or is it a general edge case that we happened to find on a test run when docs happened to be reordered? I think its a general case. BP exposed it. My concern, and waiting for @mayya-sharipova to clarify, is that do we want to return the set if our heap got exhausted, or do we consider it "incomplete" and return an empty set instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Handle NaN results in TestVectorUtilSupport.testBinaryVectors [lucene]
thecoop commented on code in PR #14419: URL: https://github.com/apache/lucene/pull/14419#discussion_r2018614545 ## lucene/core/src/test/org/apache/lucene/internal/vectorization/TestVectorUtilSupport.java: ## @@ -210,9 +210,13 @@ public void testMinMaxScalarQuantize() { } private void assertFloatReturningProviders(ToDoubleFunction func) { -assertThat( -func.applyAsDouble(PANAMA_PROVIDER.getVectorUtilSupport()), -closeTo(func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport()), delta)); +double luceneValue = func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport()); Review Comment: I think `closeTo` gives better error messages (specifying the delta, as in the error message, rather than just giving the two values and letting you work it out). It's also a clearer assert specification in code. But yeah, hadn't spotted it doesn't handle NaN in the same way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make DenseConjunctionBulkScorer align scoring windows with #docIDRunEnd(). [lucene]
jpountz commented on code in PR #14400: URL: https://github.com/apache/lucene/pull/14400#discussion_r2018633191 ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -171,37 +171,36 @@ private int scoreWindow( } } -if (acceptDocs == null) { - int minDocIDRunEnd = max; - for (DisiWrapper w : iterators) { -if (w.docID() > min) { - minDocIDRunEnd = min; - break; -} else { - minDocIDRunEnd = Math.min(minDocIDRunEnd, w.docIDRunEnd()); -} - } - - if (minDocIDRunEnd - min >= WINDOW_SIZE / 2) { -// We have a large range of doc IDs that all match. -rangeDocIdStream.from = min; -rangeDocIdStream.to = minDocIDRunEnd; -collector.collect(rangeDocIdStream); -return minDocIDRunEnd; - } -} - -int bitsetWindowMax = (int) Math.min(max, (long) min + WINDOW_SIZE); - +// Partition clauses of the conjunction into: +// - clauses that don't fully match the first half of the window and get evaluated via +// #loadIntoBitSet or leaf-frog, +// - other clauses that are used to compute the greatest possible window size that they fully +// match. +// This logic helps align scoring windows with the natural #docIDRunEnd() boundaries of the +// data, which helps evaluate fewer clauses per window - without allowing windows to become too +// small thanks to the WINDOW_SIZE/2 threshold. +int minDocIDRunEnd = max; for (DisiWrapper w : iterators) { - if (w.docID() > min || w.docIDRunEnd() < bitsetWindowMax) { + int docIdRunEnd = w.docIDRunEnd(); + if (w.docID() > min || (docIdRunEnd - min) < WINDOW_SIZE / 2) { Review Comment: I like it, I'll apply this suggestion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
gsmiller commented on code in PR #14273: URL: https://github.com/apache/lucene/pull/14273#discussion_r2018769975 ## lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java: ## @@ -204,6 +205,40 @@ public int cardinality() { return Math.toIntExact(tot); } + /** + * Return the number of set bits between indexes {@code from} inclusive and {@code to} exclusive. + */ + public int cardinality(int from, int to) { +Objects.checkFromToIndex(from, to, length()); + +int cardinality = 0; + +// First, align `from` with a word start, ie. a multiple of Long.SIZE (64) +if ((from & 0x3F) != 0) { + long bits = this.bits[from >> 6] >>> from; + int numBitsTilNextWord = -from & 0x3F; + if (to - from < numBitsTilNextWord) { +bits &= (1L << (to - from)) - 1L; +return Long.bitCount(bits); + } + cardinality += Long.bitCount(bits); + from += numBitsTilNextWord; + assert (from & 0x3F) == 0; +} + +for (int i = from >> 6, end = to >> 6; i < end; ++i) { + cardinality += Long.bitCount(bits[i]); +} + +// Now handle bits between the last complete word and to +if ((to & 0x3F) != 0) { + long bits = this.bits[to >> 6] << -to; Review Comment: Thanks for the detailed explanation. This makes sense to me, it just left me scratching my head for a little bit initially to figure out the intention behind the different approaches :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix HistogramCollector to not create zero-count buckets. [lucene]
jpountz opened a new pull request, #14421: URL: https://github.com/apache/lucene/pull/14421 If a bucket in the middle of the range doesn't match docs, it would be returned with a count of zero. Better not return it at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] PointInSetQuery use reverse collection to improve performance [lucene]
github-actions[bot] commented on PR #14352: URL: https://github.com/apache/lucene/pull/14352#issuecomment-2762926172 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Leverage sparse doc value indexes for range and value facet collection [lucene]
epotyom commented on issue #14406: URL: https://github.com/apache/lucene/issues/14406#issuecomment-2761891472 We might still be able to benefit from Skipper in sandbox facets module. In FacetCutter for long ranges https://github.com/apache/lucene/blob/b5d13e29f1431ba30ae7df43c89def87e1677db0/lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/cutters/ranges/LongRangeFacetCutter.java#L247 we can reuse elementary interval ordinal for all documents within a block of docs that falls entirely within the interval. It won't be as fast as collecting a DocIdStream, but it might still make it faster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Handle NaN results in TestVectorUtilSupport.testBinaryVectors [lucene]
iverase opened a new pull request, #14419: URL: https://github.com/apache/lucene/pull/14419 I notice the following error in CI: ``` ./gradlew test --tests TestVectorUtilSupport.testBinaryVectors -Dtests.seed=B12D50704230E803 -Dtests.locale=ce -Dtests.timezone=SystemV/AST4 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 > java.lang.AssertionError: > Expected: a numeric value within <1.0E-5> of > but: differed by more than delta <1.0E-5> > at __randomizedtesting.SeedInfo.seed([B12D50704230E803:FF2F6188C28072F0]:0) > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) > at org.junit.Assert.assertThat(Assert.java:964) > at org.junit.Assert.assertThat(Assert.java:930) > at org.apache.lucene.internal.vectorization.TestVectorUtilSupport.assertFloatReturningProviders(TestVectorUtilSupport.java:213) > at org.apache.lucene.internal.vectorization.TestVectorUtilSupport.testBinaryVectors(TestVectorUtilSupport.java:71) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ``` To avoid this, this commit makes `assertFloatReturningProviders` resilient to NaN results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] removing constructor with deprecated attribute 'onlyLongestMatch [lucene]
renatoh commented on PR #14356: URL: https://github.com/apache/lucene/pull/14356#issuecomment-2760436593 @rmuir Please have a look at it when you have the chance. thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]
vsop-479 commented on PR #14371: URL: https://github.com/apache/lucene/pull/14371#issuecomment-2760720322 @gf2121 I implemented `VectorMask` approach. There is still a slowdown. I think the reason is my laptop (Mac M2). Benchmark Mode CntScoreError Units AdvanceSparseDISIBenchmark.advancethrpt 15 654.472 ± 2.349 ops/ms AdvanceSparseDISIBenchmark.advanceVector thrpt 15 498.590 ± 66.751 ops/ms -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Handle NaN results in TestVectorUtilSupport.testBinaryVectors [lucene]
iverase commented on code in PR #14419: URL: https://github.com/apache/lucene/pull/14419#discussion_r2018525082 ## lucene/core/src/test/org/apache/lucene/internal/vectorization/TestVectorUtilSupport.java: ## @@ -210,9 +210,13 @@ public void testMinMaxScalarQuantize() { } private void assertFloatReturningProviders(ToDoubleFunction func) { -assertThat( -func.applyAsDouble(PANAMA_PROVIDER.getVectorUtilSupport()), -closeTo(func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport()), delta)); +double luceneValue = func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport()); Review Comment: I updated the code to use `assertEquals` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow skip cache factor to be updated dynamically [lucene]
sgup432 commented on code in PR #14412: URL: https://github.com/apache/lucene/pull/14412#discussion_r2019087636 ## lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java: ## @@ -99,7 +100,7 @@ public class LRUQueryCache implements QueryCache, Accountable { private final Map cache; private final ReentrantReadWriteLock.ReadLock readLock; private final ReentrantReadWriteLock.WriteLock writeLock; - private final float skipCacheFactor; + private final AtomicReference skipCacheFactor; Review Comment: Yeah considering we are just doing set/get, volatile should be enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]
gf2121 commented on code in PR #14401: URL: https://github.com/apache/lucene/pull/14401#discussion_r2018024992 ## lucene/core/src/java/org/apache/lucene/search/LeafCollector.java: ## @@ -83,6 +84,21 @@ public interface LeafCollector { */ void collect(int doc) throws IOException; + /** + * Collect a range of doc IDs, between {@code min} inclusive and {@code max} exclusive. + * + * Extending this method is typically useful to take advantage of pre-aggregated data exposed + * in a {@link DocValuesSkipper}. + * + * The default implementation calls {@link #collect(DocIdStream)} on a {@link DocIdStream} that Review Comment: Maybe clarify if we have any guarantee on the given values like `max > min` ? ## lucene/core/src/java/org/apache/lucene/search/RangeDocIdStream.java: ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; + +final class RangeDocIdStream extends DocIdStream { + + private final int min, max; + + RangeDocIdStream(int min, int max) { +if (max < min) { Review Comment: Do we intend to allow `min==max`, which is actually an empty range? We need to update this or the exception message anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Preparing existing profiler for adding concurrent profiling [lucene]
jainankitk commented on PR #14413: URL: https://github.com/apache/lucene/pull/14413#issuecomment-2760401254 One of the failing check is: ``` -- 1. ERROR in /home/runner/work/lucene/lucene/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/QueryProfilerBreakdown.java (at line 63) currentThreadId, ctx -> new QuerySliceProfilerBreakdown()); ^^^ The value of the lambda parameter ctx is not used -- 1 problem (1 error) > Task :lucene:sandbox:ecjLintMain FAILED ``` I am wondering if there is a workaround for this? One option is to use `putIfAbsent` which doesn't require function as input, but then need to explicitly `get` before returning -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] For hnsw merger, do not pop from empty heap [lucene]
benwtrent opened a new pull request, #14420: URL: https://github.com/apache/lucene/pull/14420 It seems there are edge cases when the heap is empty. This prevents us from attempting to pop from an empty heap. This bug fix should go into 10.2 to make the 10.2.0 release. closes: https://github.com/apache/lucene/issues/14407 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] For hnsw merger, do not pop from empty heap [lucene]
benwtrent commented on PR #14420: URL: https://github.com/apache/lucene/pull/14420#issuecomment-2761161246 //cc @iverase as you are handling the 10.2 release. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow skip cache factor to be updated dynamically [lucene]
jpountz commented on code in PR #14412: URL: https://github.com/apache/lucene/pull/14412#discussion_r2019386671 ## lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java: ## @@ -125,7 +125,9 @@ public LRUQueryCache( this.maxSize = maxSize; this.maxRamBytesUsed = maxRamBytesUsed; this.leavesToCache = leavesToCache; -if (skipCacheFactor >= 1 == false) { // NaN >= 1 evaluates false +if (skipCacheFactor < 1) { // + // NaN >= 1 + // evaluates false Review Comment: Please revert so that we keep rejecting skipCacheFactor=Float.NaN? ## lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java: ## @@ -142,6 +144,14 @@ public LRUQueryCache( missCount = new LongAdder(); } + public float getSkipCacheFactor() { +return skipCacheFactor; + } + + public void setSkipCacheFactor(float skipCacheFactor) { +this.skipCacheFactor = skipCacheFactor; + } Review Comment: Can you add javadocs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make DenseConjunctionBulkScorer align scoring windows with #docIDRunEnd(). [lucene]
jpountz merged PR #14400: URL: https://github.com/apache/lucene/pull/14400 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]
gsmiller commented on PR #14401: URL: https://github.com/apache/lucene/pull/14401#issuecomment-2761478694 > I don't think so, or rather taking advantage of range collection shouldn't help more than what https://github.com/apache/lucene/pull/14273 does with RangeDocIdStream? My thinking here was that `HistogramCollector` should benefit from any scorers that can provide it with a `DocIdStream` for collection, and that this change lays the groundwork for more scorers to pass streams to collectors instead of individual docs (specifically thinking about some of the query use-cases you mention in the description). I probably should have been more clear :) (and maybe I'm still getting confused and this isn't true...) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]
jpountz commented on PR #14401: URL: https://github.com/apache/lucene/pull/14401#issuecomment-2761651163 Ah, that's right. We have a good number of queries that are already covered, in my opinion the next natural step is to look into making ranges collect ranges when any clause would collect a range. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow skip cache factor to be updated dynamically [lucene]
jpountz commented on code in PR #14412: URL: https://github.com/apache/lucene/pull/14412#discussion_r2018865956 ## lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java: ## @@ -122,12 +123,30 @@ public LRUQueryCache( long maxRamBytesUsed, Predicate leavesToCache, float skipCacheFactor) { +this(maxSize, maxRamBytesUsed, leavesToCache, new AtomicReference<>(skipCacheFactor)); + } + + /** + * Additionally, allows the ability to pass skipCacheFactor in form of AtomicReference where the + * caller can dynamically update(in a thread safe way) its value by calling skipCacheFactor.set() + * on their end. + */ + public LRUQueryCache( + int maxSize, + long maxRamBytesUsed, + Predicate leavesToCache, + AtomicReference skipCacheFactor) { Review Comment: I don't think that this is a reason for exposing it via an AtomicReference instead of a getter/setter. TieredMergePolicy, to give an example, is the same: many of its setters are not exposed on MergePolicy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] PointInSetQuery clips segments by lower and upper [lucene]
gsmiller commented on code in PR #14268: URL: https://github.com/apache/lucene/pull/14268#discussion_r2018814824 ## lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java: ## @@ -108,6 +110,8 @@ protected PointInSetQuery(String field, int numDims, int bytesPerDim, Stream pac } if (previous == null) { previous = new BytesRefBuilder(); +lowerPoint = new byte[bytesPerDim * numDims]; +System.arraycopy(current.bytes, current.offset, lowerPoint, 0, lowerPoint.length); Review Comment: minor: I think it'd be slightly more readable if you used `current.length` here instead of ` lowerPoint.length` (and I might also throw an `assert lowerPoint.length == current.length` immediately before this line to make it clear they should be equal). ## lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java: ## @@ -153,6 +162,21 @@ public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOExcepti return null; } +if (values.getDocCount() == 0) { + return null; +} else if (lowerPoint != null && upperPoint != null) { Review Comment: Should it be true that either 1) both of these are null, or 2) both are non-null? I think so right? If that's right, I would check `lowerPoint != null` then put an `assert upperPoint != null` in the condition branch. ## lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java: ## @@ -122,6 +126,11 @@ protected PointInSetQuery(String field, int numDims, int bytesPerDim, Stream pac } sortedPackedPoints = builder.finish(); sortedPackedPointsHashCode = sortedPackedPoints.hashCode(); +if (previous != null) { + BytesRef max = previous.get(); + upperPoint = new byte[bytesPerDim * numDims]; + System.arraycopy(max.bytes, max.offset, upperPoint, 0, upperPoint.length); Review Comment: minor: same comment here about using `previous.length` in place of `upperPoint.length`. I don't feel strongly about this so please feel free to disagree if you have a different perspective. ## lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java: ## @@ -153,6 +162,21 @@ public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOExcepti return null; } +if (values.getDocCount() == 0) { Review Comment: I'm not 100% sure of this but I don't think it's possible to get back a non-null instance from `reader#getPointValues` that has a zero doc count. I believe you'll always get back a null instance if the points field has no docs in a segment. Can you confirm with a test and/or some debugging? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]
jpountz commented on PR #14401: URL: https://github.com/apache/lucene/pull/14401#issuecomment-2761666414 Any opinion on `collect(int min, int max)` vs. `collectRange(int min, int max)`? I leaned towards `collectRange` since we already have `collect(int doc)` and it wouldn't be obvious from the parameter types whether `collect(int, int)` is collecting a range or two random docs. No strong feeling either way though. `collect(DocIdStream)` is called "collect" rather than "collectDocIdStream" so I guess that `collect(int min, int max)` would be more consistent from this perspective. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Handle NaN results in TestVectorUtilSupport.testBinaryVectors [lucene]
thecoop commented on code in PR #14419: URL: https://github.com/apache/lucene/pull/14419#discussion_r2018614545 ## lucene/core/src/test/org/apache/lucene/internal/vectorization/TestVectorUtilSupport.java: ## @@ -210,9 +210,13 @@ public void testMinMaxScalarQuantize() { } private void assertFloatReturningProviders(ToDoubleFunction func) { -assertThat( -func.applyAsDouble(PANAMA_PROVIDER.getVectorUtilSupport()), -closeTo(func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport()), delta)); +double luceneValue = func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport()); Review Comment: I think `closeTo` gives better error messages (specifying the delta, as in the error message, rather than just giving the two values and letting you work it out). It's also a clearer assert specification in code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]
jpountz commented on code in PR #14401: URL: https://github.com/apache/lucene/pull/14401#discussion_r2018638902 ## lucene/core/src/java/org/apache/lucene/search/RangeDocIdStream.java: ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; + +final class RangeDocIdStream extends DocIdStream { + + private final int min, max; + + RangeDocIdStream(int min, int max) { +if (max < min) { Review Comment: This is a good question, I had overlooked it. While I don't think that empty ranges would cause issues for any impl, I think we should reject them. I'll update the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]
gsmiller commented on PR #14401: URL: https://github.com/apache/lucene/pull/14401#issuecomment-2761941694 I prefer `collectRange` as well to make usage a little less error-prone. I don't have a strong opinion though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow skip cache factor to be updated dynamically [lucene]
sgup432 commented on code in PR #14412: URL: https://github.com/apache/lucene/pull/14412#discussion_r2019040588 ## lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java: ## @@ -122,12 +123,30 @@ public LRUQueryCache( long maxRamBytesUsed, Predicate leavesToCache, float skipCacheFactor) { +this(maxSize, maxRamBytesUsed, leavesToCache, new AtomicReference<>(skipCacheFactor)); + } + + /** + * Additionally, allows the ability to pass skipCacheFactor in form of AtomicReference where the + * caller can dynamically update(in a thread safe way) its value by calling skipCacheFactor.set() + * on their end. + */ + public LRUQueryCache( + int maxSize, + long maxRamBytesUsed, + Predicate leavesToCache, + AtomicReference skipCacheFactor) { Review Comment: Ok then, I will change it a getter/setter then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Handle NaN results in TestVectorUtilSupport.testBinaryVectors [lucene]
iverase merged PR #14419: URL: https://github.com/apache/lucene/pull/14419 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Preparing existing profiler for adding concurrent profiling [lucene]
jpountz commented on PR #14413: URL: https://github.com/apache/lucene/pull/14413#issuecomment-2761334037 You just need to replace `ctx` with `_`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
jpountz commented on code in PR #14273: URL: https://github.com/apache/lucene/pull/14273#discussion_r2018661757 ## lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java: ## @@ -204,6 +205,40 @@ public int cardinality() { return Math.toIntExact(tot); } + /** + * Return the number of set bits between indexes {@code from} inclusive and {@code to} exclusive. + */ + public int cardinality(int from, int to) { +Objects.checkFromToIndex(from, to, length()); + +int cardinality = 0; + +// First, align `from` with a word start, ie. a multiple of Long.SIZE (64) +if ((from & 0x3F) != 0) { + long bits = this.bits[from >> 6] >>> from; + int numBitsTilNextWord = -from & 0x3F; + if (to - from < numBitsTilNextWord) { +bits &= (1L << (to - from)) - 1L; +return Long.bitCount(bits); + } + cardinality += Long.bitCount(bits); + from += numBitsTilNextWord; + assert (from & 0x3F) == 0; +} + +for (int i = from >> 6, end = to >> 6; i < end; ++i) { + cardinality += Long.bitCount(bits[i]); +} + +// Now handle bits between the last complete word and to +if ((to & 0x3F) != 0) { + long bits = this.bits[to >> 6] << -to; Review Comment: You are right. We need the low "to % 64" bits of `bits[to >> 6]`. `x << -to` is nice because it requires fewer instructions but it also moves bits to a different index. This makes iterating over set bits slightly more cumbersome, but doesn't matter for counting bits. Hence why `forEach` applies a mask instead. I haven't checked actual performance, I suspect that it doesn't matter in practice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] For hnsw merger, do not pop from empty heap [lucene]
jpountz commented on PR #14420: URL: https://github.com/apache/lucene/pull/14420#issuecomment-2761367741 Does this case actually only happen when docs are reordered, or is it a general edge case that we happened to find on a test run when docs happened to be reordered? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow skip cache factor to be updated dynamically [lucene]
sgup432 commented on code in PR #14412: URL: https://github.com/apache/lucene/pull/14412#discussion_r2019109207 ## lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java: ## @@ -142,6 +161,10 @@ public LRUQueryCache( missCount = new LongAdder(); } + AtomicReference getSkipCacheFactor() { +return skipCacheFactor; + } Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow skip cache factor to be updated dynamically [lucene]
sgup432 commented on code in PR #14412: URL: https://github.com/apache/lucene/pull/14412#discussion_r2019115721 ## lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java: ## @@ -269,8 +292,8 @@ boolean requiresEviction() { } CacheAndCount get(Query key, IndexReader.CacheHelper cacheHelper) { -assert key instanceof BoostQuery == false; -assert key instanceof ConstantScoreQuery == false; +assert !(key instanceof BoostQuery); +assert !(key instanceof ConstantScoreQuery); Review Comment: Changed it back. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Handle NaN results in TestVectorUtilSupport.testBinaryVectors [lucene]
gf2121 commented on code in PR #14419: URL: https://github.com/apache/lucene/pull/14419#discussion_r2018496277 ## lucene/core/src/test/org/apache/lucene/internal/vectorization/TestVectorUtilSupport.java: ## @@ -210,9 +210,13 @@ public void testMinMaxScalarQuantize() { } private void assertFloatReturningProviders(ToDoubleFunction func) { -assertThat( -func.applyAsDouble(PANAMA_PROVIDER.getVectorUtilSupport()), -closeTo(func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport()), delta)); +double luceneValue = func.applyAsDouble(LUCENE_PROVIDER.getVectorUtilSupport()); Review Comment: I was a bit surprised that we did not have a good utility to deal double equals so I grep `1e-5` in the code base. It seems we used to use `public static void assertEquals(double expected, double actual, double delta)` for this kinda assertion. It deals `NaN` well. Do we have a special reason to use `closeTo` here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org