Re: [PR] Fix HistogramCollector to not create zero-count buckets. [lucene]
jpountz merged PR #14421: URL: https://github.com/apache/lucene/pull/14421 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix HistogramCollector to not create zero-count buckets. [lucene]
jpountz commented on PR #14421: URL: https://github.com/apache/lucene/pull/14421#issuecomment-2766465560 I backported to branch_10_2 since this is a bugfix cc @iverase -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] quick exit on filter query matching no docs when rewriting knn query [lucene]
jpountz merged PR #14418: URL: https://github.com/apache/lucene/pull/14418 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Add more information to IOContext [lucene]
thecoop opened a new issue, #14422: URL: https://github.com/apache/lucene/issues/14422 ### Description Currently Lucene tends to use `ReadAdvice.RANDOM` quite liberally, in various hard-coded locations. We have seen that this causes some problems with recent kernel versions, such as #14408. There are also several other aspects about opening files - such as which file access implementation to use, how to map memory, etc, that are dependent on the context of opening such files, and cannot be determined solely by the codec that is used. For example, you may want to access a flat vector file differently if it’s being used for an exhaustive scan vs being used to access specific vectors. To solve both these issues, Directory implementations could have information on the context a particular file is being opened with, to help it decide how and with what options it should open the file. However, there is no such information provided to Directory classes in the IOContext, and there is no space to provide such information in an extensible way. So there needs to be a way to pass additional context-specific information in an IOContext, to help Directory implementations decide on the ReadAdvice to use, and how and with what options it should open a particular file. There are two main ways I can see of doing this: 1. Make IOContext an interface. This allows custom implementations to be created, which can then be checked for specific types inside the custom Directory implementation to then cast and pull out the required information (prototype can be found [here](https://github.com/apache/lucene/commit/739f816b5492a505b1bc964214a6d2de60558d47)) 2. Add a `Map` or `Map` payload to IOContext allowing for arbitrary information to be included with an IOContext, that can then be checked by the Directory implementation As part of this, we could introduce some standard patterns, or some standard map keys for existing Lucene directories to replace the existing ReadAdvice logic, to provide more information on how files should be opened. A later piece of work could be to remove ReadAdvice from IOContext, and provide some other way Directory implementations could determine the access pattern that is likely to be used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] quick exit on filter query matching no docs when rewriting knn query [lucene]
bugmakerr commented on PR #14418: URL: https://github.com/apache/lucene/pull/14418#issuecomment-2766274186 > So in the case when the filter rewrites to MatchNoDocsQuery, `IndexSearcher#search` would now return immediately, while it may currently need to wait for tasks to be submitted (if the queue is not empty). Yes, you are right. > Can you add a CHANGES entry? Added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]
gf2121 commented on PR #14401: URL: https://github.com/apache/lucene/pull/14401#issuecomment-2765918007 The change of https://github.com/apache/lucene/pull/14421 is also included, which seems not expected? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Disable the query cache by default. [lucene]
msfroh commented on code in PR #14187: URL: https://github.com/apache/lucene/pull/14187#discussion_r2021571472 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -77,7 +77,8 @@ public class IndexSearcher { static int maxClauseCount = 1024; - private static QueryCache DEFAULT_QUERY_CACHE; + // Caching is disabled by default. + private static QueryCache DEFAULT_QUERY_CACHE = null; private static QueryCachingPolicy DEFAULT_CACHING_POLICY = new UsageTrackingQueryCachingPolicy(); Review Comment: > In general, I'm in favor of this, would looking into it in a follow-up work for you? Sounds good to me, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] add Automaton.toMermaid() for emitting mermaid state charts [lucene]
github-actions[bot] commented on PR #14360: URL: https://github.com/apache/lucene/pull/14360#issuecomment-2767713257 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]
jpountz merged PR #14401: URL: https://github.com/apache/lucene/pull/14401 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding profiling support for concurrent segment search [lucene]
jainankitk commented on PR #14413: URL: https://github.com/apache/lucene/pull/14413#issuecomment-2767417910 @jpountz - Can you provide your thoughts on above? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] PointInSetQuery clips segments by lower and upper [lucene]
gsmiller commented on code in PR #14268: URL: https://github.com/apache/lucene/pull/14268#discussion_r2021381006 ## lucene/CHANGES.txt: ## @@ -186,6 +186,8 @@ Optimizations * GITHUB#14272: Use DocIdSetIterator#range for continuous-id BKD leaves. (Guo Feng) +* GITHUB#14268: PointInSetQuery clips segments by lower and upper (hanbj) Review Comment: Let's move the changes entry to 10.3 since the 10.2 branch has already been cut and I don't think we need to squeeze this in with that release? Also, can we make this entry a little more descriptive? Maybe something like `PointInSetQuery optimization for the case when no segment docs can intersect with the query values`? ## lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java: ## @@ -248,6 +255,33 @@ public long cost() { } } + private boolean checkValidPointValues(PointValues values) throws IOException { Review Comment: I'd prefer going back to having this logic inlined as it was. I don't think checking `values == null` is really part of validating the `PointValues`. And there's nothing else calling this, so I think it's a bit easier to read when inlined. ## lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java: ## @@ -248,6 +255,33 @@ public long cost() { } } + private boolean checkValidPointValues(PointValues values) throws IOException { Review Comment: (But I do agree we should do the validation before the optimization you introduced) ## lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java: ## @@ -153,6 +162,21 @@ public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOExcepti return null; } +if (values.getDocCount() == 0) { Review Comment: Looking at the git annotations in `PointRangeQuery`, these checks were added in two different changes. I think it's likely this was just overlooked. I do not believe it's possible to have a non-null `PointValue` at this point that returns a zero doc count. (I also played around with this using some unit tests and a debugger and can confirm that behavior). All that said, I'm not strongly opposed to leaving the check in there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize commit retention policy to maintain only the last 5 commits [lucene]
github-actions[bot] commented on PR #14325: URL: https://github.com/apache/lucene/pull/14325#issuecomment-2767713463 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add more information to IOContext [lucene]
jpountz commented on issue #14422: URL: https://github.com/apache/lucene/issues/14422#issuecomment-2767398441 @rmuir I'm curious if you can expand a bit more on what you have in mind, what you are describing sounds to me like how things are today where `ReadAdvice.RANDOM` is the context and the `MADV_RANDOM` flag to `madvise` is the implementation. So I'm sure I'm missing something. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]
jainankitk commented on code in PR #14267: URL: https://github.com/apache/lucene/pull/14267#discussion_r2021794022 ## lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java: ## @@ -129,6 +141,16 @@ public final Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, fl private boolean matches(byte[] packedValue) { int offset = 0; + +if (equalValues) { + for (int dim = 0; dim < numDims; dim++, offset += bytesPerDim) { +if (comparator.compare(packedValue, offset, lowerPoint, offset) != 0) { + return false; +} + } + return true; +} Review Comment: Given we know about `equalValues` being true/false while initializing the `PointRangeQuery`, I would have a separate weight object, instead of having this additional logic when `lowerPoint != upperPoint`. For example : ``` @Override public final Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) throws IOException { if (this.equalValues) { return new ConstantScoreWeight(this, boost) {} } // We don't use RandomAccessWeight here: it's no good to approximate with "match all docs". // This is an inverted structure and should be used in the first pass: return new ConstantScoreWeight(this, boost) {} } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] PointInSetQuery early exit on non-matching segments [lucene]
hanbj commented on code in PR #14268: URL: https://github.com/apache/lucene/pull/14268#discussion_r2022087763 ## lucene/CHANGES.txt: ## @@ -186,6 +186,8 @@ Optimizations * GITHUB#14272: Use DocIdSetIterator#range for continuous-id BKD leaves. (Guo Feng) +* GITHUB#14268: PointInSetQuery clips segments by lower and upper (hanbj) Review Comment: This has been modified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]
jainankitk commented on code in PR #14397: URL: https://github.com/apache/lucene/pull/14397#discussion_r2021750116 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/LZ4WithPresetDictCompressionMode.java: ## @@ -98,12 +98,17 @@ public void decompress(DataInput in, int originalLength, int offset, int length, final int blockLength = in.readVInt(); final int numBlocks = readCompressedLengths(in, originalLength, dictLength, blockLength); - - buffer = ArrayUtil.growNoCopy(buffer, dictLength + blockLength); bytes.length = 0; - // Read the dictionary - if (LZ4.decompress(in, dictLength, buffer, 0) != dictLength) { -throw new CorruptIndexException("Illegal dict length", in); + if (reused) { +assert buffer.length >= dictLength + blockLength; +in.skipBytes(compressedLengths[0]); + } else { +// Read the dictionary +buffer = ArrayUtil.growNoCopy(buffer, dictLength + blockLength); +if (LZ4.decompress(in, dictLength, buffer, 0) != dictLength) { + throw new CorruptIndexException("Illegal dict length", in); +} +reused = true; Review Comment: I am wondering if we should consider exposing metric (simple counter maybe) on how many times we could reuse, and how many times had to read from the disk? That would provide some useful insights on the usefulness of this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]
jainankitk commented on code in PR #14397: URL: https://github.com/apache/lucene/pull/14397#discussion_r2021748068 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/LZ4WithPresetDictCompressionMode.java: ## @@ -144,10 +148,85 @@ public void decompress(DataInput in, int originalLength, int offset, int length, assert bytes.isValid(); } +@Override +public void decompress( +IndexInput in, int originalLength, int offset, int length, BytesRef bytes) +throws IOException { + assert offset + length <= originalLength; + + if (length == 0) { +bytes.length = 0; +return; + } + + final int dictLength = in.readVInt(); + final int blockLength = in.readVInt(); + + final int numBlocks = readCompressedLengths(in, originalLength, dictLength, blockLength); + + buffer = ArrayUtil.grow(buffer, dictLength + blockLength); + long startPointer = in.getFilePointer(); + bytes.length = 0; + if (cachedDictFilPointer == startPointer) { +assert cachedDictLength == dictLength && dictEndFilePointer > 0; +in.seek(dictEndFilePointer); Review Comment: Yes, looks so much cleaner now. Thanks for the refactoring! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org