Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
ChrisHegarty commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2028456102 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java: ## @@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() { * The default implementation is empty */ public void finishMerge() throws IOException {} + + /** A string representing the off-heap category for quantized vectors. */ + public static final String QUANTIZED = "QUANTIZED"; Review Comment: The file extension is quite cute, but discloses more of the implementation details to the caller which seems unnecessary. I deliberately chose to represent the categories as a `String` (rather than say an enum), since this offers more flexibility to the implementation while standardising on the three main categories. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding profiling support for concurrent segment search [lucene]
jpountz commented on code in PR #14413: URL: https://github.com/apache/lucene/pull/14413#discussion_r2028458611 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/search/QueryProfilerBreakdown.java: ## @@ -17,46 +17,113 @@ package org.apache.lucene.sandbox.search; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collection; import java.util.Collections; +import java.util.HashMap; +import java.util.List; import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.ConcurrentMap; +import org.apache.lucene.search.Query; import org.apache.lucene.util.CollectionUtil; /** * A record of timings for the various operations that may happen during query execution. A node's * time may be composed of several internal attributes (rewriting, weighting, scoring, etc). */ class QueryProfilerBreakdown { - - /** The accumulated timings for this query node */ - private final QueryProfilerTimer[] timers; + private static final Collection QUERY_LEVEL_TIMING_TYPE = + Arrays.stream(QueryProfilerTimingType.values()).filter(t -> !t.isSliceLevel()).toList(); + private final Map queryProfilerTimers; + private final ConcurrentMap threadToSliceBreakdown; Review Comment: There seems to be a general problem with how profiling figures out which slice it runs in. We should look into reliably identifying slices, or breaking down by thread rather than by slice? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
ChrisHegarty commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2028417528 ## lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java: ## @@ -257,6 +259,19 @@ public long ramBytesUsed() { return size; } + @Override + public Map getOffHeapByteSize(FieldInfo fieldInfo) { +Objects.requireNonNull(fieldInfo); +var raw = rawVectorsReader.getOffHeapByteSize(fieldInfo); +var fieldEntry = fields.get(fieldInfo.name); +if (fieldEntry == null) { + assert fieldInfo.getVectorEncoding() == VectorEncoding.BYTE; Review Comment: I did think that too, however both this format and the ScalarQuantizedFormat can be used directly with bytes - in which case there is no auto-quantization, just a byte vector store. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
ChrisHegarty commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2028456102 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java: ## @@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() { * The default implementation is empty */ public void finishMerge() throws IOException {} + + /** A string representing the off-heap category for quantized vectors. */ + public static final String QUANTIZED = "QUANTIZED"; Review Comment: ~The file extension is quite cute, but discloses more of the implementation details to the caller which seems unnecessary.~ ~I deliberately chose to represent the categories as a `String` (rather than say an enum), since this offers more flexibility to the implementation while standardising on the three main categories.~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
ChrisHegarty commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2028541365 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java: ## @@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() { * The default implementation is empty */ public void finishMerge() throws IOException {} + + /** A string representing the off-heap category for quantized vectors. */ + public static final String QUANTIZED = "QUANTIZED"; Review Comment: I chatted with Jim offline; the file extension relates nicely with the how preLoading currently works, and makes that connection more concrete, which is a very nice property. It also reduces the surface of the API change, allowing to remove the constant strings. I'll update the use the file extension. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] QueryParser parsing a phrase with a wildcard [lucene]
mkhludnev commented on issue #14440: URL: https://github.com/apache/lucene/issues/14440#issuecomment-2778755848 Just a gentle reminder about `ComplexPhraseQueryParser` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] KeywordField.newSetQuery() to reuse prefixed terms in IndexOrDocValuesQuery [lucene]
jainankitk commented on code in PR #14435: URL: https://github.com/apache/lucene/pull/14435#discussion_r2028202175 ## lucene/core/src/java/org/apache/lucene/document/KeywordField.java: ## @@ -175,9 +174,8 @@ public static Query newExactQuery(String field, String value) { public static Query newSetQuery(String field, Collection values) { Objects.requireNonNull(field, "field must not be null"); Objects.requireNonNull(values, "values must not be null"); -Query indexQuery = new TermInSetQuery(field, values); -Query dvQuery = new TermInSetQuery(MultiTermQuery.DOC_VALUES_REWRITE, field, values); -return new IndexOrDocValuesQuery(indexQuery, dvQuery); +return TermInSetQuery.newIndexOrDocValuesQuery( +MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE, field, values); Review Comment: Then we don't need that method, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] New IndexReaderFunctions.positionLength from the norm [lucene]
dsmiley commented on PR #14433: URL: https://github.com/apache/lucene/pull/14433#issuecomment-2778653535 I'd expect a hypothetical `IndexReaderFunctions.numTerms(field)` to return the number of terms in the index for that field. That's not even close to what we want! "Length" should be a component of the name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] [Release Wizard] Move choice for signing the release to when is needed [lucene]
iverase opened a new issue, #14441: URL: https://github.com/apache/lucene/issues/14441 The release wizard ask for in `Prerequisites/GPG key id is configured` for the method for signing the release: ``` Q: Do you want to sign the release with gradle plugin? No means gpg (y/n): ``` For most of us (at least for me), this choice is very obscure and it is not very clear of the consequences of it. Probably this is the only time you have to deal with gpg! As far as I understand, this choice is only used when calling the script `buildAndPushRelease.py` which it uses a different sing-in method based on the choice above. This is happening later on the release wizard under `Build the release artifacts\Build the release candidate`. If something goes wrongs because of the choice above, it is actually difficult to track down why. I wonder if the question should be move just before running the script so there is more connection between actions and results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] KeywordField.newSetQuery() to reuse prefixed terms in IndexOrDocValuesQuery [lucene]
jpountz commented on code in PR #14435: URL: https://github.com/apache/lucene/pull/14435#discussion_r2028403966 ## lucene/core/src/java/org/apache/lucene/document/KeywordField.java: ## @@ -175,9 +174,8 @@ public static Query newExactQuery(String field, String value) { public static Query newSetQuery(String field, Collection values) { Objects.requireNonNull(field, "field must not be null"); Objects.requireNonNull(values, "values must not be null"); -Query indexQuery = new TermInSetQuery(field, values); -Query dvQuery = new TermInSetQuery(MultiTermQuery.DOC_VALUES_REWRITE, field, values); -return new IndexOrDocValuesQuery(indexQuery, dvQuery); +return TermInSetQuery.newIndexOrDocValuesQuery( +MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE, field, values); Review Comment: I would agree with this. `KeywordField#newSetQuery` is the convenience method, so it's nice to not have to pass a rewrite, `TermInSetQuery#newIndexOrDocValuesQuery` is the expert method, and experts can figure out how they want their multi-term queries to be rewritten. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
ChrisHegarty commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2029105175 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java: ## @@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() { * The default implementation is empty */ public void finishMerge() throws IOException {} + + /** A string representing the off-heap category for quantized vectors. */ + public static final String QUANTIZED = "QUANTIZED"; Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Adding logic for collecting Histogram efficiently using Point Trees [lucene]
jainankitk opened a new pull request, #14439: URL: https://github.com/apache/lucene/pull/14439 ### Description This PR adds multi range traversal logic to collect the histogram on numeric field indexed as pointValues for MATCH_ALL cases. Even for non-match all cases like `PointRangeQuery`, if the query field == histogram field, this logic can be used. For the later, need to supply the `PointRangeQuery` bounds for building the appropriate `Ranges` to be collected. One of the key assumptions is absence of any deleted documents. Maybe going forward (especially if the deleted documents percentage is low), we can consider correcting the collected `Ranges` by subtracting for deleted documents. Although if I remember correctly, getting histogram field values for just deleted documents was non-trivial task! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Let Decompressor implement the Closeable interface. [lucene]
mulugetam commented on PR #14438: URL: https://github.com/apache/lucene/pull/14438#issuecomment-2779253165 > Unfortunately, you can't easily use close() to release resources from a Decompressor, because `StoredFieldsReader` is cloneable, and close() is never called on the clones. The only workaround that comes to mind would consist of using thread-locals, but I don't think we want to support that. Got it, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Let Decompressor implement the Closeable interface. [lucene]
mulugetam closed pull request #14438: Let Decompressor implement the Closeable interface. URL: https://github.com/apache/lucene/pull/14438 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add a timeout for forceMergeDeletes in IndexWriter [lucene]
houserjohn commented on issue #14431: URL: https://github.com/apache/lucene/issues/14431#issuecomment-2779350159 Apologies if I am misunderstanding your question, but the example that it is great for is right after the full indexing of your documents. The indexing likely created many deletes during the process which can accumulate and affect performance over time. If we use `forceMergeDeletes` we can address this issue before we finish the full indexing stage. If we do not wait, then this problem will carry on past our full indexing step, which may not be ideal for some use cases. Without a timeout you have less control over how long your full indexing step takes, and some deletes being addressed is better than none. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove slices creation overhead from IndexSearcher constructor [lucene]
jainankitk commented on code in PR #14428: URL: https://github.com/apache/lucene/pull/14428#discussion_r2026306457 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -223,12 +223,6 @@ public IndexSearcher(IndexReaderContext context, Executor executor) { executor == null ? new TaskExecutor(Runnable::run) : new TaskExecutor(executor); Review Comment: We need to use `DIRECT_TASK_EXECUTOR` here, for `taskExecutor == TaskExecutor.DIRECT_TASK_EXECUTOR` in `computeAndCacheSlices` to be true, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]
vigyasharma commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2744585265 re: using long for graph node ids, I can see how using int ordinals can be limiting for the no. of vectors we can index per segment. However, adapting to long node ids is also a non-trivial change because of how our current code is structured. I described it in more detail [here](https://github.com/apache/lucene/pull/14173#issuecomment-2629167044). I think we should do both (multi-vectors and long node ids) but decouple these two changes. If we can support multi-vectors without impacting performance for existing single-vector use cases, we should add that support. Separately, we should work on moving to long nodeIds, which will give us the full power of multi-vector models. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestIndexSortBackwardsCompatibility.testSortedIndexAddDocBlocks fails reproducibly [lucene]
dweiss commented on issue #14344: URL: https://github.com/apache/lucene/issues/14344#issuecomment-2776875359 This has been fixed. Closing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Incorrect use of fsync [lucene]
viliam-durina commented on issue #14334: URL: https://github.com/apache/lucene/issues/14334#issuecomment-2770735674 Without fsync, there's no guarantee that anything you wrote was written. The OS delays the writes. If you close, the data is still in memory, the OS later tries to write and fails with an IO error, and it has no way to tell you about the problem, and discards the pages. If you read it back, it might be garbage, truncated, or something else, and no IO error was ever reported, let alone a system crash. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Disable HNSW connectedComponents (#14214) [lucene]
benwtrent merged PR #14436: URL: https://github.com/apache/lucene/pull/14436 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add more information to IOContext [lucene]
rmuir commented on issue #14422: URL: https://github.com/apache/lucene/issues/14422#issuecomment-2767743265 > [@rmuir](https://github.com/rmuir) I'm curious if you can expand a bit more on what you have in mind, what you are describing sounds to me like how things are today where `ReadAdvice.RANDOM` is the context and the `MADV_RANDOM` flag to `madvise` is the implementation. So I'm sure I'm missing something. I think today the context is trying too hard to direct the implementation. For reference, the current structure looks like this: ```java public enum Context { MERGE, FLUSH, DEFAULT }; public enum ReadAdvice { NORMAL, RANDOM, SEQUENTIAL, RANDOM_PRELOAD } // dictating implementation: let the Directory decide this public static final IOContext DEFAULT = new IOContext(Constants.DEFAULT_READADVICE); // dictating implementation: let the Directory decide this public static final IOContext READONCE = new IOContext(ReadAdvice.SEQUENTIAL); // a jazillion IOContext ctors taking different types of Objects, all of which // set various defaults sneakily instead of letting Directory decide. ``` If i wanted to baby-step this, i'd walk the whole thing back to an `int`, as a start. I definitely think a codec should be able to OR together multiple different flags, and that the directory should be able to make use of them in a generic way. Feel free to translate this into EnumSets or whatever java does :) This is just brainstorming and not properly thought out, so its just examples: codec should be able to specify that the file is one of, say `METADATA_FILE`, `DATA_FILE`, `INDEX_FILE` (these might be bad names, but think slurped-metadata vs stored-fields-data vs stored-fields-index). It allows the Directory to make decisions based upon the general category of a file (or CFS range?), without hardcoded file extensions, etc. Ideally type-safe(ish) which is an improvement over matching "*.fdt" or other hacks. codec should be able to specify the purpose of the file, say `POSTINGS`, `STORED_FIELDS`, `VECTORS`. Similar reasons to the above: give the "context", file extensions are not the solution. I don't even see any reason to pass `.class` names of formats here, that's also limiting. We should be able to tell the directory its `TERMS`, even though that's sorta an impl-detail of PostingsFormat and not a first-class codec api citizen. It is just a flag, let's be practical. separately, codec should maybe be able to specify stuff such as whether file is accessed `SEQUENTIAL` or `RANDOM`: and thats just expressing what may happen, unrelated and disconnected from read-ahead or other possible implementations. But I've given this piece no thought, I think the first problem is that the Directory doesnt even have the "basic context" such as file's general category and purpose. And Directory should be in control of all "defaults", push those out of Codec, IOContext, etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix HistogramCollector to not create zero-count buckets. [lucene]
jainankitk commented on code in PR #14421: URL: https://github.com/apache/lucene/pull/14421#discussion_r2020439736 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/HistogramCollector.java: ## @@ -279,7 +279,9 @@ public void collect(DocIdStream stream) throws IOException { public void finish() throws IOException { // Put counts that we computed in the int[] back into the hash map. for (int i = 0; i < counts.length; ++i) { -collectorCounts.addTo(leafMinBucket + i, counts[i]); +if (counts[i] != 0) { + collectorCounts.addTo(leafMinBucket + i, counts[i]); +} } checkMaxBuckets(collectorCounts.size(), maxBuckets); Review Comment: Sounds fair. I was expecting low bound like 1024, just wanted to confirm! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add leafReaders() Method to IndexReader and Unit Test [lucene]
benwtrent commented on PR #14370: URL: https://github.com/apache/lucene/pull/14370#issuecomment-2736659338 Eh, I am not sold that this change needs to occur if ever. While, "this is how its always been" isn't a good argument for some things, I think expanding the public, and then backwards compatible API needs careful consideration. I am not sure the syntactic sugar is worth the cavities. If others in the community really want this, I am cool with it. My opinion is that it isn't necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix test delta in minMaxScalarQuantize [lucene]
benwtrent merged PR #14403: URL: https://github.com/apache/lucene/pull/14403 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]
guojialiang92 commented on PR #14417: URL: https://github.com/apache/lucene/pull/14417#issuecomment-2766081605 Thanks, @vigyasharma > I think we can add a couple more tests to make it robust. > > 1. Some tests around concurrency – index with multiple threads, then advance the counter in one of the threads, and validate behavior. You can look at `ThreadedIndexingAndSearchingTestCase` and its derived tests for motivation. > 2. A test for the crash-recovery scenario, which I suppose it the primary use case. We could make the writer index a bunch of docs, then kill it, start a new writer on the same index, and advance its counter. I have added the following tests according to the suggestions: 1. For crash-recovery scenario, I have added `TestIndexWriter#testAdvanceSegmentCounterInCrashAndRecoveryScenario`. 2. For scenarios with multi-threaded concurrent writes, I have modified the test `ThreadedIndexingAndSearchingTestCase`, in the concurrent write thread, increased the logic of randomly modifying the segment counter, and checked after all write threads ended. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
jimczi commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2027537161 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java: ## @@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() { * The default implementation is empty */ public void finishMerge() throws IOException {} + + /** A string representing the off-heap category for quantized vectors. */ + public static final String QUANTIZED = "QUANTIZED"; Review Comment: Yep, then maybe using the file name extension is more consistent? That would be `VEC`, `VEX`, `VEQ` and `VEB` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add a timeout for forceMergeDeletes in IndexWriter [lucene]
jpountz commented on issue #14431: URL: https://github.com/apache/lucene/issues/14431#issuecomment-2778108878 For my understanding, what is the benefit of waiting until the timeout is reached rather than not waiting at all? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] build support: java 24 [lucene]
rmuir commented on issue #14379: URL: https://github.com/apache/lucene/issues/14379#issuecomment-2740946896 maybe this one will fix it long-term and this is the last time we have to go thru it? unfortunately it just went live in java 24 so it doesn't help us now: https://openjdk.org/jeps/484 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]
vsop-479 commented on PR #14371: URL: https://github.com/apache/lucene/pull/14371#issuecomment-2774626745 @gf2121 , I measured it on a linux server (`uses preferredBitSize=512; FMA enabled`), there is still a massive slowndown. I will dig more ... ``` Benchmark Mode CntScore Error Units AdvanceSparseDISIBenchmark.advance thrpt 15 386.100 ± 0.162 ops/ms AdvanceSparseDISIBenchmark.advanceVector thrpt 15 162.697 ± 0.581 ops/ms AdvanceSparseDISIBenchmark.advanceExactthrpt 15 437.998 ± 0.644 ops/ms AdvanceSparseDISIBenchmark.advanceExactVector thrpt 15 271.823 ± 0.625 ops/ms ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Incorrect use of fsync [lucene]
viliam-durina commented on issue #14334: URL: https://github.com/apache/lucene/issues/14334#issuecomment-2771736474 >every application everywhere and everytime would need to call fsync on close Yeah, I don't know what would be the use case for not fsyncing before closing. Maybe if you don't care if the file is written or not, but then why are you writing it in the first place? If you don't fsync, it's like submitting an async operation, but not checking the future's result. It will mostly succeed, but if not, you won't know. >If a temporary file in the lucene index is corrupted then lucene will bail out Are we sure that we can detect every possible corruption in the file? E.g. if the file size is 0, but there should have been some data? If so then yes, the commit will crash, which is better than producing incorrect index. Only the exception will be misleading: instead of an IOException you'll have unexpected EOFExceptions, buffer overflows, array indexes out of bounds, invalid values etc. Checksum violation is the best of these. And you won't be able to reproduce any of these errors in a test. It's unlikely that you'll think that there was a hidden IO failure. The unresolved issue #10906 is also a result of a missing fsync IMO. For temporary files, we should either fsync before closing, or start reading without closing the file. The latter is more complicated, but has no performance hit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding profiling support for concurrent segment search [lucene]
jainankitk commented on code in PR #14413: URL: https://github.com/apache/lucene/pull/14413#discussion_r2029297568 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/search/QueryProfilerBreakdown.java: ## @@ -17,46 +17,113 @@ package org.apache.lucene.sandbox.search; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collection; import java.util.Collections; +import java.util.HashMap; +import java.util.List; import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.ConcurrentMap; +import org.apache.lucene.search.Query; import org.apache.lucene.util.CollectionUtil; /** * A record of timings for the various operations that may happen during query execution. A node's * time may be composed of several internal attributes (rewriting, weighting, scoring, etc). */ class QueryProfilerBreakdown { - - /** The accumulated timings for this query node */ - private final QueryProfilerTimer[] timers; + private static final Collection QUERY_LEVEL_TIMING_TYPE = + Arrays.stream(QueryProfilerTimingType.values()).filter(t -> !t.isSliceLevel()).toList(); + private final Map queryProfilerTimers; + private final ConcurrentMap threadToSliceBreakdown; Review Comment: > We should look into reliably identifying slices, or breaking down by thread rather than by slice? The code currently breaks down by thread only, even though it mentions slice. I incorrectly assumed that every slice is processed by unique thread, which need not be true. Thinking bit more on this, I feel breakdown by thread is good enough, as our primary intent is to understand the concurrent processing. If slice 1 and slice 2 are processed sequentially by thread A, it is actually better to aggregate that information as thread A breakdown instead of breakdown for slice 1 and slice 2. Thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]
jpountz commented on PR #14364: URL: https://github.com/apache/lucene/pull/14364#issuecomment-2738217161 Agreed! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]
jpountz commented on PR #14401: URL: https://github.com/apache/lucene/pull/14401#issuecomment-2766473986 It is unexpected indeed! I'll fix this and add a CHANGES entry. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow skip cache factor to be updated dynamically [lucene]
sgup432 commented on PR #14412: URL: https://github.com/apache/lucene/pull/14412#issuecomment-2766561573 @jpountz Thanks for expanding on the reasoning above w.r.t. query cache usage. I was working on [refactoring](https://github.com/apache/lucene/issues/14222) query cache so that it is not a bottleneck itself when utilized well. Maybe we can an additional caching policy wrt to determine high upfront cost or query latency, so that we only cache queries which will give us the benefit. Let me know if its still worth a shot, meanwhile we can work on optimizing the queries itself. I can open up a PR soon for that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Address gradle temp file pollution insanity [lucene]
dweiss closed issue #14385: Address gradle temp file pollution insanity URL: https://github.com/apache/lucene/issues/14385 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow skip cache factor to be updated dynamically [lucene]
jpountz commented on PR #14412: URL: https://github.com/apache/lucene/pull/14412#issuecomment-2765833637 Sorry I'm not sure I get your suggestion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Incorrect use of fsync [lucene]
viliam-durina commented on issue #14334: URL: https://github.com/apache/lucene/issues/14334#issuecomment-2771963385 >Please stop arguing here about problems that don't exist. Issue https://github.com/apache/lucene/issues/10906 has nothing to do with temporary files. This issue is not only about temporary files, but about the way Lucene doing or not doing fsync. The issue #10906, I think, was about a lost IOException due to closing without fsyncing. I used it to illustrate the kind of errors we get from corrupted files, if we don't check the IO errors correctly. >This is not possible because Lucene solely uses MMap for reading files, and for writing it uses OutputStreams, which can't be reused to read again. It's possible to to continue using `OutputStream` API for writing and mmap API for reading, with a single underlying read-write file descriptor: ```java public static void main(String[] args) throws IOException { Path tempFolder = Files.createTempDirectory("foo"); Path filePath = tempFolder.resolve("test_file"); try (FileChannel ch = FileChannel.open(filePath, READ, WRITE, CREATE_NEW)) { // writing OutputStream os = Channels.newOutputStream(ch); os.write("hello world".getBytes()); // When done writing, we must not close the output stream, because it will close the channel. But // we must flush it so that all in-flight changes are sent to the underlying channel (e.g. if the output // stream was wrapped in some buffering output stream). os.flush(); // reading try (Arena arena = Arena.ofConfined()) { MemorySegment segment = ch.map(READ_ONLY, 0, ch.size(), arena); byte[] buf = new byte[toIntExact(segment.byteSize())]; segment.asByteBuffer().get(buf); System.out.println("Data read: " + new String(buf)); } } // channel closed now // we can delete the file now Files.delete(filePath); Files.delete(tempFolder); } ``` For example, `Directory.createTempOutput()` can open such read-write channel, create `OutputStream` from it and use that to create an `IndexOutput` so that existing code doesn't have to be modified. Then the `IndexOutput` should not be closed, but it can be converted to `IndexInput`, e.g. with a new method `IndexOutput.convertToInput()`, reusing the file channel. We need to figure out how to handle the `IOContext` with this. When the `IndexInput` is closed, the channel is closed. There are probably many ways to implement this, the above was just an example. I have checked a few of the `createTempOutput` usages, they might be pretty straightforward to update. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] KeywordField.newSetQuery() to reuse prefixed terms in IndexOrDocValuesQuery [lucene]
jainankitk commented on code in PR #14435: URL: https://github.com/apache/lucene/pull/14435#discussion_r2029301580 ## lucene/core/src/java/org/apache/lucene/document/KeywordField.java: ## @@ -175,9 +174,8 @@ public static Query newExactQuery(String field, String value) { public static Query newSetQuery(String field, Collection values) { Objects.requireNonNull(field, "field must not be null"); Objects.requireNonNull(values, "values must not be null"); -Query indexQuery = new TermInSetQuery(field, values); -Query dvQuery = new TermInSetQuery(MultiTermQuery.DOC_VALUES_REWRITE, field, values); -return new IndexOrDocValuesQuery(indexQuery, dvQuery); +return TermInSetQuery.newIndexOrDocValuesQuery( +MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE, field, values); Review Comment: We should remove the unused method `TermInSetQuery#newIndexOrDocValuesQuery(String field, Collection terms)` in that case, as anyone invoking `TermInSetQuery#newIndexOrDocValuesQuery` should use the expert method `TermInSetQuery#newIndexOrDocValuesQuery(RewriteMethod indexRewriteMethod, String field, Collection terms)`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Case insensitive regex query with character range [lucene]
rmuir commented on issue #14378: URL: https://github.com/apache/lucene/issues/14378#issuecomment-2741911493 we are better armed to begin looking at this one after recent changes, see https://github.com/apache/lucene/pull/14193#issuecomment-2638840849 now that there is an actual single parsing node that correlates with single automaton state for a character class, we can support the caseless ranges using "icu algorithm" (iterating over every char in the range and adding its variants) without explosion of objects: but unicode space is big, so it will use up some CPU time during parsing. This is why I think it needs opt-in flag. Also we have the potential problem that Automaton.finishState() will collapse huge `int[]` into efficient ranges, but it doesn't seem to ever free the memory. e.g. if i add 50,000 transitions to it and then it reduces them to one simple range, I think the underlying array is still 50,000. I will attempt to test and address this first if possible, let's avoid memory issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
benwtrent commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2027504697 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java: ## @@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() { * The default implementation is empty */ public void finishMerge() throws IOException {} + + /** A string representing the off-heap category for quantized vectors. */ + public static final String QUANTIZED = "QUANTIZED"; Review Comment: > I wonder if we should rather reflect the underlying format here. Something like flat_vector_float, flat_vector_byte, flat_vector_bbq? IMO, this is then transforming to just the codec format name as I would expect different formats to have different characteristics, even between versions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add CaseFolding.fold(), inverse of expand(), move to UnicodeUtil, add filter [lucene]
github-actions[bot] commented on PR #14389: URL: https://github.com/apache/lucene/pull/14389#issuecomment-2779945827 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Integrating GPU based Vector Search using cuVS [lucene]
github-actions[bot] commented on PR #14131: URL: https://github.com/apache/lucene/pull/14131#issuecomment-2779945929 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add more information to IOContext [lucene]
jainankitk commented on issue #14422: URL: https://github.com/apache/lucene/issues/14422#issuecomment-2767494904 Looks related to the discussion here - https://github.com/apache/lucene/issues/14348#issuecomment-2730902349. More general, than specifically vector files -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Remove slices creation overhead from IndexSearcher constructor [lucene]
javanna opened a new pull request, #14428: URL: https://github.com/apache/lucene/pull/14428 We simplified the slices creation in IndexSearcher with #13893. That removed the need for a caching supplier, but had an unexpected effect: we eagerly create slices now for the case where no executor is provided, and that may cause some overhead. There are situations in which a searcher is created, although search methods that trigger the creation of slices are never called. We can in that case avoid going through that trouble. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org