Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-04 Thread via GitHub


ChrisHegarty commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2028456102


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() {
* The default implementation is empty
*/
   public void finishMerge() throws IOException {}
+
+  /** A string representing the off-heap category for quantized vectors. */
+  public static final String QUANTIZED = "QUANTIZED";

Review Comment:
   The file extension is quite cute, but discloses more of the implementation 
details to the caller which seems unnecessary.  
   
   I deliberately chose to represent the categories as a `String` (rather than 
say an enum), since this offers more flexibility to the implementation while 
standardising on the three main categories.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding profiling support for concurrent segment search [lucene]

2025-04-04 Thread via GitHub


jpountz commented on code in PR #14413:
URL: https://github.com/apache/lucene/pull/14413#discussion_r2028458611


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/QueryProfilerBreakdown.java:
##
@@ -17,46 +17,113 @@
 
 package org.apache.lucene.sandbox.search;
 
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
 import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
 import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ConcurrentMap;
+import org.apache.lucene.search.Query;
 import org.apache.lucene.util.CollectionUtil;
 
 /**
  * A record of timings for the various operations that may happen during query 
execution. A node's
  * time may be composed of several internal attributes (rewriting, weighting, 
scoring, etc).
  */
 class QueryProfilerBreakdown {
-
-  /** The accumulated timings for this query node */
-  private final QueryProfilerTimer[] timers;
+  private static final Collection 
QUERY_LEVEL_TIMING_TYPE =
+  Arrays.stream(QueryProfilerTimingType.values()).filter(t -> 
!t.isSliceLevel()).toList();
+  private final Map 
queryProfilerTimers;
+  private final ConcurrentMap 
threadToSliceBreakdown;

Review Comment:
   There seems to be a general problem with how profiling figures out which 
slice it runs in. We should look into reliably identifying slices, or breaking 
down by thread rather than by slice?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-04 Thread via GitHub


ChrisHegarty commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2028417528


##
lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java:
##
@@ -257,6 +259,19 @@ public long ramBytesUsed() {
 return size;
   }
 
+  @Override
+  public Map getOffHeapByteSize(FieldInfo fieldInfo) {
+Objects.requireNonNull(fieldInfo);
+var raw = rawVectorsReader.getOffHeapByteSize(fieldInfo);
+var fieldEntry = fields.get(fieldInfo.name);
+if (fieldEntry == null) {
+  assert fieldInfo.getVectorEncoding() == VectorEncoding.BYTE;

Review Comment:
   I did think that too, however both this format and the ScalarQuantizedFormat 
can be used directly with bytes - in which case there is no auto-quantization, 
just a byte vector store.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-04 Thread via GitHub


ChrisHegarty commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2028456102


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() {
* The default implementation is empty
*/
   public void finishMerge() throws IOException {}
+
+  /** A string representing the off-heap category for quantized vectors. */
+  public static final String QUANTIZED = "QUANTIZED";

Review Comment:
   ~The file extension is quite cute, but discloses more of the implementation 
details to the caller which seems unnecessary.~
   
   ~I deliberately chose to represent the categories as a `String` (rather than 
say an enum), since this offers more flexibility to the implementation while 
standardising on the three main categories.~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-04 Thread via GitHub


ChrisHegarty commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2028541365


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() {
* The default implementation is empty
*/
   public void finishMerge() throws IOException {}
+
+  /** A string representing the off-heap category for quantized vectors. */
+  public static final String QUANTIZED = "QUANTIZED";

Review Comment:
   I chatted with Jim offline;  the file extension relates nicely with the how 
preLoading currently works, and makes that connection more concrete, which is a 
very nice property.  It also reduces the surface of the API change, allowing to 
remove the constant strings. I'll update the use the file extension. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] QueryParser parsing a phrase with a wildcard [lucene]

2025-04-04 Thread via GitHub


mkhludnev commented on issue #14440:
URL: https://github.com/apache/lucene/issues/14440#issuecomment-2778755848

   Just a gentle reminder about `ComplexPhraseQueryParser` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] KeywordField.newSetQuery() to reuse prefixed terms in IndexOrDocValuesQuery [lucene]

2025-04-04 Thread via GitHub


jainankitk commented on code in PR #14435:
URL: https://github.com/apache/lucene/pull/14435#discussion_r2028202175


##
lucene/core/src/java/org/apache/lucene/document/KeywordField.java:
##
@@ -175,9 +174,8 @@ public static Query newExactQuery(String field, String 
value) {
   public static Query newSetQuery(String field, Collection values) {
 Objects.requireNonNull(field, "field must not be null");
 Objects.requireNonNull(values, "values must not be null");
-Query indexQuery = new TermInSetQuery(field, values);
-Query dvQuery = new TermInSetQuery(MultiTermQuery.DOC_VALUES_REWRITE, 
field, values);
-return new IndexOrDocValuesQuery(indexQuery, dvQuery);
+return TermInSetQuery.newIndexOrDocValuesQuery(
+MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE, field, values);

Review Comment:
   Then we don't need that method, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] New IndexReaderFunctions.positionLength from the norm [lucene]

2025-04-04 Thread via GitHub


dsmiley commented on PR #14433:
URL: https://github.com/apache/lucene/pull/14433#issuecomment-2778653535

   I'd expect a hypothetical `IndexReaderFunctions.numTerms(field)` to return 
the number of terms in the index for that field.  That's not even close to what 
we want!  "Length" should be a component of the name.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] [Release Wizard] Move choice for signing the release to when is needed [lucene]

2025-04-04 Thread via GitHub


iverase opened a new issue, #14441:
URL: https://github.com/apache/lucene/issues/14441

   The release wizard ask for  in `Prerequisites/GPG key id is configured`  for 
the method for signing the release:
   
   ```
   Q: Do you want to sign the release with gradle plugin? No means gpg (y/n):
   ```
   
   For most of us (at least for me), this choice is very obscure and it is not 
very clear of the consequences of it. Probably this is the only time you have 
to deal with gpg!
   
   As far as I understand, this choice is only used when calling the script 
`buildAndPushRelease.py` which it uses a different sing-in method based on the 
choice above. This is happening later on the release wizard under `Build the 
release artifacts\Build the release candidate`. If something goes wrongs 
because of the choice above, it is actually difficult to track down why. 
   
   I wonder if the question should be move just before running the script so 
there is more connection between actions and results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] KeywordField.newSetQuery() to reuse prefixed terms in IndexOrDocValuesQuery [lucene]

2025-04-04 Thread via GitHub


jpountz commented on code in PR #14435:
URL: https://github.com/apache/lucene/pull/14435#discussion_r2028403966


##
lucene/core/src/java/org/apache/lucene/document/KeywordField.java:
##
@@ -175,9 +174,8 @@ public static Query newExactQuery(String field, String 
value) {
   public static Query newSetQuery(String field, Collection values) {
 Objects.requireNonNull(field, "field must not be null");
 Objects.requireNonNull(values, "values must not be null");
-Query indexQuery = new TermInSetQuery(field, values);
-Query dvQuery = new TermInSetQuery(MultiTermQuery.DOC_VALUES_REWRITE, 
field, values);
-return new IndexOrDocValuesQuery(indexQuery, dvQuery);
+return TermInSetQuery.newIndexOrDocValuesQuery(
+MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE, field, values);

Review Comment:
   I would agree with this. `KeywordField#newSetQuery` is the convenience 
method, so it's nice to not have to pass a rewrite, 
`TermInSetQuery#newIndexOrDocValuesQuery` is the expert method, and experts can 
figure out how they want their multi-term queries to be rewritten.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-04 Thread via GitHub


ChrisHegarty commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2029105175


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() {
* The default implementation is empty
*/
   public void finishMerge() throws IOException {}
+
+  /** A string representing the off-heap category for quantized vectors. */
+  public static final String QUANTIZED = "QUANTIZED";

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Adding logic for collecting Histogram efficiently using Point Trees [lucene]

2025-04-04 Thread via GitHub


jainankitk opened a new pull request, #14439:
URL: https://github.com/apache/lucene/pull/14439

   ### Description
   
   This PR adds multi range traversal logic to collect the histogram on numeric 
field indexed as pointValues for MATCH_ALL cases. Even for non-match all cases 
like `PointRangeQuery`, if the query field == histogram field, this logic can 
be used. For the later, need to supply the `PointRangeQuery` bounds for 
building the appropriate `Ranges` to be collected.
   
   One of the key assumptions is absence of any deleted documents. Maybe going 
forward (especially if the deleted documents percentage is low), we can 
consider correcting the collected `Ranges` by subtracting for deleted 
documents. Although if I remember correctly, getting histogram field values for 
just deleted documents was non-trivial task!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Let Decompressor implement the Closeable interface. [lucene]

2025-04-04 Thread via GitHub


mulugetam commented on PR #14438:
URL: https://github.com/apache/lucene/pull/14438#issuecomment-2779253165

   > Unfortunately, you can't easily use close() to release resources from a 
Decompressor, because `StoredFieldsReader` is cloneable, and close() is never 
called on the clones. The only workaround that comes to mind would consist of 
using thread-locals, but I don't think we want to support that.
   
   Got it, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Let Decompressor implement the Closeable interface. [lucene]

2025-04-04 Thread via GitHub


mulugetam closed pull request #14438: Let Decompressor implement the Closeable 
interface.
URL: https://github.com/apache/lucene/pull/14438


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add a timeout for forceMergeDeletes in IndexWriter [lucene]

2025-04-04 Thread via GitHub


houserjohn commented on issue #14431:
URL: https://github.com/apache/lucene/issues/14431#issuecomment-2779350159

   Apologies if I am misunderstanding your question, but the example that it is 
great for is right after the full indexing of your documents. The indexing 
likely created many deletes during the process which can accumulate and affect 
performance over time. If we use `forceMergeDeletes` we can address this issue 
before we finish the full indexing stage. If we do not wait, then this problem 
will carry on past our full indexing step, which may not be ideal for some use 
cases. Without a timeout you have less control over how long your full indexing 
step takes, and some deletes being addressed is better than none.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove slices creation overhead from IndexSearcher constructor [lucene]

2025-04-04 Thread via GitHub


jainankitk commented on code in PR #14428:
URL: https://github.com/apache/lucene/pull/14428#discussion_r2026306457


##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -223,12 +223,6 @@ public IndexSearcher(IndexReaderContext context, Executor 
executor) {
 executor == null ? new TaskExecutor(Runnable::run) : new 
TaskExecutor(executor);

Review Comment:
   We need to use `DIRECT_TASK_EXECUTOR` here, for `taskExecutor == 
TaskExecutor.DIRECT_TASK_EXECUTOR` in `computeAndCacheSlices` to be true, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

2025-04-04 Thread via GitHub


vigyasharma commented on PR #14173:
URL: https://github.com/apache/lucene/pull/14173#issuecomment-2744585265

   re: using long for graph node ids, I can see how using int ordinals can be 
limiting for the no. of vectors we can index per segment. However, adapting to 
long node ids is also a non-trivial change because of how our current code is 
structured. I described it in more detail 
[here](https://github.com/apache/lucene/pull/14173#issuecomment-2629167044).
   
   I think we should do both (multi-vectors and long node ids) but decouple 
these two changes. If we can support multi-vectors without impacting 
performance for existing single-vector use cases, we should add that support. 
Separately, we should work on moving to long nodeIds, which will give us the 
full power of multi-vector models.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestIndexSortBackwardsCompatibility.testSortedIndexAddDocBlocks fails reproducibly [lucene]

2025-04-04 Thread via GitHub


dweiss commented on issue #14344:
URL: https://github.com/apache/lucene/issues/14344#issuecomment-2776875359

   This has been fixed. Closing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Incorrect use of fsync [lucene]

2025-04-04 Thread via GitHub


viliam-durina commented on issue #14334:
URL: https://github.com/apache/lucene/issues/14334#issuecomment-2770735674

   Without fsync, there's no guarantee that anything you wrote was written. The 
OS delays the writes. If you close, the data is still in memory, the OS later 
tries to write and fails with an IO error, and it has no way to tell you about 
the problem, and discards the pages. If you read it back, it might be garbage, 
truncated, or something else, and no IO error was ever reported, let alone a 
system crash.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Disable HNSW connectedComponents (#14214) [lucene]

2025-04-04 Thread via GitHub


benwtrent merged PR #14436:
URL: https://github.com/apache/lucene/pull/14436


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add more information to IOContext [lucene]

2025-04-04 Thread via GitHub


rmuir commented on issue #14422:
URL: https://github.com/apache/lucene/issues/14422#issuecomment-2767743265

   > [@rmuir](https://github.com/rmuir) I'm curious if you can expand a bit 
more on what you have in mind, what you are describing sounds to me like how 
things are today where `ReadAdvice.RANDOM` is the context and the `MADV_RANDOM` 
flag to `madvise` is the implementation. So I'm sure I'm missing something.
   
   I think today the context is trying too hard to direct the implementation. 
   For reference, the current structure looks like this:
   
   ```java
   public enum Context {
 MERGE,
 FLUSH,
 DEFAULT
   };
   
   public enum ReadAdvice {
 NORMAL,
 RANDOM,
 SEQUENTIAL,
 RANDOM_PRELOAD
   }
   
   // dictating implementation: let the Directory decide this
   public static final IOContext DEFAULT = new 
IOContext(Constants.DEFAULT_READADVICE);
   // dictating implementation: let the Directory decide this
   public static final IOContext READONCE = new 
IOContext(ReadAdvice.SEQUENTIAL);
   
   // a jazillion IOContext ctors taking different types of Objects, all of 
which
   // set various defaults sneakily instead of letting Directory decide.
   ```
   
   If i wanted to baby-step this, i'd walk the whole thing back to an `int`, as 
a start. I definitely think a codec should be able to OR together multiple 
different flags, and that the directory should be able to make use of them in a 
generic way. Feel free to translate this into EnumSets or whatever java does :)
   
   This is just brainstorming and not properly thought out, so its just 
examples:
   
   codec should be able to specify that the file is one of, say 
`METADATA_FILE`, `DATA_FILE`, `INDEX_FILE` (these might be bad names, but think 
slurped-metadata vs stored-fields-data vs stored-fields-index). It allows the 
Directory to make decisions based upon the general category of a file (or CFS 
range?), without hardcoded file extensions, etc. Ideally type-safe(ish) which 
is an improvement over matching "*.fdt" or other hacks.
   
   codec should be able to specify the purpose of the file, say `POSTINGS`, 
`STORED_FIELDS`, `VECTORS`. Similar reasons to the above: give the "context", 
file extensions are not the solution. I don't even see any reason to pass 
`.class` names of formats here, that's also limiting. We should be able to tell 
the directory its `TERMS`, even though that's sorta an impl-detail of 
PostingsFormat and not a first-class codec api citizen. It is just a flag, 
let's be practical.
   
   separately, codec should maybe be able to specify stuff such as whether file 
is accessed `SEQUENTIAL` or `RANDOM`: and thats just expressing what may 
happen, unrelated and disconnected from read-ahead or other possible 
implementations. But I've given this piece no thought, I think the first 
problem is that the Directory doesnt even have the "basic context" such as 
file's general category and purpose. And Directory should be in control of all 
"defaults", push those out of Codec, IOContext, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix HistogramCollector to not create zero-count buckets. [lucene]

2025-04-04 Thread via GitHub


jainankitk commented on code in PR #14421:
URL: https://github.com/apache/lucene/pull/14421#discussion_r2020439736


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/HistogramCollector.java:
##
@@ -279,7 +279,9 @@ public void collect(DocIdStream stream) throws IOException {
 public void finish() throws IOException {
   // Put counts that we computed in the int[] back into the hash map.
   for (int i = 0; i < counts.length; ++i) {
-collectorCounts.addTo(leafMinBucket + i, counts[i]);
+if (counts[i] != 0) {
+  collectorCounts.addTo(leafMinBucket + i, counts[i]);
+}
   }
   checkMaxBuckets(collectorCounts.size(), maxBuckets);

Review Comment:
   Sounds fair. I was expecting low bound like 1024, just wanted to confirm!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add leafReaders() Method to IndexReader and Unit Test [lucene]

2025-04-04 Thread via GitHub


benwtrent commented on PR #14370:
URL: https://github.com/apache/lucene/pull/14370#issuecomment-2736659338

   Eh, I am not sold that this change needs to occur if ever. While, "this is 
how its always been" isn't a good argument for some things, I think expanding 
the public, and then backwards compatible API needs careful consideration. 
   
   I am not sure the syntactic sugar is worth the cavities.
   
   If others in the community really want this, I am cool with it. My opinion 
is that it isn't necessary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix test delta in minMaxScalarQuantize [lucene]

2025-04-04 Thread via GitHub


benwtrent merged PR #14403:
URL: https://github.com/apache/lucene/pull/14403


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]

2025-04-04 Thread via GitHub


guojialiang92 commented on PR #14417:
URL: https://github.com/apache/lucene/pull/14417#issuecomment-2766081605

   Thanks, @vigyasharma 
   > I think we can add a couple more tests to make it robust.
   > 
   > 1. Some tests around concurrency –  index with multiple threads, then 
advance the counter in one of the threads, and validate behavior. You can look 
at `ThreadedIndexingAndSearchingTestCase` and its derived tests for motivation.
   > 2. A test for the crash-recovery scenario, which I suppose it the primary 
use case. We could make the writer index a bunch of docs, then kill it, start a 
new writer on the same index, and advance its counter.
   
   I have added the following tests according to the suggestions: 
   1. For crash-recovery scenario, I have added 
`TestIndexWriter#testAdvanceSegmentCounterInCrashAndRecoveryScenario`.
   2. For scenarios with multi-threaded concurrent writes, I have modified the 
test `ThreadedIndexingAndSearchingTestCase`, in the concurrent write thread, 
increased the logic of randomly modifying the segment counter, and checked 
after all write threads ended.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-04 Thread via GitHub


jimczi commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2027537161


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() {
* The default implementation is empty
*/
   public void finishMerge() throws IOException {}
+
+  /** A string representing the off-heap category for quantized vectors. */
+  public static final String QUANTIZED = "QUANTIZED";

Review Comment:
   Yep, then maybe using the file name extension is more consistent? That would 
be `VEC`, `VEX`, `VEQ` and `VEB` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add a timeout for forceMergeDeletes in IndexWriter [lucene]

2025-04-04 Thread via GitHub


jpountz commented on issue #14431:
URL: https://github.com/apache/lucene/issues/14431#issuecomment-2778108878

   For my understanding, what is the benefit of waiting until the timeout is 
reached rather than not waiting at all?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] build support: java 24 [lucene]

2025-04-04 Thread via GitHub


rmuir commented on issue #14379:
URL: https://github.com/apache/lucene/issues/14379#issuecomment-2740946896

   maybe this one will fix it long-term and this is the last time we have to go 
thru it? unfortunately it just went live in java 24 so it doesn't help us now: 
https://openjdk.org/jeps/484


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]

2025-04-04 Thread via GitHub


vsop-479 commented on PR #14371:
URL: https://github.com/apache/lucene/pull/14371#issuecomment-2774626745

   @gf2121 , I measured it on a linux server (`uses preferredBitSize=512; FMA 
enabled`), there is still a massive slowndown. I will dig more ...
   
   ```
   Benchmark   Mode  CntScore   Error   
Units
   AdvanceSparseDISIBenchmark.advance thrpt   15  386.100 ± 0.162  
ops/ms
   AdvanceSparseDISIBenchmark.advanceVector   thrpt   15  162.697 ± 0.581  
ops/ms
   AdvanceSparseDISIBenchmark.advanceExactthrpt   15  437.998 ± 0.644  
ops/ms
   AdvanceSparseDISIBenchmark.advanceExactVector  thrpt   15  271.823 ± 0.625  
ops/ms
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Incorrect use of fsync [lucene]

2025-04-04 Thread via GitHub


viliam-durina commented on issue #14334:
URL: https://github.com/apache/lucene/issues/14334#issuecomment-2771736474

   >every application everywhere and everytime would need to call fsync on close
   
   Yeah, I don't know what would be the use case for not fsyncing before 
closing. Maybe if you don't care if the file is written or not, but then why 
are you writing it in the first place?
   
   If you don't fsync, it's like submitting an async operation, but not 
checking the future's result. It will mostly succeed, but if not, you won't 
know.
   
   >If a temporary file in the lucene index is corrupted then lucene will bail 
out
   
   Are we sure that we can detect every possible corruption in the file? E.g. 
if the file size is 0, but there should have been some data? If so then yes, 
the commit will crash, which is better than producing incorrect index. Only the 
exception will be misleading: instead of an IOException you'll have unexpected 
EOFExceptions, buffer overflows, array indexes out of bounds, invalid values 
etc. Checksum violation is the best of these. And you won't be able to 
reproduce any of these errors in a test. It's unlikely that you'll think that 
there was a hidden IO failure. The unresolved issue #10906 is also a result of 
a missing fsync IMO.
   
   For temporary files, we should either fsync before closing, or start reading 
without closing the file. The latter is more complicated, but has no 
performance hit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding profiling support for concurrent segment search [lucene]

2025-04-04 Thread via GitHub


jainankitk commented on code in PR #14413:
URL: https://github.com/apache/lucene/pull/14413#discussion_r2029297568


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/QueryProfilerBreakdown.java:
##
@@ -17,46 +17,113 @@
 
 package org.apache.lucene.sandbox.search;
 
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
 import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
 import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ConcurrentMap;
+import org.apache.lucene.search.Query;
 import org.apache.lucene.util.CollectionUtil;
 
 /**
  * A record of timings for the various operations that may happen during query 
execution. A node's
  * time may be composed of several internal attributes (rewriting, weighting, 
scoring, etc).
  */
 class QueryProfilerBreakdown {
-
-  /** The accumulated timings for this query node */
-  private final QueryProfilerTimer[] timers;
+  private static final Collection 
QUERY_LEVEL_TIMING_TYPE =
+  Arrays.stream(QueryProfilerTimingType.values()).filter(t -> 
!t.isSliceLevel()).toList();
+  private final Map 
queryProfilerTimers;
+  private final ConcurrentMap 
threadToSliceBreakdown;

Review Comment:
   > We should look into reliably identifying slices, or breaking down by 
thread rather than by slice?
   
   The code currently breaks down by thread only, even though it mentions 
slice. I incorrectly assumed that every slice is processed by unique thread, 
which need not be true. Thinking bit more on this, I feel breakdown by thread 
is good enough, as our primary intent is to understand the concurrent 
processing. If slice 1 and slice 2 are processed sequentially by thread A, it 
is actually better to aggregate that information as thread A breakdown instead 
of breakdown for slice 1 and slice 2. Thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]

2025-04-04 Thread via GitHub


jpountz commented on PR #14364:
URL: https://github.com/apache/lucene/pull/14364#issuecomment-2738217161

   Agreed!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]

2025-04-04 Thread via GitHub


jpountz commented on PR #14401:
URL: https://github.com/apache/lucene/pull/14401#issuecomment-2766473986

   It is unexpected indeed! I'll fix this and add a CHANGES entry.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow skip cache factor to be updated dynamically [lucene]

2025-04-04 Thread via GitHub


sgup432 commented on PR #14412:
URL: https://github.com/apache/lucene/pull/14412#issuecomment-2766561573

   @jpountz Thanks for expanding on the reasoning above w.r.t. query cache 
usage. I was working on 
[refactoring](https://github.com/apache/lucene/issues/14222) query cache so 
that it is not a bottleneck itself when utilized well. Maybe we can an 
additional caching policy wrt to determine high upfront cost or query latency, 
so that we only cache queries which will give us the benefit. Let me know if 
its still worth a shot, meanwhile we can work on optimizing the queries itself. 
I can open up a PR soon for that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Address gradle temp file pollution insanity [lucene]

2025-04-04 Thread via GitHub


dweiss closed issue #14385: Address gradle temp file pollution insanity
URL: https://github.com/apache/lucene/issues/14385


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow skip cache factor to be updated dynamically [lucene]

2025-04-04 Thread via GitHub


jpountz commented on PR #14412:
URL: https://github.com/apache/lucene/pull/14412#issuecomment-2765833637

   Sorry I'm not sure I get your suggestion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Incorrect use of fsync [lucene]

2025-04-04 Thread via GitHub


viliam-durina commented on issue #14334:
URL: https://github.com/apache/lucene/issues/14334#issuecomment-2771963385

   >Please stop arguing here about problems that don't exist. Issue 
https://github.com/apache/lucene/issues/10906 has nothing to do with temporary 
files.
   
   This issue is not only about temporary files, but about the way Lucene doing 
or not doing fsync. The issue #10906, I think, was about a lost IOException due 
to closing without fsyncing. I used it to illustrate the kind of errors we get 
from corrupted files, if we don't check the IO errors correctly.
   
   >This is not possible because Lucene solely uses MMap for reading files, and 
for writing it uses OutputStreams, which can't be reused to read again.
   
   It's possible to to continue using `OutputStream` API for writing and mmap 
API for reading, with a single underlying read-write file descriptor:
   
   ```java
   public static void main(String[] args) throws IOException {
 Path tempFolder = Files.createTempDirectory("foo");
 Path filePath = tempFolder.resolve("test_file");
 try (FileChannel ch = FileChannel.open(filePath, READ, WRITE, CREATE_NEW)) 
{
   // writing
   OutputStream os = Channels.newOutputStream(ch);
   os.write("hello world".getBytes());
   // When done writing, we must not close the output stream, because it 
will close the channel. But
   // we must flush it so that all in-flight changes are sent to the 
underlying channel (e.g. if the output
   // stream was wrapped in some buffering output stream).
   os.flush();
   
   // reading
   try (Arena arena = Arena.ofConfined()) {
 MemorySegment segment = ch.map(READ_ONLY, 0, ch.size(), arena);
 byte[] buf = new byte[toIntExact(segment.byteSize())];
 segment.asByteBuffer().get(buf);
 System.out.println("Data read: " + new String(buf));
   }
 } // channel closed now
   
 // we can delete the file now
 Files.delete(filePath);
 Files.delete(tempFolder);
   }
   ```
   
   For example, `Directory.createTempOutput()`  can open such read-write 
channel, create `OutputStream` from it and use that to create an `IndexOutput` 
so that existing code doesn't have to be modified. Then the `IndexOutput` 
should not be closed, but it can be converted to `IndexInput`, e.g. with a new 
method `IndexOutput.convertToInput()`, reusing the file channel. We need to 
figure out how to handle the `IOContext` with this. When the `IndexInput` is 
closed, the channel is closed. 
   
   There are probably many ways to implement this, the above was just an 
example. I have checked a few of the `createTempOutput` usages, they might be 
pretty straightforward to update.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] KeywordField.newSetQuery() to reuse prefixed terms in IndexOrDocValuesQuery [lucene]

2025-04-04 Thread via GitHub


jainankitk commented on code in PR #14435:
URL: https://github.com/apache/lucene/pull/14435#discussion_r2029301580


##
lucene/core/src/java/org/apache/lucene/document/KeywordField.java:
##
@@ -175,9 +174,8 @@ public static Query newExactQuery(String field, String 
value) {
   public static Query newSetQuery(String field, Collection values) {
 Objects.requireNonNull(field, "field must not be null");
 Objects.requireNonNull(values, "values must not be null");
-Query indexQuery = new TermInSetQuery(field, values);
-Query dvQuery = new TermInSetQuery(MultiTermQuery.DOC_VALUES_REWRITE, 
field, values);
-return new IndexOrDocValuesQuery(indexQuery, dvQuery);
+return TermInSetQuery.newIndexOrDocValuesQuery(
+MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE, field, values);

Review Comment:
   We should remove the unused method 
`TermInSetQuery#newIndexOrDocValuesQuery(String field, Collection 
terms)` in that case, as anyone invoking 
`TermInSetQuery#newIndexOrDocValuesQuery` should use the expert method 
`TermInSetQuery#newIndexOrDocValuesQuery(RewriteMethod indexRewriteMethod, 
String field, Collection terms)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Case insensitive regex query with character range [lucene]

2025-04-04 Thread via GitHub


rmuir commented on issue #14378:
URL: https://github.com/apache/lucene/issues/14378#issuecomment-2741911493

   we are better armed to begin looking at this one after recent changes, see 
https://github.com/apache/lucene/pull/14193#issuecomment-2638840849
   
   now that there is an actual single parsing node that correlates with single 
automaton state for a character class, we can support the caseless ranges using 
"icu algorithm" (iterating over every char in the range and adding its 
variants) without explosion of objects: but unicode space is big, so it will 
use up some CPU time during parsing. This is why I think it needs opt-in flag.
   
   Also we have the potential problem that Automaton.finishState() will 
collapse huge `int[]` into efficient ranges, but it doesn't seem to ever free 
the memory. e.g. if i add 50,000 transitions to it and then it reduces them to 
one simple range, I think the underlying array is still 50,000. I will attempt 
to test and address this first if possible, let's avoid memory issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-04 Thread via GitHub


benwtrent commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2027504697


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -130,4 +134,56 @@ public KnnVectorsReader getMergeInstance() {
* The default implementation is empty
*/
   public void finishMerge() throws IOException {}
+
+  /** A string representing the off-heap category for quantized vectors. */
+  public static final String QUANTIZED = "QUANTIZED";

Review Comment:
   >  I wonder if we should rather reflect the underlying format here. 
Something like flat_vector_float, flat_vector_byte, flat_vector_bbq?
   
   IMO, this is then transforming to just the codec format name as I would 
expect different formats to have different characteristics, even between 
versions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add CaseFolding.fold(), inverse of expand(), move to UnicodeUtil, add filter [lucene]

2025-04-04 Thread via GitHub


github-actions[bot] commented on PR #14389:
URL: https://github.com/apache/lucene/pull/14389#issuecomment-2779945827

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Integrating GPU based Vector Search using cuVS [lucene]

2025-04-04 Thread via GitHub


github-actions[bot] commented on PR #14131:
URL: https://github.com/apache/lucene/pull/14131#issuecomment-2779945929

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add more information to IOContext [lucene]

2025-04-04 Thread via GitHub


jainankitk commented on issue #14422:
URL: https://github.com/apache/lucene/issues/14422#issuecomment-2767494904

   Looks related to the discussion here - 
https://github.com/apache/lucene/issues/14348#issuecomment-2730902349. More 
general, than specifically vector files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Remove slices creation overhead from IndexSearcher constructor [lucene]

2025-04-04 Thread via GitHub


javanna opened a new pull request, #14428:
URL: https://github.com/apache/lucene/pull/14428

   We simplified the slices creation in IndexSearcher with #13893. That removed 
the need for a caching supplier, but had an unexpected effect: we eagerly 
create slices now for the case where no executor is provided, and that may 
cause some overhead.
   
   There are situations in which a searcher is created, although search methods 
that trigger the creation of slices are never called. We can in that case avoid 
going through that trouble.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org