date:20250401

Re: [I] Add more information to IOContext [lucene]

2025-04-01 Thread via GitHub



jpountz commented on issue #14422:
URL: https://github.com/apache/lucene/issues/14422#issuecomment-2769197888

   Thank you, I had started thinking along those lines but got blocked because 
I hadn't thought about using multiple "dimensions" for the context, ie. 
metadata/index/data is one dimension, terms/postings/vectors/etc. is another 
one. This sounds like a good step in the right direction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] PointInSetQuery early exit on non-matching segments [lucene]

2025-04-01 Thread via GitHub



gsmiller merged PR #14268:
URL: https://github.com/apache/lucene/pull/14268


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] PointInSetQuery early exit on non-matching segments [lucene]

2025-04-01 Thread via GitHub



gsmiller commented on PR #14268:
URL: https://github.com/apache/lucene/pull/14268#issuecomment-2769642302

   This looks great! Taking care of the merge now. Thank you @hanbj !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-01 Thread via GitHub



ChrisHegarty commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2022995174


##
lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java:
##
@@ -257,6 +257,15 @@ public long ramBytesUsed() {
 return size;
   }
 
+  @Override
+  public long offHeapBytes() {
+long bytes = 0L;
+for (var field : fields.values()) {
+  bytes += field.vectorDataLength();
+}
+return bytes;
+  }
+

Review Comment:
   I don't deny that this is not ideal, but it not a new model, it already 
exists - `Accountable`.  One should always consider whether or not children 
should be counted, and if so how. 



##
lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java:
##
@@ -257,6 +257,15 @@ public long ramBytesUsed() {
 return size;
   }
 
+  @Override
+  public long offHeapBytes() {
+long bytes = 0L;
+for (var field : fields.values()) {
+  bytes += field.vectorDataLength();
+}
+return bytes;
+  }
+

Review Comment:
   I don't deny that this is not ideal, but it's not a new model, it already 
exists - `Accountable`.  One should always consider whether or not children 
should be counted, and if so how. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-01 Thread via GitHub



ChrisHegarty commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2022995174


##
lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java:
##
@@ -257,6 +257,15 @@ public long ramBytesUsed() {
 return size;
   }
 
+  @Override
+  public long offHeapBytes() {
+long bytes = 0L;
+for (var field : fields.values()) {
+  bytes += field.vectorDataLength();
+}
+return bytes;
+  }
+

Review Comment:
   I don't deny that this is not ideal, but it's not a new model, it already 
exists - `Accountable`.  One should always consider whether or not children 
should be counted, and if so how.  Lemme see if/how named  accountable could 
help here.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-01 Thread via GitHub



ChrisHegarty commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2023074747


##
lucene/core/src/java/org/apache/lucene/util/OffHeapAccountable.java:
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util;
+
+import java.util.Collection;
+import java.util.Collections;
+
+/**
+ * An object whose off-heap memory requirements can be computed.
+ *
+ * @lucene.internal
+ */
+public interface OffHeapAccountable {

Review Comment:
   I reworded things a bit and added a bit of javadoc to make things clearer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]

2025-04-01 Thread via GitHub



hanbj commented on code in PR #14267:
URL: https://github.com/apache/lucene/pull/14267#discussion_r2022373498


##
lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java:
##
@@ -129,6 +141,16 @@ public final Weight createWeight(IndexSearcher searcher, 
ScoreMode scoreMode, fl
 
   private boolean matches(byte[] packedValue) {
 int offset = 0;
+
+if (equalValues) {
+  for (int dim = 0; dim < numDims; dim++, offset += bytesPerDim) {
+if (comparator.compare(packedValue, offset, lowerPoint, offset) != 
0) {
+  return false;
+}
+  }
+  return true;
+}

Review Comment:
   This implementation will result in a lot of code duplication. I will 
implement it first. Thank you for your hard work in the review.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Reuse packedTerms between two TermInSetQuery what combined with IndexOrDocValuesQuery [lucene]

2025-04-01 Thread via GitHub



jpountz commented on issue #14425:
URL: https://github.com/apache/lucene/issues/14425#issuecomment-2769408354

   The builder approach should work. Or maybe a static helper like `public 
static Query newIndexOrDocValuesSetQuery(RewriteMethod indexRewriteMethod, 
String field, Collection terms)` that could take advantage of the 
private constructor since this is only a problem for creating an 
`IndexOrDocValuesQuery`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-01 Thread via GitHub



benwtrent commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2022890436


##
lucene/core/src/java/org/apache/lucene/util/OffHeapAccountable.java:
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util;
+
+import java.util.Collection;
+import java.util.Collections;
+
+/**
+ * An object whose off-heap memory requirements can be computed.
+ *
+ * @lucene.internal
+ */
+public interface OffHeapAccountable {

Review Comment:
   I wonder if we want to call this "OffHeapAccountable". 
   
   I am not sure, "OnHeap" has a "hard limit" on heap usage to prevent OOMs. 
   
   This "OffHeap" is a "soft limit" that relates to performance, not the system 
running. 
   
   So, its not a "requirements" type of thing.
   
   Maybe I am just stuck on the phrase "resource requirements". As I also see 
this being useful for other field types, besides HNSW knn, where holding 
everything in off-heap memory is critical. 



##
lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java:
##
@@ -257,6 +257,15 @@ public long ramBytesUsed() {
 return size;
   }
 
+  @Override
+  public long offHeapBytes() {
+long bytes = 0L;
+for (var field : fields.values()) {
+  bytes += field.vectorDataLength();
+}
+return bytes;
+  }
+

Review Comment:
   Here is where we need to be careful with the API. Technically binary 
quantized vectors keeps the raw vectors (same with other quantized techniques).
   
   Technically, the add to the overall off-heap size, though they may not need 
to be loaded into off-heap memory for search to be fast.
   
   For example, if somebody is using `byte` type vectors, they aren't actually 
quantized, but its searchable. Meaning, they would all need to be in off-heap 
memory to be fast.



##
lucene/core/src/java/org/apache/lucene/util/OffHeapAccountable.java:
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util;
+
+import java.util.Collection;
+import java.util.Collections;
+
+/**
+ * An object whose off-heap memory requirements can be computed.
+ *
+ * @lucene.internal

Review Comment:
   Is this really an "internal" API? Maybe I misunderstand its use, but it 
seems designed for folks to make decisions or provide feedback directly using 
this API.
   
   So, maybe we call it "experimental" if unsure of its usage/stability right 
now?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Adding profiling support for concurrent segment search [lucene]

2025-04-01 Thread via GitHub



jpountz commented on PR #14413:
URL: https://github.com/apache/lucene/pull/14413#issuecomment-2769456423

   I'd have a top-level tree for everything related to initializing the search 
and combining results (rewrite(), createWeight(), CollectorManager#reduce) and 
then a list of trees for each slice. Related, it'd be nice if each per-slice 
object could also tell us about the thread that it ran in and its start time so 
that we could understand how exactly Lucene managed to parallelize the search.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-01 Thread via GitHub



ChrisHegarty commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2022920202


##
lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java:
##
@@ -257,6 +257,15 @@ public long ramBytesUsed() {
 return size;
   }
 
+  @Override
+  public long offHeapBytes() {
+long bytes = 0L;
+for (var field : fields.values()) {
+  bytes += field.vectorDataLength();
+}
+return bytes;
+  }
+

Review Comment:
   Right, that is what `getChildOffHeapResources` tries to address - by 
allowing to separate graph, quantized, and raw vectors. But I do accept that 
this may not be all that obvious to the user. Actually I need to fix a missing 
`getChildOffHeapResources` from Lucene102... for the raw vectors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-01 Thread via GitHub



ChrisHegarty commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2022923407


##
lucene/core/src/java/org/apache/lucene/util/OffHeapAccountable.java:
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util;
+
+import java.util.Collection;
+import java.util.Collections;
+
+/**
+ * An object whose off-heap memory requirements can be computed.
+ *
+ * @lucene.internal
+ */
+public interface OffHeapAccountable {

Review Comment:
   Yeah, "requirements" is not the best word here. It's just a "size".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Reuse packedTerms between two TermInSetQuery what combined with IndexOrDocValuesQuery [lucene]

2025-04-01 Thread via GitHub



jpountz commented on issue #14425:
URL: https://github.com/apache/lucene/issues/14425#issuecomment-2769216017

   Indeed it would be nice if `KeywordField#newSetQuery` didn't pay the CPU and 
heap price for creating the `PrefixCodecTerms` instance twice. At the same 
time, let's keep `PrefixCodedTerms` an implementation-detail that users don't 
have to deal with?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Reuse packedTerms between two TermInSetQuery what combined with IndexOrDocValuesQuery [lucene]

2025-04-01 Thread via GitHub



mkhludnev commented on issue #14425:
URL: https://github.com/apache/lucene/issues/14425#issuecomment-2769246604

   Presumably, one TermInSetQuery may create another with the rewrite method 
specified. WDYT?
   Or TermInSetQueryBuider may create query by query with different rewrites?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

2025-04-01 Thread via GitHub



tteofili commented on code in PR #14094:
URL: https://github.com/apache/lucene/pull/14094#discussion_r2022551917


##
lucene/core/src/java/org/apache/lucene/search/HnswKnnCollector.java:
##
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+/**
+ * {@link KnnCollector} that exposes methods to hook into specific parts of 
the HNSW algorithm.
+ *
+ * @lucene.experimental
+ */
+public abstract class HnswKnnCollector extends KnnCollector.Decorator {

Review Comment:
   I've spent some time trying to refactor this and extract a wider 
nextVectorsBlock API, but it sounds like conflating too much into this PR, so 
I'd opt to "only" get ride of the `HnswKnnCollector` class and rely on the 
strategy. 



##
lucene/core/src/java/org/apache/lucene/search/HnswKnnCollector.java:
##
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+/**
+ * {@link KnnCollector} that exposes methods to hook into specific parts of 
the HNSW algorithm.
+ *
+ * @lucene.experimental
+ */
+public abstract class HnswKnnCollector extends KnnCollector.Decorator {

Review Comment:
   I've spent some time trying to refactor this and extract a wider 
`nextVectorsBlock` API, but it sounds like conflating too much into this PR, so 
I'd opt to "only" get ride of the `HnswKnnCollector` class and rely on the 
strategy. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

2025-04-01 Thread via GitHub



tteofili commented on code in PR #14094:
URL: https://github.com/apache/lucene/pull/14094#discussion_r2022636246


##
lucene/core/src/java/org/apache/lucene/search/HnswKnnCollector.java:
##
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+/**
+ * {@link KnnCollector} that exposes methods to hook into specific parts of 
the HNSW algorithm.
+ *
+ * @lucene.experimental
+ */
+public abstract class HnswKnnCollector extends KnnCollector.Decorator {

Review Comment:
   as a first step I've dropped HnswKnnCollector in favor of adding the 
`nextVectorsBlock` API to `KnnCollector.Decorator`.



##
lucene/core/src/java/org/apache/lucene/search/HnswKnnCollector.java:
##
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+/**
+ * {@link KnnCollector} that exposes methods to hook into specific parts of 
the HNSW algorithm.
+ *
+ * @lucene.experimental
+ */
+public abstract class HnswKnnCollector extends KnnCollector.Decorator {

Review Comment:
   as a first step I've dropped `HnswKnnCollector` in favor of adding the 
`nextVectorsBlock` API to `KnnCollector.Decorator`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

2025-04-01 Thread via GitHub



tteofili commented on code in PR #14094:
URL: https://github.com/apache/lucene/pull/14094#discussion_r2022551917


##
lucene/core/src/java/org/apache/lucene/search/HnswKnnCollector.java:
##
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+/**
+ * {@link KnnCollector} that exposes methods to hook into specific parts of 
the HNSW algorithm.
+ *
+ * @lucene.experimental
+ */
+public abstract class HnswKnnCollector extends KnnCollector.Decorator {

Review Comment:
   I've spent some time trying to refactor this and extract a wider 
`nextVectorsBlock` API, but it sounds like conflating too much into this PR, so 
I'd opt to "only" get rid of the `HnswKnnCollector` class and rely on the 
strategy. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-01 Thread via GitHub



ChrisHegarty commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2022693959


##
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java:
##
@@ -306,6 +306,16 @@ private HnswGraph getGraphValues(FieldEntry entry) throws 
IOException {
 return new OffHeapHnswGraph(entry, bytesSlice);
   }
 
+  @Override
+  public long offHeapBytes() {
+long bytes = 0L;
+for (var field : fields.values()) {
+  var f = field.value;
+  bytes += f.vectorDataLength() + f.indexDataLength();
+}
+return bytes;

Review Comment:
   That a good point, and one that I did think off too, but I wasn't able to 
come up with something simple as all these fields are their own individual 
inner classes, and any refactoring I experimented with just seemed to add 
complexity. :-( 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-01 Thread via GitHub



tteofili commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2022684543


##
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java:
##
@@ -306,6 +306,16 @@ private HnswGraph getGraphValues(FieldEntry entry) throws 
IOException {
 return new OffHeapHnswGraph(entry, bytesSlice);
   }
 
+  @Override
+  public long offHeapBytes() {
+long bytes = 0L;
+for (var field : fields.values()) {
+  var f = field.value;
+  bytes += f.vectorDataLength() + f.indexDataLength();
+}
+return bytes;

Review Comment:
   what about an `util` static method that can be applied here and for other 
`*VectorsReader` without having to re-implement the same code block multiple 
times? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-01 Thread via GitHub



benwtrent commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2023094888


##
lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java:
##
@@ -257,6 +257,15 @@ public long ramBytesUsed() {
 return size;
   }
 
+  @Override
+  public long offHeapBytes() {
+long bytes = 0L;
+for (var field : fields.values()) {
+  bytes += field.vectorDataLength();
+}
+return bytes;
+  }
+

Review Comment:
   @ChrisHegarty yeah, I am not sure :(. I think this is a nice first step, but 
the goal here is to signal the off-heap size so that the user can determine if 
they have "enough" off-heap for speed. 
   
   But, then they need to know how to actually calculate what "enough" is for 
their typical search case for that format. 
   
   Not ideal, but maybe its enough for a step 1.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-04-01 Thread via GitHub



gf2121 commented on PR #14333:
URL: https://github.com/apache/lucene/pull/14333#issuecomment-2770201445

   > We should add this format to RandomCodec then, so that it gets included as 
part of codec randomization.
   
   OK, did not see this. I know how to do it then. Thanks Adrien :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-04-01 Thread via GitHub



jpountz commented on PR #14333:
URL: https://github.com/apache/lucene/pull/14333#issuecomment-2770123261

   > Once we think this is ready, we should prolly merge at first as the 
non-default Codec
   
   We should add this format to `RandomCodec` then, so that it gets included as 
part of codec randomization. Otherwise it will only be exercised in 
`TestXXXPostingsFormat`.
   
   We just cut branch_10_2, so this should give us some time to let it bake 
while still having it released in 10.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]

2025-04-01 Thread via GitHub



jainankitk commented on code in PR #14267:
URL: https://github.com/apache/lucene/pull/14267#discussion_r2023189174


##
lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java:
##
@@ -120,381 +132,447 @@ public void visit(QueryVisitor visitor) {
   public final Weight createWeight(IndexSearcher searcher, ScoreMode 
scoreMode, float boost)
   throws IOException {
 
+if (this.equalValues) { // lowerPoint==upperPoint
+  return new SinglePointConstantScoreWeight(this, scoreMode, boost);
+}
 // We don't use RandomAccessWeight here: it's no good to approximate with 
"match all docs".
 // This is an inverted structure and should be used in the first pass:
+return new MultiPointsConstantScoreWeight(this, scoreMode, boost);
+  }
 
-return new ConstantScoreWeight(this, boost) {
+  /**
+   * Essentially, it is to reduce the number of comparisons. This is an 
optimization, used for the
+   * case of lowerPoint==upperPoint.
+   */
+  protected class SinglePointConstantScoreWeight extends 
MultiPointsConstantScoreWeight {

Review Comment:
   I am assuming we are reusing some of the methods from 
`MultiPointsConstantScoreWeight`. That's why we are extending from that class. 
May, I suggest creating class say `PointRangeQueryWeight` that extends from 
`ConstantScoreWeight`? And, both `SinglePointRangeQueryWeight` and 
`MultiPointRangeQueryWeight` extend from `PointRangeQueryWeight`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Incorrect use of fsync [lucene]

2025-04-01 Thread via GitHub



viliam-durina commented on issue #14334:
URL: https://github.com/apache/lucene/issues/14334#issuecomment-2768802025

   I think we must fsync also the temporary files. Without fsyncing, when we 
read them back, they might be incomplete and no error might be reported. We 
could perhaps avoid fsyncing these if we open them for reading and writing, and 
do the reading after writing without closing the descriptor in the meantime.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-01 Thread via GitHub



benwtrent commented on code in PR #14426:
URL: https://github.com/apache/lucene/pull/14426#discussion_r2022982193


##
lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java:
##
@@ -257,6 +257,15 @@ public long ramBytesUsed() {
 return size;
   }
 
+  @Override
+  public long offHeapBytes() {
+long bytes = 0L;
+for (var field : fields.values()) {
+  bytes += field.vectorDataLength();
+}
+return bytes;
+  }
+

Review Comment:
   So, the user needs to know "when to sum them" and when "not to sum them" 
without any information as to what the child resources actually provide? 
   
   I guess they can do an `instanceof` and determine what they want to do per 
codec name :(
   
   seems brittle. I am not sure of a better way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-04-01 Thread via GitHub



gf2121 commented on PR #14333:
URL: https://github.com/apache/lucene/pull/14333#issuecomment-2770173066

   Thank you very much for all these careful, warm and helpful comments!
   
   > Are there any major items / blockers?
   
   I think I've addressed all of them (hopefully didn't miss any).
   
   > Do all Lucene tests pass if you force tests to use this in their Codec?
   
   The `Lucene90BlockTreeTermsWriter` is now used by default codec so tests 
should catch it easily. All lucene core tests passed.
   
   > Maybe beast things a bit (run nightly, increase TEST_ITERS, etc.) :)
   
   I've finished a round of `./gradlew :lucene:core:test -Ptests.nightly=true 
-Ptests.monster=true` and all tests passed. I'll run more.
   
   > Once we think this is ready, we should prolly merge at first as the 
non-default Codec, and let CI chew on things / let it bake for a while in main? 
If all goes well, then maybe at some point we backport and make it default for 
a 10.x?
   
   I understand the idea that merge it into main but not backport immediately. 
But i do not know how to make CI catch this change and test it heavily without 
making it a default codec in main? Or you are meaning not backport to 10.x, but 
make it a default codec in main? (Sorry for my pool English!)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

2025-04-01 Thread via GitHub



tteofili commented on PR #14094:
URL: https://github.com/apache/lucene/pull/14094#issuecomment-2769837089

   @benwtrent I've reworked the design exposing 
`KnnSearchStrategy#nextVectorsBlock` and `PatienceKnnVectorQuery` leverages a 
`Patience` strategy that calls the `HnswQueueSaturationCollector#nextCandidate` 
on `nextVectorsBlock`.
   hopefully this is a bit cleaner now. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]

2025-04-01 Thread via GitHub



hanbj commented on code in PR #14267:
URL: https://github.com/apache/lucene/pull/14267#discussion_r2024183379


##
lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java:
##
@@ -120,381 +132,447 @@ public void visit(QueryVisitor visitor) {
   public final Weight createWeight(IndexSearcher searcher, ScoreMode 
scoreMode, float boost)
   throws IOException {
 
+if (this.equalValues) { // lowerPoint==upperPoint
+  return new SinglePointConstantScoreWeight(this, scoreMode, boost);
+}
 // We don't use RandomAccessWeight here: it's no good to approximate with 
"match all docs".
 // This is an inverted structure and should be used in the first pass:
+return new MultiPointsConstantScoreWeight(this, scoreMode, boost);
+  }
 
-return new ConstantScoreWeight(this, boost) {
+  /**
+   * Essentially, it is to reduce the number of comparisons. This is an 
optimization, used for the
+   * case of lowerPoint==upperPoint.
+   */
+  protected class SinglePointConstantScoreWeight extends 
MultiPointsConstantScoreWeight {

Review Comment:
   This suggestion is great, SinglePointRangeQueryWeight and 
MultiplaPointRangeQueryWeight only need to implement their own point value 
matching logic and relationship judgment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Incorrect use of fsync [lucene]

2025-04-01 Thread via GitHub



dweiss commented on issue #14334:
URL: https://github.com/apache/lucene/issues/14334#issuecomment-2771397718

   But why would you want to read a temporary file after a crash? These are... 
temporary - if a process crashed, there is no recovery at all (at least 
concerning temporary files).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]

2025-04-01 Thread via GitHub



jainankitk commented on code in PR #14397:
URL: https://github.com/apache/lucene/pull/14397#discussion_r2021750116


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/LZ4WithPresetDictCompressionMode.java:
##
@@ -98,12 +98,17 @@ public void decompress(DataInput in, int originalLength, 
int offset, int length,
   final int blockLength = in.readVInt();
 
   final int numBlocks = readCompressedLengths(in, originalLength, 
dictLength, blockLength);
-
-  buffer = ArrayUtil.growNoCopy(buffer, dictLength + blockLength);
   bytes.length = 0;
-  // Read the dictionary
-  if (LZ4.decompress(in, dictLength, buffer, 0) != dictLength) {
-throw new CorruptIndexException("Illegal dict length", in);
+  if (reused) {
+assert buffer.length >= dictLength + blockLength;
+in.skipBytes(compressedLengths[0]);
+  } else {
+// Read the dictionary
+buffer = ArrayUtil.growNoCopy(buffer, dictLength + blockLength);
+if (LZ4.decompress(in, dictLength, buffer, 0) != dictLength) {
+  throw new CorruptIndexException("Illegal dict length", in);
+}
+reused = true;

Review Comment:
   I am wondering if we should consider exposing metric on how many times we 
could reuse, and how many times had to read from the disk? That would provide 
some useful insights on the usefulness of this change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] fix TestIndexWriterWithThreads#testIOExceptionDuringWriteSegmentWithThreadsOnlyOnce [lucene]

2025-04-01 Thread via GitHub



guojialiang92 opened a new issue, #14423:
URL: https://github.com/apache/lucene/issues/14423

   ### Description
   
   ### Description
   I found that Test 
`TestIndexWriterWithThreads#testIOExceptionDuringWriteSegmentWithThreadsOnlyOnce`
 may fail in rare cases. Exception information is as follows: 
   `java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are 
still 108 open files`
   
   I looked up some other related issues and finally located the problem. At 
the same time, I also added a test that can stably reproduce the problem.
   
   ### Analysis
   The root cause is that `IndexWriter#ensureOpen()` throws 
`AlreadyClosedException` when `IndexWriter#closed` or `IndexWriter#closing` is 
`true`. In the test code, after throwing `AlreadyClosedException` when 
executing `IndexWriter#commit`, the `IndexWriter#close` is not called again to 
ensure that `IndexWriter` closes normally.
   ```
   try {
 writer.commit();
 writer.close();
 success = true;
   } catch (
   @SuppressWarnings("unused")
   AlreadyClosedException ace) {
 // OK: abort closes the writer
 assertTrue(writer.isDeleterClosed());
   } catch (
   @SuppressWarnings("unused")
   IOException ioe) {
 writer.rollback();
 failure.clearDoFail();
   }
   ```
   
   ### To Reproduce
   In order to stabilize the reproduce problem, I added a test 
`TestIndexWriterWithThreads#testIOExceptionWithMergeNotEndLongTime`.
   
   The code will be executed in the following order:
   
   1. Set `LiveIndexWriterConfig#readerPooling` to `true` to ensure that 
`ReaderPool#release` does not release `ReadersAndUpdates`
   2. Start an `IndexerThread` and start writing data
   3. Wait for a merge thread to start working (using `mergeCountDownLatch`), 
then simulate write failure via `MockDirectoryWrapper#failOn`
   4. Block the execution of the merge thread until 
`IndexWriter#updateDocument` throws an exception due to write failure (using 
`updateDocumentFailCountDownLatch`).
   5. Merge thread and IW thread will both call 
`IndexWriter#maybeCloseOnTragicEvent`, and I control to let Merge thread 
execute `IndexWriter#rollbackInternalNoCommit` first, and the IW thread will 
skip it. (using `mergeCloseCountDownLatch`)
   6. Because the execution of the merge thread is asynchronous, the test will 
continue until `IndexWriter#commit` is called and a `AlreadyClosedException` is 
throw.
   7. In order to prevent `IndexInput` from being closed before calling 
`MockDirectoryWrapper#close`, I let the `ConcurrentMergeScheduler#close` sleep 
for 5s.
   
   ### Solution
   The closure of `IndexInput` needs to wait until the execution of 
`ReaderPool#close` in `IndexWriter#rollbackInternalNoCommit`. After calling 
`IndexWriter#commit` to throw `AlreadyClosedException`,  It is necessary to 
call `IndexWriter#close` in finalize.
   
   ```
   try {
 writer.commit();
 writer.close();
 success = true;
   } catch (
   @SuppressWarnings("unused")
   AlreadyClosedException ace) {
 // OK: abort closes the writer
 assertTrue(writer.isDeleterClosed());
   } catch (
   @SuppressWarnings("unused")
   IOException ioe) {
 writer.rollback();
 failure.clearDoFail();
   } finally {
 writer.close();
   }
   ```
   
   ### Related issues
   [10930](https://github.com/apache/lucene/issues/10930), 
[13552](https://github.com/apache/lucene/issues/13552)
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]

2025-04-01 Thread via GitHub



vsop-479 commented on PR #14371:
URL: https://github.com/apache/lucene/pull/14371#issuecomment-2768597044

   Adjust `ENABLE_ADVANCE_WITHIN_BLOCK_VECTOR_OPTO` to 16 (at least 16 lanes, 
such as: AVX, AVX-512).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-04-01 Thread via GitHub



mikemccand commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2022734745


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java:
##
@@ -0,0 +1,552 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class TrieBuilder {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  static final int LEAF_NODE_HAS_TERMS = 1 << 5;
+  static final int LEAF_NODE_HAS_FLOOR = 1 << 6;
+  static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1;
+  static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0;
+
+  /**
+   * The output describing the term block the prefix point to.
+   *
+   * @param fp describes the on-disk terms block which a trie node points to.
+   * @param hasTerms A boolean which will be false if this on-disk block 
consists entirely of
+   * pointers to child blocks.
+   * @param floorData A {@link BytesRef} which will be non-null when a large 
block of terms sharing
+   * a single trie prefix is split into multiple on-disk blocks.
+   */
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+BUILDING,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+
+// The utf8 digit that leads to this Node, 0 for root node
+private final int label;
+// The children listed in order by their utf8 label
+private final LinkedList children;
+// The output of this node.
+private Output output;
+
+// Vars used during saving:
+
+// The file pointer point to where the node saved. -1 means the node has 
not been saved.
+private long fp = -1;
+// The iterator whose next() point to the first child has not been saved.
+private Iterator childrenIterator;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.BUILDING;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  static TrieBuilder bytesRefToTrie(BytesRef k, Output v) {
+return new TrieBuilder(k, v);
+  }
+
+  private TrieBuilder(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  /**
+   * Absorb all (K, V) pairs from the given trie into this one. The given trie 
builder should not
+   * have key that already exists in this one, otherwise a {@link 
IllegalArgumentException } will be
+   * thrown and this trie will get destroyed.
+   *
+   * Note: the given trie will be destroyed after absorbing.
+   */
+  void absorb(TrieBuilder trieBuilder) {
+if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+// Use a simple stack to avoid recursion.
+Deque stack = new ArrayDeque<>();
+stack.add(() -> absorb(this.root, trieBuilder.root, stack));
+while (!stack.isEmpty()) {
+  stack.pop().run();
+}
+trieBuilder.status = Status.DESTROYED;
+  }
+
+  private void absorb(Node n, Node add, Deque stack) {
+assert n.label == add.label;
+if (add.output != null) {
+  if (n.output != null) {
+

Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]

2025-04-01 Thread via GitHub



hanbj commented on code in PR #14267:
URL: https://github.com/apache/lucene/pull/14267#discussion_r2022373498


##
lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java:
##
@@ -129,6 +141,16 @@ public final Weight createWeight(IndexSearcher searcher, 
ScoreMode scoreMode, fl
 
   private boolean matches(byte[] packedValue) {
 int offset = 0;
+
+if (equalValues) {
+  for (int dim = 0; dim < numDims; dim++, offset += bytesPerDim) {
+if (comparator.compare(packedValue, offset, lowerPoint, offset) != 
0) {
+  return false;
+}
+  }
+  return true;
+}

Review Comment:
   I will implement it first. Thank you for your hard work in the review.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] Reuse packedTerms between two TermInSetQuery what combined with IndexOrDocValuesQuery [lucene]

2025-04-01 Thread via GitHub



mkhludnev opened a new issue, #14425:
URL: https://github.com/apache/lucene/issues/14425

   ### Description
   
   In cases like these 
   ```
   new IndexOrDocValuesQuery(
  new TermInSetQuery(MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE, 
name(), iBytesRefs), 
  new TermInSetQuery(MultiTermQuery.DOC_VALUES_REWRITE, name(), 
dVByteRefs));
   ```
   I would like to reuse packedTerms between `TermInSetQuery` instances. What's 
the best approach to do it? Or it isn't worthwhile?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] SortedSet DV Multi Range query [lucene]

2025-04-01 Thread via GitHub



mkhludnev commented on code in PR #13974:
URL: https://github.com/apache/lucene/pull/13974#discussion_r2023179556


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java:
##
@@ -0,0 +1,300 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.search;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Objects;
+import org.apache.lucene.document.SortedSetDocValuesField;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesSkipper;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.index.TermsEnum;
+import org.apache.lucene.search.BooleanQuery;
+import org.apache.lucene.search.ConstantScoreScorer;
+import org.apache.lucene.search.ConstantScoreWeight;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.DocValuesRangeIterator;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.QueryVisitor;
+import org.apache.lucene.search.ScoreMode;
+import org.apache.lucene.search.Scorer;
+import org.apache.lucene.search.ScorerSupplier;
+import org.apache.lucene.search.TwoPhaseIterator;
+import org.apache.lucene.search.Weight;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.LongBitSet;
+
+/** A union multiple ranges over SortedSetDocValuesField */
+public class SortedSetMultiRangeQuery extends Query {
+  private final String field;
+  private final int bytesPerDim;
+  private final ArrayUtil.ByteArrayComparator comparator;
+  List rangeClauses;
+
+  SortedSetMultiRangeQuery(
+  String name,
+  List clauses,
+  int bytes,
+  ArrayUtil.ByteArrayComparator comparator) {
+this.field = name;
+this.rangeClauses = clauses;
+this.bytesPerDim = bytes;
+this.comparator = comparator;
+  }
+
+  /** Builder for creating a SortedSetMultiRangeQuery. */
+  public static class Builder {
+private final String name;
+protected final List clauses = new 
ArrayList<>();
+private final int bytes;
+private final ArrayUtil.ByteArrayComparator comparator;
+
+public Builder(String name, int bytes) {
+  this.name = Objects.requireNonNull(name);
+  this.bytes = bytes; // TODO assrt positive
+  this.comparator = ArrayUtil.getUnsignedComparator(bytes);
+}
+
+public Builder add(BytesRef lowerValue, BytesRef upperValue) {
+  byte[] low = lowerValue.clone().bytes;
+  byte[] up = upperValue.clone().bytes;
+  if (this.comparator.compare(low, 0, up, 0) > 0) {
+throw new IllegalArgumentException("lowerValue must be <= upperValue");
+  } else {
+clauses.add(new MultiRangeQuery.RangeClause(low, up));
+  }
+  return this;
+}
+
+public Query build() {
+  if (clauses.isEmpty()) {
+return new BooleanQuery.Builder().build();
+  }
+  if (clauses.size() == 1) {
+return SortedSetDocValuesField.newSlowRangeQuery(
+name,
+new BytesRef(clauses.getFirst().lowerValue),
+new BytesRef(clauses.getFirst().upperValue),
+true,
+true);
+  }
+  return new SortedSetMultiRangeQuery(name, clauses, this.bytes, 
comparator);
+}
+  }
+
+  @Override
+  public Query rewrite(IndexSearcher indexSearcher) throws IOException {
+ArrayList sortedClauses = new 
ArrayList<>(this.rangeClauses);
+sortedClauses.sort(
+(o1, o2) -> {
+  // if (result == 0) {
+  //return comparator.compare(o1.upperValue, 0, o2.upperValue, 0);
+  // } else {
+  return comparator.compare(o1.lowerValue, 0, o2.lowerValue, 0);
+  // }
+});
+if (!this.rangeClauses.equals(sortedClauses)) {
+  return new SortedSetMultiRangeQuery(
+  this.field, sortedClauses, this.bytesPerDim, this.comparator);
+} else {
+  return this;
+}
+  }
+
+  @Override
+  public String toString(String fld) {
+return "SortedSetMultiRangeQuery{"
++ "field='"
++ fld
+

Re: [I] Add more information to IOContext [lucene]

Re: [PR] PointInSetQuery early exit on non-matching segments [lucene]

Re: [PR] PointInSetQuery early exit on non-matching segments [lucene]

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]

Re: [I] Reuse packedTerms between two TermInSetQuery what combined with IndexOrDocValuesQuery [lucene]

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

Re: [PR] Adding profiling support for concurrent segment search [lucene]

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

Re: [I] Reuse packedTerms between two TermInSetQuery what combined with IndexOrDocValuesQuery [lucene]

Re: [I] Reuse packedTerms between two TermInSetQuery what combined with IndexOrDocValuesQuery [lucene]

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

Re: [PR] A specialized Trie for Block Tree Index [lucene]

Re: [PR] A specialized Trie for Block Tree Index [lucene]

Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]

Re: [I] Incorrect use of fsync [lucene]

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

Re: [PR] A specialized Trie for Block Tree Index [lucene]

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]

Re: [I] Incorrect use of fsync [lucene]

Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]

[I] fix TestIndexWriterWithThreads#testIOExceptionDuringWriteSegmentWithThreadsOnlyOnce [lucene]

Re: [PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]

Re: [PR] A specialized Trie for Block Tree Index [lucene]

Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]

[I] Reuse packedTerms between two TermInSetQuery what combined with IndexOrDocValuesQuery [lucene]

Re: [PR] SortedSet DV Multi Range query [lucene]

36 matches

Site Navigation

Mail list logo

Footer information