Re: [I] Add more information to IOContext [lucene]
jpountz commented on issue #14422: URL: https://github.com/apache/lucene/issues/14422#issuecomment-2769197888 Thank you, I had started thinking along those lines but got blocked because I hadn't thought about using multiple "dimensions" for the context, ie. metadata/index/data is one dimension, terms/postings/vectors/etc. is another one. This sounds like a good step in the right direction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] PointInSetQuery early exit on non-matching segments [lucene]
gsmiller merged PR #14268: URL: https://github.com/apache/lucene/pull/14268 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] PointInSetQuery early exit on non-matching segments [lucene]
gsmiller commented on PR #14268: URL: https://github.com/apache/lucene/pull/14268#issuecomment-2769642302 This looks great! Taking care of the merge now. Thank you @hanbj ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
ChrisHegarty commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2022995174 ## lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java: ## @@ -257,6 +257,15 @@ public long ramBytesUsed() { return size; } + @Override + public long offHeapBytes() { +long bytes = 0L; +for (var field : fields.values()) { + bytes += field.vectorDataLength(); +} +return bytes; + } + Review Comment: I don't deny that this is not ideal, but it not a new model, it already exists - `Accountable`. One should always consider whether or not children should be counted, and if so how. ## lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java: ## @@ -257,6 +257,15 @@ public long ramBytesUsed() { return size; } + @Override + public long offHeapBytes() { +long bytes = 0L; +for (var field : fields.values()) { + bytes += field.vectorDataLength(); +} +return bytes; + } + Review Comment: I don't deny that this is not ideal, but it's not a new model, it already exists - `Accountable`. One should always consider whether or not children should be counted, and if so how. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
ChrisHegarty commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2022995174 ## lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java: ## @@ -257,6 +257,15 @@ public long ramBytesUsed() { return size; } + @Override + public long offHeapBytes() { +long bytes = 0L; +for (var field : fields.values()) { + bytes += field.vectorDataLength(); +} +return bytes; + } + Review Comment: I don't deny that this is not ideal, but it's not a new model, it already exists - `Accountable`. One should always consider whether or not children should be counted, and if so how. Lemme see if/how named accountable could help here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
ChrisHegarty commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2023074747 ## lucene/core/src/java/org/apache/lucene/util/OffHeapAccountable.java: ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util; + +import java.util.Collection; +import java.util.Collections; + +/** + * An object whose off-heap memory requirements can be computed. + * + * @lucene.internal + */ +public interface OffHeapAccountable { Review Comment: I reworded things a bit and added a bit of javadoc to make things clearer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]
hanbj commented on code in PR #14267: URL: https://github.com/apache/lucene/pull/14267#discussion_r2022373498 ## lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java: ## @@ -129,6 +141,16 @@ public final Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, fl private boolean matches(byte[] packedValue) { int offset = 0; + +if (equalValues) { + for (int dim = 0; dim < numDims; dim++, offset += bytesPerDim) { +if (comparator.compare(packedValue, offset, lowerPoint, offset) != 0) { + return false; +} + } + return true; +} Review Comment: This implementation will result in a lot of code duplication. I will implement it first. Thank you for your hard work in the review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Reuse packedTerms between two TermInSetQuery what combined with IndexOrDocValuesQuery [lucene]
jpountz commented on issue #14425: URL: https://github.com/apache/lucene/issues/14425#issuecomment-2769408354 The builder approach should work. Or maybe a static helper like `public static Query newIndexOrDocValuesSetQuery(RewriteMethod indexRewriteMethod, String field, Collection terms)` that could take advantage of the private constructor since this is only a problem for creating an `IndexOrDocValuesQuery`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
benwtrent commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2022890436 ## lucene/core/src/java/org/apache/lucene/util/OffHeapAccountable.java: ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util; + +import java.util.Collection; +import java.util.Collections; + +/** + * An object whose off-heap memory requirements can be computed. + * + * @lucene.internal + */ +public interface OffHeapAccountable { Review Comment: I wonder if we want to call this "OffHeapAccountable". I am not sure, "OnHeap" has a "hard limit" on heap usage to prevent OOMs. This "OffHeap" is a "soft limit" that relates to performance, not the system running. So, its not a "requirements" type of thing. Maybe I am just stuck on the phrase "resource requirements". As I also see this being useful for other field types, besides HNSW knn, where holding everything in off-heap memory is critical. ## lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java: ## @@ -257,6 +257,15 @@ public long ramBytesUsed() { return size; } + @Override + public long offHeapBytes() { +long bytes = 0L; +for (var field : fields.values()) { + bytes += field.vectorDataLength(); +} +return bytes; + } + Review Comment: Here is where we need to be careful with the API. Technically binary quantized vectors keeps the raw vectors (same with other quantized techniques). Technically, the add to the overall off-heap size, though they may not need to be loaded into off-heap memory for search to be fast. For example, if somebody is using `byte` type vectors, they aren't actually quantized, but its searchable. Meaning, they would all need to be in off-heap memory to be fast. ## lucene/core/src/java/org/apache/lucene/util/OffHeapAccountable.java: ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util; + +import java.util.Collection; +import java.util.Collections; + +/** + * An object whose off-heap memory requirements can be computed. + * + * @lucene.internal Review Comment: Is this really an "internal" API? Maybe I misunderstand its use, but it seems designed for folks to make decisions or provide feedback directly using this API. So, maybe we call it "experimental" if unsure of its usage/stability right now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding profiling support for concurrent segment search [lucene]
jpountz commented on PR #14413: URL: https://github.com/apache/lucene/pull/14413#issuecomment-2769456423 I'd have a top-level tree for everything related to initializing the search and combining results (rewrite(), createWeight(), CollectorManager#reduce) and then a list of trees for each slice. Related, it'd be nice if each per-slice object could also tell us about the thread that it ran in and its start time so that we could understand how exactly Lucene managed to parallelize the search. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
ChrisHegarty commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2022920202 ## lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java: ## @@ -257,6 +257,15 @@ public long ramBytesUsed() { return size; } + @Override + public long offHeapBytes() { +long bytes = 0L; +for (var field : fields.values()) { + bytes += field.vectorDataLength(); +} +return bytes; + } + Review Comment: Right, that is what `getChildOffHeapResources` tries to address - by allowing to separate graph, quantized, and raw vectors. But I do accept that this may not be all that obvious to the user. Actually I need to fix a missing `getChildOffHeapResources` from Lucene102... for the raw vectors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
ChrisHegarty commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2022923407 ## lucene/core/src/java/org/apache/lucene/util/OffHeapAccountable.java: ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util; + +import java.util.Collection; +import java.util.Collections; + +/** + * An object whose off-heap memory requirements can be computed. + * + * @lucene.internal + */ +public interface OffHeapAccountable { Review Comment: Yeah, "requirements" is not the best word here. It's just a "size". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Reuse packedTerms between two TermInSetQuery what combined with IndexOrDocValuesQuery [lucene]
jpountz commented on issue #14425: URL: https://github.com/apache/lucene/issues/14425#issuecomment-2769216017 Indeed it would be nice if `KeywordField#newSetQuery` didn't pay the CPU and heap price for creating the `PrefixCodecTerms` instance twice. At the same time, let's keep `PrefixCodedTerms` an implementation-detail that users don't have to deal with? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Reuse packedTerms between two TermInSetQuery what combined with IndexOrDocValuesQuery [lucene]
mkhludnev commented on issue #14425: URL: https://github.com/apache/lucene/issues/14425#issuecomment-2769246604 Presumably, one TermInSetQuery may create another with the rewrite method specified. WDYT? Or TermInSetQueryBuider may create query by query with different rewrites? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]
tteofili commented on code in PR #14094: URL: https://github.com/apache/lucene/pull/14094#discussion_r2022551917 ## lucene/core/src/java/org/apache/lucene/search/HnswKnnCollector.java: ## @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +/** + * {@link KnnCollector} that exposes methods to hook into specific parts of the HNSW algorithm. + * + * @lucene.experimental + */ +public abstract class HnswKnnCollector extends KnnCollector.Decorator { Review Comment: I've spent some time trying to refactor this and extract a wider nextVectorsBlock API, but it sounds like conflating too much into this PR, so I'd opt to "only" get ride of the `HnswKnnCollector` class and rely on the strategy. ## lucene/core/src/java/org/apache/lucene/search/HnswKnnCollector.java: ## @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +/** + * {@link KnnCollector} that exposes methods to hook into specific parts of the HNSW algorithm. + * + * @lucene.experimental + */ +public abstract class HnswKnnCollector extends KnnCollector.Decorator { Review Comment: I've spent some time trying to refactor this and extract a wider `nextVectorsBlock` API, but it sounds like conflating too much into this PR, so I'd opt to "only" get ride of the `HnswKnnCollector` class and rely on the strategy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]
tteofili commented on code in PR #14094: URL: https://github.com/apache/lucene/pull/14094#discussion_r2022636246 ## lucene/core/src/java/org/apache/lucene/search/HnswKnnCollector.java: ## @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +/** + * {@link KnnCollector} that exposes methods to hook into specific parts of the HNSW algorithm. + * + * @lucene.experimental + */ +public abstract class HnswKnnCollector extends KnnCollector.Decorator { Review Comment: as a first step I've dropped HnswKnnCollector in favor of adding the `nextVectorsBlock` API to `KnnCollector.Decorator`. ## lucene/core/src/java/org/apache/lucene/search/HnswKnnCollector.java: ## @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +/** + * {@link KnnCollector} that exposes methods to hook into specific parts of the HNSW algorithm. + * + * @lucene.experimental + */ +public abstract class HnswKnnCollector extends KnnCollector.Decorator { Review Comment: as a first step I've dropped `HnswKnnCollector` in favor of adding the `nextVectorsBlock` API to `KnnCollector.Decorator`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]
tteofili commented on code in PR #14094: URL: https://github.com/apache/lucene/pull/14094#discussion_r2022551917 ## lucene/core/src/java/org/apache/lucene/search/HnswKnnCollector.java: ## @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +/** + * {@link KnnCollector} that exposes methods to hook into specific parts of the HNSW algorithm. + * + * @lucene.experimental + */ +public abstract class HnswKnnCollector extends KnnCollector.Decorator { Review Comment: I've spent some time trying to refactor this and extract a wider `nextVectorsBlock` API, but it sounds like conflating too much into this PR, so I'd opt to "only" get rid of the `HnswKnnCollector` class and rely on the strategy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
ChrisHegarty commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2022693959 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java: ## @@ -306,6 +306,16 @@ private HnswGraph getGraphValues(FieldEntry entry) throws IOException { return new OffHeapHnswGraph(entry, bytesSlice); } + @Override + public long offHeapBytes() { +long bytes = 0L; +for (var field : fields.values()) { + var f = field.value; + bytes += f.vectorDataLength() + f.indexDataLength(); +} +return bytes; Review Comment: That a good point, and one that I did think off too, but I wasn't able to come up with something simple as all these fields are their own individual inner classes, and any refactoring I experimented with just seemed to add complexity. :-( -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
tteofili commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2022684543 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java: ## @@ -306,6 +306,16 @@ private HnswGraph getGraphValues(FieldEntry entry) throws IOException { return new OffHeapHnswGraph(entry, bytesSlice); } + @Override + public long offHeapBytes() { +long bytes = 0L; +for (var field : fields.values()) { + var f = field.value; + bytes += f.vectorDataLength() + f.indexDataLength(); +} +return bytes; Review Comment: what about an `util` static method that can be applied here and for other `*VectorsReader` without having to re-implement the same code block multiple times? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
benwtrent commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2023094888 ## lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java: ## @@ -257,6 +257,15 @@ public long ramBytesUsed() { return size; } + @Override + public long offHeapBytes() { +long bytes = 0L; +for (var field : fields.values()) { + bytes += field.vectorDataLength(); +} +return bytes; + } + Review Comment: @ChrisHegarty yeah, I am not sure :(. I think this is a nice first step, but the goal here is to signal the off-heap size so that the user can determine if they have "enough" off-heap for speed. But, then they need to know how to actually calculate what "enough" is for their typical search case for that format. Not ideal, but maybe its enough for a step 1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on PR #14333: URL: https://github.com/apache/lucene/pull/14333#issuecomment-2770201445 > We should add this format to RandomCodec then, so that it gets included as part of codec randomization. OK, did not see this. I know how to do it then. Thanks Adrien :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
jpountz commented on PR #14333: URL: https://github.com/apache/lucene/pull/14333#issuecomment-2770123261 > Once we think this is ready, we should prolly merge at first as the non-default Codec We should add this format to `RandomCodec` then, so that it gets included as part of codec randomization. Otherwise it will only be exercised in `TestXXXPostingsFormat`. We just cut branch_10_2, so this should give us some time to let it bake while still having it released in 10.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]
jainankitk commented on code in PR #14267: URL: https://github.com/apache/lucene/pull/14267#discussion_r2023189174 ## lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java: ## @@ -120,381 +132,447 @@ public void visit(QueryVisitor visitor) { public final Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) throws IOException { +if (this.equalValues) { // lowerPoint==upperPoint + return new SinglePointConstantScoreWeight(this, scoreMode, boost); +} // We don't use RandomAccessWeight here: it's no good to approximate with "match all docs". // This is an inverted structure and should be used in the first pass: +return new MultiPointsConstantScoreWeight(this, scoreMode, boost); + } -return new ConstantScoreWeight(this, boost) { + /** + * Essentially, it is to reduce the number of comparisons. This is an optimization, used for the + * case of lowerPoint==upperPoint. + */ + protected class SinglePointConstantScoreWeight extends MultiPointsConstantScoreWeight { Review Comment: I am assuming we are reusing some of the methods from `MultiPointsConstantScoreWeight`. That's why we are extending from that class. May, I suggest creating class say `PointRangeQueryWeight` that extends from `ConstantScoreWeight`? And, both `SinglePointRangeQueryWeight` and `MultiPointRangeQueryWeight` extend from `PointRangeQueryWeight`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Incorrect use of fsync [lucene]
viliam-durina commented on issue #14334: URL: https://github.com/apache/lucene/issues/14334#issuecomment-2768802025 I think we must fsync also the temporary files. Without fsyncing, when we read them back, they might be incomplete and no error might be reported. We could perhaps avoid fsyncing these if we open them for reading and writing, and do the reading after writing without closing the descriptor in the meantime. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
benwtrent commented on code in PR #14426: URL: https://github.com/apache/lucene/pull/14426#discussion_r2022982193 ## lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102BinaryQuantizedVectorsReader.java: ## @@ -257,6 +257,15 @@ public long ramBytesUsed() { return size; } + @Override + public long offHeapBytes() { +long bytes = 0L; +for (var field : fields.values()) { + bytes += field.vectorDataLength(); +} +return bytes; + } + Review Comment: So, the user needs to know "when to sum them" and when "not to sum them" without any information as to what the child resources actually provide? I guess they can do an `instanceof` and determine what they want to do per codec name :( seems brittle. I am not sure of a better way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on PR #14333: URL: https://github.com/apache/lucene/pull/14333#issuecomment-2770173066 Thank you very much for all these careful, warm and helpful comments! > Are there any major items / blockers? I think I've addressed all of them (hopefully didn't miss any). > Do all Lucene tests pass if you force tests to use this in their Codec? The `Lucene90BlockTreeTermsWriter` is now used by default codec so tests should catch it easily. All lucene core tests passed. > Maybe beast things a bit (run nightly, increase TEST_ITERS, etc.) :) I've finished a round of `./gradlew :lucene:core:test -Ptests.nightly=true -Ptests.monster=true` and all tests passed. I'll run more. > Once we think this is ready, we should prolly merge at first as the non-default Codec, and let CI chew on things / let it bake for a while in main? If all goes well, then maybe at some point we backport and make it default for a 10.x? I understand the idea that merge it into main but not backport immediately. But i do not know how to make CI catch this change and test it heavily without making it a default codec in main? Or you are meaning not backport to 10.x, but make it a default codec in main? (Sorry for my pool English!) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]
tteofili commented on PR #14094: URL: https://github.com/apache/lucene/pull/14094#issuecomment-2769837089 @benwtrent I've reworked the design exposing `KnnSearchStrategy#nextVectorsBlock` and `PatienceKnnVectorQuery` leverages a `Patience` strategy that calls the `HnswQueueSaturationCollector#nextCandidate` on `nextVectorsBlock`. hopefully this is a bit cleaner now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]
hanbj commented on code in PR #14267: URL: https://github.com/apache/lucene/pull/14267#discussion_r2024183379 ## lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java: ## @@ -120,381 +132,447 @@ public void visit(QueryVisitor visitor) { public final Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) throws IOException { +if (this.equalValues) { // lowerPoint==upperPoint + return new SinglePointConstantScoreWeight(this, scoreMode, boost); +} // We don't use RandomAccessWeight here: it's no good to approximate with "match all docs". // This is an inverted structure and should be used in the first pass: +return new MultiPointsConstantScoreWeight(this, scoreMode, boost); + } -return new ConstantScoreWeight(this, boost) { + /** + * Essentially, it is to reduce the number of comparisons. This is an optimization, used for the + * case of lowerPoint==upperPoint. + */ + protected class SinglePointConstantScoreWeight extends MultiPointsConstantScoreWeight { Review Comment: This suggestion is great, SinglePointRangeQueryWeight and MultiplaPointRangeQueryWeight only need to implement their own point value matching logic and relationship judgment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Incorrect use of fsync [lucene]
dweiss commented on issue #14334: URL: https://github.com/apache/lucene/issues/14334#issuecomment-2771397718 But why would you want to read a temporary file after a crash? These are... temporary - if a process crashed, there is no recovery at all (at least concerning temporary files). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]
jainankitk commented on code in PR #14397: URL: https://github.com/apache/lucene/pull/14397#discussion_r2021750116 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/LZ4WithPresetDictCompressionMode.java: ## @@ -98,12 +98,17 @@ public void decompress(DataInput in, int originalLength, int offset, int length, final int blockLength = in.readVInt(); final int numBlocks = readCompressedLengths(in, originalLength, dictLength, blockLength); - - buffer = ArrayUtil.growNoCopy(buffer, dictLength + blockLength); bytes.length = 0; - // Read the dictionary - if (LZ4.decompress(in, dictLength, buffer, 0) != dictLength) { -throw new CorruptIndexException("Illegal dict length", in); + if (reused) { +assert buffer.length >= dictLength + blockLength; +in.skipBytes(compressedLengths[0]); + } else { +// Read the dictionary +buffer = ArrayUtil.growNoCopy(buffer, dictLength + blockLength); +if (LZ4.decompress(in, dictLength, buffer, 0) != dictLength) { + throw new CorruptIndexException("Illegal dict length", in); +} +reused = true; Review Comment: I am wondering if we should consider exposing metric on how many times we could reuse, and how many times had to read from the disk? That would provide some useful insights on the usefulness of this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] fix TestIndexWriterWithThreads#testIOExceptionDuringWriteSegmentWithThreadsOnlyOnce [lucene]
guojialiang92 opened a new issue, #14423: URL: https://github.com/apache/lucene/issues/14423 ### Description ### Description I found that Test `TestIndexWriterWithThreads#testIOExceptionDuringWriteSegmentWithThreadsOnlyOnce` may fail in rare cases. Exception information is as follows: `java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still 108 open files` I looked up some other related issues and finally located the problem. At the same time, I also added a test that can stably reproduce the problem. ### Analysis The root cause is that `IndexWriter#ensureOpen()` throws `AlreadyClosedException` when `IndexWriter#closed` or `IndexWriter#closing` is `true`. In the test code, after throwing `AlreadyClosedException` when executing `IndexWriter#commit`, the `IndexWriter#close` is not called again to ensure that `IndexWriter` closes normally. ``` try { writer.commit(); writer.close(); success = true; } catch ( @SuppressWarnings("unused") AlreadyClosedException ace) { // OK: abort closes the writer assertTrue(writer.isDeleterClosed()); } catch ( @SuppressWarnings("unused") IOException ioe) { writer.rollback(); failure.clearDoFail(); } ``` ### To Reproduce In order to stabilize the reproduce problem, I added a test `TestIndexWriterWithThreads#testIOExceptionWithMergeNotEndLongTime`. The code will be executed in the following order: 1. Set `LiveIndexWriterConfig#readerPooling` to `true` to ensure that `ReaderPool#release` does not release `ReadersAndUpdates` 2. Start an `IndexerThread` and start writing data 3. Wait for a merge thread to start working (using `mergeCountDownLatch`), then simulate write failure via `MockDirectoryWrapper#failOn` 4. Block the execution of the merge thread until `IndexWriter#updateDocument` throws an exception due to write failure (using `updateDocumentFailCountDownLatch`). 5. Merge thread and IW thread will both call `IndexWriter#maybeCloseOnTragicEvent`, and I control to let Merge thread execute `IndexWriter#rollbackInternalNoCommit` first, and the IW thread will skip it. (using `mergeCloseCountDownLatch`) 6. Because the execution of the merge thread is asynchronous, the test will continue until `IndexWriter#commit` is called and a `AlreadyClosedException` is throw. 7. In order to prevent `IndexInput` from being closed before calling `MockDirectoryWrapper#close`, I let the `ConcurrentMergeScheduler#close` sleep for 5s. ### Solution The closure of `IndexInput` needs to wait until the execution of `ReaderPool#close` in `IndexWriter#rollbackInternalNoCommit`. After calling `IndexWriter#commit` to throw `AlreadyClosedException`, It is necessary to call `IndexWriter#close` in finalize. ``` try { writer.commit(); writer.close(); success = true; } catch ( @SuppressWarnings("unused") AlreadyClosedException ace) { // OK: abort closes the writer assertTrue(writer.isDeleterClosed()); } catch ( @SuppressWarnings("unused") IOException ioe) { writer.rollback(); failure.clearDoFail(); } finally { writer.close(); } ``` ### Related issues [10930](https://github.com/apache/lucene/issues/10930), [13552](https://github.com/apache/lucene/issues/13552) ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]
vsop-479 commented on PR #14371: URL: https://github.com/apache/lucene/pull/14371#issuecomment-2768597044 Adjust `ENABLE_ADVANCE_WITHIN_BLOCK_VECTOR_OPTO` to 16 (at least 16 lanes, such as: AVX, AVX-512). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
mikemccand commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2022734745 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java: ## @@ -0,0 +1,552 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class TrieBuilder { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + static final int LEAF_NODE_HAS_TERMS = 1 << 5; + static final int LEAF_NODE_HAS_FLOOR = 1 << 6; + static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1; + static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0; + + /** + * The output describing the term block the prefix point to. + * + * @param fp describes the on-disk terms block which a trie node points to. + * @param hasTerms A boolean which will be false if this on-disk block consists entirely of + * pointers to child blocks. + * @param floorData A {@link BytesRef} which will be non-null when a large block of terms sharing + * a single trie prefix is split into multiple on-disk blocks. + */ + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +BUILDING, +SAVED, +DESTROYED + } + + private static class Node { + +// The utf8 digit that leads to this Node, 0 for root node +private final int label; +// The children listed in order by their utf8 label +private final LinkedList children; +// The output of this node. +private Output output; + +// Vars used during saving: + +// The file pointer point to where the node saved. -1 means the node has not been saved. +private long fp = -1; +// The iterator whose next() point to the first child has not been saved. +private Iterator childrenIterator; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.BUILDING; + final Node root = new Node(0, null, new LinkedList<>()); + + static TrieBuilder bytesRefToTrie(BytesRef k, Output v) { +return new TrieBuilder(k, v); + } + + private TrieBuilder(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + /** + * Absorb all (K, V) pairs from the given trie into this one. The given trie builder should not + * have key that already exists in this one, otherwise a {@link IllegalArgumentException } will be + * thrown and this trie will get destroyed. + * + * Note: the given trie will be destroyed after absorbing. + */ + void absorb(TrieBuilder trieBuilder) { +if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) { + throw new IllegalStateException("tries should be unsaved"); +} +// Use a simple stack to avoid recursion. +Deque stack = new ArrayDeque<>(); +stack.add(() -> absorb(this.root, trieBuilder.root, stack)); +while (!stack.isEmpty()) { + stack.pop().run(); +} +trieBuilder.status = Status.DESTROYED; + } + + private void absorb(Node n, Node add, Deque stack) { +assert n.label == add.label; +if (add.output != null) { + if (n.output != null) { +
Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]
hanbj commented on code in PR #14267: URL: https://github.com/apache/lucene/pull/14267#discussion_r2022373498 ## lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java: ## @@ -129,6 +141,16 @@ public final Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, fl private boolean matches(byte[] packedValue) { int offset = 0; + +if (equalValues) { + for (int dim = 0; dim < numDims; dim++, offset += bytesPerDim) { +if (comparator.compare(packedValue, offset, lowerPoint, offset) != 0) { + return false; +} + } + return true; +} Review Comment: I will implement it first. Thank you for your hard work in the review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Reuse packedTerms between two TermInSetQuery what combined with IndexOrDocValuesQuery [lucene]
mkhludnev opened a new issue, #14425: URL: https://github.com/apache/lucene/issues/14425 ### Description In cases like these ``` new IndexOrDocValuesQuery( new TermInSetQuery(MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE, name(), iBytesRefs), new TermInSetQuery(MultiTermQuery.DOC_VALUES_REWRITE, name(), dVByteRefs)); ``` I would like to reuse packedTerms between `TermInSetQuery` instances. What's the best approach to do it? Or it isn't worthwhile? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] SortedSet DV Multi Range query [lucene]
mkhludnev commented on code in PR #13974: URL: https://github.com/apache/lucene/pull/13974#discussion_r2023179556 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java: ## @@ -0,0 +1,300 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.search; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import java.util.Objects; +import org.apache.lucene.document.SortedSetDocValuesField; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.DocValuesSkipper; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.index.TermsEnum; +import org.apache.lucene.search.BooleanQuery; +import org.apache.lucene.search.ConstantScoreScorer; +import org.apache.lucene.search.ConstantScoreWeight; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.DocValuesRangeIterator; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.QueryVisitor; +import org.apache.lucene.search.ScoreMode; +import org.apache.lucene.search.Scorer; +import org.apache.lucene.search.ScorerSupplier; +import org.apache.lucene.search.TwoPhaseIterator; +import org.apache.lucene.search.Weight; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.LongBitSet; + +/** A union multiple ranges over SortedSetDocValuesField */ +public class SortedSetMultiRangeQuery extends Query { + private final String field; + private final int bytesPerDim; + private final ArrayUtil.ByteArrayComparator comparator; + List rangeClauses; + + SortedSetMultiRangeQuery( + String name, + List clauses, + int bytes, + ArrayUtil.ByteArrayComparator comparator) { +this.field = name; +this.rangeClauses = clauses; +this.bytesPerDim = bytes; +this.comparator = comparator; + } + + /** Builder for creating a SortedSetMultiRangeQuery. */ + public static class Builder { +private final String name; +protected final List clauses = new ArrayList<>(); +private final int bytes; +private final ArrayUtil.ByteArrayComparator comparator; + +public Builder(String name, int bytes) { + this.name = Objects.requireNonNull(name); + this.bytes = bytes; // TODO assrt positive + this.comparator = ArrayUtil.getUnsignedComparator(bytes); +} + +public Builder add(BytesRef lowerValue, BytesRef upperValue) { + byte[] low = lowerValue.clone().bytes; + byte[] up = upperValue.clone().bytes; + if (this.comparator.compare(low, 0, up, 0) > 0) { +throw new IllegalArgumentException("lowerValue must be <= upperValue"); + } else { +clauses.add(new MultiRangeQuery.RangeClause(low, up)); + } + return this; +} + +public Query build() { + if (clauses.isEmpty()) { +return new BooleanQuery.Builder().build(); + } + if (clauses.size() == 1) { +return SortedSetDocValuesField.newSlowRangeQuery( +name, +new BytesRef(clauses.getFirst().lowerValue), +new BytesRef(clauses.getFirst().upperValue), +true, +true); + } + return new SortedSetMultiRangeQuery(name, clauses, this.bytes, comparator); +} + } + + @Override + public Query rewrite(IndexSearcher indexSearcher) throws IOException { +ArrayList sortedClauses = new ArrayList<>(this.rangeClauses); +sortedClauses.sort( +(o1, o2) -> { + // if (result == 0) { + //return comparator.compare(o1.upperValue, 0, o2.upperValue, 0); + // } else { + return comparator.compare(o1.lowerValue, 0, o2.lowerValue, 0); + // } +}); +if (!this.rangeClauses.equals(sortedClauses)) { + return new SortedSetMultiRangeQuery( + this.field, sortedClauses, this.bytesPerDim, this.comparator); +} else { + return this; +} + } + + @Override + public String toString(String fld) { +return "SortedSetMultiRangeQuery{" ++ "field='" ++ fld +