Re: [PR] Replace Map with IntObjectHashMap for KnnVectorsReader [lucene]
bugmakerr commented on code in PR #13763: URL: https://github.com/apache/lucene/pull/13763#discussion_r1756317165 ## lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java: ## @@ -239,51 +245,69 @@ public FieldsReader(final SegmentReadState readState) throws IOException { * @param field the name of a numeric vector field */ public KnnVectorsReader getFieldReader(String field) { - return fields.get(field); + final FieldInfo info = fieldInfos.fieldInfo(field); + if (info == null) { +return null; + } + return fields.get(info.number); } @Override public void checkIntegrity() throws IOException { - for (KnnVectorsReader reader : fields.values()) { -reader.checkIntegrity(); + for (ObjectCursor cursor : fields.values()) { +cursor.value.checkIntegrity(); } } @Override public FloatVectorValues getFloatVectorValues(String field) throws IOException { - KnnVectorsReader knnVectorsReader = fields.get(field); - if (knnVectorsReader == null) { + final FieldInfo info = fieldInfos.fieldInfo(field); + KnnVectorsReader reader; + if (info == null || (reader = fields.get(info.number)) == null) { return null; Review Comment: forgive me if i misunderstand something, but the javadoc says that **the return value is never null**, and we do return null in `PerFieldKnnVectorsFormat`. If we check the field info in caller, `AssertingKnnVectorsReader` will always get the non-null value from the delegate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]
javanna commented on code in PR #13747: URL: https://github.com/apache/lucene/pull/13747#discussion_r1756305014 ## lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java: ## @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.join; + +import java.io.IOException; +import org.apache.lucene.search.Collector; + +interface MergeableCollector> extends Collector { + + /** Folds the results from another collector of the same type into this collector. */ + void merge(C collector) throws IOException; +} Review Comment: I wonder if we need to introduce these new abstractions: MergeableCollector and MergeableCollectorManager. I can see what you are up to, many times we use Collector themselves to reduce multiple collectors into. I am not a big fan of that design though, it seems like a quick way to move to using collector managers, but ideally what reduce and hence the search method returns isn't a collector, and we would not use collectors to retrieve state after search is executed. How much more complicated would the code be without these new concepts, writing the collector managers that are needed and adding that logic to reduce? Would that cause a lot of code duplication? ## lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java: ## @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.join; + +import java.io.IOException; +import org.apache.lucene.search.Collector; + +interface MergeableCollector> extends Collector { + + /** Folds the results from another collector of the same type into this collector. */ + void merge(C collector) throws IOException; +} Review Comment: Another take at this, maybe `merge` does not fit that well in a `Collector` design but we could have a base collector manager impl that exposes the same merge method. After all, I think merging should be a collector manager's concern? Maybe we would even call it reduce: `protected C reduce(C collector);` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Can `gradle tidy` reflow text properly when it inserts newlines? [lucene]
dweiss commented on issue #13766: URL: https://github.com/apache/lucene/issues/13766#issuecomment-2345552152 So I just double checked and block-level comments are treated as preformatted within code. This: ``` /* * fba sdifhopi hweifh swehf doish dsopih ewfpoih efwspoihb wepihb apin wef. spedfih [wefph wepfih wqefdpoi qp[hjfaph we. * bar bar */ // fba sdifhopi hweifh swehf doish dsopih ewfpoih efwspoihb wepihb apin wef. spedfih [wefph wepfih wqefdpoi qp[hjfaph we. // bar bar ``` is tidied up to this: ``` /* * fba sdifhopi hweifh swehf doish dsopih ewfpoih efwspoihb wepihb apin wef. spedfih [wefph wepfih wqefdpoi qp[hjfaph we. * bar bar */ // fba sdifhopi hweifh swehf doish dsopih ewfpoih efwspoihb wepihb apin wef. spedfih [wefph // wepfih wqefdpoi qp[hjfaph we. // bar bar ``` I personally think this is the right choice by gjf - breaks long lines into visibility but does not break any other line comments (and potentially damage them). Block comments ```/* */``` can be used to preserve any formatting that should not be touched at all. I think this issue can be closed, sorry if it bothers you, but it's an operator's error. :) There are more headaches coming with the markdown comments spec [1]. It'll be fun. [1] https://openjdk.org/jeps/467 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]
msfroh commented on code in PR #13747: URL: https://github.com/apache/lucene/pull/13747#discussion_r1756363078 ## lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java: ## @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.join; + +import java.io.IOException; +import org.apache.lucene.search.Collector; + +interface MergeableCollector> extends Collector { + + /** Folds the results from another collector of the same type into this collector. */ + void merge(C collector) throws IOException; +} Review Comment: Makes sense! Given that the merge logic is collector-specific, I'm thinking that collectors should expose what they produce after reduction. This also makes it easier to push the N-way merge into the collectors. Maybe the resulting CollectorManager might be more reusable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]
msfroh commented on code in PR #13747: URL: https://github.com/apache/lucene/pull/13747#discussion_r1756378484 ## lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java: ## @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.join; + +import java.io.IOException; +import org.apache.lucene.search.Collector; + +interface MergeableCollector> extends Collector { + + /** Folds the results from another collector of the same type into this collector. */ + void merge(C collector) throws IOException; +} Review Comment: Actually, maybe we're disagreeing? I think you're suggesting that the CollectorManager should handle merging, while I'm suggesting that Collector should have some output and know how to merge multiple outputs. This would mean that every collector needs a corresponding CollectorManager, or at least a CollectorManager for every output type, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]
msfroh commented on code in PR #13747: URL: https://github.com/apache/lucene/pull/13747#discussion_r1756385824 ## lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java: ## @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.join; + +import java.io.IOException; +import org.apache.lucene.search.Collector; + +interface MergeableCollector> extends Collector { + + /** Folds the results from another collector of the same type into this collector. */ + void merge(C collector) throws IOException; +} Review Comment: Hmm... While it may be a bigger change, it feels like we'll we should really have just one CollectorManager that merges results from Collecors based on what the Collectors collect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]
msfroh commented on code in PR #13747: URL: https://github.com/apache/lucene/pull/13747#discussion_r1756385824 ## lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java: ## @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.join; + +import java.io.IOException; +import org.apache.lucene.search.Collector; + +interface MergeableCollector> extends Collector { + + /** Folds the results from another collector of the same type into this collector. */ + void merge(C collector) throws IOException; +} Review Comment: Hmm... While it may be a bigger change, it feels like we should really have just one CollectorManager that merges results from Collectors based on what the Collectors collect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]
msfroh commented on code in PR #13747: URL: https://github.com/apache/lucene/pull/13747#discussion_r1756429132 ## lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java: ## @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.join; + +import java.io.IOException; +import org.apache.lucene.search.Collector; + +interface MergeableCollector> extends Collector { + + /** Folds the results from another collector of the same type into this collector. */ + void merge(C collector) throws IOException; +} Review Comment: I can reimplent my follow-up PR to do three separate CollectorManagers that handle the merge logic, I suppose. That would be more consistent with the existing Lucene code, but I'm not sure that I agree with that approach. It essentially means that every Collector needs a parallel CollectorManager that knows how to merge that Collector. For something like these join Collectors, it also means that the CollectorManager needs to know how to implement the ScoreMode (min, max, sum, or avg). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]
msfroh commented on code in PR #13747: URL: https://github.com/apache/lucene/pull/13747#discussion_r1756429132 ## lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java: ## @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.join; + +import java.io.IOException; +import org.apache.lucene.search.Collector; + +interface MergeableCollector> extends Collector { + + /** Folds the results from another collector of the same type into this collector. */ + void merge(C collector) throws IOException; +} Review Comment: I can reimplement my follow-up PR to do three separate CollectorManagers that handle the merge logic, I suppose. That would be more consistent with the existing Lucene code, but I'm not sure that I agree with that approach. It essentially means that every Collector needs a parallel CollectorManager that knows how to merge that Collector. For something like these join Collectors, it also means that the CollectorManager needs to know how to implement the ScoreMode (min, max, sum, or avg). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] similarities: provide default computeNorm implementation; remove remaining discountOverlaps setters; [lucene]
cpoerschke commented on code in PR #13757: URL: https://github.com/apache/lucene/pull/13757#discussion_r1756534635 ## lucene/core/src/java/org/apache/lucene/search/similarities/DFRSimilarity.java: ## @@ -94,6 +94,26 @@ public class DFRSimilarity extends SimilarityBase { */ public DFRSimilarity( BasicModel basicModel, AfterEffect afterEffect, Normalization normalization) { +this(basicModel, afterEffect, normalization, true); + } + + /** + * Creates DFRSimilarity from the three components. + * + * Note that null values are not allowed: if you want no normalization, instead + * pass {@link NoNormalization}. + * + * @param basicModel Basic model of information content + * @param afterEffect First normalization of information gain + * @param normalization Second (length) normalization + * @param discountOverlaps True if overlap tokens (tokens with a position of increment of zero) Review Comment: Good catch! ```suggestion * @param discountOverlaps True if overlap tokens (tokens with a position of increment of zero) * are discounted from the document's length. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] similarities: provide default computeNorm implementation; remove remaining discountOverlaps setters; [lucene]
cpoerschke commented on code in PR #13757: URL: https://github.com/apache/lucene/pull/13757#discussion_r1756540884 ## lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java: ## @@ -111,7 +133,17 @@ protected Similarity() {} * @param state current processing state for this field * @return computed norm value */ - public abstract long computeNorm(FieldInvertState state); + public long computeNorm(FieldInvertState state) { Review Comment: I don't feel knowledgeable enough to properly document/edit on this. _"Allow edits and access to secrets by maintainers"_ is enabled for the pull request -- please feel free to (in-browser or otherwise) change the documentation here. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Can we decrease the overhead of skipping? [lucene]
jpountz commented on issue #13106: URL: https://github.com/apache/lucene/issues/13106#issuecomment-2345810711 Fixed by https://github.com/apache/lucene/pull/13585 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Can we decrease the overhead of skipping? [lucene]
jpountz closed issue #13106: Can we decrease the overhead of skipping? URL: https://github.com/apache/lucene/issues/13106 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix TestPrefixRandom failures when intra-segment concurrency is enabled [lucene]
javanna opened a new pull request, #13773: URL: https://github.com/apache/lucene/pull/13773 The order in which documents are processed is not a guarantee, hence we may return a different set of documents when early terminating the search. Those are not incorrect results though. I opted for increasing the number of hits so that we never early terminate the search for this test. I believe this test should have failed with inter-segment concurrency as well, and intra-segment concurrency makes it more likely to fail (although not so frequently). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]
javanna commented on code in PR #13747: URL: https://github.com/apache/lucene/pull/13747#discussion_r1756577220 ## lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java: ## @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.join; + +import java.io.IOException; +import org.apache.lucene.search.Collector; + +interface MergeableCollector> extends Collector { + + /** Folds the results from another collector of the same type into this collector. */ + void merge(C collector) throws IOException; +} Review Comment: > This would mean that every collector needs a corresponding CollectorManager, or at least a CollectorManager for every output type, right? I guess it depends, you could also reuse the same collector manager across collectors if they share a sub-type perhaps. But as a general guideline, yes. After all, search goes through collector managers and collector managers are responsible to create collectors and reduce multiple collectors into a single result, whether that be a collector or some more meaningful result class. I don't have a strong opinion, but I thought that moving your merge method to the collector manager should not be a big change, I may be missing some details perhaps. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] PayloadScoreQuery javadoc update w.r.t. SpanQuery use [lucene]
cpoerschke merged PR #13731: URL: https://github.com/apache/lucene/pull/13731 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Address length used to copy array in FacetsCollector to not be out of bounds [lucene]
javanna merged PR #13774: URL: https://github.com/apache/lucene/pull/13774 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]
javanna opened a new pull request, #13775: URL: https://github.com/apache/lucene/pull/13775 There's two tests where we use 250_000 as number of collected hits, but we only ever index max 2000 docs. That makes use create a priority queue of size 250_000 for each segment partition which causes out of memory errors when the number of partitions is higher than a few. With this commit I propose that we lower the threshold to 2000 for those tests that need a high number of collected hits. The assumption that a priority queue is not built within the LargeNumHitsTopDocsCollector still holds so this change should not defeat the purpose of the tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]
javanna commented on code in PR #13775: URL: https://github.com/apache/lucene/pull/13775#discussion_r1756658422 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/search/LargeNumHitsTopDocsCollector.java: ## @@ -137,7 +137,7 @@ protected void populateResults(ScoreDoc[] results, int howMany) { } // Total number of hits collected were less than requestedHitCount -assert totalHits < requestedHitCount; +assert totalHits <= requestedHitCount; Review Comment: this appears to be an existing bug -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Can we remove `compress` option for quantized KNN vector indexing? [lucene]
mikemccand commented on issue #13768: URL: https://github.com/apache/lucene/issues/13768#issuecomment-2346033320 Well, I ran [`knnPerfTest.py`](https://github.com/mikemccand/luceneutil/blob/f4a07ed8de36c47aacb6033a3709e236bc42aca4/src/python/knnPerfTest.py) on my Linux dev box (x86-64 Raptorlake i9-13900K). This CPU has crazy number of flags, but NOT AVX-512: ``` flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities ``` This is with Panama enabled (`INFO: Java vector incubator API enabled; uses preferredBitSize=256; FMA enabled`). I'll try with Panama disabled next. Results: ``` recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s force merge s num segments index size (MB) 0.319 0.167 15010 6 32 50 4 bits 86.01 64.49 1 5013.19 0.326 0.187 15010 6 32 50 4 bits 89.03 74.99 1 5562.51 ``` Unfortunately the output doesn't state it, but the first row is `compress=True` and 2nd is `compress=False`. Indeed, latency (search time) got faster (187 usec -> 167 usec) with `compress=True`, and this is quite a reduction (~10% ish) in index size. Indexing and force merge time did get a bit slower ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix TestPrefixRandom failures when intra-segment concurrency is enabled [lucene]
jpountz commented on PR #13773: URL: https://github.com/apache/lucene/pull/13773#issuecomment-2346043431 > Those are not incorrect results though. I think that these are incorrect results. Inter-segment concurrency has a mechanism so that if the top-N-th hit has score S and doc ID D, then it would require a score of at least Math.nextUp(S) for doc IDs > D and S for doc IDs < D (relying on tie-breaking by doc ID). It looks like this mechanism no longer works with intra-segment concurrency? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]
jpountz commented on PR #13775: URL: https://github.com/apache/lucene/pull/13775#issuecomment-2346049667 A property of this collector is that it's supposed to allocate RAM in order of the actual number of collected hits rather than the max number of hits to retrieve. So this change defeats a bit the prupose of the test. I would rather update the test to contain the total number of slices. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Can we remove `compress` option for quantized KNN vector indexing? [lucene]
mikemccand commented on issue #13768: URL: https://github.com/apache/lucene/issues/13768#issuecomment-2346091188 OK I disabled Panama (via temporary code change in `VectorizationProvider.java` -- we don't accept sysprops to disable this anymore right?): ``` recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s force merge s num segments index size (MB) 0.318 0.312 15010 6 32 50 4 bits 175.21 125.64 1 5013.21 0.319 0.285 15010 6 32 50 4 bits 174.54 124.80 1 5562.51 ``` Indeed there is some performance penalty now (285 usec -> 312 usec, ~9.5%) ... recall also bounced around a bit, but prolly that's acceptable HNSW randomness noise. And wow look how much slower indexing / force merging got ... those SIMD instructions clearly help ;) But I don't think we should block removing `compress` option due to non-SIMD results? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Integrate merge-time index reordering with the intra-merge executor. [lucene]
jpountz merged PR #13289: URL: https://github.com/apache/lucene/pull/13289 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Migrate more classes to records. [lucene]
shubhamvishu commented on code in PR #13772: URL: https://github.com/apache/lucene/pull/13772#discussion_r1756801465 ## lucene/core/src/java/org/apache/lucene/util/TermAndVector.java: ## @@ -24,37 +24,24 @@ * * @lucene.experimental */ -public class TermAndVector { - - private final BytesRef term; - private final float[] vector; - - public TermAndVector(BytesRef term, float[] vector) { -this.term = term; -this.vector = vector; - } - - public BytesRef getTerm() { -return this.term; - } - - public float[] getVector() { -return this.vector; - } +public record TermAndVector(BytesRef term, float[] vector) { public int size() { return vector.length; } - public void normalizeVector() { + /** Return a {@link TermAndVector} whose vector is normalized according to the L2 norm. */ + public TermAndVector normalizeVector() { float vectorLength = 0; for (int i = 0; i < vector.length; i++) { vectorLength += vector[i] * vector[i]; } vectorLength = (float) Math.sqrt(vectorLength); +float[] newVector = new float[vector.length]; for (int i = 0; i < vector.length; i++) { - vector[i] /= vectorLength; + newVector[i] = vector[i] / vectorLength; } Review Comment: Could we instead just use the existing `VectorUtil#l2normalize` here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Can we remove `compress` option for quantized KNN vector indexing? [lucene]
ChrisHegarty commented on issue #13768: URL: https://github.com/apache/lucene/issues/13768#issuecomment-2346214427 > But I don't think we should block removing compress option due to non-SIMD results? I agree. At this point we're just comparing the scalar and SIMD implementation of the distance functions. For vector operations, we really need SIMD, and I think we're ok with this approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]
shubhamvishu commented on code in PR #12868: URL: https://github.com/apache/lucene/pull/12868#discussion_r1756832910 ## lucene/core/src/java/org/apache/lucene/util/StringHelper.java: ## @@ -209,6 +209,156 @@ public static int murmurhash3_x86_32(BytesRef bytes, int seed) { return murmurhash3_x86_32(bytes.bytes, bytes.offset, bytes.length, seed); } + /** + * Generates 128-bit hash from the byte array with the given offset, length and seed. + * + * The code is adopted from Apache Commons (https://commons.apache.org/proper/commons-codec/jacoco/org.apache.commons.codec.digest/MurmurHash3.java.html";>link) + * + * @param data The input byte array + * @param offset The first element of array + * @param length The length of array + * @param seed The initial seed value + * @return The 128-bit hash (2 longs) + */ + public static long[] murmurhash3_x64_128( + final byte[] data, final int offset, final int length, final int seed) { +// Use an unsigned 32-bit integer as the seed +return murmurhash3_x64_128(data, offset, length, seed & 0xL); + } + + @SuppressWarnings("fallthrough") + private static long[] murmurhash3_x64_128( + final byte[] data, final int offset, final int length, final long seed) { +long h1 = seed; +long h2 = seed; +final int nblocks = length >> 4; + +// Constants for 128-bit variant +final long C1 = 0x87c37b91114253d5L; +final long C2 = 0x4cf5ad432745937fL; +final int R1 = 31; +final int R2 = 27; +final int R3 = 33; +final int M = 5; +final int N1 = 0x52dce729; +final int N2 = 0x38495ab5; + +// body +for (int i = 0; i < nblocks; i++) { + final int index = offset + (i << 4); + long k1 = (long) BitUtil.VH_LE_LONG.get(data, index); + long k2 = (long) BitUtil.VH_LE_LONG.get(data, index + 8); + + // mix functions for k1 + k1 *= C1; + k1 = Long.rotateLeft(k1, R1); + k1 *= C2; + h1 ^= k1; + h1 = Long.rotateLeft(h1, R2); + h1 += h2; + h1 = h1 * M + N1; + + // mix functions for k2 + k2 *= C2; + k2 = Long.rotateLeft(k2, R3); + k2 *= C1; + h2 ^= k2; + h2 = Long.rotateLeft(h2, R1); + h2 += h1; + h2 = h2 * M + N2; +} + +// tail +long k1 = 0; +long k2 = 0; +final int index = offset + (nblocks << 4); +switch (offset + length - index) { Review Comment: Makes sense. I'll update it (Maybe we could do this in the [Apache commons impl](https://github.com/apache/commons-codec/blob/master/src/main/java/org/apache/commons/codec/digest/MurmurHash3.java) also) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]
shubhamvishu commented on code in PR #12868: URL: https://github.com/apache/lucene/pull/12868#discussion_r1756836548 ## lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java: ## @@ -150,9 +150,10 @@ private FuzzySet(FixedBitSet filter, int bloomSize, int hashCount) { * @return NO or MAYBE */ public ContainsResult contains(BytesRef value) { -long hash = hashFunction.hash(value); -int msb = (int) (hash >>> Integer.SIZE); -int lsb = (int) hash; +long[] hash = hashFunction.hash128(value); + +int msb = ((int) hash[0] >>> Integer.SIZE) >>> 1 + ((int) hash[1] >>> Integer.SIZE) >>> 1; +int lsb = ((int) hash[0]) >>> 1 + ((int) hash[1]) >>> 1; Review Comment: I'm fine with doing this (though I'll look into it separately why the smaller has values are showing better gains in the benchmark). For now, lets keep it as you propose. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] similarities: provide default computeNorm implementation; remove remaining discountOverlaps setters; [lucene]
rmuir commented on PR #13757: URL: https://github.com/apache/lucene/pull/13757#issuecomment-2346252096 I took a stab at improving the docs in https://github.com/apache/lucene/pull/13757/commits/47d4fa02f3c52fffc5aac9328a1de9c0e640ff27 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Migrate more classes to records. [lucene]
jpountz commented on code in PR #13772: URL: https://github.com/apache/lucene/pull/13772#discussion_r1756843565 ## lucene/core/src/java/org/apache/lucene/util/TermAndVector.java: ## @@ -24,37 +24,24 @@ * * @lucene.experimental */ -public class TermAndVector { - - private final BytesRef term; - private final float[] vector; - - public TermAndVector(BytesRef term, float[] vector) { -this.term = term; -this.vector = vector; - } - - public BytesRef getTerm() { -return this.term; - } - - public float[] getVector() { -return this.vector; - } +public record TermAndVector(BytesRef term, float[] vector) { public int size() { return vector.length; } - public void normalizeVector() { + /** Return a {@link TermAndVector} whose vector is normalized according to the L2 norm. */ + public TermAndVector normalizeVector() { float vectorLength = 0; for (int i = 0; i < vector.length; i++) { vectorLength += vector[i] * vector[i]; } vectorLength = (float) Math.sqrt(vectorLength); +float[] newVector = new float[vector.length]; for (int i = 0; i < vector.length; i++) { - vector[i] /= vectorLength; + newVector[i] = vector[i] / vectorLength; } Review Comment: Good call -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]
javanna commented on PR #13775: URL: https://github.com/apache/lucene/pull/13775#issuecomment-2346263866 Thanks for clarifying! But then is it a feature that it goes out of memory now? Especially given that it's a side by side comparison, and what causes the OOM is the ordinary top docs search, and not the large num hits collector, which suggests that is indeed more optimized in terms of memory allocated. I would think that if memory usage is what we want to test, we should probably have a different test? Or maybe it makes sense to lower the threshold only for the top docs collector part, which is the one that causes issues and we are only using to compare final results? Note that the second search does not even use concurrency as it uses the deprecated `search(Query, Collector)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Simplify FST return [lucene]
jpountz merged PR #13771: URL: https://github.com/apache/lucene/pull/13771 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]
jpountz commented on PR #13775: URL: https://github.com/apache/lucene/pull/13775#issuecomment-2346271119 > Or maybe it makes sense to lower the threshold only for the top docs collector part, +1 to that, let's pass n=reader.numDocs()? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix TestPrefixRandom failures when intra-segment concurrency is enabled [lucene]
javanna commented on PR #13773: URL: https://github.com/apache/lucene/pull/13773#issuecomment-2346282034 Happy that you commented, thanks! Can you help me better understand why these are incorrect results? ``` Error Message: java.lang.AssertionError: Hit 0 docnumbers don't match Hits length1=25 length2=25 hit=0: doc0=1.0 shardIndex=-1, doc155=1.0 shardIndex=-1 hit=1: doc1=1.0 shardIndex=-1, doc156=1.0 shardIndex=-1 hit=2: doc2=1.0 shardIndex=-1, doc157=1.0 shardIndex=-1 hit=3: doc3=1.0 shardIndex=-1, doc158=1.0 shardIndex=-1 hit=4: doc4=1.0 shardIndex=-1, doc159=1.0 shardIndex=-1 hit=5: doc5=1.0 shardIndex=-1, doc160=1.0 shardIndex=-1 hit=6: doc6=1.0 shardIndex=-1, doc161=1.0 shardIndex=-1 hit=7: doc7=1.0 shardIndex=-1, doc162=1.0 shardIndex=-1 hit=8: doc8=1.0 shardIndex=-1, doc163=1.0 shardIndex=-1 hit=9: doc9=1.0 shardIndex=-1, doc164=1.0 shardIndex=-1 hit=10: doc10=1.0 shardIndex=-1, doc165=1.0 shardIndex=-1 hit=11: doc11=1.0 shardIndex=-1, doc166=1.0 shardIndex=-1 hit=12: doc12=1.0 shardIndex=-1, doc167=1.0 shardIndex=-1 hit=13: doc13=1.0 shardIndex=-1, doc168=1.0 shardIndex=-1 hit=14: doc14=1.0 shardIndex=-1, doc169=1.0 shardIndex=-1 hit=15: doc15=1.0 shardIndex=-1, doc170=1.0 shardIndex=-1 hit=16: doc16=1.0 shardIndex=-1, doc171=1.0 shardIndex=-1 hit=17: doc17=1.0 shardIndex=-1, doc172=1.0 shardIndex=-1 hit=18: doc18=1.0 shardIndex=-1, doc173=1.0 shardIndex=-1 hit=19: doc19=1.0 shardIndex=-1, doc174=1.0 shardIndex=-1 hit=20: doc20=1.0 shardIndex=-1, doc175=1.0 shardIndex=-1 hit=21: doc21=1.0 shardIndex=-1, doc176=1.0 shardIndex=-1 hit=22: doc22=1.0 shardIndex=-1, doc177=1.0 shardIndex=-1 hit=23: doc23=1.0 shardIndex=-1, doc178=1.0 shardIndex=-1 hit=24: doc24=1.0 shardIndex=-1, doc179=1.0 shardIndex=-1 for query:field:* ``` From the failure, I see all docs have the same score, but doc ids don't match. I assume that the column on the right targeted a leaf partition, which is why doc ids don't start from 0. I think that we are still early terminating though or you mean that we shouldn't have as there were docs with same score but lower doc ids? I do think that I am missing something though, or this test could have failed with inter-segment concurrency too, because it would otherwise rely on the first segment to be the first one being searched at all times? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]
javanna commented on PR #13775: URL: https://github.com/apache/lucene/pull/13775#issuecomment-2346289887 > +1 to that, let's pass n=reader.numDocs()? Sure, I made that change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce ProfilerCollectorManager [lucene]
javanna merged PR #13746: URL: https://github.com/apache/lucene/pull/13746 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix TestPrefixRandom failures when intra-segment concurrency is enabled [lucene]
jpountz commented on PR #13773: URL: https://github.com/apache/lucene/pull/13773#issuecomment-2346309641 TopScoreDocCollectorManager is supposed to return the top-N results that have the best score, tie-broken by doc ID. So (docID=0, score=1) compares better than (docID=155, score=1) and should appear first in the resulting top docs. > I think that we are still early terminating though or you mean that we shouldn't have as there were docs with same score but lower doc ids? Right. Looking at the code, it looks like the problem is here: https://github.com/apache/lucene/blob/ff8b81afc59dc09ea4dfdd02a58398f08fcad8a8/lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java#L260. If a segment has two partitions, then its leaf collectors will collect hits with the same "docBase", so the tie-breaking logic will not know that hits from the first partition should compare better than hits from the second partition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]
jpountz commented on code in PR #13775: URL: https://github.com/apache/lucene/pull/13775#discussion_r1756898252 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/search/LargeNumHitsTopDocsCollector.java: ## @@ -137,7 +137,7 @@ protected void populateResults(ScoreDoc[] results, int howMany) { } // Total number of hits collected were less than requestedHitCount -assert totalHits < requestedHitCount; +assert totalHits <= requestedHitCount; Review Comment: I believe you're right. And then a few lines above it should be `assert totalHits > requestedHitCount;`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Further reduce the search concurrency overhead. [lucene]
javanna commented on code in PR #13606: URL: https://github.com/apache/lucene/pull/13606#discussion_r1756897923 ## lucene/core/src/java/org/apache/lucene/search/HitsThresholdChecker.java: ## @@ -24,6 +24,10 @@ abstract class HitsThresholdChecker { /** Implementation of HitsThresholdChecker which allows global hit counting */ private static class GlobalHitsThresholdChecker extends HitsThresholdChecker { private final LongAdder globalHitCount = new LongAdder(); +// Cache whether the threshold has been reached already. It is not volatile or synchronized on +// purpose to contain the overhead of reading the value similarly to what String#hashCode() +// does. This does not affect correctness. +private boolean thresholdReached = false; Review Comment: I bumped into this only now, and I was wondering: this is mutable state accessed concurrently by multiple threads? If so can it be a primitive boolean without any thread-safety measure? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix TestPrefixRandom failures when intra-segment concurrency is enabled [lucene]
javanna closed pull request #13773: Fix TestPrefixRandom failures when intra-segment concurrency is enabled URL: https://github.com/apache/lucene/pull/13773 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix TestPrefixRandom failures when intra-segment concurrency is enabled [lucene]
javanna commented on PR #13773: URL: https://github.com/apache/lucene/pull/13773#issuecomment-2346327496 Understood, thanks for the explanation, I did not realize doc ids played a role, I thought it was an incorrect assumption and it worked with inter-segment concurrency only by coincidence. I will go ahead and close this PR, and see if I can fix the production code instead of adapting the test. There's other recent failures around the same problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]
javanna commented on code in PR #13775: URL: https://github.com/apache/lucene/pull/13775#discussion_r1756915409 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/search/LargeNumHitsTopDocsCollector.java: ## @@ -137,7 +137,7 @@ protected void populateResults(ScoreDoc[] results, int howMany) { } // Total number of hits collected were less than requestedHitCount -assert totalHits < requestedHitCount; +assert totalHits <= requestedHitCount; Review Comment: good catch! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Further reduce the search concurrency overhead. [lucene]
jpountz commented on code in PR #13606: URL: https://github.com/apache/lucene/pull/13606#discussion_r1756914836 ## lucene/core/src/java/org/apache/lucene/search/HitsThresholdChecker.java: ## @@ -24,6 +24,10 @@ abstract class HitsThresholdChecker { /** Implementation of HitsThresholdChecker which allows global hit counting */ private static class GlobalHitsThresholdChecker extends HitsThresholdChecker { private final LongAdder globalHitCount = new LongAdder(); +// Cache whether the threshold has been reached already. It is not volatile or synchronized on +// purpose to contain the overhead of reading the value similarly to what String#hashCode() +// does. This does not affect correctness. +private boolean thresholdReached = false; Review Comment: Indeed it may be accessed by multiple threads, but it is still safe, we may just have multiple threads independently set it to the same value. If the threshold isn't reached, then the value may only be `false`. Vice-versa if we read a `true`, the the threshold is guaranteed to have been reached. It is possible to read a `false` while the threshold has been reached (e.g. if a write in another thread hasn't been made visible to the current thread), but then this thread will take care of setting the boolean to true, same value that has been set by the other thread. This SO post tries to answer the same question for String#hashCode: https://stackoverflow.com/questions/41704185/is-javas-string-hashcode-function-thread-safe-if-its-cache-setter-does-not-us -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix TestShapeDocValues.testLatLonPolygonBBox [lucene]
msokolov commented on PR #13743: URL: https://github.com/apache/lucene/pull/13743#issuecomment-2346349295 Should we have an epsilon here rather than exact comparison with 0? Usually that's the way of things with floating point checks - we would check |a-b| < epsilon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]
javanna merged PR #13775: URL: https://github.com/apache/lucene/pull/13775 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] similarities: provide default computeNorm implementation; remove remaining discountOverlaps setters; [lucene]
cpoerschke commented on PR #13757: URL: https://github.com/apache/lucene/pull/13757#issuecomment-2346451914 > I took a stab at improving the docs in [47d4fa0](https://github.com/apache/lucene/commit/47d4fa02f3c52fffc5aac9328a1de9c0e640ff27) Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Further reduce the search concurrency overhead. [lucene]
javanna commented on code in PR #13606: URL: https://github.com/apache/lucene/pull/13606#discussion_r1757003986 ## lucene/core/src/java/org/apache/lucene/search/HitsThresholdChecker.java: ## @@ -24,6 +24,10 @@ abstract class HitsThresholdChecker { /** Implementation of HitsThresholdChecker which allows global hit counting */ private static class GlobalHitsThresholdChecker extends HitsThresholdChecker { private final LongAdder globalHitCount = new LongAdder(); +// Cache whether the threshold has been reached already. It is not volatile or synchronized on +// purpose to contain the overhead of reading the value similarly to what String#hashCode() +// does. This does not affect correctness. +private boolean thresholdReached = false; Review Comment: sounds good, thanks for clarifying -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Make HnswLock and LockedRow final [lucene]
ChrisHegarty opened a new pull request, #13776: URL: https://github.com/apache/lucene/pull/13776 Trivial commit that makes HnswLock final and LockedRow a record. This is general clean up and helps the JIT a little when reasoning about these types - which show quite a bit in indexing and search profiles. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix TestShapeDocValues.testLatLonPolygonBBox [lucene]
shubhamvishu commented on PR #13743: URL: https://github.com/apache/lucene/pull/13743#issuecomment-2346624909 Yes, usually we do that but in this case there is a catch. This test ONLY fails when the generated points are non collinear(as expected) but the `area == 0` ([checked in this condition](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/geo/Tessellator.java#L1320-L1327)), which makes it throw the error of `at least three non-collinear points required`. If the `area > 0` then this test runs as expected without any issues. So with comparison to 0 we would rightly avoid only those cases where it actually trips but, epsilon might cause it to simply skip more cases where test could have run successfully. Moreover, as the generated coordinates have extremely small absolute values and close distance (eg : SomeNumE-84, SomeNumE-240 etc), its tricky to find some sane epsilon in this case (which would only skip running the test unnecessarily). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix TestShapeDocValues.testLatLonPolygonBBox [lucene]
stefanvodita commented on code in PR #13743: URL: https://github.com/apache/lucene/pull/13743#discussion_r1757126653 ## lucene/core/src/test/org/apache/lucene/document/TestShapeDocValues.java: ## @@ -255,4 +257,9 @@ private List getTessellation(XYPolygon p) { } return tess; } + + /** Compute signed area of rectangle */ + private static double area(Polygon p) { +return (p.maxLon - p.minLon) * (p.maxLat - p.minLat); Review Comment: Isn't it simpler to check the min/max for equality? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add unit-of-least-precision float comparison [lucene]
stefanvodita commented on PR #13723: URL: https://github.com/apache/lucene/pull/13723#issuecomment-2346657253 Thank you for the feedback! I've added a comparison method for doubles and a test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix TestPrefixRandom failures when intra-segment concurrency is enabled [lucene]
javanna commented on PR #13773: URL: https://github.com/apache/lucene/pull/13773#issuecomment-2346773435 I opened #13777 . It needs some proper review because I am not entirely sure what side effects it may have. There is currently no direct link between leaf collector and scorers, to pass along the minDocId that we could have used somehow as docBase in the maxScoreAccumulator. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix TestShapeDocValues.testLatLonPolygonBBox [lucene]
shubhamvishu commented on code in PR #13743: URL: https://github.com/apache/lucene/pull/13743#discussion_r1757244198 ## lucene/core/src/test/org/apache/lucene/document/TestShapeDocValues.java: ## @@ -255,4 +257,9 @@ private List getTessellation(XYPolygon p) { } return tess; } + + /** Compute signed area of rectangle */ + private static double area(Polygon p) { +return (p.maxLon - p.minLon) * (p.maxLat - p.minLat); Review Comment: Nope, its possible that both the parts of multiplication are `> 0`(non zero) but the result/area is 0 if the result goes beyond the double range(Eg: (`1E-240 * 1E-78` is `0.0`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix TestShapeDocValues.testLatLonPolygonBBox [lucene]
shubhamvishu commented on code in PR #13743: URL: https://github.com/apache/lucene/pull/13743#discussion_r1757244198 ## lucene/core/src/test/org/apache/lucene/document/TestShapeDocValues.java: ## @@ -255,4 +257,9 @@ private List getTessellation(XYPolygon p) { } return tess; } + + /** Compute signed area of rectangle */ + private static double area(Polygon p) { +return (p.maxLon - p.minLon) * (p.maxLat - p.minLat); Review Comment: Nope, its possible that both the parts of multiplication are `> 0`(non zero) but the result/area is 0 if the result goes beyond the double range(Eg: (`1E-250 * 1E-78` is `0.0`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]
javanna commented on PR #13777: URL: https://github.com/apache/lucene/pull/13777#issuecomment-2346813650 I am not getting many test failures, which is surprising, because I think that this is entirely the wrong fix, in that it defeats the early termination purpose of top score docs collector. I don't have good ideas on how to fix this properly so far, without API changes, in that leaf collector don't have a way to get notified that the current segment is seen as part of a segment partition. It would be enough to get the min doc id of the partition, and use it as doc base in the accumulator, but it's hard to get it there with the current API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Can we remove `compress` option for quantized KNN vector indexing? [lucene]
benwtrent commented on issue #13768: URL: https://github.com/apache/lucene/issues/13768#issuecomment-2346877986 If we are ok with the perf hit on non-panama, I am cool with it :). It will definitely simplify the code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add unit-of-least-precision float comparison [lucene]
uschindler commented on PR #13723: URL: https://github.com/apache/lucene/pull/13723#issuecomment-2346992406 I don't like the last commit because it changes from a assert-like method to a boolean returning method. Could we not keep the previous method signature and still add a test? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2346995734 > Is "default run" from this PR? No. "default run" is knn search where each embedding is a separate document with no relationship between them. I'm still wiring things up to see benchmark results for this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] First-class random access API for KnnVectorValues [lucene]
msokolov opened a new pull request, #13779: URL: https://github.com/apache/lucene/pull/13779 addresses https://github.com/apache/lucene/issues/13778 Key things in this PR: 1. Introduces abstract `KnnVectorValues` from which `ByteVectorValues` and `FloatVectorValues` derive; 2. Folds `RandomAccessVectorValues` into `KnnVectorValues` thus eliminating some casts. 3. `RandomAccessVectorValues.Floats` becomes `FloatVectorValues` and `RandomAccessVectorValues.Bytes` becomes `ByteVectorValues`. `RandomAccessQuantizedByteVectorValues` folded into `QuantizedByteVectorValues`. 4. `IndexInput getSlice()` moved to a new `HasIndexSlice` interface. 5. Introduces `VectorEncoding KnnVectorValues.getEncoding()` to enable type-specific branches in a few places where we are dealing with abstract `KnnVectorValues` (tests only IIRC). Also used that to provide a default `getVectorByteLength()`. 6. `KnnVectorValues` no longer extends `DocIdSetIterator`; rather it provides a tightly-coupled `iterator()`. This enables refactoring common iteration patterns that were repeated many times in the code base. This new iterator, `DocIndexIterator` provides an additional method `index()` analogous to `IndexedDISI`. Some of the methods on `KnnVectorValues` have default impls that throw `UnsupportedOperationException` enabling subclasses to provide partial implementations and relying on testing to catch missing required methods. I'd like feedback on this. Should we provide implementations we never use, just to make these classes complete? That didn't make sense to me. But the previous alternative of attempting to provide strict adherence to declarative contracts was becoming in my view, overly restrictive and leading to hard-to-maintain code. Some of these readers would only ever be used iteratively. Random access is required for search, but not used when merging the values themselves, and when we merge we do search, but using a temporary file so that searching is always done over a file-based value. Random access also gets used during merging when the index is sorted, again this is provided by specialized readers, so not every reader needs to implement random access. But the API maintenance is greatly simplified if we allow partial implementation. Anyway that is the idea I am trying out here. Can we live with a little less API purity and gain some simplicity and less boilerplate? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]
vigyasharma commented on code in PR #13525: URL: https://github.com/apache/lucene/pull/13525#discussion_r1757395276 ## lucene/core/src/java/org/apache/lucene/index/MultiVectorSimilarityFunction.java: ## @@ -0,0 +1,203 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.index; + +import java.util.ArrayList; +import java.util.List; +import org.apache.lucene.util.ArrayUtil; + +/** + * Multi-vector similarity function; used in search to return top K most similar multi-vectors to a + * target multi-vector. This method is used during indexing and searching of the multi-vectors in + * order to determine the nearest neighbors. + */ +// no commit +public class MultiVectorSimilarityFunction implements MultiVectorSimilarity { + + /** Aggregation function to combine similarity across multiple vector values */ + public enum Aggregation { +/** Placeholder aggregation that is not intended to be used. */ +NONE { + @Override + public float aggregate( + float[] outer, + float[] inner, + VectorSimilarityFunction vectorSimilarityFunction, + int dimension) { +throw new UnsupportedOperationException(); + } + + @Override + public float aggregate( + byte[] outer, + byte[] inner, + VectorSimilarityFunction vectorSimilarityFunction, + int dimension) { +throw new UnsupportedOperationException(); + } +}, + +/** + * SumMaxSimilarity between two multi-vectors. Aggregates using the sum of maximum similarity + * found for each vector in the first multi-vector against all vectors in the second + * multi-vector. + */ +SUM_MAX { + @Override + public float aggregate( + float[] outer, + float[] inner, + VectorSimilarityFunction vectorSimilarityFunction, + int dimension) { +if (outer.length % dimension != 0 || inner.length % dimension != 0) { + throw new IllegalArgumentException("Multi vectors do not match provided dimensions"); +} +// TODO: can we avoid making vector copies? +List outerList = new ArrayList<>(); +List innerList = new ArrayList<>(); +for (int i = 0; i <= outer.length; i += dimension) { + outerList.add(ArrayUtil.copyOfSubArray(outer, i, dimension)); +} +for (int i = 0; i <= inner.length; i += dimension) { + innerList.add(ArrayUtil.copyOfSubArray(inner, i, dimension)); +} Review Comment: yes, that's what I had in mind. It'll be slower right now with all the extra allocation but we can change it later by making our sim. fn impl (probably in the vectorization provider?) work with offsets and lengths. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
msokolov commented on PR #13779: URL: https://github.com/apache/lucene/pull/13779#issuecomment-2347012727 another concern I have is how this would impact ongoing work to enable multiple vectors per doc/field. There would almost certainly be conflicts with that PR on the surface, but I hope this could actually simplify things in that the new `DocIndexIterator` class could be enhanced / extended to provide access to a series of values (maybe a list or array?) instead of (or in addition to?) a single one, possibly centralizing the required changes (since we have many fewer iterator implementations after this change). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]
jpountz commented on code in PR #13777: URL: https://github.com/apache/lucene/pull/13777#discussion_r1757465545 ## lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java: ## @@ -254,10 +254,9 @@ protected void updateMinCompetitiveScore(Scorable scorer) throws IOException { totalHitsRelation = TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO; minCompetitiveScore = localMinScore; if (minScoreAcc != null) { - // we don't use the next float but we register the document - // id so that other leaves can require it if they are after - // the current maximum - minScoreAcc.accumulate(docBase, pqTop.score); + // we don't use the next float but we register the document id so that other leaves or + // leaf partitions can require it if they are after the current maximum + minScoreAcc.accumulate(docBase + pqTop.doc, pqTop.score); Review Comment: pqTop.doc should already be a top-level doc ID, so adding `docBase`shouldn't be needed? ## lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java: ## @@ -232,7 +232,7 @@ protected void updateGlobalMinCompetitiveScore(Scorable scorer) throws IOExcepti // the next float if the global minimum score is set on a document id that is // smaller than the ids in the current leaf float score = - docBase >= maxMinScore.docBase() ? Math.nextUp(maxMinScore.score()) : maxMinScore.score(); + docBase >= maxMinScore.docId() ? Math.nextUp(maxMinScore.score()) : maxMinScore.score(); Review Comment: In theory, we could replace this docBase with the current doc ID as well (not important for correctness, but could help skip more docs when some segments have multiple partitions). I'm good to make this change in a follow-up PR instead of this one if it helps. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]
jpountz commented on PR #13777: URL: https://github.com/apache/lucene/pull/13777#issuecomment-2347079637 > I think that this is entirely the wrong fix Why is it the wrong fix? This looks correct to me: we want to consider docs with equal or greater scores if the doc ID is less than the top-n-th hit so far, but only docs with a greater doc ID otherwise? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Remove IndexSearcher#search(List, Weight, Collector) [lucene]
javanna opened a new pull request, #13780: URL: https://github.com/apache/lucene/pull/13780 With the introduction of intra-segment concurrency, we have introduced a new protected search(LeafReaderContextPartition, Weight, Collector) method. The previous variant that accepts a list of leaf reader contexts was left deprecated as there is one leftover usages coming from search(Query, Collector). The hope was that the latter was going to be removed soon as well, but there is actually no need to tie the two removals. It is easier to fold this method into its only caller, in order for it to still bypass the collector manager based methods. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]
javanna commented on code in PR #13777: URL: https://github.com/apache/lucene/pull/13777#discussion_r1757481284 ## lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java: ## @@ -254,10 +254,9 @@ protected void updateMinCompetitiveScore(Scorable scorer) throws IOException { totalHitsRelation = TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO; minCompetitiveScore = localMinScore; if (minScoreAcc != null) { - // we don't use the next float but we register the document - // id so that other leaves can require it if they are after - // the current maximum - minScoreAcc.accumulate(docBase, pqTop.score); + // we don't use the next float but we register the document id so that other leaves or + // leaf partitions can require it if they are after the current maximum + minScoreAcc.accumulate(docBase + pqTop.doc, pqTop.score); Review Comment: ah good point! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]
javanna commented on PR #13777: URL: https://github.com/apache/lucene/pull/13777#issuecomment-2347094763 > Why is it the wrong fix? Happy to be proven wrong, especially when I believe that my fix is wrong. ahah I was under the impression that we may no longer early terminate if we replace the docBase with the docId. A test failure on `TestConstantScoreScorer` seems to show that pretty clearly. I also thought that this is too easy to be true especially if it does not have side effects :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Migrate more classes to records. [lucene]
jpountz merged PR #13772: URL: https://github.com/apache/lucene/pull/13772 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make HnswLock and LockedRow final [lucene]
ChrisHegarty merged PR #13776: URL: https://github.com/apache/lucene/pull/13776 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
benwtrent commented on PR #13779: URL: https://github.com/apache/lucene/pull/13779#issuecomment-2347102278 > but I hope this could actually simplify things That is my intuition as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]
javanna commented on PR #13777: URL: https://github.com/apache/lucene/pull/13777#issuecomment-2347103042 > A test failure on TestConstantScoreScorer seems to show that pretty clearly. Ok, that was caused by using the docBase when already included in the doc id. Phew. This looks to be right after all! Sorry for the back and forth and thanks for the help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]
javanna commented on code in PR #13777: URL: https://github.com/apache/lucene/pull/13777#discussion_r1757494555 ## lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java: ## @@ -232,7 +232,7 @@ protected void updateGlobalMinCompetitiveScore(Scorable scorer) throws IOExcepti // the next float if the global minimum score is set on a document id that is // smaller than the ids in the current leaf float score = - docBase >= maxMinScore.docBase() ? Math.nextUp(maxMinScore.score()) : maxMinScore.score(); + docBase >= maxMinScore.docId() ? Math.nextUp(maxMinScore.score()) : maxMinScore.score(); Review Comment: I see what you mean , I gave it a try, but then I reverted it, I prefer to think about testing and do it as a follow-up -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]
javanna commented on PR #13777: URL: https://github.com/apache/lucene/pull/13777#issuecomment-2347172996 This should be good to go now :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
jpountz commented on code in PR #13779: URL: https://github.com/apache/lucene/pull/13779#discussion_r1757529511 ## lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java: ## @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.index; + +import java.io.IOException; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.document.KnnByteVectorField; +import org.apache.lucene.document.KnnFloatVectorField; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.util.Bits; + +/** + * This class abstracts addressing of document vector values indexed as {@link KnnFloatVectorField} + * or {@link KnnByteVectorField}. + * + * @lucene.experimental + */ +public abstract class KnnVectorValues { + + /** The iterator associated with these values. */ + protected DocIndexIterator iterator; + + /** Return the dimension of the vectors */ + public abstract int dimension(); + + /** + * Return the number of vectors for this field. + * + * @return the number of vectors returned by this iterator + */ + public abstract int size(); + + /** + * Return the docid of the document indexed with the given vector ordinal. This default + * implementation returns the argument and is appropriate for dense values implementations where + * every doc has a single value. + */ + public int ordToDoc(int ord) { +return ord; + } + + /** + * Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access + * different values at once, to avoid overwriting the underlying vector returned. + */ + public abstract KnnVectorValues copy() throws IOException; + + /** Returns the vector byte length, defaults to dimension multiplied by float byte size */ + public int getVectorByteLength() { +return dimension() * getEncoding().byteSize; + } + + /** The vector encoding of these values. */ + public abstract VectorEncoding getEncoding(); + + /** Returns a Bits accepting docs accepted by the argument and having a vector value */ + public Bits getAcceptOrds(Bits acceptDocs) { +// FIXME: change default to return acceptDocs and provide this impl +// somewhere more specialized (in every non-dense impl). +if (acceptDocs == null) { + return null; +} +return new Bits() { + @Override + public boolean get(int index) { +return acceptDocs.get(ordToDoc(index)); + } + + @Override + public int length() { +return size(); + } +}; + } + + /** + * Return the iterator for this instance. If you need multiple iterators, call + * this.copy().iterator(). + */ + public DocIndexIterator iterator() { +if (iterator == null) { + iterator = createIterator(); +} +return iterator; + } + + /** + * Create an iterator for this instance; typically called once by iterator(). Wrapper + * value classes delegate to their inner instance's iterator and shouldn't implement this. + */ + protected DocIndexIterator createIterator() { +throw new UnsupportedOperationException(); + } + + /** + * A DocIdSetIterator that also provides an index() method tracking a distinct ordinal for a + * vector associated with each doc. + */ + public abstract static class DocIndexIterator extends DocIdSetIterator { + +/** return the value index (aka "ordinal" or "ord") corresponding to the current doc */ +public abstract int index(); + +@Override +public int advance(int target) throws IOException { + return slowAdvance(target); +} + +@Override +public long cost() { + throw new UnsupportedOperationException(); +} + +/** + * Returns an iterator that delegates to the IndexedDISI. Advancing this iterator will advance + * the underlying IndexedDISI, and vice-versa. + */ +public static DocIndexIterator fromIndexedDISI(IndexedDISI disi) { + // can we replace with fromDISI? + return new DocIndexIterator() { +@Override +public int docID() { + return disi.docID(); +} + +@Override +public int index() { + return disi.index(); +}
Re: [PR] First-class random access API for KnnVectorValues [lucene]
jpountz commented on PR #13779: URL: https://github.com/apache/lucene/pull/13779#issuecomment-2347207782 Am guessing correctly that you're targeting 10.0 for this change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
msokolov commented on code in PR #13779: URL: https://github.com/apache/lucene/pull/13779#discussion_r1757568397 ## lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java: ## @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.index; + +import java.io.IOException; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.document.KnnByteVectorField; +import org.apache.lucene.document.KnnFloatVectorField; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.util.Bits; + +/** + * This class abstracts addressing of document vector values indexed as {@link KnnFloatVectorField} + * or {@link KnnByteVectorField}. + * + * @lucene.experimental + */ +public abstract class KnnVectorValues { + + /** The iterator associated with these values. */ + protected DocIndexIterator iterator; + + /** Return the dimension of the vectors */ + public abstract int dimension(); + + /** + * Return the number of vectors for this field. + * + * @return the number of vectors returned by this iterator + */ + public abstract int size(); + + /** + * Return the docid of the document indexed with the given vector ordinal. This default + * implementation returns the argument and is appropriate for dense values implementations where + * every doc has a single value. + */ + public int ordToDoc(int ord) { +return ord; + } + + /** + * Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access + * different values at once, to avoid overwriting the underlying vector returned. + */ + public abstract KnnVectorValues copy() throws IOException; + + /** Returns the vector byte length, defaults to dimension multiplied by float byte size */ + public int getVectorByteLength() { Review Comment: I think it can possibly vary due to quantization --- but you may be right and it refers only to the original size? Will have to confirm, or maybe @benwtrent can comment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
msokolov commented on PR #13779: URL: https://github.com/apache/lucene/pull/13779#issuecomment-2347232511 Thanks for the quick review! I will get started on addressing. As for timeline for this change, it would definitely be convenient to get in to 10.0 release. I think you had said 9/22 would be a feature freeze date; it seems we could possibly meet that timeline. I will be traveling starting tomorrow for a week, but I should be able to put in some quality time on the plane LOL -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
msokolov commented on code in PR #13779: URL: https://github.com/apache/lucene/pull/13779#discussion_r1757605144 ## lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java: ## @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.index; + +import java.io.IOException; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.document.KnnByteVectorField; +import org.apache.lucene.document.KnnFloatVectorField; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.util.Bits; + +/** + * This class abstracts addressing of document vector values indexed as {@link KnnFloatVectorField} + * or {@link KnnByteVectorField}. + * + * @lucene.experimental + */ +public abstract class KnnVectorValues { + + /** The iterator associated with these values. */ + protected DocIndexIterator iterator; + + /** Return the dimension of the vectors */ + public abstract int dimension(); + + /** + * Return the number of vectors for this field. + * + * @return the number of vectors returned by this iterator + */ + public abstract int size(); + + /** + * Return the docid of the document indexed with the given vector ordinal. This default + * implementation returns the argument and is appropriate for dense values implementations where + * every doc has a single value. + */ + public int ordToDoc(int ord) { +return ord; + } + + /** + * Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access + * different values at once, to avoid overwriting the underlying vector returned. + */ + public abstract KnnVectorValues copy() throws IOException; + + /** Returns the vector byte length, defaults to dimension multiplied by float byte size */ + public int getVectorByteLength() { +return dimension() * getEncoding().byteSize; + } + + /** The vector encoding of these values. */ + public abstract VectorEncoding getEncoding(); + + /** Returns a Bits accepting docs accepted by the argument and having a vector value */ + public Bits getAcceptOrds(Bits acceptDocs) { +// FIXME: change default to return acceptDocs and provide this impl +// somewhere more specialized (in every non-dense impl). +if (acceptDocs == null) { + return null; +} +return new Bits() { + @Override + public boolean get(int index) { +return acceptDocs.get(ordToDoc(index)); + } + + @Override + public int length() { +return size(); + } +}; + } + + /** + * Return the iterator for this instance. If you need multiple iterators, call + * this.copy().iterator(). + */ + public DocIndexIterator iterator() { +if (iterator == null) { + iterator = createIterator(); +} +return iterator; + } + + /** + * Create an iterator for this instance; typically called once by iterator(). Wrapper + * value classes delegate to their inner instance's iterator and shouldn't implement this. + */ + protected DocIndexIterator createIterator() { +throw new UnsupportedOperationException(); + } + + /** + * A DocIdSetIterator that also provides an index() method tracking a distinct ordinal for a + * vector associated with each doc. + */ + public abstract static class DocIndexIterator extends DocIdSetIterator { + +/** return the value index (aka "ordinal" or "ord") corresponding to the current doc */ +public abstract int index(); + +@Override +public int advance(int target) throws IOException { + return slowAdvance(target); +} + +@Override +public long cost() { + throw new UnsupportedOperationException(); +} + +/** + * Returns an iterator that delegates to the IndexedDISI. Advancing this iterator will advance + * the underlying IndexedDISI, and vice-versa. + */ +public static DocIndexIterator fromIndexedDISI(IndexedDISI disi) { + // can we replace with fromDISI? + return new DocIndexIterator() { +@Override +public int docID() { + return disi.docID(); +} + +@Override +public int index() { + return disi.index(); +
Re: [PR] First-class random access API for KnnVectorValues [lucene]
msokolov commented on code in PR #13779: URL: https://github.com/apache/lucene/pull/13779#discussion_r1757605492 ## lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java: ## @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.index; + +import java.io.IOException; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.document.KnnByteVectorField; +import org.apache.lucene.document.KnnFloatVectorField; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.util.Bits; + +/** + * This class abstracts addressing of document vector values indexed as {@link KnnFloatVectorField} + * or {@link KnnByteVectorField}. + * + * @lucene.experimental + */ +public abstract class KnnVectorValues { + + /** The iterator associated with these values. */ + protected DocIndexIterator iterator; + + /** Return the dimension of the vectors */ + public abstract int dimension(); + + /** + * Return the number of vectors for this field. + * + * @return the number of vectors returned by this iterator + */ + public abstract int size(); + + /** + * Return the docid of the document indexed with the given vector ordinal. This default + * implementation returns the argument and is appropriate for dense values implementations where + * every doc has a single value. + */ + public int ordToDoc(int ord) { +return ord; + } + + /** + * Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access + * different values at once, to avoid overwriting the underlying vector returned. + */ + public abstract KnnVectorValues copy() throws IOException; + + /** Returns the vector byte length, defaults to dimension multiplied by float byte size */ + public int getVectorByteLength() { +return dimension() * getEncoding().byteSize; + } + + /** The vector encoding of these values. */ + public abstract VectorEncoding getEncoding(); + + /** Returns a Bits accepting docs accepted by the argument and having a vector value */ + public Bits getAcceptOrds(Bits acceptDocs) { +// FIXME: change default to return acceptDocs and provide this impl +// somewhere more specialized (in every non-dense impl). +if (acceptDocs == null) { + return null; +} +return new Bits() { + @Override + public boolean get(int index) { +return acceptDocs.get(ordToDoc(index)); + } + + @Override + public int length() { +return size(); + } +}; + } + + /** + * Return the iterator for this instance. If you need multiple iterators, call + * this.copy().iterator(). + */ + public DocIndexIterator iterator() { +if (iterator == null) { + iterator = createIterator(); +} +return iterator; Review Comment: Let me try - I was also a bit unhappy about this, but at one point along this journey I was relying on being able to recover the shared state - maybe I finally was able to get rid of that and just didn't notice! ## lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java: ## @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.index; + +import java.io.IOException; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.document.KnnByteVectorField; +import org.apache.lucene.document.KnnFloatVectorFiel
Re: [PR] First-class random access API for KnnVectorValues [lucene]
msokolov commented on code in PR #13779: URL: https://github.com/apache/lucene/pull/13779#discussion_r1757605699 ## lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java: ## @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.index; + +import java.io.IOException; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.document.KnnByteVectorField; +import org.apache.lucene.document.KnnFloatVectorField; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.util.Bits; + +/** + * This class abstracts addressing of document vector values indexed as {@link KnnFloatVectorField} + * or {@link KnnByteVectorField}. + * + * @lucene.experimental + */ +public abstract class KnnVectorValues { + + /** The iterator associated with these values. */ + protected DocIndexIterator iterator; + + /** Return the dimension of the vectors */ + public abstract int dimension(); + + /** + * Return the number of vectors for this field. + * + * @return the number of vectors returned by this iterator + */ + public abstract int size(); + + /** + * Return the docid of the document indexed with the given vector ordinal. This default + * implementation returns the argument and is appropriate for dense values implementations where + * every doc has a single value. + */ + public int ordToDoc(int ord) { +return ord; + } + + /** + * Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access + * different values at once, to avoid overwriting the underlying vector returned. + */ + public abstract KnnVectorValues copy() throws IOException; + + /** Returns the vector byte length, defaults to dimension multiplied by float byte size */ + public int getVectorByteLength() { +return dimension() * getEncoding().byteSize; + } + + /** The vector encoding of these values. */ + public abstract VectorEncoding getEncoding(); + + /** Returns a Bits accepting docs accepted by the argument and having a vector value */ + public Bits getAcceptOrds(Bits acceptDocs) { +// FIXME: change default to return acceptDocs and provide this impl +// somewhere more specialized (in every non-dense impl). +if (acceptDocs == null) { + return null; +} +return new Bits() { + @Override + public boolean get(int index) { +return acceptDocs.get(ordToDoc(index)); + } + + @Override + public int length() { +return size(); + } +}; + } + + /** + * Return the iterator for this instance. If you need multiple iterators, call + * this.copy().iterator(). + */ + public DocIndexIterator iterator() { +if (iterator == null) { + iterator = createIterator(); +} +return iterator; + } + + /** + * Create an iterator for this instance; typically called once by iterator(). Wrapper + * value classes delegate to their inner instance's iterator and shouldn't implement this. + */ + protected DocIndexIterator createIterator() { +throw new UnsupportedOperationException(); + } + + /** + * A DocIdSetIterator that also provides an index() method tracking a distinct ordinal for a + * vector associated with each doc. + */ + public abstract static class DocIndexIterator extends DocIdSetIterator { + +/** return the value index (aka "ordinal" or "ord") corresponding to the current doc */ +public abstract int index(); + +@Override +public int advance(int target) throws IOException { + return slowAdvance(target); +} + +@Override +public long cost() { + throw new UnsupportedOperationException(); Review Comment: hmm I think cost() is rarely used in the vector reader/writers which instead are concerned with KnnVectorValues.size() -- they typically want to know how many vector values there are and to the extent they care about the number of docs it's only when they must iterate through all of them and have no use for an estimate. These iterators aren't really used during searching? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub
Re: [PR] Provide a custom hash implementation in HnswLock [lucene]
msokolov commented on PR #13751: URL: https://github.com/apache/lucene/pull/13751#issuecomment-2347296049 thank you @ChrisHegarty ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
msokolov commented on code in PR #13779: URL: https://github.com/apache/lucene/pull/13779#discussion_r1757672091 ## lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java: ## @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.index; + +import java.io.IOException; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.document.KnnByteVectorField; +import org.apache.lucene.document.KnnFloatVectorField; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.util.Bits; + +/** + * This class abstracts addressing of document vector values indexed as {@link KnnFloatVectorField} + * or {@link KnnByteVectorField}. + * + * @lucene.experimental + */ +public abstract class KnnVectorValues { + + /** The iterator associated with these values. */ + protected DocIndexIterator iterator; + + /** Return the dimension of the vectors */ + public abstract int dimension(); + + /** + * Return the number of vectors for this field. + * + * @return the number of vectors returned by this iterator + */ + public abstract int size(); + + /** + * Return the docid of the document indexed with the given vector ordinal. This default + * implementation returns the argument and is appropriate for dense values implementations where + * every doc has a single value. + */ + public int ordToDoc(int ord) { +return ord; + } + + /** + * Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access + * different values at once, to avoid overwriting the underlying vector returned. + */ + public abstract KnnVectorValues copy() throws IOException; Review Comment: The way `SortedDocValuesTermsEnum` is, calling its `next` method will overwrite the internal buffer ofd the `SortedDocValues` on which it is built, defeating the purpose of `copy()` which is to provide two completely independent sources. Another thing we could do is to add `vectorValue(int ord, float[] scratch)` allowing the caller to provide the memory to write to. If we had that, we wouldn't need `copy()`. Maybe we could manage to squeeze that into 10.0 too, but I'd rather do it in a separate PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add unit-of-least-precision float comparison [lucene]
stefanvodita commented on PR #13723: URL: https://github.com/apache/lucene/pull/13723#issuecomment-2347389307 I changed it away from an assertion because I liked this more. It makes it so you can assert on floats *not* being equal or use their equality in a condition, without making an assertion statement much longer. Do you not like it because it's too generic? We could move it to `TestUtil` or we could provide assertion methods alongside the equality methods, if that helps. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Use Arrays.compareUnsigned instead of iterating compare in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
vsop-479 opened a new pull request, #13781: URL: https://github.com/apache/lucene/pull/13781 ### Description This change similars to https://github.com/apache/lucene/pull/13252. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.compareUnsigned instead of iterating compare in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
vsop-479 closed pull request #13781: Use Arrays.compareUnsigned instead of iterating compare in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. URL: https://github.com/apache/lucene/pull/13781 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
vsop-479 opened a new pull request, #13782: URL: https://github.com/apache/lucene/pull/13782 ### Description Description This change similars to https://github.com/apache/lucene/pull/13252. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
vsop-479 commented on PR #13782: URL: https://github.com/apache/lucene/pull/13782#issuecomment-2347950684 @uschindler Please take a look when you get a chance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
jpountz commented on code in PR #13782: URL: https://github.com/apache/lucene/pull/13782#discussion_r1758190884 ## lucene/CHANGES.txt: ## @@ -292,6 +292,8 @@ Build API Changes - +* GITHUB#13782: Replace handwritten loops compare with Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. (zhouhui) Review Comment: We usually put under "API changes" things that require users to change their code upon upgrading, which is not the case here. Can you move it under "Optimizations" instead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
jpountz commented on code in PR #13779: URL: https://github.com/apache/lucene/pull/13779#discussion_r1758191856 ## lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java: ## @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.index; + +import java.io.IOException; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.document.KnnByteVectorField; +import org.apache.lucene.document.KnnFloatVectorField; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.util.Bits; + +/** + * This class abstracts addressing of document vector values indexed as {@link KnnFloatVectorField} + * or {@link KnnByteVectorField}. + * + * @lucene.experimental + */ +public abstract class KnnVectorValues { + + /** The iterator associated with these values. */ + protected DocIndexIterator iterator; + + /** Return the dimension of the vectors */ + public abstract int dimension(); + + /** + * Return the number of vectors for this field. + * + * @return the number of vectors returned by this iterator + */ + public abstract int size(); + + /** + * Return the docid of the document indexed with the given vector ordinal. This default + * implementation returns the argument and is appropriate for dense values implementations where + * every doc has a single value. + */ + public int ordToDoc(int ord) { +return ord; + } + + /** + * Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access + * different values at once, to avoid overwriting the underlying vector returned. + */ + public abstract KnnVectorValues copy() throws IOException; + + /** Returns the vector byte length, defaults to dimension multiplied by float byte size */ + public int getVectorByteLength() { +return dimension() * getEncoding().byteSize; + } + + /** The vector encoding of these values. */ + public abstract VectorEncoding getEncoding(); + + /** Returns a Bits accepting docs accepted by the argument and having a vector value */ + public Bits getAcceptOrds(Bits acceptDocs) { +// FIXME: change default to return acceptDocs and provide this impl +// somewhere more specialized (in every non-dense impl). +if (acceptDocs == null) { + return null; +} +return new Bits() { + @Override + public boolean get(int index) { +return acceptDocs.get(ordToDoc(index)); + } + + @Override + public int length() { +return size(); + } +}; + } + + /** + * Return the iterator for this instance. If you need multiple iterators, call + * this.copy().iterator(). + */ + public DocIndexIterator iterator() { +if (iterator == null) { + iterator = createIterator(); +} +return iterator; + } + + /** + * Create an iterator for this instance; typically called once by iterator(). Wrapper + * value classes delegate to their inner instance's iterator and shouldn't implement this. + */ + protected DocIndexIterator createIterator() { +throw new UnsupportedOperationException(); + } + + /** + * A DocIdSetIterator that also provides an index() method tracking a distinct ordinal for a + * vector associated with each doc. + */ + public abstract static class DocIndexIterator extends DocIdSetIterator { + +/** return the value index (aka "ordinal" or "ord") corresponding to the current doc */ +public abstract int index(); + +@Override +public int advance(int target) throws IOException { + return slowAdvance(target); +} + +@Override +public long cost() { + throw new UnsupportedOperationException(); +} + +/** + * Returns an iterator that delegates to the IndexedDISI. Advancing this iterator will advance + * the underlying IndexedDISI, and vice-versa. + */ +public static DocIndexIterator fromIndexedDISI(IndexedDISI disi) { + // can we replace with fromDISI? + return new DocIndexIterator() { +@Override +public int docID() { + return disi.docID(); +} + +@Override +public int index() { + return disi.index(); +}
Re: [PR] First-class random access API for KnnVectorValues [lucene]
jpountz commented on code in PR #13779: URL: https://github.com/apache/lucene/pull/13779#discussion_r1758192953 ## lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java: ## @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.index; + +import java.io.IOException; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.document.KnnByteVectorField; +import org.apache.lucene.document.KnnFloatVectorField; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.util.Bits; + +/** + * This class abstracts addressing of document vector values indexed as {@link KnnFloatVectorField} + * or {@link KnnByteVectorField}. + * + * @lucene.experimental + */ +public abstract class KnnVectorValues { + + /** The iterator associated with these values. */ + protected DocIndexIterator iterator; + + /** Return the dimension of the vectors */ + public abstract int dimension(); + + /** + * Return the number of vectors for this field. + * + * @return the number of vectors returned by this iterator + */ + public abstract int size(); + + /** + * Return the docid of the document indexed with the given vector ordinal. This default + * implementation returns the argument and is appropriate for dense values implementations where + * every doc has a single value. + */ + public int ordToDoc(int ord) { +return ord; + } + + /** + * Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access + * different values at once, to avoid overwriting the underlying vector returned. + */ + public abstract KnnVectorValues copy() throws IOException; Review Comment: But if you call SortedDocValues#termsEnum twice, this would give you two independent sources of terms? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add prefetching support to term vectors. [lucene]
jpountz merged PR #13758: URL: https://github.com/apache/lucene/pull/13758 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
jpountz commented on code in PR #13779: URL: https://github.com/apache/lucene/pull/13779#discussion_r1758220291 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -361,33 +385,46 @@ private MergedByteVectorValues(List subs, MergeState mergeS totalSize += sub.values.size(); } size = totalSize; -docId = -1; } @Override - public byte[] vectorValue() throws IOException { -return current.values.vectorValue(); + public byte[] vectorValue(int ord) throws IOException { +return current.values.vectorValue(current.values.iterator().index()); Review Comment: This part feels a bit hacky, could we instead merge the ord->vector mappings of the various vector values by concatenating them? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
vsop-479 commented on code in PR #13782: URL: https://github.com/apache/lucene/pull/13782#discussion_r1758230120 ## lucene/CHANGES.txt: ## @@ -292,6 +292,8 @@ Build API Changes - +* GITHUB#13782: Replace handwritten loops compare with Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. (zhouhui) Review Comment: Thanks @jpountz . That is a mistake, I fixed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
vsop-479 commented on PR #13782: URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348109781 @jpountz I think this check failure relats to https://github.com/apache/lucene/commit/5045d3c67b18d23c534a32cee1d95827b0b7c482 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]
jpountz commented on PR #13782: URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348146820 Indeed it likely does, can you merge main back into your branch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]
javanna merged PR #13777: URL: https://github.com/apache/lucene/pull/13777 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org