Re: [PR] Replace Map with IntObjectHashMap for KnnVectorsReader [lucene]

2024-09-12 Thread via GitHub


bugmakerr commented on code in PR #13763:
URL: https://github.com/apache/lucene/pull/13763#discussion_r1756317165


##
lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java:
##
@@ -239,51 +245,69 @@ public FieldsReader(final SegmentReadState readState) 
throws IOException {
  * @param field the name of a numeric vector field
  */
 public KnnVectorsReader getFieldReader(String field) {
-  return fields.get(field);
+  final FieldInfo info = fieldInfos.fieldInfo(field);
+  if (info == null) {
+return null;
+  }
+  return fields.get(info.number);
 }
 
 @Override
 public void checkIntegrity() throws IOException {
-  for (KnnVectorsReader reader : fields.values()) {
-reader.checkIntegrity();
+  for (ObjectCursor cursor : fields.values()) {
+cursor.value.checkIntegrity();
   }
 }
 
 @Override
 public FloatVectorValues getFloatVectorValues(String field) throws 
IOException {
-  KnnVectorsReader knnVectorsReader = fields.get(field);
-  if (knnVectorsReader == null) {
+  final FieldInfo info = fieldInfos.fieldInfo(field);
+  KnnVectorsReader reader;
+  if (info == null || (reader = fields.get(info.number)) == null) {
 return null;

Review Comment:
   forgive me if i misunderstand something, but the javadoc says that **the 
return value is never null**, and we do return null in 
`PerFieldKnnVectorsFormat`. If we check the field info in caller, 
`AssertingKnnVectorsReader` will always get the non-null value from the 
delegate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]

2024-09-12 Thread via GitHub


javanna commented on code in PR #13747:
URL: https://github.com/apache/lucene/pull/13747#discussion_r1756305014


##
lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java:
##
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.join;
+
+import java.io.IOException;
+import org.apache.lucene.search.Collector;
+
+interface MergeableCollector> extends 
Collector {
+
+  /** Folds the results from another collector of the same type into this 
collector. */
+  void merge(C collector) throws IOException;
+}

Review Comment:
   I wonder if we need to introduce these new abstractions: MergeableCollector 
and MergeableCollectorManager. I can see what you are up to, many times we use 
Collector themselves to reduce multiple collectors into. I am not a big fan of 
that design though, it seems like a quick way to move to using collector 
managers, but ideally what reduce and hence the search method returns isn't a 
collector, and we would not use collectors to retrieve state after search is 
executed.
   
   How much more complicated would the code be without these new concepts, 
writing the collector managers that are needed and adding that logic to reduce? 
Would that cause a lot of code duplication?



##
lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java:
##
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.join;
+
+import java.io.IOException;
+import org.apache.lucene.search.Collector;
+
+interface MergeableCollector> extends 
Collector {
+
+  /** Folds the results from another collector of the same type into this 
collector. */
+  void merge(C collector) throws IOException;
+}

Review Comment:
   Another take at this, maybe `merge` does not fit that well in a `Collector` 
design but we could have a base collector manager impl that exposes the same 
merge method. After all, I think merging should be a collector manager's 
concern? Maybe we would even call it reduce:
   
   `protected C reduce(C collector);`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Can `gradle tidy` reflow text properly when it inserts newlines? [lucene]

2024-09-12 Thread via GitHub


dweiss commented on issue #13766:
URL: https://github.com/apache/lucene/issues/13766#issuecomment-2345552152

   So I just double checked and block-level comments are treated as 
preformatted within code. This:
   ```
   /*
* fba sdifhopi hweifh swehf doish dsopih ewfpoih efwspoihb wepihb apin 
wef. spedfih [wefph wepfih wqefdpoi qp[hjfaph we.
* bar bar
*/
   
   // fba sdifhopi hweifh swehf doish dsopih ewfpoih efwspoihb wepihb apin 
wef. spedfih [wefph wepfih wqefdpoi qp[hjfaph we.
   // bar bar
   ```
   is tidied up to this:
   ```
   /*
* fba sdifhopi hweifh swehf doish dsopih ewfpoih efwspoihb wepihb apin 
wef. spedfih [wefph wepfih wqefdpoi qp[hjfaph we.
* bar bar
*/
   
   // fba sdifhopi hweifh swehf doish dsopih ewfpoih efwspoihb wepihb apin 
wef. spedfih [wefph
   // wepfih wqefdpoi qp[hjfaph we.
   // bar bar
   ```
   
   I personally think this is the right choice by gjf - breaks long lines into 
visibility but does not break any other line comments (and potentially damage 
them). Block comments ```/* */``` can be used to preserve any formatting that 
should not be touched at all.
   
   I think this issue can be closed, sorry if it bothers you, but it's an 
operator's error. :)
   
   There are more headaches coming with the markdown comments spec [1]. It'll 
be fun.
   
   [1] https://openjdk.org/jeps/467


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]

2024-09-12 Thread via GitHub


msfroh commented on code in PR #13747:
URL: https://github.com/apache/lucene/pull/13747#discussion_r1756363078


##
lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java:
##
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.join;
+
+import java.io.IOException;
+import org.apache.lucene.search.Collector;
+
+interface MergeableCollector> extends 
Collector {
+
+  /** Folds the results from another collector of the same type into this 
collector. */
+  void merge(C collector) throws IOException;
+}

Review Comment:
   Makes sense! 
   
   Given that the merge logic is collector-specific, I'm thinking that 
collectors should expose what they produce after reduction.
   
   This also makes it easier to push the N-way merge into the collectors.
   
   Maybe the resulting CollectorManager might be more reusable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]

2024-09-12 Thread via GitHub


msfroh commented on code in PR #13747:
URL: https://github.com/apache/lucene/pull/13747#discussion_r1756378484


##
lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java:
##
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.join;
+
+import java.io.IOException;
+import org.apache.lucene.search.Collector;
+
+interface MergeableCollector> extends 
Collector {
+
+  /** Folds the results from another collector of the same type into this 
collector. */
+  void merge(C collector) throws IOException;
+}

Review Comment:
   Actually, maybe we're disagreeing? I think you're suggesting that the 
CollectorManager should handle merging, while I'm suggesting that Collector 
should have some output and know how to merge multiple outputs.
   
   This would mean that every collector needs a corresponding CollectorManager, 
or at least a CollectorManager for every output type, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]

2024-09-12 Thread via GitHub


msfroh commented on code in PR #13747:
URL: https://github.com/apache/lucene/pull/13747#discussion_r1756385824


##
lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java:
##
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.join;
+
+import java.io.IOException;
+import org.apache.lucene.search.Collector;
+
+interface MergeableCollector> extends 
Collector {
+
+  /** Folds the results from another collector of the same type into this 
collector. */
+  void merge(C collector) throws IOException;
+}

Review Comment:
   Hmm... While it may be a bigger change, it feels like we'll we should really 
have just one CollectorManager that merges results from Collecors based on what 
the Collectors collect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]

2024-09-12 Thread via GitHub


msfroh commented on code in PR #13747:
URL: https://github.com/apache/lucene/pull/13747#discussion_r1756385824


##
lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java:
##
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.join;
+
+import java.io.IOException;
+import org.apache.lucene.search.Collector;
+
+interface MergeableCollector> extends 
Collector {
+
+  /** Folds the results from another collector of the same type into this 
collector. */
+  void merge(C collector) throws IOException;
+}

Review Comment:
   Hmm... While it may be a bigger change, it feels like we should really have 
just one CollectorManager that merges results from Collectors based on what the 
Collectors collect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]

2024-09-12 Thread via GitHub


msfroh commented on code in PR #13747:
URL: https://github.com/apache/lucene/pull/13747#discussion_r1756429132


##
lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java:
##
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.join;
+
+import java.io.IOException;
+import org.apache.lucene.search.Collector;
+
+interface MergeableCollector> extends 
Collector {
+
+  /** Folds the results from another collector of the same type into this 
collector. */
+  void merge(C collector) throws IOException;
+}

Review Comment:
   I can reimplent my follow-up PR to do three separate CollectorManagers that 
handle the merge logic, I suppose.
   
   That would be more consistent with the existing Lucene code, but I'm not 
sure that I agree with that approach.
   
   It essentially means that every Collector needs a parallel CollectorManager 
that knows how to merge that Collector. For something like these join 
Collectors, it also means that the CollectorManager needs to know how to 
implement the ScoreMode (min, max, sum, or avg).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]

2024-09-12 Thread via GitHub


msfroh commented on code in PR #13747:
URL: https://github.com/apache/lucene/pull/13747#discussion_r1756429132


##
lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java:
##
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.join;
+
+import java.io.IOException;
+import org.apache.lucene.search.Collector;
+
+interface MergeableCollector> extends 
Collector {
+
+  /** Folds the results from another collector of the same type into this 
collector. */
+  void merge(C collector) throws IOException;
+}

Review Comment:
   I can reimplement my follow-up PR to do three separate CollectorManagers 
that handle the merge logic, I suppose.
   
   That would be more consistent with the existing Lucene code, but I'm not 
sure that I agree with that approach.
   
   It essentially means that every Collector needs a parallel CollectorManager 
that knows how to merge that Collector. For something like these join 
Collectors, it also means that the CollectorManager needs to know how to 
implement the ScoreMode (min, max, sum, or avg).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] similarities: provide default computeNorm implementation; remove remaining discountOverlaps setters; [lucene]

2024-09-12 Thread via GitHub


cpoerschke commented on code in PR #13757:
URL: https://github.com/apache/lucene/pull/13757#discussion_r1756534635


##
lucene/core/src/java/org/apache/lucene/search/similarities/DFRSimilarity.java:
##
@@ -94,6 +94,26 @@ public class DFRSimilarity extends SimilarityBase {
*/
   public DFRSimilarity(
   BasicModel basicModel, AfterEffect afterEffect, Normalization 
normalization) {
+this(basicModel, afterEffect, normalization, true);
+  }
+
+  /**
+   * Creates DFRSimilarity from the three components.
+   *
+   * Note that null values are not allowed: if you want no 
normalization, instead
+   * pass {@link NoNormalization}.
+   *
+   * @param basicModel Basic model of information content
+   * @param afterEffect First normalization of information gain
+   * @param normalization Second (length) normalization
+   * @param discountOverlaps True if overlap tokens (tokens with a position of 
increment of zero)

Review Comment:
   Good catch!
   ```suggestion
  * @param discountOverlaps True if overlap tokens (tokens with a position 
of increment of zero)
  * are discounted from the document's length.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] similarities: provide default computeNorm implementation; remove remaining discountOverlaps setters; [lucene]

2024-09-12 Thread via GitHub


cpoerschke commented on code in PR #13757:
URL: https://github.com/apache/lucene/pull/13757#discussion_r1756540884


##
lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java:
##
@@ -111,7 +133,17 @@ protected Similarity() {}
* @param state current processing state for this field
* @return computed norm value
*/
-  public abstract long computeNorm(FieldInvertState state);
+  public long computeNorm(FieldInvertState state) {

Review Comment:
   I don't feel knowledgeable enough to properly document/edit on this.
   
   _"Allow edits and access to secrets by maintainers"_ is enabled for the pull 
request -- please feel free to (in-browser or otherwise) change the 
documentation here. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Can we decrease the overhead of skipping? [lucene]

2024-09-12 Thread via GitHub


jpountz commented on issue #13106:
URL: https://github.com/apache/lucene/issues/13106#issuecomment-2345810711

   Fixed by https://github.com/apache/lucene/pull/13585


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Can we decrease the overhead of skipping? [lucene]

2024-09-12 Thread via GitHub


jpountz closed issue #13106: Can we decrease the overhead of skipping?
URL: https://github.com/apache/lucene/issues/13106


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Fix TestPrefixRandom failures when intra-segment concurrency is enabled [lucene]

2024-09-12 Thread via GitHub


javanna opened a new pull request, #13773:
URL: https://github.com/apache/lucene/pull/13773

   The order in which documents are processed is not a guarantee, hence we may 
return a different set of documents when early terminating the search. Those 
are not incorrect results though.
   
   I opted for increasing the number of hits so that we never early terminate 
the search for this test.
   
   I believe this test should have failed with inter-segment concurrency as 
well, and intra-segment concurrency makes it more likely to fail (although not 
so frequently).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]

2024-09-12 Thread via GitHub


javanna commented on code in PR #13747:
URL: https://github.com/apache/lucene/pull/13747#discussion_r1756577220


##
lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java:
##
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.join;
+
+import java.io.IOException;
+import org.apache.lucene.search.Collector;
+
+interface MergeableCollector> extends 
Collector {
+
+  /** Folds the results from another collector of the same type into this 
collector. */
+  void merge(C collector) throws IOException;
+}

Review Comment:
   > This would mean that every collector needs a corresponding 
CollectorManager, or at least a CollectorManager for every output type, right?
   
   I guess it depends, you could also reuse the same collector manager across 
collectors if they share a sub-type perhaps. But as a general guideline, yes. 
After all, search goes through collector managers and collector managers are 
responsible to create collectors and reduce multiple collectors into a single 
result, whether that be a collector or some more meaningful result class.
   
   I don't have a strong opinion, but I thought that moving your merge method 
to the collector manager should not be a big change, I may be missing some 
details perhaps.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] PayloadScoreQuery javadoc update w.r.t. SpanQuery use [lucene]

2024-09-12 Thread via GitHub


cpoerschke merged PR #13731:
URL: https://github.com/apache/lucene/pull/13731


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Address length used to copy array in FacetsCollector to not be out of bounds [lucene]

2024-09-12 Thread via GitHub


javanna merged PR #13774:
URL: https://github.com/apache/lucene/pull/13774


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]

2024-09-12 Thread via GitHub


javanna opened a new pull request, #13775:
URL: https://github.com/apache/lucene/pull/13775

   There's two tests where we use 250_000 as number of collected hits, but we 
only ever index max 2000 docs. That makes use create a priority queue of size 
250_000 for each segment partition which causes out of memory errors when the 
number of partitions is higher than a few.
   
   With this commit I propose that we lower the threshold to 2000 for those 
tests that need a high number of collected hits. The assumption that a priority 
queue is not built within the LargeNumHitsTopDocsCollector still holds so this 
change should not defeat the purpose of the tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]

2024-09-12 Thread via GitHub


javanna commented on code in PR #13775:
URL: https://github.com/apache/lucene/pull/13775#discussion_r1756658422


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/LargeNumHitsTopDocsCollector.java:
##
@@ -137,7 +137,7 @@ protected void populateResults(ScoreDoc[] results, int 
howMany) {
 }
 
 // Total number of hits collected were less than requestedHitCount
-assert totalHits < requestedHitCount;
+assert totalHits <= requestedHitCount;

Review Comment:
   this appears to be an existing bug



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Can we remove `compress` option for quantized KNN vector indexing? [lucene]

2024-09-12 Thread via GitHub


mikemccand commented on issue #13768:
URL: https://github.com/apache/lucene/issues/13768#issuecomment-2346033320

   Well, I ran 
[`knnPerfTest.py`](https://github.com/mikemccand/luceneutil/blob/f4a07ed8de36c47aacb6033a3709e236bc42aca4/src/python/knnPerfTest.py)
 on my Linux dev box (x86-64 Raptorlake i9-13900K).  This CPU has crazy number 
of flags, but NOT AVX-512:
   
   ```
   flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl 
xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 
monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 
x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 
3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow 
flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms 
invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 
xsaves split_lock_detect user_shstk avx_vnni dtherm ida arat pln pts hwp 
hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg 
gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize 
pconfig arch_lbr ibt flush_l1d arch_capabilities
   ```
   
   This is with Panama enabled (`INFO: Java vector incubator API enabled; uses 
preferredBitSize=256; FMA enabled`).  I'll try with Panama disabled next.
   
   Results:
   
   ```
   recall  latency (ms) nDoc  topK  fanout  maxConn  beamWidth  quantized  
index s  force merge s  num segments  index size (MB)
0.319 0.167  15010   6   32 50 4 bits   
 86.01  64.49 1  5013.19
0.326 0.187  15010   6   32 50 4 bits   
 89.03  74.99 1  5562.51
   ```
   
   Unfortunately the output doesn't state it, but the first row is 
`compress=True` and 2nd is `compress=False`.  Indeed, latency (search time) got 
faster (187 usec -> 167 usec) with `compress=True`, and this is quite a 
reduction (~10% ish) in index size.  Indexing and force merge time did get a 
bit slower ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix TestPrefixRandom failures when intra-segment concurrency is enabled [lucene]

2024-09-12 Thread via GitHub


jpountz commented on PR #13773:
URL: https://github.com/apache/lucene/pull/13773#issuecomment-2346043431

   > Those are not incorrect results though.
   
   I think that these are incorrect results. Inter-segment concurrency has a 
mechanism so that if the top-N-th hit has score S and doc ID D, then it would 
require a score of at least Math.nextUp(S) for doc IDs > D and S for doc IDs < 
D (relying on tie-breaking by doc ID). It looks like this mechanism no longer 
works with intra-segment concurrency?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]

2024-09-12 Thread via GitHub


jpountz commented on PR #13775:
URL: https://github.com/apache/lucene/pull/13775#issuecomment-2346049667

   A property of this collector is that it's supposed to allocate RAM in order 
of the actual number of collected hits rather than the max number of hits to 
retrieve. So this change defeats a bit the prupose of the test. I would rather 
update the test to contain the total number of slices.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Can we remove `compress` option for quantized KNN vector indexing? [lucene]

2024-09-12 Thread via GitHub


mikemccand commented on issue #13768:
URL: https://github.com/apache/lucene/issues/13768#issuecomment-2346091188

   OK I disabled Panama (via temporary code change in 
`VectorizationProvider.java` -- we don't accept sysprops to disable this 
anymore right?):
   
   ```
   recall  latency (ms) nDoc  topK  fanout  maxConn  beamWidth  quantized  
index s  force merge s  num segments  index size (MB)
0.318 0.312  15010   6   32 50 4 bits   
175.21 125.64 1  5013.21
0.319 0.285  15010   6   32 50 4 bits   
174.54 124.80 1  5562.51
   ```
   
   Indeed there is some performance penalty now (285 usec -> 312 usec, ~9.5%) 
... recall also bounced around a bit, but prolly that's acceptable HNSW 
randomness noise.  And wow look how much slower indexing / force merging got 
... those SIMD instructions clearly help ;)
   
   But I don't think we should block removing `compress` option due to non-SIMD 
results?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Integrate merge-time index reordering with the intra-merge executor. [lucene]

2024-09-12 Thread via GitHub


jpountz merged PR #13289:
URL: https://github.com/apache/lucene/pull/13289


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Migrate more classes to records. [lucene]

2024-09-12 Thread via GitHub


shubhamvishu commented on code in PR #13772:
URL: https://github.com/apache/lucene/pull/13772#discussion_r1756801465


##
lucene/core/src/java/org/apache/lucene/util/TermAndVector.java:
##
@@ -24,37 +24,24 @@
  *
  * @lucene.experimental
  */
-public class TermAndVector {
-
-  private final BytesRef term;
-  private final float[] vector;
-
-  public TermAndVector(BytesRef term, float[] vector) {
-this.term = term;
-this.vector = vector;
-  }
-
-  public BytesRef getTerm() {
-return this.term;
-  }
-
-  public float[] getVector() {
-return this.vector;
-  }
+public record TermAndVector(BytesRef term, float[] vector) {
 
   public int size() {
 return vector.length;
   }
 
-  public void normalizeVector() {
+  /** Return a {@link TermAndVector} whose vector is normalized according to 
the L2 norm. */
+  public TermAndVector normalizeVector() {
 float vectorLength = 0;
 for (int i = 0; i < vector.length; i++) {
   vectorLength += vector[i] * vector[i];
 }
 vectorLength = (float) Math.sqrt(vectorLength);
+float[] newVector = new float[vector.length];
 for (int i = 0; i < vector.length; i++) {
-  vector[i] /= vectorLength;
+  newVector[i] = vector[i] / vectorLength;
 }

Review Comment:
   Could we instead just use the existing `VectorUtil#l2normalize` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Can we remove `compress` option for quantized KNN vector indexing? [lucene]

2024-09-12 Thread via GitHub


ChrisHegarty commented on issue #13768:
URL: https://github.com/apache/lucene/issues/13768#issuecomment-2346214427

   > But I don't think we should block removing compress option due to non-SIMD 
results?
   
   I agree. At this point we're just comparing the scalar and SIMD 
implementation of the distance functions. For vector operations, we really need 
SIMD, and I think we're ok with this approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]

2024-09-12 Thread via GitHub


shubhamvishu commented on code in PR #12868:
URL: https://github.com/apache/lucene/pull/12868#discussion_r1756832910


##
lucene/core/src/java/org/apache/lucene/util/StringHelper.java:
##
@@ -209,6 +209,156 @@ public static int murmurhash3_x86_32(BytesRef bytes, int 
seed) {
 return murmurhash3_x86_32(bytes.bytes, bytes.offset, bytes.length, seed);
   }
 
+  /**
+   * Generates 128-bit hash from the byte array with the given offset, length 
and seed.
+   *
+   * The code is adopted from Apache Commons (https://commons.apache.org/proper/commons-codec/jacoco/org.apache.commons.codec.digest/MurmurHash3.java.html";>link)
+   *
+   * @param data The input byte array
+   * @param offset The first element of array
+   * @param length The length of array
+   * @param seed The initial seed value
+   * @return The 128-bit hash (2 longs)
+   */
+  public static long[] murmurhash3_x64_128(
+  final byte[] data, final int offset, final int length, final int seed) {
+// Use an unsigned 32-bit integer as the seed
+return murmurhash3_x64_128(data, offset, length, seed & 0xL);
+  }
+
+  @SuppressWarnings("fallthrough")
+  private static long[] murmurhash3_x64_128(
+  final byte[] data, final int offset, final int length, final long seed) {
+long h1 = seed;
+long h2 = seed;
+final int nblocks = length >> 4;
+
+// Constants for 128-bit variant
+final long C1 = 0x87c37b91114253d5L;
+final long C2 = 0x4cf5ad432745937fL;
+final int R1 = 31;
+final int R2 = 27;
+final int R3 = 33;
+final int M = 5;
+final int N1 = 0x52dce729;
+final int N2 = 0x38495ab5;
+
+// body
+for (int i = 0; i < nblocks; i++) {
+  final int index = offset + (i << 4);
+  long k1 = (long) BitUtil.VH_LE_LONG.get(data, index);
+  long k2 = (long) BitUtil.VH_LE_LONG.get(data, index + 8);
+
+  // mix functions for k1
+  k1 *= C1;
+  k1 = Long.rotateLeft(k1, R1);
+  k1 *= C2;
+  h1 ^= k1;
+  h1 = Long.rotateLeft(h1, R2);
+  h1 += h2;
+  h1 = h1 * M + N1;
+
+  // mix functions for k2
+  k2 *= C2;
+  k2 = Long.rotateLeft(k2, R3);
+  k2 *= C1;
+  h2 ^= k2;
+  h2 = Long.rotateLeft(h2, R1);
+  h2 += h1;
+  h2 = h2 * M + N2;
+}
+
+// tail
+long k1 = 0;
+long k2 = 0;
+final int index = offset + (nblocks << 4);
+switch (offset + length - index) {

Review Comment:
   Makes sense. I'll update it (Maybe we could do this in the [Apache commons 
impl](https://github.com/apache/commons-codec/blob/master/src/main/java/org/apache/commons/codec/digest/MurmurHash3.java)
 also)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]

2024-09-12 Thread via GitHub


shubhamvishu commented on code in PR #12868:
URL: https://github.com/apache/lucene/pull/12868#discussion_r1756836548


##
lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java:
##
@@ -150,9 +150,10 @@ private FuzzySet(FixedBitSet filter, int bloomSize, int 
hashCount) {
* @return NO or MAYBE
*/
   public ContainsResult contains(BytesRef value) {
-long hash = hashFunction.hash(value);
-int msb = (int) (hash >>> Integer.SIZE);
-int lsb = (int) hash;
+long[] hash = hashFunction.hash128(value);
+
+int msb = ((int) hash[0] >>> Integer.SIZE) >>> 1 + ((int) hash[1] >>> 
Integer.SIZE) >>> 1;
+int lsb = ((int) hash[0]) >>> 1 + ((int) hash[1]) >>> 1;

Review Comment:
   I'm fine with doing this (though I'll look into it separately why the 
smaller has values are showing better gains in the benchmark). For now, lets 
keep it as you propose.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] similarities: provide default computeNorm implementation; remove remaining discountOverlaps setters; [lucene]

2024-09-12 Thread via GitHub


rmuir commented on PR #13757:
URL: https://github.com/apache/lucene/pull/13757#issuecomment-2346252096

   I took a stab at improving the docs in 
https://github.com/apache/lucene/pull/13757/commits/47d4fa02f3c52fffc5aac9328a1de9c0e640ff27


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Migrate more classes to records. [lucene]

2024-09-12 Thread via GitHub


jpountz commented on code in PR #13772:
URL: https://github.com/apache/lucene/pull/13772#discussion_r1756843565


##
lucene/core/src/java/org/apache/lucene/util/TermAndVector.java:
##
@@ -24,37 +24,24 @@
  *
  * @lucene.experimental
  */
-public class TermAndVector {
-
-  private final BytesRef term;
-  private final float[] vector;
-
-  public TermAndVector(BytesRef term, float[] vector) {
-this.term = term;
-this.vector = vector;
-  }
-
-  public BytesRef getTerm() {
-return this.term;
-  }
-
-  public float[] getVector() {
-return this.vector;
-  }
+public record TermAndVector(BytesRef term, float[] vector) {
 
   public int size() {
 return vector.length;
   }
 
-  public void normalizeVector() {
+  /** Return a {@link TermAndVector} whose vector is normalized according to 
the L2 norm. */
+  public TermAndVector normalizeVector() {
 float vectorLength = 0;
 for (int i = 0; i < vector.length; i++) {
   vectorLength += vector[i] * vector[i];
 }
 vectorLength = (float) Math.sqrt(vectorLength);
+float[] newVector = new float[vector.length];
 for (int i = 0; i < vector.length; i++) {
-  vector[i] /= vectorLength;
+  newVector[i] = vector[i] / vectorLength;
 }

Review Comment:
   Good call



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]

2024-09-12 Thread via GitHub


javanna commented on PR #13775:
URL: https://github.com/apache/lucene/pull/13775#issuecomment-2346263866

   Thanks for clarifying!
   
   But then is it a feature that it goes out of memory now? Especially given 
that it's a side by side comparison, and what causes the OOM is the ordinary 
top docs search, and not the large num hits collector, which suggests that is 
indeed more optimized in terms of memory allocated. 
   
   I would think that if memory usage is what we want to test, we should 
probably have a different test? Or maybe it makes sense to lower the threshold 
only for the top docs collector part, which is the one that causes issues and 
we are only using to compare final results? Note that the second search does 
not even use concurrency as it uses the deprecated `search(Query, Collector)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Simplify FST return [lucene]

2024-09-12 Thread via GitHub


jpountz merged PR #13771:
URL: https://github.com/apache/lucene/pull/13771


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]

2024-09-12 Thread via GitHub


jpountz commented on PR #13775:
URL: https://github.com/apache/lucene/pull/13775#issuecomment-2346271119

   > Or maybe it makes sense to lower the threshold only for the top docs 
collector part,
   
   +1 to that, let's pass n=reader.numDocs()?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix TestPrefixRandom failures when intra-segment concurrency is enabled [lucene]

2024-09-12 Thread via GitHub


javanna commented on PR #13773:
URL: https://github.com/apache/lucene/pull/13773#issuecomment-2346282034

   Happy that you commented, thanks!
   
   Can you help me better understand why these are incorrect results?
   
   ```
   Error Message:
   java.lang.AssertionError: Hit 0 docnumbers don't match
   Hits length1=25 length2=25
   hit=0: doc0=1.0 shardIndex=-1,   doc155=1.0 shardIndex=-1
   hit=1: doc1=1.0 shardIndex=-1,   doc156=1.0 shardIndex=-1
   hit=2: doc2=1.0 shardIndex=-1,   doc157=1.0 shardIndex=-1
   hit=3: doc3=1.0 shardIndex=-1,   doc158=1.0 shardIndex=-1
   hit=4: doc4=1.0 shardIndex=-1,   doc159=1.0 shardIndex=-1
   hit=5: doc5=1.0 shardIndex=-1,   doc160=1.0 shardIndex=-1
   hit=6: doc6=1.0 shardIndex=-1,   doc161=1.0 shardIndex=-1
   hit=7: doc7=1.0 shardIndex=-1,   doc162=1.0 shardIndex=-1
   hit=8: doc8=1.0 shardIndex=-1,   doc163=1.0 shardIndex=-1
   hit=9: doc9=1.0 shardIndex=-1,   doc164=1.0 shardIndex=-1
   hit=10: doc10=1.0 shardIndex=-1, doc165=1.0 shardIndex=-1
   hit=11: doc11=1.0 shardIndex=-1, doc166=1.0 shardIndex=-1
   hit=12: doc12=1.0 shardIndex=-1, doc167=1.0 shardIndex=-1
   hit=13: doc13=1.0 shardIndex=-1, doc168=1.0 shardIndex=-1
   hit=14: doc14=1.0 shardIndex=-1, doc169=1.0 shardIndex=-1
   hit=15: doc15=1.0 shardIndex=-1, doc170=1.0 shardIndex=-1
   hit=16: doc16=1.0 shardIndex=-1, doc171=1.0 shardIndex=-1
   hit=17: doc17=1.0 shardIndex=-1, doc172=1.0 shardIndex=-1
   hit=18: doc18=1.0 shardIndex=-1, doc173=1.0 shardIndex=-1
   hit=19: doc19=1.0 shardIndex=-1, doc174=1.0 shardIndex=-1
   hit=20: doc20=1.0 shardIndex=-1, doc175=1.0 shardIndex=-1
   hit=21: doc21=1.0 shardIndex=-1, doc176=1.0 shardIndex=-1
   hit=22: doc22=1.0 shardIndex=-1, doc177=1.0 shardIndex=-1
   hit=23: doc23=1.0 shardIndex=-1, doc178=1.0 shardIndex=-1
   hit=24: doc24=1.0 shardIndex=-1, doc179=1.0 shardIndex=-1
   for query:field:*
   ```
   
   From the failure, I see all docs have the same score, but doc ids don't 
match. I assume that the column on the right targeted a leaf partition, which 
is why doc ids don't start from 0. I think that we are still early terminating 
though or you mean that we shouldn't have as there were docs with same score 
but lower doc ids? I do think that I am missing something though, or this test 
could have failed with inter-segment concurrency too, because it would 
otherwise rely on the first segment to be the first one being searched at all 
times?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]

2024-09-12 Thread via GitHub


javanna commented on PR #13775:
URL: https://github.com/apache/lucene/pull/13775#issuecomment-2346289887

   > +1 to that, let's pass n=reader.numDocs()?
   
   Sure, I made that change. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce ProfilerCollectorManager [lucene]

2024-09-12 Thread via GitHub


javanna merged PR #13746:
URL: https://github.com/apache/lucene/pull/13746


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix TestPrefixRandom failures when intra-segment concurrency is enabled [lucene]

2024-09-12 Thread via GitHub


jpountz commented on PR #13773:
URL: https://github.com/apache/lucene/pull/13773#issuecomment-2346309641

   TopScoreDocCollectorManager is supposed to return the top-N results that 
have the best score, tie-broken by doc ID. So (docID=0, score=1) compares 
better than (docID=155, score=1) and should appear first in the resulting top 
docs.
   
   > I think that we are still early terminating though or you mean that we 
shouldn't have as there were docs with same score but lower doc ids?
   
   Right.
   
   Looking at the code, it looks like the problem is here: 
https://github.com/apache/lucene/blob/ff8b81afc59dc09ea4dfdd02a58398f08fcad8a8/lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java#L260.
 If a segment has two partitions, then its leaf collectors will collect hits 
with the same "docBase", so the tie-breaking logic will not know that hits from 
the first partition should compare better than hits from the second partition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]

2024-09-12 Thread via GitHub


jpountz commented on code in PR #13775:
URL: https://github.com/apache/lucene/pull/13775#discussion_r1756898252


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/LargeNumHitsTopDocsCollector.java:
##
@@ -137,7 +137,7 @@ protected void populateResults(ScoreDoc[] results, int 
howMany) {
 }
 
 // Total number of hits collected were less than requestedHitCount
-assert totalHits < requestedHitCount;
+assert totalHits <= requestedHitCount;

Review Comment:
   I believe you're right. And then a few lines above it should be `assert 
totalHits > requestedHitCount;`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Further reduce the search concurrency overhead. [lucene]

2024-09-12 Thread via GitHub


javanna commented on code in PR #13606:
URL: https://github.com/apache/lucene/pull/13606#discussion_r1756897923


##
lucene/core/src/java/org/apache/lucene/search/HitsThresholdChecker.java:
##
@@ -24,6 +24,10 @@ abstract class HitsThresholdChecker {
   /** Implementation of HitsThresholdChecker which allows global hit counting 
*/
   private static class GlobalHitsThresholdChecker extends HitsThresholdChecker 
{
 private final LongAdder globalHitCount = new LongAdder();
+// Cache whether the threshold has been reached already. It is not 
volatile or synchronized on
+// purpose to contain the overhead of reading the value similarly to what 
String#hashCode()
+// does. This does not affect correctness.
+private boolean thresholdReached = false;

Review Comment:
   I bumped into this only now, and I was wondering: this is mutable state 
accessed concurrently by multiple threads? If so can it be a primitive boolean 
without any thread-safety measure?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix TestPrefixRandom failures when intra-segment concurrency is enabled [lucene]

2024-09-12 Thread via GitHub


javanna closed pull request #13773: Fix TestPrefixRandom failures when 
intra-segment concurrency is enabled
URL: https://github.com/apache/lucene/pull/13773


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix TestPrefixRandom failures when intra-segment concurrency is enabled [lucene]

2024-09-12 Thread via GitHub


javanna commented on PR #13773:
URL: https://github.com/apache/lucene/pull/13773#issuecomment-2346327496

   Understood, thanks for the explanation, I did not realize doc ids played a 
role, I thought it was an incorrect assumption and it worked with inter-segment 
concurrency only by coincidence. I will go ahead and close this PR, and see if 
I can fix the production code instead of adapting the test. There's other 
recent failures around the same problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]

2024-09-12 Thread via GitHub


javanna commented on code in PR #13775:
URL: https://github.com/apache/lucene/pull/13775#discussion_r1756915409


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/LargeNumHitsTopDocsCollector.java:
##
@@ -137,7 +137,7 @@ protected void populateResults(ScoreDoc[] results, int 
howMany) {
 }
 
 // Total number of hits collected were less than requestedHitCount
-assert totalHits < requestedHitCount;
+assert totalHits <= requestedHitCount;

Review Comment:
   good catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Further reduce the search concurrency overhead. [lucene]

2024-09-12 Thread via GitHub


jpountz commented on code in PR #13606:
URL: https://github.com/apache/lucene/pull/13606#discussion_r1756914836


##
lucene/core/src/java/org/apache/lucene/search/HitsThresholdChecker.java:
##
@@ -24,6 +24,10 @@ abstract class HitsThresholdChecker {
   /** Implementation of HitsThresholdChecker which allows global hit counting 
*/
   private static class GlobalHitsThresholdChecker extends HitsThresholdChecker 
{
 private final LongAdder globalHitCount = new LongAdder();
+// Cache whether the threshold has been reached already. It is not 
volatile or synchronized on
+// purpose to contain the overhead of reading the value similarly to what 
String#hashCode()
+// does. This does not affect correctness.
+private boolean thresholdReached = false;

Review Comment:
   Indeed it may be accessed by multiple threads, but it is still safe, we may 
just have multiple threads independently set it to the same value. If the 
threshold isn't reached, then the value may only be `false`. Vice-versa if we 
read a `true`, the the threshold is guaranteed to have been reached. It is 
possible to read a `false` while the threshold has been reached (e.g. if a 
write in another thread hasn't been made visible to the current thread), but 
then this thread will take care of setting the boolean to true, same value that 
has been set by the other thread. This SO post tries to answer the same 
question for String#hashCode: 
https://stackoverflow.com/questions/41704185/is-javas-string-hashcode-function-thread-safe-if-its-cache-setter-does-not-us



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix TestShapeDocValues.testLatLonPolygonBBox [lucene]

2024-09-12 Thread via GitHub


msokolov commented on PR #13743:
URL: https://github.com/apache/lucene/pull/13743#issuecomment-2346349295

   Should we have an epsilon here rather than exact comparison with 0? Usually 
that's the way of things with floating point checks - we would check |a-b| < 
epsilon


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Lower number of hits in TestLargeNumHitsTopDocsCollector [lucene]

2024-09-12 Thread via GitHub


javanna merged PR #13775:
URL: https://github.com/apache/lucene/pull/13775


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] similarities: provide default computeNorm implementation; remove remaining discountOverlaps setters; [lucene]

2024-09-12 Thread via GitHub


cpoerschke commented on PR #13757:
URL: https://github.com/apache/lucene/pull/13757#issuecomment-2346451914

   > I took a stab at improving the docs in 
[47d4fa0](https://github.com/apache/lucene/commit/47d4fa02f3c52fffc5aac9328a1de9c0e640ff27)
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Further reduce the search concurrency overhead. [lucene]

2024-09-12 Thread via GitHub


javanna commented on code in PR #13606:
URL: https://github.com/apache/lucene/pull/13606#discussion_r1757003986


##
lucene/core/src/java/org/apache/lucene/search/HitsThresholdChecker.java:
##
@@ -24,6 +24,10 @@ abstract class HitsThresholdChecker {
   /** Implementation of HitsThresholdChecker which allows global hit counting 
*/
   private static class GlobalHitsThresholdChecker extends HitsThresholdChecker 
{
 private final LongAdder globalHitCount = new LongAdder();
+// Cache whether the threshold has been reached already. It is not 
volatile or synchronized on
+// purpose to contain the overhead of reading the value similarly to what 
String#hashCode()
+// does. This does not affect correctness.
+private boolean thresholdReached = false;

Review Comment:
   sounds good, thanks for clarifying



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Make HnswLock and LockedRow final [lucene]

2024-09-12 Thread via GitHub


ChrisHegarty opened a new pull request, #13776:
URL: https://github.com/apache/lucene/pull/13776

   Trivial commit that makes HnswLock final and LockedRow a record. This is 
general clean up and helps the JIT a little when reasoning about these types - 
which show quite a bit in indexing and search profiles.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix TestShapeDocValues.testLatLonPolygonBBox [lucene]

2024-09-12 Thread via GitHub


shubhamvishu commented on PR #13743:
URL: https://github.com/apache/lucene/pull/13743#issuecomment-2346624909

   Yes, usually we do that but in this case there is a catch. This test ONLY 
fails when the generated points are non collinear(as expected) but the `area == 
0` ([checked in this 
condition](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/geo/Tessellator.java#L1320-L1327)),
 which makes it throw the error of `at least three non-collinear points 
required`. If the `area > 0` then this test runs as expected without any 
issues. So with comparison to 0 we would rightly avoid only those cases where 
it actually trips but, epsilon might cause it to simply skip more cases where 
test could have run successfully. Moreover, as the generated coordinates have 
extremely small absolute values and close distance (eg : SomeNumE-84, 
SomeNumE-240 etc), its tricky to find some sane epsilon in this case (which 
would only skip running the test unnecessarily).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix TestShapeDocValues.testLatLonPolygonBBox [lucene]

2024-09-12 Thread via GitHub


stefanvodita commented on code in PR #13743:
URL: https://github.com/apache/lucene/pull/13743#discussion_r1757126653


##
lucene/core/src/test/org/apache/lucene/document/TestShapeDocValues.java:
##
@@ -255,4 +257,9 @@ private List 
getTessellation(XYPolygon p) {
 }
 return tess;
   }
+
+  /** Compute signed area of rectangle */
+  private static double area(Polygon p) {
+return (p.maxLon - p.minLon) * (p.maxLat - p.minLat);

Review Comment:
   Isn't it simpler to check the min/max for equality?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add unit-of-least-precision float comparison [lucene]

2024-09-12 Thread via GitHub


stefanvodita commented on PR #13723:
URL: https://github.com/apache/lucene/pull/13723#issuecomment-2346657253

   Thank you for the feedback! I've added a comparison method for doubles and a 
test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix TestPrefixRandom failures when intra-segment concurrency is enabled [lucene]

2024-09-12 Thread via GitHub


javanna commented on PR #13773:
URL: https://github.com/apache/lucene/pull/13773#issuecomment-2346773435

   I opened #13777 . It needs some proper review because I am not entirely sure 
what side effects it may have. There is currently no direct link between leaf 
collector and scorers, to pass along the minDocId that we could have used 
somehow as docBase in the maxScoreAccumulator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix TestShapeDocValues.testLatLonPolygonBBox [lucene]

2024-09-12 Thread via GitHub


shubhamvishu commented on code in PR #13743:
URL: https://github.com/apache/lucene/pull/13743#discussion_r1757244198


##
lucene/core/src/test/org/apache/lucene/document/TestShapeDocValues.java:
##
@@ -255,4 +257,9 @@ private List 
getTessellation(XYPolygon p) {
 }
 return tess;
   }
+
+  /** Compute signed area of rectangle */
+  private static double area(Polygon p) {
+return (p.maxLon - p.minLon) * (p.maxLat - p.minLat);

Review Comment:
   Nope, its possible that both the parts of multiplication are `> 0`(non zero) 
but the result/area is 0 if the result goes beyond the double range(Eg: 
(`1E-240 * 1E-78` is `0.0`).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix TestShapeDocValues.testLatLonPolygonBBox [lucene]

2024-09-12 Thread via GitHub


shubhamvishu commented on code in PR #13743:
URL: https://github.com/apache/lucene/pull/13743#discussion_r1757244198


##
lucene/core/src/test/org/apache/lucene/document/TestShapeDocValues.java:
##
@@ -255,4 +257,9 @@ private List 
getTessellation(XYPolygon p) {
 }
 return tess;
   }
+
+  /** Compute signed area of rectangle */
+  private static double area(Polygon p) {
+return (p.maxLon - p.minLon) * (p.maxLat - p.minLat);

Review Comment:
   Nope, its possible that both the parts of multiplication are `> 0`(non zero) 
but the result/area is 0 if the result goes beyond the double range(Eg: 
(`1E-250 * 1E-78` is `0.0`).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]

2024-09-12 Thread via GitHub


javanna commented on PR #13777:
URL: https://github.com/apache/lucene/pull/13777#issuecomment-2346813650

   I am not getting many test failures, which is surprising, because I think 
that this is entirely the wrong fix, in that it defeats  the early termination 
purpose of top score docs collector. I don't have good ideas on how to fix this 
properly so far, without API changes, in that leaf collector don't have a way 
to get notified that the current segment is seen as part of a segment 
partition. It would be enough to get the min doc id of the partition, and use 
it as doc base in the accumulator, but it's hard to get it there with the 
current API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Can we remove `compress` option for quantized KNN vector indexing? [lucene]

2024-09-12 Thread via GitHub


benwtrent commented on issue #13768:
URL: https://github.com/apache/lucene/issues/13768#issuecomment-2346877986

   If we are ok with the perf hit on non-panama, I am cool with it :). It will 
definitely simplify the code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add unit-of-least-precision float comparison [lucene]

2024-09-12 Thread via GitHub


uschindler commented on PR #13723:
URL: https://github.com/apache/lucene/pull/13723#issuecomment-2346992406

   I don't like the last commit because it changes from a assert-like method to 
a boolean returning method.
   
   Could we not keep the previous method signature and still add a test?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-09-12 Thread via GitHub


vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2346995734

   > Is "default run" from this PR?
   
   No. "default run" is knn search where each embedding is a separate document 
with no relationship between them. I'm still wiring things up to see benchmark 
results for this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] First-class random access API for KnnVectorValues [lucene]

2024-09-12 Thread via GitHub


msokolov opened a new pull request, #13779:
URL: https://github.com/apache/lucene/pull/13779

   addresses https://github.com/apache/lucene/issues/13778
   
   Key things in this PR:
   
   1. Introduces abstract `KnnVectorValues` from which `ByteVectorValues` and 
`FloatVectorValues` derive;
   2. Folds `RandomAccessVectorValues` into `KnnVectorValues` thus eliminating 
some casts.
   3. `RandomAccessVectorValues.Floats` becomes `FloatVectorValues` and 
`RandomAccessVectorValues.Bytes` becomes `ByteVectorValues`. 
`RandomAccessQuantizedByteVectorValues` folded into `QuantizedByteVectorValues`.
   4. `IndexInput getSlice()` moved to a new `HasIndexSlice` interface.
   5. Introduces `VectorEncoding KnnVectorValues.getEncoding()` to enable 
type-specific branches in a few places where we are dealing with abstract 
`KnnVectorValues` (tests only IIRC). Also used that to provide a default 
`getVectorByteLength()`.
   6. `KnnVectorValues` no longer extends  `DocIdSetIterator`; rather it 
provides a tightly-coupled `iterator()`. This enables refactoring common 
iteration patterns that were repeated many times in the code base. This new 
iterator, `DocIndexIterator` provides an additional method `index()` analogous 
to `IndexedDISI`.
   
   Some of the methods on `KnnVectorValues` have default impls that throw 
`UnsupportedOperationException` enabling subclasses to provide partial 
implementations and relying on testing to catch missing required methods. I'd 
like feedback on this. Should we provide implementations we never use, just to 
make these classes complete? That didn't make sense to me. But the previous 
alternative of attempting to provide strict adherence to declarative contracts 
was becoming in my view, overly restrictive and leading to hard-to-maintain 
code. Some of these readers would only ever be used iteratively. Random access 
is required for search, but not used when  merging the values themselves, and 
when we merge we do search, but using a temporary file so that searching is 
always done over a file-based value. Random access also gets used during 
merging when the index is sorted, again this is provided by specialized 
readers, so not every reader needs to implement random access. But the API 
maintenance 
 is greatly simplified if we allow partial implementation. Anyway that is the 
idea I am trying out here. Can we live with a little less API purity and gain 
some simplicity and less boilerplate?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-09-12 Thread via GitHub


vigyasharma commented on code in PR #13525:
URL: https://github.com/apache/lucene/pull/13525#discussion_r1757395276


##
lucene/core/src/java/org/apache/lucene/index/MultiVectorSimilarityFunction.java:
##
@@ -0,0 +1,203 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * Multi-vector similarity function; used in search to return top K most 
similar multi-vectors to a
+ * target multi-vector. This method is used during indexing and searching of 
the multi-vectors in
+ * order to determine the nearest neighbors.
+ */
+// no commit
+public class MultiVectorSimilarityFunction implements MultiVectorSimilarity {
+
+  /** Aggregation function to combine similarity across multiple vector values 
*/
+  public enum Aggregation {
+/** Placeholder aggregation that is not intended to be used. */
+NONE {
+  @Override
+  public float aggregate(
+  float[] outer,
+  float[] inner,
+  VectorSimilarityFunction vectorSimilarityFunction,
+  int dimension) {
+throw new UnsupportedOperationException();
+  }
+
+  @Override
+  public float aggregate(
+  byte[] outer,
+  byte[] inner,
+  VectorSimilarityFunction vectorSimilarityFunction,
+  int dimension) {
+throw new UnsupportedOperationException();
+  }
+},
+
+/**
+ * SumMaxSimilarity between two multi-vectors. Aggregates using the sum of 
maximum similarity
+ * found for each vector in the first multi-vector against all vectors in 
the second
+ * multi-vector.
+ */
+SUM_MAX {
+  @Override
+  public float aggregate(
+  float[] outer,
+  float[] inner,
+  VectorSimilarityFunction vectorSimilarityFunction,
+  int dimension) {
+if (outer.length % dimension != 0 || inner.length % dimension != 0) {
+  throw new IllegalArgumentException("Multi vectors do not match 
provided dimensions");
+}
+// TODO: can we avoid making vector copies?
+List outerList = new ArrayList<>();
+List innerList = new ArrayList<>();
+for (int i = 0; i <= outer.length; i += dimension) {
+  outerList.add(ArrayUtil.copyOfSubArray(outer, i, dimension));
+}
+for (int i = 0; i <= inner.length; i += dimension) {
+  innerList.add(ArrayUtil.copyOfSubArray(inner, i, dimension));
+}

Review Comment:
   yes, that's what I had in mind. It'll be slower right now with all the extra 
allocation but we can change it later by making our sim. fn impl (probably in 
the vectorization provider?) work with offsets and lengths.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-12 Thread via GitHub


msokolov commented on PR #13779:
URL: https://github.com/apache/lucene/pull/13779#issuecomment-2347012727

   another concern I have is how this would impact ongoing work to enable 
multiple vectors per doc/field. There would almost certainly be conflicts with 
that PR on the surface, but I hope this could actually simplify things in that 
the new `DocIndexIterator` class could be enhanced / extended to provide access 
to a series of values (maybe a list or array?) instead of (or in addition to?) 
a single one, possibly centralizing the required changes (since we have many 
fewer iterator implementations after this change).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]

2024-09-12 Thread via GitHub


jpountz commented on code in PR #13777:
URL: https://github.com/apache/lucene/pull/13777#discussion_r1757465545


##
lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java:
##
@@ -254,10 +254,9 @@ protected void updateMinCompetitiveScore(Scorable scorer) 
throws IOException {
 totalHitsRelation = TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO;
 minCompetitiveScore = localMinScore;
 if (minScoreAcc != null) {
-  // we don't use the next float but we register the document
-  // id so that other leaves can require it if they are after
-  // the current maximum
-  minScoreAcc.accumulate(docBase, pqTop.score);
+  // we don't use the next float but we register the document id so 
that other leaves or
+  // leaf partitions can require it if they are after the current 
maximum
+  minScoreAcc.accumulate(docBase + pqTop.doc, pqTop.score);

Review Comment:
   pqTop.doc should already be a top-level doc ID, so adding `docBase`shouldn't 
be needed?



##
lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java:
##
@@ -232,7 +232,7 @@ protected void updateGlobalMinCompetitiveScore(Scorable 
scorer) throws IOExcepti
   // the next float if the global minimum score is set on a document id 
that is
   // smaller than the ids in the current leaf
   float score =
-  docBase >= maxMinScore.docBase() ? Math.nextUp(maxMinScore.score()) 
: maxMinScore.score();
+  docBase >= maxMinScore.docId() ? Math.nextUp(maxMinScore.score()) : 
maxMinScore.score();

Review Comment:
   In theory, we could replace this docBase with the current doc ID as well 
(not important for correctness, but could help skip more docs when some 
segments have multiple partitions). I'm good to make this change in a follow-up 
PR instead of this one if it helps.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]

2024-09-12 Thread via GitHub


jpountz commented on PR #13777:
URL: https://github.com/apache/lucene/pull/13777#issuecomment-2347079637

   > I think that this is entirely the wrong fix
   
   Why is it the wrong fix? This looks correct to me: we want to consider docs 
with equal or greater scores if the doc ID is less than the top-n-th hit so 
far, but only docs with a greater doc ID otherwise?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Remove IndexSearcher#search(List, Weight, Collector) [lucene]

2024-09-12 Thread via GitHub


javanna opened a new pull request, #13780:
URL: https://github.com/apache/lucene/pull/13780

   With the introduction of intra-segment concurrency, we have introduced a new 
protected search(LeafReaderContextPartition, Weight, Collector) method. The 
previous variant that accepts a list of leaf reader contexts was left 
deprecated as there is one leftover usages coming from search(Query, 
Collector). The hope was that the latter was going to be removed soon as well, 
but there is actually no need to tie the two removals. It is easier to fold 
this method into its only caller, in order for it to still bypass the collector 
manager based methods.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]

2024-09-12 Thread via GitHub


javanna commented on code in PR #13777:
URL: https://github.com/apache/lucene/pull/13777#discussion_r1757481284


##
lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java:
##
@@ -254,10 +254,9 @@ protected void updateMinCompetitiveScore(Scorable scorer) 
throws IOException {
 totalHitsRelation = TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO;
 minCompetitiveScore = localMinScore;
 if (minScoreAcc != null) {
-  // we don't use the next float but we register the document
-  // id so that other leaves can require it if they are after
-  // the current maximum
-  minScoreAcc.accumulate(docBase, pqTop.score);
+  // we don't use the next float but we register the document id so 
that other leaves or
+  // leaf partitions can require it if they are after the current 
maximum
+  minScoreAcc.accumulate(docBase + pqTop.doc, pqTop.score);

Review Comment:
   ah good point!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]

2024-09-12 Thread via GitHub


javanna commented on PR #13777:
URL: https://github.com/apache/lucene/pull/13777#issuecomment-2347094763

   > Why is it the wrong fix?
   
   Happy to be proven wrong, especially when I believe that my fix is wrong. 
ahah
   
   I was under the impression that we may no longer early terminate if we 
replace the docBase with the docId. A test failure on `TestConstantScoreScorer` 
seems to show that pretty clearly. I also thought that this is too easy to be 
true especially if it does not have side effects :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Migrate more classes to records. [lucene]

2024-09-12 Thread via GitHub


jpountz merged PR #13772:
URL: https://github.com/apache/lucene/pull/13772


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make HnswLock and LockedRow final [lucene]

2024-09-12 Thread via GitHub


ChrisHegarty merged PR #13776:
URL: https://github.com/apache/lucene/pull/13776


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-12 Thread via GitHub


benwtrent commented on PR #13779:
URL: https://github.com/apache/lucene/pull/13779#issuecomment-2347102278

   >  but I hope this could actually simplify things 
   
   That is my intuition as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]

2024-09-12 Thread via GitHub


javanna commented on PR #13777:
URL: https://github.com/apache/lucene/pull/13777#issuecomment-2347103042

   > A test failure on TestConstantScoreScorer seems to show that pretty 
clearly.
   
   Ok, that was caused by using the docBase when already included in the doc 
id. Phew. This looks to be right after all! Sorry for the back and forth and 
thanks for the help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]

2024-09-12 Thread via GitHub


javanna commented on code in PR #13777:
URL: https://github.com/apache/lucene/pull/13777#discussion_r1757494555


##
lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java:
##
@@ -232,7 +232,7 @@ protected void updateGlobalMinCompetitiveScore(Scorable 
scorer) throws IOExcepti
   // the next float if the global minimum score is set on a document id 
that is
   // smaller than the ids in the current leaf
   float score =
-  docBase >= maxMinScore.docBase() ? Math.nextUp(maxMinScore.score()) 
: maxMinScore.score();
+  docBase >= maxMinScore.docId() ? Math.nextUp(maxMinScore.score()) : 
maxMinScore.score();

Review Comment:
   I see what you mean , I gave it a try, but then I reverted it, I prefer to 
think about testing and do it as a follow-up



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]

2024-09-12 Thread via GitHub


javanna commented on PR #13777:
URL: https://github.com/apache/lucene/pull/13777#issuecomment-2347172996

   This should be good to go now :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-12 Thread via GitHub


jpountz commented on code in PR #13779:
URL: https://github.com/apache/lucene/pull/13779#discussion_r1757529511


##
lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java:
##
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.document.KnnByteVectorField;
+import org.apache.lucene.document.KnnFloatVectorField;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+
+/**
+ * This class abstracts addressing of document vector values indexed as {@link 
KnnFloatVectorField}
+ * or {@link KnnByteVectorField}.
+ *
+ * @lucene.experimental
+ */
+public abstract class KnnVectorValues {
+
+  /** The iterator associated with these values. */
+  protected DocIndexIterator iterator;
+
+  /** Return the dimension of the vectors */
+  public abstract int dimension();
+
+  /**
+   * Return the number of vectors for this field.
+   *
+   * @return the number of vectors returned by this iterator
+   */
+  public abstract int size();
+
+  /**
+   * Return the docid of the document indexed with the given vector ordinal. 
This default
+   * implementation returns the argument and is appropriate for dense values 
implementations where
+   * every doc has a single value.
+   */
+  public int ordToDoc(int ord) {
+return ord;
+  }
+
+  /**
+   * Creates a new copy of this {@link KnnVectorValues}. This is helpful when 
you need to access
+   * different values at once, to avoid overwriting the underlying vector 
returned.
+   */
+  public abstract KnnVectorValues copy() throws IOException;
+
+  /** Returns the vector byte length, defaults to dimension multiplied by 
float byte size */
+  public int getVectorByteLength() {
+return dimension() * getEncoding().byteSize;
+  }
+
+  /** The vector encoding of these values. */
+  public abstract VectorEncoding getEncoding();
+
+  /** Returns a Bits accepting docs accepted by the argument and having a 
vector value */
+  public Bits getAcceptOrds(Bits acceptDocs) {
+// FIXME: change default to return acceptDocs and provide this impl
+// somewhere more specialized (in every non-dense impl).
+if (acceptDocs == null) {
+  return null;
+}
+return new Bits() {
+  @Override
+  public boolean get(int index) {
+return acceptDocs.get(ordToDoc(index));
+  }
+
+  @Override
+  public int length() {
+return size();
+  }
+};
+  }
+
+  /**
+   * Return the iterator for this instance. If you need multiple iterators, 
call 
+   * this.copy().iterator().
+   */
+  public DocIndexIterator iterator() {
+if (iterator == null) {
+  iterator = createIterator();
+}
+return iterator;
+  }
+
+  /**
+   * Create an iterator for this instance; typically called once by 
iterator(). Wrapper
+   * value classes delegate to their inner instance's iterator and shouldn't 
implement this.
+   */
+  protected DocIndexIterator createIterator() {
+throw new UnsupportedOperationException();
+  }
+
+  /**
+   * A DocIdSetIterator that also provides an index() method tracking a 
distinct ordinal for a
+   * vector associated with each doc.
+   */
+  public abstract static class DocIndexIterator extends DocIdSetIterator {
+
+/** return the value index (aka "ordinal" or "ord") corresponding to the 
current doc */
+public abstract int index();
+
+@Override
+public int advance(int target) throws IOException {
+  return slowAdvance(target);
+}
+
+@Override
+public long cost() {
+  throw new UnsupportedOperationException();
+}
+
+/**
+ * Returns an iterator that delegates to the IndexedDISI. Advancing this 
iterator will advance
+ * the underlying IndexedDISI, and vice-versa.
+ */
+public static DocIndexIterator fromIndexedDISI(IndexedDISI disi) {
+  // can we replace with fromDISI?
+  return new DocIndexIterator() {
+@Override
+public int docID() {
+  return disi.docID();
+}
+
+@Override
+public int index() {
+  return disi.index();
+}

Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-12 Thread via GitHub


jpountz commented on PR #13779:
URL: https://github.com/apache/lucene/pull/13779#issuecomment-2347207782

   Am guessing correctly that you're targeting 10.0 for this change?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-12 Thread via GitHub


msokolov commented on code in PR #13779:
URL: https://github.com/apache/lucene/pull/13779#discussion_r1757568397


##
lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java:
##
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.document.KnnByteVectorField;
+import org.apache.lucene.document.KnnFloatVectorField;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+
+/**
+ * This class abstracts addressing of document vector values indexed as {@link 
KnnFloatVectorField}
+ * or {@link KnnByteVectorField}.
+ *
+ * @lucene.experimental
+ */
+public abstract class KnnVectorValues {
+
+  /** The iterator associated with these values. */
+  protected DocIndexIterator iterator;
+
+  /** Return the dimension of the vectors */
+  public abstract int dimension();
+
+  /**
+   * Return the number of vectors for this field.
+   *
+   * @return the number of vectors returned by this iterator
+   */
+  public abstract int size();
+
+  /**
+   * Return the docid of the document indexed with the given vector ordinal. 
This default
+   * implementation returns the argument and is appropriate for dense values 
implementations where
+   * every doc has a single value.
+   */
+  public int ordToDoc(int ord) {
+return ord;
+  }
+
+  /**
+   * Creates a new copy of this {@link KnnVectorValues}. This is helpful when 
you need to access
+   * different values at once, to avoid overwriting the underlying vector 
returned.
+   */
+  public abstract KnnVectorValues copy() throws IOException;
+
+  /** Returns the vector byte length, defaults to dimension multiplied by 
float byte size */
+  public int getVectorByteLength() {

Review Comment:
   I think it can possibly vary due to quantization --- but you may be right 
and it refers only to the original size?  Will have to confirm, or maybe 
@benwtrent can comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-12 Thread via GitHub


msokolov commented on PR #13779:
URL: https://github.com/apache/lucene/pull/13779#issuecomment-2347232511

   Thanks for the quick review! I will get started on addressing. As for 
timeline for this change, it would definitely be convenient to get in to 10.0 
release. I think you had said 9/22 would be a feature freeze date; it seems we 
could possibly meet that timeline. I will be traveling starting tomorrow for a 
week, but I should be able to put in some quality time on the plane LOL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-12 Thread via GitHub


msokolov commented on code in PR #13779:
URL: https://github.com/apache/lucene/pull/13779#discussion_r1757605144


##
lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java:
##
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.document.KnnByteVectorField;
+import org.apache.lucene.document.KnnFloatVectorField;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+
+/**
+ * This class abstracts addressing of document vector values indexed as {@link 
KnnFloatVectorField}
+ * or {@link KnnByteVectorField}.
+ *
+ * @lucene.experimental
+ */
+public abstract class KnnVectorValues {
+
+  /** The iterator associated with these values. */
+  protected DocIndexIterator iterator;
+
+  /** Return the dimension of the vectors */
+  public abstract int dimension();
+
+  /**
+   * Return the number of vectors for this field.
+   *
+   * @return the number of vectors returned by this iterator
+   */
+  public abstract int size();
+
+  /**
+   * Return the docid of the document indexed with the given vector ordinal. 
This default
+   * implementation returns the argument and is appropriate for dense values 
implementations where
+   * every doc has a single value.
+   */
+  public int ordToDoc(int ord) {
+return ord;
+  }
+
+  /**
+   * Creates a new copy of this {@link KnnVectorValues}. This is helpful when 
you need to access
+   * different values at once, to avoid overwriting the underlying vector 
returned.
+   */
+  public abstract KnnVectorValues copy() throws IOException;
+
+  /** Returns the vector byte length, defaults to dimension multiplied by 
float byte size */
+  public int getVectorByteLength() {
+return dimension() * getEncoding().byteSize;
+  }
+
+  /** The vector encoding of these values. */
+  public abstract VectorEncoding getEncoding();
+
+  /** Returns a Bits accepting docs accepted by the argument and having a 
vector value */
+  public Bits getAcceptOrds(Bits acceptDocs) {
+// FIXME: change default to return acceptDocs and provide this impl
+// somewhere more specialized (in every non-dense impl).
+if (acceptDocs == null) {
+  return null;
+}
+return new Bits() {
+  @Override
+  public boolean get(int index) {
+return acceptDocs.get(ordToDoc(index));
+  }
+
+  @Override
+  public int length() {
+return size();
+  }
+};
+  }
+
+  /**
+   * Return the iterator for this instance. If you need multiple iterators, 
call 
+   * this.copy().iterator().
+   */
+  public DocIndexIterator iterator() {
+if (iterator == null) {
+  iterator = createIterator();
+}
+return iterator;
+  }
+
+  /**
+   * Create an iterator for this instance; typically called once by 
iterator(). Wrapper
+   * value classes delegate to their inner instance's iterator and shouldn't 
implement this.
+   */
+  protected DocIndexIterator createIterator() {
+throw new UnsupportedOperationException();
+  }
+
+  /**
+   * A DocIdSetIterator that also provides an index() method tracking a 
distinct ordinal for a
+   * vector associated with each doc.
+   */
+  public abstract static class DocIndexIterator extends DocIdSetIterator {
+
+/** return the value index (aka "ordinal" or "ord") corresponding to the 
current doc */
+public abstract int index();
+
+@Override
+public int advance(int target) throws IOException {
+  return slowAdvance(target);
+}
+
+@Override
+public long cost() {
+  throw new UnsupportedOperationException();
+}
+
+/**
+ * Returns an iterator that delegates to the IndexedDISI. Advancing this 
iterator will advance
+ * the underlying IndexedDISI, and vice-versa.
+ */
+public static DocIndexIterator fromIndexedDISI(IndexedDISI disi) {
+  // can we replace with fromDISI?
+  return new DocIndexIterator() {
+@Override
+public int docID() {
+  return disi.docID();
+}
+
+@Override
+public int index() {
+  return disi.index();
+

Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-12 Thread via GitHub


msokolov commented on code in PR #13779:
URL: https://github.com/apache/lucene/pull/13779#discussion_r1757605492


##
lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java:
##
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.document.KnnByteVectorField;
+import org.apache.lucene.document.KnnFloatVectorField;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+
+/**
+ * This class abstracts addressing of document vector values indexed as {@link 
KnnFloatVectorField}
+ * or {@link KnnByteVectorField}.
+ *
+ * @lucene.experimental
+ */
+public abstract class KnnVectorValues {
+
+  /** The iterator associated with these values. */
+  protected DocIndexIterator iterator;
+
+  /** Return the dimension of the vectors */
+  public abstract int dimension();
+
+  /**
+   * Return the number of vectors for this field.
+   *
+   * @return the number of vectors returned by this iterator
+   */
+  public abstract int size();
+
+  /**
+   * Return the docid of the document indexed with the given vector ordinal. 
This default
+   * implementation returns the argument and is appropriate for dense values 
implementations where
+   * every doc has a single value.
+   */
+  public int ordToDoc(int ord) {
+return ord;
+  }
+
+  /**
+   * Creates a new copy of this {@link KnnVectorValues}. This is helpful when 
you need to access
+   * different values at once, to avoid overwriting the underlying vector 
returned.
+   */
+  public abstract KnnVectorValues copy() throws IOException;
+
+  /** Returns the vector byte length, defaults to dimension multiplied by 
float byte size */
+  public int getVectorByteLength() {
+return dimension() * getEncoding().byteSize;
+  }
+
+  /** The vector encoding of these values. */
+  public abstract VectorEncoding getEncoding();
+
+  /** Returns a Bits accepting docs accepted by the argument and having a 
vector value */
+  public Bits getAcceptOrds(Bits acceptDocs) {
+// FIXME: change default to return acceptDocs and provide this impl
+// somewhere more specialized (in every non-dense impl).
+if (acceptDocs == null) {
+  return null;
+}
+return new Bits() {
+  @Override
+  public boolean get(int index) {
+return acceptDocs.get(ordToDoc(index));
+  }
+
+  @Override
+  public int length() {
+return size();
+  }
+};
+  }
+
+  /**
+   * Return the iterator for this instance. If you need multiple iterators, 
call 
+   * this.copy().iterator().
+   */
+  public DocIndexIterator iterator() {
+if (iterator == null) {
+  iterator = createIterator();
+}
+return iterator;

Review Comment:
   Let me try - I was also a bit unhappy about this, but at one point along 
this journey I was relying on being able to recover the shared state - maybe I 
finally was able to get rid of that and just didn't notice!



##
lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java:
##
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.document.KnnByteVectorField;
+import org.apache.lucene.document.KnnFloatVectorFiel

Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-12 Thread via GitHub


msokolov commented on code in PR #13779:
URL: https://github.com/apache/lucene/pull/13779#discussion_r1757605699


##
lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java:
##
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.document.KnnByteVectorField;
+import org.apache.lucene.document.KnnFloatVectorField;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+
+/**
+ * This class abstracts addressing of document vector values indexed as {@link 
KnnFloatVectorField}
+ * or {@link KnnByteVectorField}.
+ *
+ * @lucene.experimental
+ */
+public abstract class KnnVectorValues {
+
+  /** The iterator associated with these values. */
+  protected DocIndexIterator iterator;
+
+  /** Return the dimension of the vectors */
+  public abstract int dimension();
+
+  /**
+   * Return the number of vectors for this field.
+   *
+   * @return the number of vectors returned by this iterator
+   */
+  public abstract int size();
+
+  /**
+   * Return the docid of the document indexed with the given vector ordinal. 
This default
+   * implementation returns the argument and is appropriate for dense values 
implementations where
+   * every doc has a single value.
+   */
+  public int ordToDoc(int ord) {
+return ord;
+  }
+
+  /**
+   * Creates a new copy of this {@link KnnVectorValues}. This is helpful when 
you need to access
+   * different values at once, to avoid overwriting the underlying vector 
returned.
+   */
+  public abstract KnnVectorValues copy() throws IOException;
+
+  /** Returns the vector byte length, defaults to dimension multiplied by 
float byte size */
+  public int getVectorByteLength() {
+return dimension() * getEncoding().byteSize;
+  }
+
+  /** The vector encoding of these values. */
+  public abstract VectorEncoding getEncoding();
+
+  /** Returns a Bits accepting docs accepted by the argument and having a 
vector value */
+  public Bits getAcceptOrds(Bits acceptDocs) {
+// FIXME: change default to return acceptDocs and provide this impl
+// somewhere more specialized (in every non-dense impl).
+if (acceptDocs == null) {
+  return null;
+}
+return new Bits() {
+  @Override
+  public boolean get(int index) {
+return acceptDocs.get(ordToDoc(index));
+  }
+
+  @Override
+  public int length() {
+return size();
+  }
+};
+  }
+
+  /**
+   * Return the iterator for this instance. If you need multiple iterators, 
call 
+   * this.copy().iterator().
+   */
+  public DocIndexIterator iterator() {
+if (iterator == null) {
+  iterator = createIterator();
+}
+return iterator;
+  }
+
+  /**
+   * Create an iterator for this instance; typically called once by 
iterator(). Wrapper
+   * value classes delegate to their inner instance's iterator and shouldn't 
implement this.
+   */
+  protected DocIndexIterator createIterator() {
+throw new UnsupportedOperationException();
+  }
+
+  /**
+   * A DocIdSetIterator that also provides an index() method tracking a 
distinct ordinal for a
+   * vector associated with each doc.
+   */
+  public abstract static class DocIndexIterator extends DocIdSetIterator {
+
+/** return the value index (aka "ordinal" or "ord") corresponding to the 
current doc */
+public abstract int index();
+
+@Override
+public int advance(int target) throws IOException {
+  return slowAdvance(target);
+}
+
+@Override
+public long cost() {
+  throw new UnsupportedOperationException();

Review Comment:
   hmm I think cost() is rarely used in the vector reader/writers which instead 
are concerned with KnnVectorValues.size() -- they typically want to know how 
many vector values there are and to the extent they care about the number of 
docs it's only when they must iterate through all of them and have no use for 
an estimate. These iterators aren't really used during searching?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub 

Re: [PR] Provide a custom hash implementation in HnswLock [lucene]

2024-09-12 Thread via GitHub


msokolov commented on PR #13751:
URL: https://github.com/apache/lucene/pull/13751#issuecomment-2347296049

   thank you @ChrisHegarty !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-12 Thread via GitHub


msokolov commented on code in PR #13779:
URL: https://github.com/apache/lucene/pull/13779#discussion_r1757672091


##
lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java:
##
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.document.KnnByteVectorField;
+import org.apache.lucene.document.KnnFloatVectorField;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+
+/**
+ * This class abstracts addressing of document vector values indexed as {@link 
KnnFloatVectorField}
+ * or {@link KnnByteVectorField}.
+ *
+ * @lucene.experimental
+ */
+public abstract class KnnVectorValues {
+
+  /** The iterator associated with these values. */
+  protected DocIndexIterator iterator;
+
+  /** Return the dimension of the vectors */
+  public abstract int dimension();
+
+  /**
+   * Return the number of vectors for this field.
+   *
+   * @return the number of vectors returned by this iterator
+   */
+  public abstract int size();
+
+  /**
+   * Return the docid of the document indexed with the given vector ordinal. 
This default
+   * implementation returns the argument and is appropriate for dense values 
implementations where
+   * every doc has a single value.
+   */
+  public int ordToDoc(int ord) {
+return ord;
+  }
+
+  /**
+   * Creates a new copy of this {@link KnnVectorValues}. This is helpful when 
you need to access
+   * different values at once, to avoid overwriting the underlying vector 
returned.
+   */
+  public abstract KnnVectorValues copy() throws IOException;

Review Comment:
   The way `SortedDocValuesTermsEnum` is, calling its `next` method will 
overwrite the internal buffer ofd the `SortedDocValues` on which it is built, 
defeating the  purpose of `copy()` which is to provide two completely 
independent sources. Another thing we could do is to add `vectorValue(int ord, 
float[] scratch)` allowing the caller to provide the memory to write to. If we 
had that, we wouldn't need `copy()`. Maybe we could manage to squeeze that into 
10.0 too, but I'd rather do it in a separate PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add unit-of-least-precision float comparison [lucene]

2024-09-12 Thread via GitHub


stefanvodita commented on PR #13723:
URL: https://github.com/apache/lucene/pull/13723#issuecomment-2347389307

   I changed it away from an assertion because I liked this more. It makes it 
so you can assert on floats *not* being equal or use their equality in a 
condition, without making an assertion statement much longer. Do you not like 
it because it's too generic? We could move it to `TestUtil` or we could provide 
assertion methods alongside the equality methods, if that helps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Use Arrays.compareUnsigned instead of iterating compare in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-09-12 Thread via GitHub


vsop-479 opened a new pull request, #13781:
URL: https://github.com/apache/lucene/pull/13781

   ### Description
   
   This change similars to https://github.com/apache/lucene/pull/13252.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.compareUnsigned instead of iterating compare in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-09-12 Thread via GitHub


vsop-479 closed pull request #13781: Use Arrays.compareUnsigned instead of 
iterating compare in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum.
URL: https://github.com/apache/lucene/pull/13781


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-09-12 Thread via GitHub


vsop-479 opened a new pull request, #13782:
URL: https://github.com/apache/lucene/pull/13782

   ### Description
   
   Description
   This change similars to https://github.com/apache/lucene/pull/13252.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-09-12 Thread via GitHub


vsop-479 commented on PR #13782:
URL: https://github.com/apache/lucene/pull/13782#issuecomment-2347950684

   @uschindler 
   Please take a look when you get a chance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-09-12 Thread via GitHub


jpountz commented on code in PR #13782:
URL: https://github.com/apache/lucene/pull/13782#discussion_r1758190884


##
lucene/CHANGES.txt:
##
@@ -292,6 +292,8 @@ Build
 API Changes
 -
 
+* GITHUB#13782: Replace handwritten loops compare with Arrays.compareUnsigned 
in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. (zhouhui)

Review Comment:
   We usually put under "API changes" things that require users to change their 
code upon upgrading, which is not the case here. Can you move it under 
"Optimizations" instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-12 Thread via GitHub


jpountz commented on code in PR #13779:
URL: https://github.com/apache/lucene/pull/13779#discussion_r1758191856


##
lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java:
##
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.document.KnnByteVectorField;
+import org.apache.lucene.document.KnnFloatVectorField;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+
+/**
+ * This class abstracts addressing of document vector values indexed as {@link 
KnnFloatVectorField}
+ * or {@link KnnByteVectorField}.
+ *
+ * @lucene.experimental
+ */
+public abstract class KnnVectorValues {
+
+  /** The iterator associated with these values. */
+  protected DocIndexIterator iterator;
+
+  /** Return the dimension of the vectors */
+  public abstract int dimension();
+
+  /**
+   * Return the number of vectors for this field.
+   *
+   * @return the number of vectors returned by this iterator
+   */
+  public abstract int size();
+
+  /**
+   * Return the docid of the document indexed with the given vector ordinal. 
This default
+   * implementation returns the argument and is appropriate for dense values 
implementations where
+   * every doc has a single value.
+   */
+  public int ordToDoc(int ord) {
+return ord;
+  }
+
+  /**
+   * Creates a new copy of this {@link KnnVectorValues}. This is helpful when 
you need to access
+   * different values at once, to avoid overwriting the underlying vector 
returned.
+   */
+  public abstract KnnVectorValues copy() throws IOException;
+
+  /** Returns the vector byte length, defaults to dimension multiplied by 
float byte size */
+  public int getVectorByteLength() {
+return dimension() * getEncoding().byteSize;
+  }
+
+  /** The vector encoding of these values. */
+  public abstract VectorEncoding getEncoding();
+
+  /** Returns a Bits accepting docs accepted by the argument and having a 
vector value */
+  public Bits getAcceptOrds(Bits acceptDocs) {
+// FIXME: change default to return acceptDocs and provide this impl
+// somewhere more specialized (in every non-dense impl).
+if (acceptDocs == null) {
+  return null;
+}
+return new Bits() {
+  @Override
+  public boolean get(int index) {
+return acceptDocs.get(ordToDoc(index));
+  }
+
+  @Override
+  public int length() {
+return size();
+  }
+};
+  }
+
+  /**
+   * Return the iterator for this instance. If you need multiple iterators, 
call 
+   * this.copy().iterator().
+   */
+  public DocIndexIterator iterator() {
+if (iterator == null) {
+  iterator = createIterator();
+}
+return iterator;
+  }
+
+  /**
+   * Create an iterator for this instance; typically called once by 
iterator(). Wrapper
+   * value classes delegate to their inner instance's iterator and shouldn't 
implement this.
+   */
+  protected DocIndexIterator createIterator() {
+throw new UnsupportedOperationException();
+  }
+
+  /**
+   * A DocIdSetIterator that also provides an index() method tracking a 
distinct ordinal for a
+   * vector associated with each doc.
+   */
+  public abstract static class DocIndexIterator extends DocIdSetIterator {
+
+/** return the value index (aka "ordinal" or "ord") corresponding to the 
current doc */
+public abstract int index();
+
+@Override
+public int advance(int target) throws IOException {
+  return slowAdvance(target);
+}
+
+@Override
+public long cost() {
+  throw new UnsupportedOperationException();
+}
+
+/**
+ * Returns an iterator that delegates to the IndexedDISI. Advancing this 
iterator will advance
+ * the underlying IndexedDISI, and vice-versa.
+ */
+public static DocIndexIterator fromIndexedDISI(IndexedDISI disi) {
+  // can we replace with fromDISI?
+  return new DocIndexIterator() {
+@Override
+public int docID() {
+  return disi.docID();
+}
+
+@Override
+public int index() {
+  return disi.index();
+}

Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-12 Thread via GitHub


jpountz commented on code in PR #13779:
URL: https://github.com/apache/lucene/pull/13779#discussion_r1758192953


##
lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java:
##
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.document.KnnByteVectorField;
+import org.apache.lucene.document.KnnFloatVectorField;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+
+/**
+ * This class abstracts addressing of document vector values indexed as {@link 
KnnFloatVectorField}
+ * or {@link KnnByteVectorField}.
+ *
+ * @lucene.experimental
+ */
+public abstract class KnnVectorValues {
+
+  /** The iterator associated with these values. */
+  protected DocIndexIterator iterator;
+
+  /** Return the dimension of the vectors */
+  public abstract int dimension();
+
+  /**
+   * Return the number of vectors for this field.
+   *
+   * @return the number of vectors returned by this iterator
+   */
+  public abstract int size();
+
+  /**
+   * Return the docid of the document indexed with the given vector ordinal. 
This default
+   * implementation returns the argument and is appropriate for dense values 
implementations where
+   * every doc has a single value.
+   */
+  public int ordToDoc(int ord) {
+return ord;
+  }
+
+  /**
+   * Creates a new copy of this {@link KnnVectorValues}. This is helpful when 
you need to access
+   * different values at once, to avoid overwriting the underlying vector 
returned.
+   */
+  public abstract KnnVectorValues copy() throws IOException;

Review Comment:
   But if you call SortedDocValues#termsEnum twice, this would give you two 
independent sources of terms?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add prefetching support to term vectors. [lucene]

2024-09-12 Thread via GitHub


jpountz merged PR #13758:
URL: https://github.com/apache/lucene/pull/13758


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-12 Thread via GitHub


jpountz commented on code in PR #13779:
URL: https://github.com/apache/lucene/pull/13779#discussion_r1758220291


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##
@@ -361,33 +385,46 @@ private MergedByteVectorValues(List 
subs, MergeState mergeS
   totalSize += sub.values.size();
 }
 size = totalSize;
-docId = -1;
   }
 
   @Override
-  public byte[] vectorValue() throws IOException {
-return current.values.vectorValue();
+  public byte[] vectorValue(int ord) throws IOException {
+return current.values.vectorValue(current.values.iterator().index());

Review Comment:
   This part feels a bit hacky, could we instead merge the ord->vector mappings 
of the various vector values by concatenating them?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-09-12 Thread via GitHub


vsop-479 commented on code in PR #13782:
URL: https://github.com/apache/lucene/pull/13782#discussion_r1758230120


##
lucene/CHANGES.txt:
##
@@ -292,6 +292,8 @@ Build
 API Changes
 -
 
+* GITHUB#13782: Replace handwritten loops compare with Arrays.compareUnsigned 
in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. (zhouhui)

Review Comment:
   Thanks @jpountz . That is a mistake, I fixed it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-09-12 Thread via GitHub


vsop-479 commented on PR #13782:
URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348109781

   @jpountz 
   I think this check failure relats to 
https://github.com/apache/lucene/commit/5045d3c67b18d23c534a32cee1d95827b0b7c482
 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-09-12 Thread via GitHub


jpountz commented on PR #13782:
URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348146820

   Indeed it likely does, can you merge main back into your branch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace docBase with actual docId in MaxScoreAccumulator [lucene]

2024-09-12 Thread via GitHub


javanna merged PR #13777:
URL: https://github.com/apache/lucene/pull/13777


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org