[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


jpountz commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274490843


##
lucene/core/src/java/org/apache/lucene/util/hnsw/KnnResults.java:
##
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import org.apache.lucene.search.ScoreDoc;
+import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.search.TotalHits;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  /** KnnResults when exiting search early and returning empty top docs */
+  static class EmptyKnnResults extends KnnResults {
+public EmptyKnnResults(int k, int visitedCount, int visitLimit) {
+  super(k, visitLimit);
+  this.visitedCount = visitedCount;
+}
+
+@Override
+public void doClear() {}
+
+@Override
+public boolean collect(int vectorId, float similarity) {
+  throw new IllegalArgumentException();
+}
+
+@Override
+public boolean isFull() {
+  return true;
+}
+
+@Override
+public float minSimilarity() {
+  return 0;
+}
+
+@Override
+public TopDocs topDocs() {
+  TotalHits th = new TotalHits(visitedCount, 
TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO);
+  return new TopDocs(th, new ScoreDoc[0]);
+}
+  }
+
+  static class OrdinalTranslatedKnnResults extends KnnResults {
+private final KnnResults in;
+private final IntToIntFunction vectorOrdinalToDocId;
+
+OrdinalTranslatedKnnResults(KnnResults in, IntToIntFunction 
vectorOrdinalToDocId) {
+  super(in.k, in.visitLimit);
+  this.in = in;
+  this.vectorOrdinalToDocId = vectorOrdinalToDocId;
+}
+
+@Override
+void doClear() {
+  in.clear();
+}
+
+@Override
+boolean collect(int vectorId, float similarity) {
+  return in.collect(vectorOrdinalToDocId.apply(vectorId), similarity);
+}
+
+@Override
+boolean isFull() {
+  return in.isFull();
+}
+
+@Override
+float minSimilarity() {
+  return in.minSimilarity();
+}
+
+@Override
+public TopDocs topDocs() {
+  TopDocs td = in.topDocs();
+  return new TopDocs(
+  new TotalHits(
+  visitedCount(),
+  incomplete()
+  ? TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO
+  : TotalHits.Relation.EQUAL_TO),
+  td.scoreDocs);
+}
+  }
+
+  protected int visitedCount;
+  private final int visitLimit;
+  private final int k;
+
+  protected KnnResults(int k, int visitLimit) {
+this.visitLimit = visitLimit;
+this.k = k;
+  }
+
+  final void clear() {
+this.visitedCount = 0;
+doClear();
+  }
+
+  /** Clear the current results. */
+  abstract void doClear();
+
+  /**
+   * @return is the current result set marked as incomplete?
+   */
+  final boolean incomplete() {
+return visitedCount >= visitLimit;
+  }
+
+  final void incVisitedCount(int count) {
+assert count > 0;
+this.visitedCount += count;
+  }
+
+  /**
+   * @return the current visited count
+   */
+  final int visitedCount() {
+return visitedCount;
+  }
+
+  final int visitLimit() {
+return visitLimit;
+  }
+
+  public final int k() {
+return k;
+  }
+
+  /**
+   * Collect the provided vectorId and include in the result set.
+   *
+   * @param vectorId the vector to collect
+   * @param similarity its calculated similarity
+   * @return true if the vector is collected
+   */
+  abstract boolean collect(int vectorId, float similarity);
+
+  /**
+   * @return Is the current result set considered full
+   */
+  abstract boolean isFull();

Review Comment:
   I see what you are saying, but I think I still like the "trusting 
implementers" option better. The semantics I'd like for this method is 
something like "the minimum similarity for a vector to be competitive", so it 
would naturally be NEGATIVE_INFINITY as long as the queue is not full. If we 
don't trust implementers, then we need to update javadocs of `minSimilarity()` 
to add something like "it is only legal to call this method when isFull() 
returns true" which isn'

[GitHub] [lucene-jira-archive] uschindler closed pull request #151: Bump certifi from 2022.6.15 to 2023.7.22 in /migration

2023-07-26 Thread via GitHub


uschindler closed pull request #151: Bump certifi from 2022.6.15 to 2023.7.22 
in /migration
URL: https://github.com/apache/lucene-jira-archive/pull/151


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] uschindler closed pull request #150: Bump requests from 2.28.0 to 2.31.0 in /migration

2023-07-26 Thread via GitHub


uschindler closed pull request #150: Bump requests from 2.28.0 to 2.31.0 in 
/migration
URL: https://github.com/apache/lucene-jira-archive/pull/150


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] dependabot[bot] commented on pull request #151: Bump certifi from 2022.6.15 to 2023.7.22 in /migration

2023-07-26 Thread via GitHub


dependabot[bot] commented on PR #151:
URL: 
https://github.com/apache/lucene-jira-archive/pull/151#issuecomment-1651200940

   OK, I won't notify you again about this release, but will get in touch when 
a new version is available. If you'd rather skip all updates until the next 
major or minor version, let me know by commenting `@dependabot ignore this 
major version` or `@dependabot ignore this minor version`.
   
   If you change your mind, just re-open this PR and I'll resolve any conflicts 
on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] dependabot[bot] commented on pull request #150: Bump requests from 2.28.0 to 2.31.0 in /migration

2023-07-26 Thread via GitHub


dependabot[bot] commented on PR #150:
URL: 
https://github.com/apache/lucene-jira-archive/pull/150#issuecomment-1651201006

   OK, I won't notify you again about this release, but will get in touch when 
a new version is available. If you'd rather skip all updates until the next 
major or minor version, let me know by commenting `@dependabot ignore this 
major version` or `@dependabot ignore this minor version`.
   
   If you change your mind, just re-open this PR and I'll resolve any conflicts 
on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


jpountz commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274552571


##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  protected int visitedCount;
+  private final int visitLimit;
+  private final int k;
+
+  protected KnnResults(int k, int visitLimit) {
+this.visitLimit = visitLimit;
+this.k = k;
+  }
+
+  /**
+   * @return is the current result set marked as incomplete?
+   */
+  public final boolean incomplete() {
+return visitedCount >= visitLimit;
+  }
+
+  public final void incVisitedCount(int count) {
+assert count > 0;
+this.visitedCount += count;
+  }
+
+  /**
+   * @return the current visited count
+   */
+  public final int visitedCount() {
+return visitedCount;
+  }
+
+  public final int visitLimit() {
+return visitLimit;
+  }
+
+  public final int k() {

Review Comment:
   I wonder if we can avoid adding it to the public API, the search logic 
doesn't need to know how many hits are being searched?



##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  protected int visitedCount;
+  private final int visitLimit;

Review Comment:
   Thinking out loud: maybe we should make these two quantities `long`s instead 
of `int`s so that we don't have to break it when introducing multi-value 
support.



##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  protected int visitedCount;
+  private final int visitLimit;
+  private final int k;
+
+  protected KnnResults(int k, int visitLimit) {
+this.visitLimit = visitLimit;
+this.k = k;
+  }
+
+  /**
+   * @return is the current result set marked as incomplete?
+   */
+  public final boolean incomplete() {
+return visitedCount >= visitLimit;
+  }
+
+  public final void incVisitedCount(int count) {
+assert count > 0;
+this

[GitHub] [lucene] jpountz commented on pull request #12405: Skip docs with Docvalues in NumericLeafComparator

2023-07-26 Thread via GitHub


jpountz commented on PR #12405:
URL: https://github.com/apache/lucene/pull/12405#issuecomment-1651405438

   @LuXugang I pushed a commit with the changes I had in mind, what do you 
think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] original-brownbear commented on a diff in pull request #12455: Clean up writing String to ByteBuffersDataOutput

2023-07-26 Thread via GitHub


original-brownbear commented on code in PR #12455:
URL: https://github.com/apache/lucene/pull/12455#discussion_r1274811181


##
lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java:
##
@@ -549,66 +540,22 @@ private static int computeBlockSizeBitsFor(long bytes) {
 return blockBits;
   }
 
-  // TODO: move this block-based conversion to UnicodeUtil.
-
-  private static final long HALF_SHIFT = 10;
-  private static final int SURROGATE_OFFSET =
-  Character.MIN_SUPPLEMENTARY_CODE_POINT
-  - (UnicodeUtil.UNI_SUR_HIGH_START << HALF_SHIFT)
-  - UnicodeUtil.UNI_SUR_LOW_START;
-
-  /** A consumer-based UTF16-UTF8 encoder (writes the input string in smaller 
buffers.). */
-  private static int UTF16toUTF8(
-  final CharSequence s,
-  final int offset,
-  final int length,
-  byte[] buf,
-  IntConsumer bufferFlusher) {
-int utf8Len = 0;
-int j = 0;
-for (int i = offset, end = offset + length; i < end; i++) {
-  final int chr = (int) s.charAt(i);
-
-  if (j + 4 >= buf.length) {
-bufferFlusher.accept(j);
-utf8Len += j;
-j = 0;
-  }
-
-  if (chr < 0x80) buf[j++] = (byte) chr;
-  else if (chr < 0x800) {
-buf[j++] = (byte) (0xC0 | (chr >> 6));
-buf[j++] = (byte) (0x80 | (chr & 0x3F));
-  } else if (chr < 0xD800 || chr > 0xDFFF) {
-buf[j++] = (byte) (0xE0 | (chr >> 12));
-buf[j++] = (byte) (0x80 | ((chr >> 6) & 0x3F));
-buf[j++] = (byte) (0x80 | (chr & 0x3F));
-  } else {
-// A surrogate pair. Confirm valid high surrogate.
-if (chr < 0xDC00 && (i < end - 1)) {
-  int utf32 = (int) s.charAt(i + 1);
-  // Confirm valid low surrogate and write pair.
-  if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) {
-utf32 = (chr << 10) + utf32 + SURROGATE_OFFSET;
-i++;
-buf[j++] = (byte) (0xF0 | (utf32 >> 18));
-buf[j++] = (byte) (0x80 | ((utf32 >> 12) & 0x3F));
-buf[j++] = (byte) (0x80 | ((utf32 >> 6) & 0x3F));
-buf[j++] = (byte) (0x80 | (utf32 & 0x3F));
-continue;
-  }
-}
-// Replace unpaired surrogate or out-of-order low surrogate
-// with substitution character.
-buf[j++] = (byte) 0xEF;
-buf[j++] = (byte) 0xBF;
-buf[j++] = (byte) 0xBD;
+  /** Writes a long string in chunks */
+  private void writeLongString(final String s) throws IOException {
+final int byteLen = UnicodeUtil.calcUTF16toUTF8Length(s, 0, s.length());
+writeVInt(byteLen);
+final byte[] buf =
+new byte[Math.min(byteLen, UnicodeUtil.MAX_UTF8_BYTES_PER_CHAR * 
MAX_CHARS_PER_WINDOW)];
+for (int i = 0, end = s.length(); i < end; ) {
+  int step = Math.min(end - i, MAX_CHARS_PER_WINDOW);
+  // don't split a surrogate pair, since the pair together takes 2 bytes 
per char but have space
+  // for 3 bytes per char it's safe to encode one more char here

Review Comment:
   Sure, I guess there's not much value in optimising for one char here :) I 
went with the -1 on the step size now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274813058


##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  protected int visitedCount;
+  private final int visitLimit;
+  private final int k;
+
+  protected KnnResults(int k, int visitLimit) {
+this.visitLimit = visitLimit;
+this.k = k;
+  }
+
+  /**
+   * @return is the current result set marked as incomplete?
+   */
+  public final boolean incomplete() {
+return visitedCount >= visitLimit;
+  }
+
+  public final void incVisitedCount(int count) {
+assert count > 0;
+this.visitedCount += count;
+  }
+
+  /**
+   * @return the current visited count
+   */
+  public final int visitedCount() {
+return visitedCount;
+  }
+
+  public final int visitLimit() {
+return visitLimit;
+  }
+
+  public final int k() {

Review Comment:
   @jpountz it allows us to pre-allocate candidate collection. If we don't 
pre-allocate, we will likely have more than one array copy and extension while 
collecting candidates. 
   
   We could call it something else, but I we should still have it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274813661


##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  protected int visitedCount;
+  private final int visitLimit;

Review Comment:
   I can do that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274813992


##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  protected int visitedCount;
+  private final int visitLimit;
+  private final int k;
+
+  protected KnnResults(int k, int visitLimit) {
+this.visitLimit = visitLimit;
+this.k = k;
+  }
+
+  /**
+   * @return is the current result set marked as incomplete?
+   */
+  public final boolean incomplete() {

Review Comment:
   Or `earlyTerminated`? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274813661


##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  protected int visitedCount;
+  private final int visitLimit;

Review Comment:
   I can do that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274821505


##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  protected int visitedCount;
+  private final int visitLimit;

Review Comment:
   @jpountz, visitLimit is always "

[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


jpountz commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274825776


##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  protected int visitedCount;
+  private final int visitLimit;
+  private final int k;
+
+  protected KnnResults(int k, int visitLimit) {
+this.visitLimit = visitLimit;
+this.k = k;
+  }
+
+  /**
+   * @return is the current result set marked as incomplete?
+   */
+  public final boolean incomplete() {
+return visitedCount >= visitLimit;
+  }
+
+  public final void incVisitedCount(int count) {
+assert count > 0;
+this.visitedCount += count;
+  }
+
+  /**
+   * @return the current visited count
+   */
+  public final int visitedCount() {
+return visitedCount;
+  }
+
+  public final int visitLimit() {
+return visitLimit;
+  }
+
+  public final int k() {

Review Comment:
   I'm not seeing where it's used to pre-allocate candidate collection? (To be 
clear, I'm only suggesting we remove it from the public API, `TopKnnResults` 
would still take a `k` parameter in its ctor)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


jpountz commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274829365


##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  protected int visitedCount;
+  private final int visitLimit;

Review Comment:
   Agreed. For reference, points have a similar challenge, see e.g. 
`PointValues#estimateDocCount` for instance which tries to convert a number of 
matching values to a number of matching docs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


jpountz commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274825776


##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  protected int visitedCount;
+  private final int visitLimit;
+  private final int k;
+
+  protected KnnResults(int k, int visitLimit) {
+this.visitLimit = visitLimit;
+this.k = k;
+  }
+
+  /**
+   * @return is the current result set marked as incomplete?
+   */
+  public final boolean incomplete() {
+return visitedCount >= visitLimit;
+  }
+
+  public final void incVisitedCount(int count) {
+assert count > 0;
+this.visitedCount += count;
+  }
+
+  /**
+   * @return the current visited count
+   */
+  public final int visitedCount() {
+return visitedCount;
+  }
+
+  public final int visitLimit() {
+return visitLimit;
+  }
+
+  public final int k() {

Review Comment:
   I'm not seeing where it's used to pre-allocate candidate collection? (To be 
clear, I'm only suggesting we remove it from the public `KnnResults` API, 
`TopKnnResults` would still take a `k` parameter in its ctor)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274831373


##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  protected int visitedCount;
+  private final int visitLimit;
+  private final int k;
+
+  protected KnnResults(int k, int visitLimit) {
+this.visitLimit = visitLimit;
+this.k = k;
+  }
+
+  /**
+   * @return is the current result set marked as incomplete?
+   */
+  public final boolean incomplete() {
+return visitedCount >= visitLimit;
+  }
+
+  public final void incVisitedCount(int count) {
+assert count > 0;
+this.visitedCount += count;
+  }
+
+  /**
+   * @return the current visited count
+   */
+  public final int visitedCount() {
+return visitedCount;
+  }
+
+  public final int visitLimit() {
+return visitLimit;
+  }
+
+  public final int k() {

Review Comment:
   @jpountz 
   ```
   new NeighborQueue(knnResults.k(), true),
   ```
   This is in the 
   ```
   public static KnnResults search(
 byte[] query,
 KnnResults knnResults,
 RandomAccessVectorValues vectors,
 VectorEncoding vectorEncoding,
 VectorSimilarityFunction similarityFunction,
 HnswGraph graph,
 Bits acceptOrds)
   ```
   Method for both `float[]` and `byte[]`. Creating a `maxHeap` for keeping 
track of expected candidates



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274837238


##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  protected int visitedCount;
+  private final int visitLimit;

Review Comment:
   @jpountz , Ah, so multi-value would have some inverted 
`estimateVisitedVectors` given some filtered set, and that could be greater 
than `int`. Indeed, a `long` here would be useful in that case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


jpountz commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274841199


##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  protected int visitedCount;
+  private final int visitLimit;
+  private final int k;
+
+  protected KnnResults(int k, int visitLimit) {
+this.visitLimit = visitLimit;
+this.k = k;
+  }
+
+  /**
+   * @return is the current result set marked as incomplete?
+   */
+  public final boolean incomplete() {
+return visitedCount >= visitLimit;
+  }
+
+  public final void incVisitedCount(int count) {
+assert count > 0;
+this.visitedCount += count;
+  }
+
+  /**
+   * @return the current visited count
+   */
+  public final int visitedCount() {
+return visitedCount;
+  }
+
+  public final int visitLimit() {
+return visitLimit;
+  }
+
+  public final int k() {

Review Comment:
   OK I see it now, sorry for the noise.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274861961


##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  protected int visitedCount;
+  private final int visitLimit;

Review Comment:
   Updated, they are both `long` value now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #12455: Clean up writing String to ByteBuffersDataOutput

2023-07-26 Thread via GitHub


jpountz merged PR #12455:
URL: https://github.com/apache/lucene/pull/12455


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on PR #12434:
URL: https://github.com/apache/lucene/pull/12434#issuecomment-1651725953

   @msokolov && @alessandrobenedetti pinging y'all as you will probably be the 
most interested in this change.
   
   @alessandrobenedetti the original design did take some inspiration from your 
multi-value vector work. However, benchmarking & testing required significant 
changes. For deduplicating parent docIds during search, the hashMap is now part 
of the queue instead of iterating a cache outside  the heap. This improved 
performance significantly.
   
   I would say this is how folks should represent multi-valued vectors when 
they require access to the matching passage or additional metadata. Otherwise, 
deep changes are required in the codec to attach arbitrary metadata to the 
vectors themselves, which seems like overkill to me when we already have `join`.
   
   This does not obviate the need for "true" multi-value vector support (e.g. 
for late-interaction models, or multi-value vectors that don't require 
metadata). This does lay some nice groundwork that can improve that 
implementation (a custom collector that can deduplicate vectors to a docId 
while searching).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


jpountz commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274892128


##
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java:
##
@@ -278,7 +276,7 @@ public TopDocs search(String field, float[] target, int k, 
Bits acceptDocs, int
   }
 
   @Override
-  public TopDocs search(String field, byte[] target, int k, Bits acceptDocs, 
int visitedLimit)
+  public TopDocs search(String field, byte[] target, KnnResults knnResults, 
Bits acceptDocs)

Review Comment:
   I'm not clear if this is a temporary thing and you plan on collecting 
vectors into the `KnnResults` object in a follow-up commit, or if this is the 
way things will be. I don't like that this is bypassing the `KnnResults` 
object, e.g. passing a `ToParentJoinKnnResults` would still return child doc 
IDs instead of parent doc IDs? I'm assuming it's fixable, but maybe this old 
codec makes it challenging?



##
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##
@@ -268,8 +272,68 @@ public abstract TopDocs searchNearestVectors(
* @return the k nearest neighbor documents, along with their 
(searchStrategy-specific) scores.
* @lucene.experimental
*/
+  public final TopDocs searchNearestVectors(
+  String field, byte[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException {
+return searchNearestVectors(field, target, new TopKnnResults(k, 
visitedLimit), acceptDocs);
+  }
+
+  /**
+   * Return the k nearest neighbor documents as determined by comparison of 
their vector values for
+   * this field, to the given vector, by the field's similarity function. The 
score of each document
+   * is derived from the vector similarity in a way that ensures scores are 
positive and that a
+   * larger score corresponds to a higher ranking.
+   *
+   * The search is allowed to be approximate, meaning the results are not 
guaranteed to be the
+   * true k closest neighbors. For large values of k (for example when k is 
close to the total
+   * number of documents), the search may also retrieve fewer than k documents.
+   *
+   * The returned {@link TopDocs} will contain a {@link ScoreDoc} for each 
nearest neighbor, in
+   * order of their similarity to the query vector (decreasing scores). The 
{@link TotalHits}
+   * contains the number of documents visited during the search. If the search 
stopped early because
+   * it hit {@code visitedLimit}, it is indicated through the relation {@code
+   * TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO}.
+   *
+   * The behavior is undefined if the given field doesn't have KNN vectors 
enabled on its {@link
+   * FieldInfo}. The return value is never {@code null}.
+   *
+   * @param field the vector field to search
+   * @param target the vector-valued query
+   * @param knnResults collector and topK for gathering the vector results
+   * @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
+   * if they are all allowed to match.
+   * @return the k nearest neighbor documents, along with their 
(similarity-specific) scores.
+   */
+  public abstract TopDocs searchNearestVectors(

Review Comment:
   should it return `void` as well and require callers to pull top docs on the 
`KnnResults` object?



##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -79,14 +80,13 @@ protected KnnVectorsReader() {}
*
* @param field the vector field to search
* @param target the vector-valued query
-   * @param k the number of docs to return
+   * @param knnResults a KnnResults collector and relevant settings for 
gathering vector results
* @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
* if they are all allowed to match.
-   * @param visitedLimit the maximum number of nodes that the search is 
allowed to visit
* @return the k nearest neighbor documents, along with their 
(similarity-specific) scores.
*/
   public abstract TopDocs search(
-  String field, float[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException;
+  String field, float[] target, KnnResults knnResults, Bits acceptDocs) 
throws IOException;

Review Comment:
   Should these methods return `void`, and make callers responsible for pulling 
top docs from the `KnnResults` object?



##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http:

[GitHub] [lucene] jpountz commented on a diff in pull request #12405: Skip docs with Docvalues in NumericLeafComparator

2023-07-26 Thread via GitHub


jpountz commented on code in PR #12405:
URL: https://github.com/apache/lucene/pull/12405#discussion_r1274926106


##
lucene/core/src/test/org/apache/lucene/search/TestSortOptimization.java:
##
@@ -149,6 +149,11 @@ public void testLongSortOptimization() throws IOException {
 dir.close();
   }
 
+  /**
+   * test that if we skip docs only with {@link 
org.apache.lucene.index.NumericDocValues}, result
+   * must be equal to no optimization case
+   */

Review Comment:
   did you mean to add a test there?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274930576


##
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java:
##
@@ -278,7 +276,7 @@ public TopDocs search(String field, float[] target, int k, 
Bits acceptDocs, int
   }
 
   @Override
-  public TopDocs search(String field, byte[] target, int k, Bits acceptDocs, 
int visitedLimit)
+  public TopDocs search(String field, byte[] target, KnnResults knnResults, 
Bits acceptDocs)

Review Comment:
   >  I'm assuming it's fixable, but maybe this old codec makes it challenging?
   
   This old codec doesn't use the typical search path. Honestly, since new 
versions don't allow writing new documents to old codecs, I didn't see support 
for a new way to search as critical. 
   
   They would only need this change if a user indexed `join` documents into 
Lucene90 and haven't been able to deduplicate over parent doc id for the last 
year+. If that is the case, it seems weird that we haven't seen any issues 
related to this :/



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274931211


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -79,14 +80,13 @@ protected KnnVectorsReader() {}
*
* @param field the vector field to search
* @param target the vector-valued query
-   * @param k the number of docs to return
+   * @param knnResults a KnnResults collector and relevant settings for 
gathering vector results
* @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
* if they are all allowed to match.
-   * @param visitedLimit the maximum number of nodes that the search is 
allowed to visit
* @return the k nearest neighbor documents, along with their 
(similarity-specific) scores.
*/
   public abstract TopDocs search(
-  String field, float[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException;
+  String field, float[] target, KnnResults knnResults, Bits acceptDocs) 
throws IOException;

Review Comment:
   🤔 that seems ok to me.



##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -79,14 +80,13 @@ protected KnnVectorsReader() {}
*
* @param field the vector field to search
* @param target the vector-valued query
-   * @param k the number of docs to return
+   * @param knnResults a KnnResults collector and relevant settings for 
gathering vector results
* @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
* if they are all allowed to match.
-   * @param visitedLimit the maximum number of nodes that the search is 
allowed to visit
* @return the k nearest neighbor documents, along with their 
(similarity-specific) scores.
*/
   public abstract TopDocs search(
-  String field, float[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException;
+  String field, float[] target, KnnResults knnResults, Bits acceptDocs) 
throws IOException;

Review Comment:
   🤔 that seems ok to me.



##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -79,14 +80,13 @@ protected KnnVectorsReader() {}
*
* @param field the vector field to search
* @param target the vector-valued query
-   * @param k the number of docs to return
+   * @param knnResults a KnnResults collector and relevant settings for 
gathering vector results
* @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
* if they are all allowed to match.
-   * @param visitedLimit the maximum number of nodes that the search is 
allowed to visit
* @return the k nearest neighbor documents, along with their 
(similarity-specific) scores.
*/
   public abstract TopDocs search(
-  String field, float[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException;
+  String field, float[] target, KnnResults knnResults, Bits acceptDocs) 
throws IOException;

Review Comment:
   🤔 that seems ok to me. I don't have a strong opinion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274933739


##
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##
@@ -268,8 +272,68 @@ public abstract TopDocs searchNearestVectors(
* @return the k nearest neighbor documents, along with their 
(searchStrategy-specific) scores.
* @lucene.experimental
*/
+  public final TopDocs searchNearestVectors(
+  String field, byte[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException {
+return searchNearestVectors(field, target, new TopKnnResults(k, 
visitedLimit), acceptDocs);
+  }
+
+  /**
+   * Return the k nearest neighbor documents as determined by comparison of 
their vector values for
+   * this field, to the given vector, by the field's similarity function. The 
score of each document
+   * is derived from the vector similarity in a way that ensures scores are 
positive and that a
+   * larger score corresponds to a higher ranking.
+   *
+   * The search is allowed to be approximate, meaning the results are not 
guaranteed to be the
+   * true k closest neighbors. For large values of k (for example when k is 
close to the total
+   * number of documents), the search may also retrieve fewer than k documents.
+   *
+   * The returned {@link TopDocs} will contain a {@link ScoreDoc} for each 
nearest neighbor, in
+   * order of their similarity to the query vector (decreasing scores). The 
{@link TotalHits}
+   * contains the number of documents visited during the search. If the search 
stopped early because
+   * it hit {@code visitedLimit}, it is indicated through the relation {@code
+   * TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO}.
+   *
+   * The behavior is undefined if the given field doesn't have KNN vectors 
enabled on its {@link
+   * FieldInfo}. The return value is never {@code null}.
+   *
+   * @param field the vector field to search
+   * @param target the vector-valued query
+   * @param knnResults collector and topK for gathering the vector results
+   * @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
+   * if they are all allowed to match.
+   * @return the k nearest neighbor documents, along with their 
(similarity-specific) scores.
+   */
+  public abstract TopDocs searchNearestVectors(

Review Comment:
   If we make the reader void, then this should be as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on issue #12463: Learned sorting algorithm for Lucene

2023-07-26 Thread via GitHub


jpountz commented on issue #12463:
URL: https://github.com/apache/lucene/issues/12463#issuecomment-1651774505

   This is certainly interesting and possibly applicable to Lucene as indexing 
involves a lot of sorting, but also looks complicated to integrate. 
Contributions are welcome. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274934075


##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered

Review Comment:
   I am fine with the unified naming.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274938033


##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  protected long visitedCount;

Review Comment:
   It should be protected as these collectors can be used during graph building 
and are reused between nodes being indexed. Another option is making 
`earlyTerminated` overridable so that the `GraphBuilderKnnResults` can override 
it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274948941


##
lucene/core/src/java/org/apache/lucene/search/KnnResults.java:
##
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnResults is a collector for gathering kNN results and providing topDocs 
from the gathered
+ * neighbors
+ */
+public abstract class KnnResults {
+
+  protected long visitedCount;

Review Comment:
   If I make `KnnResults` an interface and make `GraphBuilderKnnResults` 
implement it instead, I could just ignore many of the methods. Then this 
visited count could be private. I will refactor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


jpountz commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1274952636


##
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java:
##
@@ -278,7 +276,7 @@ public TopDocs search(String field, float[] target, int k, 
Bits acceptDocs, int
   }
 
   @Override
-  public TopDocs search(String field, byte[] target, int k, Bits acceptDocs, 
int visitedLimit)
+  public TopDocs search(String field, byte[] target, KnnResults knnResults, 
Bits acceptDocs)

Review Comment:
   You are right, `ToParentJoinKnnResults` was not the best example. I'm mostly 
trying to think of whether bypassing the `KnnResults` object could yield issues 
and if we should collect entries from the result `NeighborQueue` into the 
`KnnResults` object.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1275036311


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -79,14 +80,13 @@ protected KnnVectorsReader() {}
*
* @param field the vector field to search
* @param target the vector-valued query
-   * @param k the number of docs to return
+   * @param knnResults a KnnResults collector and relevant settings for 
gathering vector results
* @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
* if they are all allowed to match.
-   * @param visitedLimit the maximum number of nodes that the search is 
allowed to visit
* @return the k nearest neighbor documents, along with their 
(similarity-specific) scores.
*/
   public abstract TopDocs search(
-  String field, float[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException;
+  String field, float[] target, KnnResults knnResults, Bits acceptDocs) 
throws IOException;

Review Comment:
   Well, this may cause weirdness with Lucene90 :/ If this is a strongly held 
opinion, I can work on figuring out that searcher. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1275050439


##
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java:
##
@@ -278,7 +276,7 @@ public TopDocs search(String field, float[] target, int k, 
Bits acceptDocs, int
   }
 
   @Override
-  public TopDocs search(String field, byte[] target, int k, Bits acceptDocs, 
int visitedLimit)
+  public TopDocs search(String field, byte[] target, KnnResults knnResults, 
Bits acceptDocs)

Review Comment:
   > if we should collect entries from the result NeighborQueue into the 
KnnResults object.
   
   🤔 I could. This would then allow the change of the method definition to be 
`void` and we always just add things to the collector. 



##
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java:
##
@@ -278,7 +276,7 @@ public TopDocs search(String field, float[] target, int k, 
Bits acceptDocs, int
   }
 
   @Override
-  public TopDocs search(String field, byte[] target, int k, Bits acceptDocs, 
int visitedLimit)
+  public TopDocs search(String field, byte[] target, KnnResults knnResults, 
Bits acceptDocs)

Review Comment:
   > if we should collect entries from the result NeighborQueue into the 
KnnResults object.
   
   🤔 I could. This would then allow the change of the method definition to be 
`void` and we always just add things to the collector. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


jpountz commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1275147272


##
lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java:
##
@@ -78,8 +79,16 @@ public KnnFloatVectorQuery(String field, float[] target, int 
k, Query filter) {
   @Override
   protected TopDocs approximateSearch(LeafReaderContext context, Bits 
acceptDocs, int visitedLimit)
   throws IOException {
+FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field);
+if (fi == null || fi.getVectorDimension() == 0) {
+  // The field does not exist or does not index vectors
+  return NO_RESULTS;
+}
+int k = Math.min(this.k, 
context.reader().getFloatVectorValues(fi.name).size());
 TopDocs results =
-context.reader().searchNearestVectors(field, target, k, acceptDocs, 
visitedLimit);
+context
+.reader()
+.searchNearestVectors(field, target, new TopKnnCollector(k, 
visitedLimit), acceptDocs);

Review Comment:
   like on the byte vector query, we could push this logic of checking for 
field existence and reducing `k` to `LeafReader`?



##
lucene/core/src/java/org/apache/lucene/search/KnnCollector.java:
##
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnCollector is a knn collector used for gathering kNN results and 
providing topDocs from the
+ * gathered neighbors
+ */
+public interface KnnCollector {
+
+  /**
+   * If search visits too many documents, the results collector will terminate 
early. Usually, this
+   * is due to some restricted filter on the document set.
+   *
+   * When collection is earlyTerminated, the results are not a correct 
representation of k
+   * nearest neighbors.
+   *
+   * @return is the current result set marked as incomplete?
+   */
+  boolean earlyTerminated();
+
+  /**
+   * @param count increments the visited vector count, must be greater than 0.
+   */
+  void incVisitedCount(int count);
+
+  /**
+   * @return the current visited vector count
+   */
+  long visitedCount();
+
+  /**
+   * @return the visited vector limit
+   */
+  long visitLimit();
+
+  /**
+   * @return the expected number of collected results
+   */
+  int k();
+
+  /**
+   * Collect the provided docId and include in the result set.
+   *
+   * @param docId of the vector to collect
+   * @param similarity its calculated similarity
+   * @return true if the vector is collected
+   */
+  boolean collect(int docId, float similarity);
+
+  /**
+   * This method is utilized during search to ensure only competitive results 
are explored.
+   *
+   * Consequently, if this results collector wants to collect `k` results, 
this should return
+   * {@link Float#NEGATIVE_INFINITY} when not full.
+   *
+   * When full, the minimum score should be returned.
+   *
+   * @return the current minimum competitive similarity in the collection
+   */
+  float minCompetitiveSimilarity();
+
+  /**
+   * This drains the collected nearest kNN results and returns them in a new 
{@link TopDocs}
+   * collection, ordered by score descending

Review Comment:
   "drains" suggests it already, but for extra clarity, maybe mention that this 
is a destructive operation



##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -79,14 +80,13 @@ protected KnnVectorsReader() {}
*
* @param field the vector field to search
* @param target the vector-valued query
-   * @param k the number of docs to return
+   * @param knnResults a KnnResults collector and relevant settings for 
gathering vector results
* @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
* if they are all allowed to match.
-   * @param visitedLimit the maximum number of nodes that the search is 
allowed to visit
* @return the k nearest neighbor documents, along with their 
(similarity-specific) scores.
*/
   public abstract TopDocs search(
-  String field, float[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException;
+  String field, float[] target, KnnResults knnResults, Bits acceptDocs)

[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


jpountz commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1275153663


##
lucene/core/src/java/org/apache/lucene/util/hnsw/TopKnnCollector.java:
##
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import org.apache.lucene.search.AbstractKnnCollector;
+import org.apache.lucene.search.ScoreDoc;
+import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.search.TotalHits;
+
+/**
+ * TopKnnCollector is a specific KnnResults. A minHeap is used to keep track 
of the currently
+ * collected vectors allowing for efficient updates as better vectors are 
collected.
+ */
+public final class TopKnnCollector extends AbstractKnnCollector {
+
+  private final NeighborQueue queue;
+
+  /**
+   * @param k the number of neighbors to collect
+   * @param visitLimit how many vector nodes the results are allowed to visit
+   */
+  public TopKnnCollector(int k, int visitLimit) {
+super(k, visitLimit);
+this.queue = new NeighborQueue(k, false);
+  }
+
+  @Override
+  public boolean collect(int docId, float similarity) {
+return queue.insertWithOverflow(docId, similarity);
+  }
+
+  @Override
+  public float minCompetitiveSimilarity() {
+return queue.size() >= k() ? queue.topScore() : Float.NEGATIVE_INFINITY;
+  }
+
+  @Override
+  public TopDocs topDocs() {
+while (queue.size() > k()) {
+  queue.pop();
+}
+int i = 0;
+ScoreDoc[] scoreDocs = new ScoreDoc[queue.size()];
+while (i < scoreDocs.length) {
+  int node = queue.topNode();
+  float score = queue.topScore();
+  queue.pop();
+  scoreDocs[scoreDocs.length - ++i] = new ScoreDoc(node, score);
+}
+TotalHits.Relation relation =
+earlyTerminated()
+? TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO
+: TotalHits.Relation.EQUAL_TO;
+return new TopDocs(new TotalHits(visitedCount(), relation), scoreDocs);
+  }
+
+  @Override
+  public String toString() {
+return "TopKnnResults[" + queue.size() + "]";

Review Comment:
   ```suggestion
   return "TopKnnResults[k=" + k() + ", size=" + queue.size() + "]";
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12436: Move max vector dims limit to Codec

2023-07-26 Thread via GitHub


jpountz commented on code in PR #12436:
URL: https://github.com/apache/lucene/pull/12436#discussion_r1275169909


##
lucene/core/src/java/org/apache/lucene/index/IndexingChain.java:
##
@@ -621,6 +621,12 @@ private void initializeFieldInfo(PerField pf) throws 
IOException {
   final Sort indexSort = indexWriterConfig.getIndexSort();
   validateIndexSortDVType(indexSort, pf.fieldName, s.docValuesType);
 }
+if (s.vectorDimension != 0) {
+  validateMaxVectorDimension(
+  pf.fieldName,
+  s.vectorDimension,
+  indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions());
+}

Review Comment:
   I worry that this adds a hashtable lookup on a hot code path. Maybe it's not 
that bad for vectors, which are slow to index anyway, but I'd rather avoid it. 
What about making the codec responsible for checking the limit? Something like 
below:
   
   ```patch
   diff --git 
a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
 
b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
   index cb3e5ef8b10..6c365e53528 100644
   --- 
a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
   +++ 
b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
   @@ -108,6 +108,9 @@ public final class Lucene95HnswVectorsFormat extends 
KnnVectorsFormat {
  public static final int VERSION_START = 0;
  public static final int VERSION_CURRENT = VERSION_START;

   +  /** A maximum number of vector dimensions supported by this codeс */
   +  public static final int MAX_DIMENSIONS = 1024;
   +
  /**
   * A maximum configurable maximum max conn.
   *
   @@ -177,7 +180,7 @@ public final class Lucene95HnswVectorsFormat extends 
KnnVectorsFormat {

  @Override
  public KnnVectorsWriter fieldsWriter(SegmentWriteState state) throws 
IOException {
   -return new Lucene95HnswVectorsWriter(state, maxConn, beamWidth);
   +return new Lucene95HnswVectorsWriter(state, maxConn, beamWidth, 
MAX_DIMENSIONS);
  }

  @Override
   diff --git 
a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
 
b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
   index 5358d66f16e..196f12a21ad 100644
   --- 
a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
   +++ 
b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
   @@ -60,13 +60,15 @@ public final class Lucene95HnswVectorsWriter extends 
KnnVectorsWriter {
  private final IndexOutput meta, vectorData, vectorIndex;
  private final int M;
  private final int beamWidth;
   +  private final int maxDimension;

  private final List> fields = new ArrayList<>();
  private boolean finished;

   -  Lucene95HnswVectorsWriter(SegmentWriteState state, int M, int beamWidth) 
throws IOException {
   +  Lucene95HnswVectorsWriter(SegmentWriteState state, int M, int beamWidth, 
int maxDimension) throws IOException {
this.M = M;
this.beamWidth = beamWidth;
   +this.maxDimension = maxDimension;
segmentWriteState = state;
String metaFileName =
IndexFileNames.segmentFileName(
   @@ -117,6 +119,9 @@ public final class Lucene95HnswVectorsWriter extends 
KnnVectorsWriter {

  @Override
  public KnnFieldVectorsWriter addField(FieldInfo fieldInfo) throws 
IOException {
   +if (fieldInfo.getVectorDimension() > maxDimension) {
   +  throw new IllegalArgumentException("Number of dimensions " + 
fieldInfo.getVectorDimension() + " for field " + fieldInfo.name + " exceeds the 
limit of " + maxDimension);
   +}
FieldWriter newField =
FieldWriter.create(fieldInfo, M, beamWidth, 
segmentWriteState.infoStream);
fields.add(newField);
   ```



##
lucene/core/src/java/org/apache/lucene/index/IndexingChain.java:
##
@@ -621,6 +621,12 @@ private void initializeFieldInfo(PerField pf) throws 
IOException {
   final Sort indexSort = indexWriterConfig.getIndexSort();
   validateIndexSortDVType(indexSort, pf.fieldName, s.docValuesType);
 }
+if (s.vectorDimension != 0) {
+  validateMaxVectorDimension(
+  pf.fieldName,
+  s.vectorDimension,
+  indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions());
+}

Review Comment:
   I worry that this adds a hashtable lookup on a hot code path. Maybe it's not 
that bad for vectors, which are slow to index anyway, but I'd rather avoid it. 
What about making the codec responsible for checking the limit? Something like 
below:
   
   ```patch
   diff --git 
a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
 
b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java

[GitHub] [lucene] jpountz closed pull request #12435: Remove sort for uniqueValues in NumericDocValues

2023-07-26 Thread via GitHub


jpountz closed pull request #12435: Remove sort for uniqueValues in 
NumericDocValues
URL: https://github.com/apache/lucene/pull/12435


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12435: Remove sort for uniqueValues in NumericDocValues

2023-07-26 Thread via GitHub


jpountz commented on PR #12435:
URL: https://github.com/apache/lucene/pull/12435#issuecomment-1652157003

   Closing as per the above comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1275240534


##
lucene/core/src/java/org/apache/lucene/util/hnsw/TopKnnCollector.java:
##
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;

Review Comment:
   @jpountz I am not sure, other nearest neighbor algos (like ivfpq) require 
more parameters that we currently don't provide. If we were to do something 
other than HNSW, I think a different collector would be required providing 
separate options.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1275257249


##
lucene/core/src/java/org/apache/lucene/search/KnnCollector.java:
##
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnCollector is a knn collector used for gathering kNN results and 
providing topDocs from the
+ * gathered neighbors
+ */
+public interface KnnCollector {
+
+  /**
+   * If search visits too many documents, the results collector will terminate 
early. Usually, this
+   * is due to some restricted filter on the document set.
+   *
+   * When collection is earlyTerminated, the results are not a correct 
representation of k
+   * nearest neighbors.
+   *
+   * @return is the current result set marked as incomplete?
+   */
+  boolean earlyTerminated();
+
+  /**
+   * @param count increments the visited vector count, must be greater than 0.
+   */
+  void incVisitedCount(int count);
+
+  /**
+   * @return the current visited vector count
+   */
+  long visitedCount();
+
+  /**
+   * @return the visited vector limit
+   */
+  long visitLimit();
+
+  /**
+   * @return the expected number of collected results
+   */
+  int k();
+
+  /**
+   * Collect the provided docId and include in the result set.
+   *
+   * @param docId of the vector to collect
+   * @param similarity its calculated similarity
+   * @return true if the vector is collected
+   */
+  boolean collect(int docId, float similarity);
+
+  /**
+   * This method is utilized during search to ensure only competitive results 
are explored.
+   *
+   * Consequently, if this results collector wants to collect `k` results, 
this should return
+   * {@link Float#NEGATIVE_INFINITY} when not full.
+   *
+   * When full, the minimum score should be returned.
+   *
+   * @return the current minimum competitive similarity in the collection
+   */
+  float minCompetitiveSimilarity();
+
+  /**
+   * This drains the collected nearest kNN results and returns them in a new 
{@link TopDocs}
+   * collection, ordered by score descending

Review Comment:
   will do!



##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnCollector.java:
##
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * AbstractKnnCollector is the default implementation for a knn collector used 
for gathering kNN
+ * results and providing topDocs from the gathered neighbors
+ */
+public abstract class AbstractKnnCollector implements KnnCollector {
+
+  private long visitedCount;
+  private final long visitLimit;
+  private final int k;
+
+  protected AbstractKnnCollector(int k, long visitLimit) {
+this.visitLimit = visitLimit;
+this.k = k;
+  }
+
+  /**
+   * If search visits too many documents, the results collector will terminate 
early. Usually, this
+   * is due to some restricted filter on the document set.
+   *
+   * When collection is earlyTerminated, the results are not a correct 
representation of k
+   * nearest neighbors.
+   *
+   * @return is the current result set marked as incomplete?
+   */
+  @Override
+  public final boolean earlyTerminated() {
+return visitedCount >= visitLimit;
+  }
+
+  /**
+   * @param count increments the visited vector count, must be greater than 0.
+   */
+  @Override
+  public final void incVisitedCount(int count) {
+assert count > 0;
+

[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1275258761


##
lucene/core/src/java/org/apache/lucene/search/KnnByteVectorQuery.java:
##
@@ -77,8 +78,16 @@ public KnnByteVectorQuery(String field, byte[] target, int 
k, Query filter) {
   @Override
   protected TopDocs approximateSearch(LeafReaderContext context, Bits 
acceptDocs, int visitedLimit)
   throws IOException {
+FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field);
+if (fi == null || fi.getVectorDimension() == 0) {
+  // The field does not exist or does not index vectors
+  return NO_RESULTS;
+}
+int k = Math.min(this.k, 
context.reader().getByteVectorValues(fi.name).size());

Review Comment:
   > that takes an int k, and call it here instead of the expert method that 
takes a KnnCollector?
   
   🤔 I suppose? This way this queries use the `final` method from the 
leafReader and don't have to worry about this.



##
lucene/core/src/java/org/apache/lucene/search/KnnByteVectorQuery.java:
##
@@ -77,8 +78,16 @@ public KnnByteVectorQuery(String field, byte[] target, int 
k, Query filter) {
   @Override
   protected TopDocs approximateSearch(LeafReaderContext context, Bits 
acceptDocs, int visitedLimit)
   throws IOException {
+FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field);
+if (fi == null || fi.getVectorDimension() == 0) {
+  // The field does not exist or does not index vectors
+  return NO_RESULTS;
+}
+int k = Math.min(this.k, 
context.reader().getByteVectorValues(fi.name).size());

Review Comment:
   > that takes an int k, and call it here instead of the expert method that 
takes a KnnCollector?
   
   🤔 I suppose? This way this queries use the `final` method from the 
leafReader and don't have to worry about this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support

2023-07-26 Thread via GitHub


benwtrent commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1275357936


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -79,14 +80,13 @@ protected KnnVectorsReader() {}
*
* @param field the vector field to search
* @param target the vector-valued query
-   * @param k the number of docs to return
+   * @param knnResults a KnnResults collector and relevant settings for 
gathering vector results
* @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
* if they are all allowed to match.
-   * @param visitedLimit the maximum number of nodes that the search is 
allowed to visit
* @return the k nearest neighbor documents, along with their 
(similarity-specific) scores.
*/
   public abstract TopDocs search(
-  String field, float[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException;
+  String field, float[] target, KnnResults knnResults, Bits acceptDocs) 
throws IOException;

Review Comment:
   Since this method accepts a collector, it should be void. Methods that do 
not accept a collector and instead create one itself continue to return 
`TopDocs` where appropriate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores

2023-07-26 Thread via GitHub


jmazanec15 commented on issue #12342:
URL: https://github.com/apache/lucene/issues/12342#issuecomment-1652318052

   Thanks @searchivarius sounds good.
   
   Additonally, @benwtrent in regards to 
https://github.com/apache/lucene/issues/12342#issuecomment-1650293826, I ran 
the non-transformed reverse test when using `np.flip(np_flat_ds_sorted, 
axis=0)` and got pretty high recall numbers:
   ```
   0.7100.86   40  0   48  200 10  2135520 1.00
post-filter
   0.9734.03   40  90  48  200 100 2150331 1.00
post-filter
   0.9916.72   40  190 48  200 200 2151494 1.00
post-filter
   0.998   14.23   40  490 48  200 500 2169778 1.00
post-filter
   ```
   
   Did you change this for your tests? I can repeat the set of tests again, but 
want to just confirm everything looks good


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #12436: Move max vector dims limit to Codec

2023-07-26 Thread via GitHub


mayya-sharipova commented on code in PR #12436:
URL: https://github.com/apache/lucene/pull/12436#discussion_r1275409313


##
lucene/core/src/java/org/apache/lucene/index/IndexingChain.java:
##
@@ -621,6 +621,12 @@ private void initializeFieldInfo(PerField pf) throws 
IOException {
   final Sort indexSort = indexWriterConfig.getIndexSort();
   validateIndexSortDVType(indexSort, pf.fieldName, s.docValuesType);
 }
+if (s.vectorDimension != 0) {
+  validateMaxVectorDimension(
+  pf.fieldName,
+  s.vectorDimension,
+  indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions());
+}

Review Comment:
   @jpountz Thank you for the additional feedback.
   
   > I worry that this adds a hashtable lookup on a hot code path. Maybe it's 
not that bad for vectors, which are slow to index anyway, but I'd rather avoid 
it. 
   
   This is not really a hot code path. We ask for 
`getCodec().knnVectorsFormat().getMaxDimensions`   in the `initializeFieldInfo` 
function, that happens only once per a new field per segment.
   
   > What about making the codec responsible for checking the limit?
   
   I experimented with this idea, and encountered the following difficulty with 
it:
   -  we need to create a new `FieldInfo` before passing it to 
`KnnFieldVectorsWriter addField(FieldInfo fieldInfo)`. 
   - The way we create it is : `FieldInfo fi = fieldInfos.add(` by adding to 
the global fieldInfos. This means that if `FieldInfo` contains incorrect number 
of dimensions, it will be stored like this in the global fieldInfos, and we 
can't change it (for example with a second document with correct number of 
dims).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #12436: Move max vector dims limit to Codec

2023-07-26 Thread via GitHub


mayya-sharipova commented on code in PR #12436:
URL: https://github.com/apache/lucene/pull/12436#discussion_r1275409313


##
lucene/core/src/java/org/apache/lucene/index/IndexingChain.java:
##
@@ -621,6 +621,12 @@ private void initializeFieldInfo(PerField pf) throws 
IOException {
   final Sort indexSort = indexWriterConfig.getIndexSort();
   validateIndexSortDVType(indexSort, pf.fieldName, s.docValuesType);
 }
+if (s.vectorDimension != 0) {
+  validateMaxVectorDimension(
+  pf.fieldName,
+  s.vectorDimension,
+  indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions());
+}

Review Comment:
   @jpountz Thank you for the additional feedback.
   
   > I worry that this adds a hashtable lookup on a hot code path. Maybe it's 
not that bad for vectors, which are slow to index anyway, but I'd rather avoid 
it. 
   
   This is not really a hot code path. We ask for 
`getCodec().knnVectorsFormat().getMaxDimensions`   in the `initializeFieldInfo` 
function, that happens only once per a new field per segment.
   
   > What about making the codec responsible for checking the limit?
   
   Thanks for the suggestion, I experimented with this idea, and encountered 
the following difficulty with it:
   -  we need to create a new `FieldInfo` before passing it to 
`KnnFieldVectorsWriter addField(FieldInfo fieldInfo)`. 
   - The way we create it is : `FieldInfo fi = fieldInfos.add(` by adding to 
the global fieldInfos. This means that if `FieldInfo` contains incorrect number 
of dimensions, it will be stored like this in the global fieldInfos, and we 
can't change it (for example with a second document with correct number of 
dims).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] donnerpeter opened a new pull request, #12464: hunspell: make the hash table load factor customizable

2023-07-26 Thread via GitHub


donnerpeter opened a new pull request, #12464:
URL: https://github.com/apache/lucene/pull/12464

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #12436: Move max vector dims limit to Codec

2023-07-26 Thread via GitHub


mayya-sharipova commented on code in PR #12436:
URL: https://github.com/apache/lucene/pull/12436#discussion_r1275409313


##
lucene/core/src/java/org/apache/lucene/index/IndexingChain.java:
##
@@ -621,6 +621,12 @@ private void initializeFieldInfo(PerField pf) throws 
IOException {
   final Sort indexSort = indexWriterConfig.getIndexSort();
   validateIndexSortDVType(indexSort, pf.fieldName, s.docValuesType);
 }
+if (s.vectorDimension != 0) {
+  validateMaxVectorDimension(
+  pf.fieldName,
+  s.vectorDimension,
+  indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions());
+}

Review Comment:
   @jpountz Thank you for the additional feedback.
   
   > I worry that this adds a hashtable lookup on a hot code path. Maybe it's 
not that bad for vectors, which are slow to index anyway, but I'd rather avoid 
it. 
   
   This is not really a hot code path. We ask for 
`getCodec().knnVectorsFormat().getMaxDimensions`   in the `initializeFieldInfo` 
function, that happens only once per a new field per segment.
   
   > What about making the codec responsible for checking the limit?
   
   Thanks for the suggestion, I experimented with this idea, and encountered 
the following difficulty with it:
   -  we need to create a new `FieldInfo` before passing it to 
`KnnFieldVectorsWriter addField(FieldInfo fieldInfo)`. 
   - The way we create it is : `FieldInfo fi = fieldInfos.add(` by adding to 
the global fieldInfos. This means that if `FieldInfo` contains incorrect number 
of dimensions, it will be stored like this in the global fieldInfos, and we 
can't change it (for example with a second document with correct number of 
dims).
   
   May be as an alternative we can do a validation as a separate method of 
`KnnVectorsWriter`:
   ```java
   public void validateFieldDims(int dims)
   ```
   
   What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 opened a new pull request, #12465: Potential bug in IndexedDISI90#SPARSE->advanceExactWithinBlock

2023-07-26 Thread via GitHub


gautamworah96 opened a new pull request, #12465:
URL: https://github.com/apache/lucene/pull/12465

   While upgrading our Lucene to 9.7 from 9.6 we noticed that a few tests were 
failing.
   
   On digging deeper into the tests we realized that they were failing on a 
particular assert. This one:
   
   ```
   if (disi.nextExistDocInBlock > targetInBlock) {
 assert !disi.exists;
 return false;
   }
   ```
   
   This changes the behavior of this function as compared to the behavior in 
Lucene 9.6 (and before). It could be possible that we intentionally decided to 
change the behavior of this function, but to my eye, it feels like that was not 
the case?
   
   I attached a test that shows the assert failing. The test tries to call 
`advanceExact` which underneath it calls `advanceExactWithinBlock` on a 
document that is behind the current doc position. In the previous version of 
Lucene this would return a `false` and would exit. But with the upgrade, it 
fails mysteriously on an assert. 
   
   I created this PR because IIUC the assert checks for something that is 
completely unrelated to the current method call? Why would we want to assert on 
the existence of a previous document when calling for `advanceExactWithinBlock` 
on a new document.. My preference here would be to remove just the assert 
statement and to revert to the previous behavior (of returning false when 
calling for a document that is behind the current cursor position). 
   
   I could also be wrong though. Would appreciate a second look from some folks 
in the community.. Thanks!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #12405: Skip docs with Docvalues in NumericLeafComparator

2023-07-26 Thread via GitHub


LuXugang commented on code in PR #12405:
URL: https://github.com/apache/lucene/pull/12405#discussion_r1275760334


##
lucene/core/src/java/org/apache/lucene/util/NumericUtils.java:
##
@@ -132,6 +132,52 @@ public static void add(int bytesPerDim, int dim, byte[] a, 
byte[] b, byte[] resu
 }
   }
 
+  /**
+   * Modify in-place the given bytes to the next value of the same length that 
compares greater than
+   * the current value. This returns false if, and only if, the value is 
currently equal to the
+   * maximum possible value.
+   */
+  public static boolean nextUp(byte[] bytes) {

Review Comment:
   `nextup` and `nextdown` only work well for int/long
   
   double/float case can't pass:
   ```java
 public void testNextUpDouble() {
   for (double i : new double[] {Double.MIN_VALUE, Double.MAX_VALUE}) {
 byte[] b = new byte[Long.BYTES];
 long a = NumericUtils.doubleToSortableLong(i);
 NumericUtils.longToSortableBytes(a, b, 0);
 assertFalse(NumericUtils.nextUp(b));
 assertEquals(i, 
NumericUtils.sortableLongToDouble(NumericUtils.sortableBytesToLong(b, 0)), 0.0);
   }
 }
   ```
   
   Because `Double.MIN_VALUE`, `Double.MAX_VALUE`'s byte values are: [-128, 0, 
0, 0, 0, 0, 0, 1] and [-1, -17, -1, -1, -1, -1, -1, -1]



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #12405: Skip docs with Docvalues in NumericLeafComparator

2023-07-26 Thread via GitHub


LuXugang commented on code in PR #12405:
URL: https://github.com/apache/lucene/pull/12405#discussion_r1275760334


##
lucene/core/src/java/org/apache/lucene/util/NumericUtils.java:
##
@@ -132,6 +132,52 @@ public static void add(int bytesPerDim, int dim, byte[] a, 
byte[] b, byte[] resu
 }
   }
 
+  /**
+   * Modify in-place the given bytes to the next value of the same length that 
compares greater than
+   * the current value. This returns false if, and only if, the value is 
currently equal to the
+   * maximum possible value.
+   */
+  public static boolean nextUp(byte[] bytes) {

Review Comment:
   @jpountz  `nextup` and `nextdown` seems only work well for int/long
   
   double/float case can't pass:
   ```java
 public void testNextUpDouble() {
   for (double i : new double[] {Double.MIN_VALUE, Double.MAX_VALUE}) {
 byte[] b = new byte[Long.BYTES];
 long a = NumericUtils.doubleToSortableLong(i);
 NumericUtils.longToSortableBytes(a, b, 0);
 if(i == Double.MIN_VALUE){
   assertFalse(NumericUtils.nextDown(b));
 }else {
   assertFalse(NumericUtils.nextUp(b));
 }
   }
 }
   ```
   
   Because `Double.MIN_VALUE`, `Double.MAX_VALUE`'s byte values are: [-128, 0, 
0, 0, 0, 0, 0, 1] and [-1, -17, -1, -1, -1, -1, -1, -1]



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org