[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support
jpountz commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274490843 ## lucene/core/src/java/org/apache/lucene/util/hnsw/KnnResults.java: ## @@ -0,0 +1,175 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.hnsw; + +import org.apache.lucene.search.ScoreDoc; +import org.apache.lucene.search.TopDocs; +import org.apache.lucene.search.TotalHits; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + /** KnnResults when exiting search early and returning empty top docs */ + static class EmptyKnnResults extends KnnResults { +public EmptyKnnResults(int k, int visitedCount, int visitLimit) { + super(k, visitLimit); + this.visitedCount = visitedCount; +} + +@Override +public void doClear() {} + +@Override +public boolean collect(int vectorId, float similarity) { + throw new IllegalArgumentException(); +} + +@Override +public boolean isFull() { + return true; +} + +@Override +public float minSimilarity() { + return 0; +} + +@Override +public TopDocs topDocs() { + TotalHits th = new TotalHits(visitedCount, TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO); + return new TopDocs(th, new ScoreDoc[0]); +} + } + + static class OrdinalTranslatedKnnResults extends KnnResults { +private final KnnResults in; +private final IntToIntFunction vectorOrdinalToDocId; + +OrdinalTranslatedKnnResults(KnnResults in, IntToIntFunction vectorOrdinalToDocId) { + super(in.k, in.visitLimit); + this.in = in; + this.vectorOrdinalToDocId = vectorOrdinalToDocId; +} + +@Override +void doClear() { + in.clear(); +} + +@Override +boolean collect(int vectorId, float similarity) { + return in.collect(vectorOrdinalToDocId.apply(vectorId), similarity); +} + +@Override +boolean isFull() { + return in.isFull(); +} + +@Override +float minSimilarity() { + return in.minSimilarity(); +} + +@Override +public TopDocs topDocs() { + TopDocs td = in.topDocs(); + return new TopDocs( + new TotalHits( + visitedCount(), + incomplete() + ? TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO + : TotalHits.Relation.EQUAL_TO), + td.scoreDocs); +} + } + + protected int visitedCount; + private final int visitLimit; + private final int k; + + protected KnnResults(int k, int visitLimit) { +this.visitLimit = visitLimit; +this.k = k; + } + + final void clear() { +this.visitedCount = 0; +doClear(); + } + + /** Clear the current results. */ + abstract void doClear(); + + /** + * @return is the current result set marked as incomplete? + */ + final boolean incomplete() { +return visitedCount >= visitLimit; + } + + final void incVisitedCount(int count) { +assert count > 0; +this.visitedCount += count; + } + + /** + * @return the current visited count + */ + final int visitedCount() { +return visitedCount; + } + + final int visitLimit() { +return visitLimit; + } + + public final int k() { +return k; + } + + /** + * Collect the provided vectorId and include in the result set. + * + * @param vectorId the vector to collect + * @param similarity its calculated similarity + * @return true if the vector is collected + */ + abstract boolean collect(int vectorId, float similarity); + + /** + * @return Is the current result set considered full + */ + abstract boolean isFull(); Review Comment: I see what you are saying, but I think I still like the "trusting implementers" option better. The semantics I'd like for this method is something like "the minimum similarity for a vector to be competitive", so it would naturally be NEGATIVE_INFINITY as long as the queue is not full. If we don't trust implementers, then we need to update javadocs of `minSimilarity()` to add something like "it is only legal to call this method when isFull() returns true" which isn'
[GitHub] [lucene-jira-archive] uschindler closed pull request #151: Bump certifi from 2022.6.15 to 2023.7.22 in /migration
uschindler closed pull request #151: Bump certifi from 2022.6.15 to 2023.7.22 in /migration URL: https://github.com/apache/lucene-jira-archive/pull/151 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] uschindler closed pull request #150: Bump requests from 2.28.0 to 2.31.0 in /migration
uschindler closed pull request #150: Bump requests from 2.28.0 to 2.31.0 in /migration URL: https://github.com/apache/lucene-jira-archive/pull/150 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] dependabot[bot] commented on pull request #151: Bump certifi from 2022.6.15 to 2023.7.22 in /migration
dependabot[bot] commented on PR #151: URL: https://github.com/apache/lucene-jira-archive/pull/151#issuecomment-1651200940 OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting `@dependabot ignore this major version` or `@dependabot ignore this minor version`. If you change your mind, just re-open this PR and I'll resolve any conflicts on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] dependabot[bot] commented on pull request #150: Bump requests from 2.28.0 to 2.31.0 in /migration
dependabot[bot] commented on PR #150: URL: https://github.com/apache/lucene-jira-archive/pull/150#issuecomment-1651201006 OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting `@dependabot ignore this major version` or `@dependabot ignore this minor version`. If you change your mind, just re-open this PR and I'll resolve any conflicts on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support
jpountz commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274552571 ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + protected int visitedCount; + private final int visitLimit; + private final int k; + + protected KnnResults(int k, int visitLimit) { +this.visitLimit = visitLimit; +this.k = k; + } + + /** + * @return is the current result set marked as incomplete? + */ + public final boolean incomplete() { +return visitedCount >= visitLimit; + } + + public final void incVisitedCount(int count) { +assert count > 0; +this.visitedCount += count; + } + + /** + * @return the current visited count + */ + public final int visitedCount() { +return visitedCount; + } + + public final int visitLimit() { +return visitLimit; + } + + public final int k() { Review Comment: I wonder if we can avoid adding it to the public API, the search logic doesn't need to know how many hits are being searched? ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + protected int visitedCount; + private final int visitLimit; Review Comment: Thinking out loud: maybe we should make these two quantities `long`s instead of `int`s so that we don't have to break it when introducing multi-value support. ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + protected int visitedCount; + private final int visitLimit; + private final int k; + + protected KnnResults(int k, int visitLimit) { +this.visitLimit = visitLimit; +this.k = k; + } + + /** + * @return is the current result set marked as incomplete? + */ + public final boolean incomplete() { +return visitedCount >= visitLimit; + } + + public final void incVisitedCount(int count) { +assert count > 0; +this
[GitHub] [lucene] jpountz commented on pull request #12405: Skip docs with Docvalues in NumericLeafComparator
jpountz commented on PR #12405: URL: https://github.com/apache/lucene/pull/12405#issuecomment-1651405438 @LuXugang I pushed a commit with the changes I had in mind, what do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] original-brownbear commented on a diff in pull request #12455: Clean up writing String to ByteBuffersDataOutput
original-brownbear commented on code in PR #12455: URL: https://github.com/apache/lucene/pull/12455#discussion_r1274811181 ## lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java: ## @@ -549,66 +540,22 @@ private static int computeBlockSizeBitsFor(long bytes) { return blockBits; } - // TODO: move this block-based conversion to UnicodeUtil. - - private static final long HALF_SHIFT = 10; - private static final int SURROGATE_OFFSET = - Character.MIN_SUPPLEMENTARY_CODE_POINT - - (UnicodeUtil.UNI_SUR_HIGH_START << HALF_SHIFT) - - UnicodeUtil.UNI_SUR_LOW_START; - - /** A consumer-based UTF16-UTF8 encoder (writes the input string in smaller buffers.). */ - private static int UTF16toUTF8( - final CharSequence s, - final int offset, - final int length, - byte[] buf, - IntConsumer bufferFlusher) { -int utf8Len = 0; -int j = 0; -for (int i = offset, end = offset + length; i < end; i++) { - final int chr = (int) s.charAt(i); - - if (j + 4 >= buf.length) { -bufferFlusher.accept(j); -utf8Len += j; -j = 0; - } - - if (chr < 0x80) buf[j++] = (byte) chr; - else if (chr < 0x800) { -buf[j++] = (byte) (0xC0 | (chr >> 6)); -buf[j++] = (byte) (0x80 | (chr & 0x3F)); - } else if (chr < 0xD800 || chr > 0xDFFF) { -buf[j++] = (byte) (0xE0 | (chr >> 12)); -buf[j++] = (byte) (0x80 | ((chr >> 6) & 0x3F)); -buf[j++] = (byte) (0x80 | (chr & 0x3F)); - } else { -// A surrogate pair. Confirm valid high surrogate. -if (chr < 0xDC00 && (i < end - 1)) { - int utf32 = (int) s.charAt(i + 1); - // Confirm valid low surrogate and write pair. - if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) { -utf32 = (chr << 10) + utf32 + SURROGATE_OFFSET; -i++; -buf[j++] = (byte) (0xF0 | (utf32 >> 18)); -buf[j++] = (byte) (0x80 | ((utf32 >> 12) & 0x3F)); -buf[j++] = (byte) (0x80 | ((utf32 >> 6) & 0x3F)); -buf[j++] = (byte) (0x80 | (utf32 & 0x3F)); -continue; - } -} -// Replace unpaired surrogate or out-of-order low surrogate -// with substitution character. -buf[j++] = (byte) 0xEF; -buf[j++] = (byte) 0xBF; -buf[j++] = (byte) 0xBD; + /** Writes a long string in chunks */ + private void writeLongString(final String s) throws IOException { +final int byteLen = UnicodeUtil.calcUTF16toUTF8Length(s, 0, s.length()); +writeVInt(byteLen); +final byte[] buf = +new byte[Math.min(byteLen, UnicodeUtil.MAX_UTF8_BYTES_PER_CHAR * MAX_CHARS_PER_WINDOW)]; +for (int i = 0, end = s.length(); i < end; ) { + int step = Math.min(end - i, MAX_CHARS_PER_WINDOW); + // don't split a surrogate pair, since the pair together takes 2 bytes per char but have space + // for 3 bytes per char it's safe to encode one more char here Review Comment: Sure, I guess there's not much value in optimising for one char here :) I went with the -1 on the step size now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274813058 ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + protected int visitedCount; + private final int visitLimit; + private final int k; + + protected KnnResults(int k, int visitLimit) { +this.visitLimit = visitLimit; +this.k = k; + } + + /** + * @return is the current result set marked as incomplete? + */ + public final boolean incomplete() { +return visitedCount >= visitLimit; + } + + public final void incVisitedCount(int count) { +assert count > 0; +this.visitedCount += count; + } + + /** + * @return the current visited count + */ + public final int visitedCount() { +return visitedCount; + } + + public final int visitLimit() { +return visitLimit; + } + + public final int k() { Review Comment: @jpountz it allows us to pre-allocate candidate collection. If we don't pre-allocate, we will likely have more than one array copy and extension while collecting candidates. We could call it something else, but I we should still have it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274813661 ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + protected int visitedCount; + private final int visitLimit; Review Comment: I can do that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274813992 ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + protected int visitedCount; + private final int visitLimit; + private final int k; + + protected KnnResults(int k, int visitLimit) { +this.visitLimit = visitLimit; +this.k = k; + } + + /** + * @return is the current result set marked as incomplete? + */ + public final boolean incomplete() { Review Comment: Or `earlyTerminated`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274813661 ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + protected int visitedCount; + private final int visitLimit; Review Comment: I can do that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274821505 ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + protected int visitedCount; + private final int visitLimit; Review Comment: @jpountz, visitLimit is always "
[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support
jpountz commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274825776 ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + protected int visitedCount; + private final int visitLimit; + private final int k; + + protected KnnResults(int k, int visitLimit) { +this.visitLimit = visitLimit; +this.k = k; + } + + /** + * @return is the current result set marked as incomplete? + */ + public final boolean incomplete() { +return visitedCount >= visitLimit; + } + + public final void incVisitedCount(int count) { +assert count > 0; +this.visitedCount += count; + } + + /** + * @return the current visited count + */ + public final int visitedCount() { +return visitedCount; + } + + public final int visitLimit() { +return visitLimit; + } + + public final int k() { Review Comment: I'm not seeing where it's used to pre-allocate candidate collection? (To be clear, I'm only suggesting we remove it from the public API, `TopKnnResults` would still take a `k` parameter in its ctor) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support
jpountz commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274829365 ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + protected int visitedCount; + private final int visitLimit; Review Comment: Agreed. For reference, points have a similar challenge, see e.g. `PointValues#estimateDocCount` for instance which tries to convert a number of matching values to a number of matching docs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support
jpountz commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274825776 ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + protected int visitedCount; + private final int visitLimit; + private final int k; + + protected KnnResults(int k, int visitLimit) { +this.visitLimit = visitLimit; +this.k = k; + } + + /** + * @return is the current result set marked as incomplete? + */ + public final boolean incomplete() { +return visitedCount >= visitLimit; + } + + public final void incVisitedCount(int count) { +assert count > 0; +this.visitedCount += count; + } + + /** + * @return the current visited count + */ + public final int visitedCount() { +return visitedCount; + } + + public final int visitLimit() { +return visitLimit; + } + + public final int k() { Review Comment: I'm not seeing where it's used to pre-allocate candidate collection? (To be clear, I'm only suggesting we remove it from the public `KnnResults` API, `TopKnnResults` would still take a `k` parameter in its ctor) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274831373 ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + protected int visitedCount; + private final int visitLimit; + private final int k; + + protected KnnResults(int k, int visitLimit) { +this.visitLimit = visitLimit; +this.k = k; + } + + /** + * @return is the current result set marked as incomplete? + */ + public final boolean incomplete() { +return visitedCount >= visitLimit; + } + + public final void incVisitedCount(int count) { +assert count > 0; +this.visitedCount += count; + } + + /** + * @return the current visited count + */ + public final int visitedCount() { +return visitedCount; + } + + public final int visitLimit() { +return visitLimit; + } + + public final int k() { Review Comment: @jpountz ``` new NeighborQueue(knnResults.k(), true), ``` This is in the ``` public static KnnResults search( byte[] query, KnnResults knnResults, RandomAccessVectorValues vectors, VectorEncoding vectorEncoding, VectorSimilarityFunction similarityFunction, HnswGraph graph, Bits acceptOrds) ``` Method for both `float[]` and `byte[]`. Creating a `maxHeap` for keeping track of expected candidates -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274837238 ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + protected int visitedCount; + private final int visitLimit; Review Comment: @jpountz , Ah, so multi-value would have some inverted `estimateVisitedVectors` given some filtered set, and that could be greater than `int`. Indeed, a `long` here would be useful in that case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support
jpountz commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274841199 ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + protected int visitedCount; + private final int visitLimit; + private final int k; + + protected KnnResults(int k, int visitLimit) { +this.visitLimit = visitLimit; +this.k = k; + } + + /** + * @return is the current result set marked as incomplete? + */ + public final boolean incomplete() { +return visitedCount >= visitLimit; + } + + public final void incVisitedCount(int count) { +assert count > 0; +this.visitedCount += count; + } + + /** + * @return the current visited count + */ + public final int visitedCount() { +return visitedCount; + } + + public final int visitLimit() { +return visitLimit; + } + + public final int k() { Review Comment: OK I see it now, sorry for the noise. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274861961 ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + protected int visitedCount; + private final int visitLimit; Review Comment: Updated, they are both `long` value now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #12455: Clean up writing String to ByteBuffersDataOutput
jpountz merged PR #12455: URL: https://github.com/apache/lucene/pull/12455 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #12434: Add ParentJoin KNN support
benwtrent commented on PR #12434: URL: https://github.com/apache/lucene/pull/12434#issuecomment-1651725953 @msokolov && @alessandrobenedetti pinging y'all as you will probably be the most interested in this change. @alessandrobenedetti the original design did take some inspiration from your multi-value vector work. However, benchmarking & testing required significant changes. For deduplicating parent docIds during search, the hashMap is now part of the queue instead of iterating a cache outside the heap. This improved performance significantly. I would say this is how folks should represent multi-valued vectors when they require access to the matching passage or additional metadata. Otherwise, deep changes are required in the codec to attach arbitrary metadata to the vectors themselves, which seems like overkill to me when we already have `join`. This does not obviate the need for "true" multi-value vector support (e.g. for late-interaction models, or multi-value vectors that don't require metadata). This does lay some nice groundwork that can improve that implementation (a custom collector that can deduplicate vectors to a docId while searching). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support
jpountz commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274892128 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java: ## @@ -278,7 +276,7 @@ public TopDocs search(String field, float[] target, int k, Bits acceptDocs, int } @Override - public TopDocs search(String field, byte[] target, int k, Bits acceptDocs, int visitedLimit) + public TopDocs search(String field, byte[] target, KnnResults knnResults, Bits acceptDocs) Review Comment: I'm not clear if this is a temporary thing and you plan on collecting vectors into the `KnnResults` object in a follow-up commit, or if this is the way things will be. I don't like that this is bypassing the `KnnResults` object, e.g. passing a `ToParentJoinKnnResults` would still return child doc IDs instead of parent doc IDs? I'm assuming it's fixable, but maybe this old codec makes it challenging? ## lucene/core/src/java/org/apache/lucene/index/LeafReader.java: ## @@ -268,8 +272,68 @@ public abstract TopDocs searchNearestVectors( * @return the k nearest neighbor documents, along with their (searchStrategy-specific) scores. * @lucene.experimental */ + public final TopDocs searchNearestVectors( + String field, byte[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException { +return searchNearestVectors(field, target, new TopKnnResults(k, visitedLimit), acceptDocs); + } + + /** + * Return the k nearest neighbor documents as determined by comparison of their vector values for + * this field, to the given vector, by the field's similarity function. The score of each document + * is derived from the vector similarity in a way that ensures scores are positive and that a + * larger score corresponds to a higher ranking. + * + * The search is allowed to be approximate, meaning the results are not guaranteed to be the + * true k closest neighbors. For large values of k (for example when k is close to the total + * number of documents), the search may also retrieve fewer than k documents. + * + * The returned {@link TopDocs} will contain a {@link ScoreDoc} for each nearest neighbor, in + * order of their similarity to the query vector (decreasing scores). The {@link TotalHits} + * contains the number of documents visited during the search. If the search stopped early because + * it hit {@code visitedLimit}, it is indicated through the relation {@code + * TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO}. + * + * The behavior is undefined if the given field doesn't have KNN vectors enabled on its {@link + * FieldInfo}. The return value is never {@code null}. + * + * @param field the vector field to search + * @param target the vector-valued query + * @param knnResults collector and topK for gathering the vector results + * @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null} + * if they are all allowed to match. + * @return the k nearest neighbor documents, along with their (similarity-specific) scores. + */ + public abstract TopDocs searchNearestVectors( Review Comment: should it return `void` as well and require callers to pull top docs on the `KnnResults` object? ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java: ## @@ -79,14 +80,13 @@ protected KnnVectorsReader() {} * * @param field the vector field to search * @param target the vector-valued query - * @param k the number of docs to return + * @param knnResults a KnnResults collector and relevant settings for gathering vector results * @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null} * if they are all allowed to match. - * @param visitedLimit the maximum number of nodes that the search is allowed to visit * @return the k nearest neighbor documents, along with their (similarity-specific) scores. */ public abstract TopDocs search( - String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException; + String field, float[] target, KnnResults knnResults, Bits acceptDocs) throws IOException; Review Comment: Should these methods return `void`, and make callers responsible for pulling top docs from the `KnnResults` object? ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http:
[GitHub] [lucene] jpountz commented on a diff in pull request #12405: Skip docs with Docvalues in NumericLeafComparator
jpountz commented on code in PR #12405: URL: https://github.com/apache/lucene/pull/12405#discussion_r1274926106 ## lucene/core/src/test/org/apache/lucene/search/TestSortOptimization.java: ## @@ -149,6 +149,11 @@ public void testLongSortOptimization() throws IOException { dir.close(); } + /** + * test that if we skip docs only with {@link org.apache.lucene.index.NumericDocValues}, result + * must be equal to no optimization case + */ Review Comment: did you mean to add a test there? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274930576 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java: ## @@ -278,7 +276,7 @@ public TopDocs search(String field, float[] target, int k, Bits acceptDocs, int } @Override - public TopDocs search(String field, byte[] target, int k, Bits acceptDocs, int visitedLimit) + public TopDocs search(String field, byte[] target, KnnResults knnResults, Bits acceptDocs) Review Comment: > I'm assuming it's fixable, but maybe this old codec makes it challenging? This old codec doesn't use the typical search path. Honestly, since new versions don't allow writing new documents to old codecs, I didn't see support for a new way to search as critical. They would only need this change if a user indexed `join` documents into Lucene90 and haven't been able to deduplicate over parent doc id for the last year+. If that is the case, it seems weird that we haven't seen any issues related to this :/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274931211 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java: ## @@ -79,14 +80,13 @@ protected KnnVectorsReader() {} * * @param field the vector field to search * @param target the vector-valued query - * @param k the number of docs to return + * @param knnResults a KnnResults collector and relevant settings for gathering vector results * @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null} * if they are all allowed to match. - * @param visitedLimit the maximum number of nodes that the search is allowed to visit * @return the k nearest neighbor documents, along with their (similarity-specific) scores. */ public abstract TopDocs search( - String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException; + String field, float[] target, KnnResults knnResults, Bits acceptDocs) throws IOException; Review Comment: 🤔 that seems ok to me. ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java: ## @@ -79,14 +80,13 @@ protected KnnVectorsReader() {} * * @param field the vector field to search * @param target the vector-valued query - * @param k the number of docs to return + * @param knnResults a KnnResults collector and relevant settings for gathering vector results * @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null} * if they are all allowed to match. - * @param visitedLimit the maximum number of nodes that the search is allowed to visit * @return the k nearest neighbor documents, along with their (similarity-specific) scores. */ public abstract TopDocs search( - String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException; + String field, float[] target, KnnResults knnResults, Bits acceptDocs) throws IOException; Review Comment: 🤔 that seems ok to me. ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java: ## @@ -79,14 +80,13 @@ protected KnnVectorsReader() {} * * @param field the vector field to search * @param target the vector-valued query - * @param k the number of docs to return + * @param knnResults a KnnResults collector and relevant settings for gathering vector results * @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null} * if they are all allowed to match. - * @param visitedLimit the maximum number of nodes that the search is allowed to visit * @return the k nearest neighbor documents, along with their (similarity-specific) scores. */ public abstract TopDocs search( - String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException; + String field, float[] target, KnnResults knnResults, Bits acceptDocs) throws IOException; Review Comment: 🤔 that seems ok to me. I don't have a strong opinion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274933739 ## lucene/core/src/java/org/apache/lucene/index/LeafReader.java: ## @@ -268,8 +272,68 @@ public abstract TopDocs searchNearestVectors( * @return the k nearest neighbor documents, along with their (searchStrategy-specific) scores. * @lucene.experimental */ + public final TopDocs searchNearestVectors( + String field, byte[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException { +return searchNearestVectors(field, target, new TopKnnResults(k, visitedLimit), acceptDocs); + } + + /** + * Return the k nearest neighbor documents as determined by comparison of their vector values for + * this field, to the given vector, by the field's similarity function. The score of each document + * is derived from the vector similarity in a way that ensures scores are positive and that a + * larger score corresponds to a higher ranking. + * + * The search is allowed to be approximate, meaning the results are not guaranteed to be the + * true k closest neighbors. For large values of k (for example when k is close to the total + * number of documents), the search may also retrieve fewer than k documents. + * + * The returned {@link TopDocs} will contain a {@link ScoreDoc} for each nearest neighbor, in + * order of their similarity to the query vector (decreasing scores). The {@link TotalHits} + * contains the number of documents visited during the search. If the search stopped early because + * it hit {@code visitedLimit}, it is indicated through the relation {@code + * TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO}. + * + * The behavior is undefined if the given field doesn't have KNN vectors enabled on its {@link + * FieldInfo}. The return value is never {@code null}. + * + * @param field the vector field to search + * @param target the vector-valued query + * @param knnResults collector and topK for gathering the vector results + * @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null} + * if they are all allowed to match. + * @return the k nearest neighbor documents, along with their (similarity-specific) scores. + */ + public abstract TopDocs searchNearestVectors( Review Comment: If we make the reader void, then this should be as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on issue #12463: Learned sorting algorithm for Lucene
jpountz commented on issue #12463: URL: https://github.com/apache/lucene/issues/12463#issuecomment-1651774505 This is certainly interesting and possibly applicable to Lucene as indexing involves a lot of sorting, but also looks complicated to integrate. Contributions are welcome. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274934075 ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered Review Comment: I am fine with the unified naming. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274938033 ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + protected long visitedCount; Review Comment: It should be protected as these collectors can be used during graph building and are reused between nodes being indexed. Another option is making `earlyTerminated` overridable so that the `GraphBuilderKnnResults` can override it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274948941 ## lucene/core/src/java/org/apache/lucene/search/KnnResults.java: ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnResults is a collector for gathering kNN results and providing topDocs from the gathered + * neighbors + */ +public abstract class KnnResults { + + protected long visitedCount; Review Comment: If I make `KnnResults` an interface and make `GraphBuilderKnnResults` implement it instead, I could just ignore many of the methods. Then this visited count could be private. I will refactor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support
jpountz commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1274952636 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java: ## @@ -278,7 +276,7 @@ public TopDocs search(String field, float[] target, int k, Bits acceptDocs, int } @Override - public TopDocs search(String field, byte[] target, int k, Bits acceptDocs, int visitedLimit) + public TopDocs search(String field, byte[] target, KnnResults knnResults, Bits acceptDocs) Review Comment: You are right, `ToParentJoinKnnResults` was not the best example. I'm mostly trying to think of whether bypassing the `KnnResults` object could yield issues and if we should collect entries from the result `NeighborQueue` into the `KnnResults` object. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1275036311 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java: ## @@ -79,14 +80,13 @@ protected KnnVectorsReader() {} * * @param field the vector field to search * @param target the vector-valued query - * @param k the number of docs to return + * @param knnResults a KnnResults collector and relevant settings for gathering vector results * @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null} * if they are all allowed to match. - * @param visitedLimit the maximum number of nodes that the search is allowed to visit * @return the k nearest neighbor documents, along with their (similarity-specific) scores. */ public abstract TopDocs search( - String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException; + String field, float[] target, KnnResults knnResults, Bits acceptDocs) throws IOException; Review Comment: Well, this may cause weirdness with Lucene90 :/ If this is a strongly held opinion, I can work on figuring out that searcher. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1275050439 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java: ## @@ -278,7 +276,7 @@ public TopDocs search(String field, float[] target, int k, Bits acceptDocs, int } @Override - public TopDocs search(String field, byte[] target, int k, Bits acceptDocs, int visitedLimit) + public TopDocs search(String field, byte[] target, KnnResults knnResults, Bits acceptDocs) Review Comment: > if we should collect entries from the result NeighborQueue into the KnnResults object. 🤔 I could. This would then allow the change of the method definition to be `void` and we always just add things to the collector. ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java: ## @@ -278,7 +276,7 @@ public TopDocs search(String field, float[] target, int k, Bits acceptDocs, int } @Override - public TopDocs search(String field, byte[] target, int k, Bits acceptDocs, int visitedLimit) + public TopDocs search(String field, byte[] target, KnnResults knnResults, Bits acceptDocs) Review Comment: > if we should collect entries from the result NeighborQueue into the KnnResults object. 🤔 I could. This would then allow the change of the method definition to be `void` and we always just add things to the collector. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support
jpountz commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1275147272 ## lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java: ## @@ -78,8 +79,16 @@ public KnnFloatVectorQuery(String field, float[] target, int k, Query filter) { @Override protected TopDocs approximateSearch(LeafReaderContext context, Bits acceptDocs, int visitedLimit) throws IOException { +FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field); +if (fi == null || fi.getVectorDimension() == 0) { + // The field does not exist or does not index vectors + return NO_RESULTS; +} +int k = Math.min(this.k, context.reader().getFloatVectorValues(fi.name).size()); TopDocs results = -context.reader().searchNearestVectors(field, target, k, acceptDocs, visitedLimit); +context +.reader() +.searchNearestVectors(field, target, new TopKnnCollector(k, visitedLimit), acceptDocs); Review Comment: like on the byte vector query, we could push this logic of checking for field existence and reducing `k` to `LeafReader`? ## lucene/core/src/java/org/apache/lucene/search/KnnCollector.java: ## @@ -0,0 +1,85 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnCollector is a knn collector used for gathering kNN results and providing topDocs from the + * gathered neighbors + */ +public interface KnnCollector { + + /** + * If search visits too many documents, the results collector will terminate early. Usually, this + * is due to some restricted filter on the document set. + * + * When collection is earlyTerminated, the results are not a correct representation of k + * nearest neighbors. + * + * @return is the current result set marked as incomplete? + */ + boolean earlyTerminated(); + + /** + * @param count increments the visited vector count, must be greater than 0. + */ + void incVisitedCount(int count); + + /** + * @return the current visited vector count + */ + long visitedCount(); + + /** + * @return the visited vector limit + */ + long visitLimit(); + + /** + * @return the expected number of collected results + */ + int k(); + + /** + * Collect the provided docId and include in the result set. + * + * @param docId of the vector to collect + * @param similarity its calculated similarity + * @return true if the vector is collected + */ + boolean collect(int docId, float similarity); + + /** + * This method is utilized during search to ensure only competitive results are explored. + * + * Consequently, if this results collector wants to collect `k` results, this should return + * {@link Float#NEGATIVE_INFINITY} when not full. + * + * When full, the minimum score should be returned. + * + * @return the current minimum competitive similarity in the collection + */ + float minCompetitiveSimilarity(); + + /** + * This drains the collected nearest kNN results and returns them in a new {@link TopDocs} + * collection, ordered by score descending Review Comment: "drains" suggests it already, but for extra clarity, maybe mention that this is a destructive operation ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java: ## @@ -79,14 +80,13 @@ protected KnnVectorsReader() {} * * @param field the vector field to search * @param target the vector-valued query - * @param k the number of docs to return + * @param knnResults a KnnResults collector and relevant settings for gathering vector results * @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null} * if they are all allowed to match. - * @param visitedLimit the maximum number of nodes that the search is allowed to visit * @return the k nearest neighbor documents, along with their (similarity-specific) scores. */ public abstract TopDocs search( - String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException; + String field, float[] target, KnnResults knnResults, Bits acceptDocs)
[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support
jpountz commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1275153663 ## lucene/core/src/java/org/apache/lucene/util/hnsw/TopKnnCollector.java: ## @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.hnsw; + +import org.apache.lucene.search.AbstractKnnCollector; +import org.apache.lucene.search.ScoreDoc; +import org.apache.lucene.search.TopDocs; +import org.apache.lucene.search.TotalHits; + +/** + * TopKnnCollector is a specific KnnResults. A minHeap is used to keep track of the currently + * collected vectors allowing for efficient updates as better vectors are collected. + */ +public final class TopKnnCollector extends AbstractKnnCollector { + + private final NeighborQueue queue; + + /** + * @param k the number of neighbors to collect + * @param visitLimit how many vector nodes the results are allowed to visit + */ + public TopKnnCollector(int k, int visitLimit) { +super(k, visitLimit); +this.queue = new NeighborQueue(k, false); + } + + @Override + public boolean collect(int docId, float similarity) { +return queue.insertWithOverflow(docId, similarity); + } + + @Override + public float minCompetitiveSimilarity() { +return queue.size() >= k() ? queue.topScore() : Float.NEGATIVE_INFINITY; + } + + @Override + public TopDocs topDocs() { +while (queue.size() > k()) { + queue.pop(); +} +int i = 0; +ScoreDoc[] scoreDocs = new ScoreDoc[queue.size()]; +while (i < scoreDocs.length) { + int node = queue.topNode(); + float score = queue.topScore(); + queue.pop(); + scoreDocs[scoreDocs.length - ++i] = new ScoreDoc(node, score); +} +TotalHits.Relation relation = +earlyTerminated() +? TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO +: TotalHits.Relation.EQUAL_TO; +return new TopDocs(new TotalHits(visitedCount(), relation), scoreDocs); + } + + @Override + public String toString() { +return "TopKnnResults[" + queue.size() + "]"; Review Comment: ```suggestion return "TopKnnResults[k=" + k() + ", size=" + queue.size() + "]"; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12436: Move max vector dims limit to Codec
jpountz commented on code in PR #12436: URL: https://github.com/apache/lucene/pull/12436#discussion_r1275169909 ## lucene/core/src/java/org/apache/lucene/index/IndexingChain.java: ## @@ -621,6 +621,12 @@ private void initializeFieldInfo(PerField pf) throws IOException { final Sort indexSort = indexWriterConfig.getIndexSort(); validateIndexSortDVType(indexSort, pf.fieldName, s.docValuesType); } +if (s.vectorDimension != 0) { + validateMaxVectorDimension( + pf.fieldName, + s.vectorDimension, + indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions()); +} Review Comment: I worry that this adds a hashtable lookup on a hot code path. Maybe it's not that bad for vectors, which are slow to index anyway, but I'd rather avoid it. What about making the codec responsible for checking the limit? Something like below: ```patch diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java index cb3e5ef8b10..6c365e53528 100644 --- a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java +++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java @@ -108,6 +108,9 @@ public final class Lucene95HnswVectorsFormat extends KnnVectorsFormat { public static final int VERSION_START = 0; public static final int VERSION_CURRENT = VERSION_START; + /** A maximum number of vector dimensions supported by this codeс */ + public static final int MAX_DIMENSIONS = 1024; + /** * A maximum configurable maximum max conn. * @@ -177,7 +180,7 @@ public final class Lucene95HnswVectorsFormat extends KnnVectorsFormat { @Override public KnnVectorsWriter fieldsWriter(SegmentWriteState state) throws IOException { -return new Lucene95HnswVectorsWriter(state, maxConn, beamWidth); +return new Lucene95HnswVectorsWriter(state, maxConn, beamWidth, MAX_DIMENSIONS); } @Override diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java index 5358d66f16e..196f12a21ad 100644 --- a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java +++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java @@ -60,13 +60,15 @@ public final class Lucene95HnswVectorsWriter extends KnnVectorsWriter { private final IndexOutput meta, vectorData, vectorIndex; private final int M; private final int beamWidth; + private final int maxDimension; private final List> fields = new ArrayList<>(); private boolean finished; - Lucene95HnswVectorsWriter(SegmentWriteState state, int M, int beamWidth) throws IOException { + Lucene95HnswVectorsWriter(SegmentWriteState state, int M, int beamWidth, int maxDimension) throws IOException { this.M = M; this.beamWidth = beamWidth; +this.maxDimension = maxDimension; segmentWriteState = state; String metaFileName = IndexFileNames.segmentFileName( @@ -117,6 +119,9 @@ public final class Lucene95HnswVectorsWriter extends KnnVectorsWriter { @Override public KnnFieldVectorsWriter addField(FieldInfo fieldInfo) throws IOException { +if (fieldInfo.getVectorDimension() > maxDimension) { + throw new IllegalArgumentException("Number of dimensions " + fieldInfo.getVectorDimension() + " for field " + fieldInfo.name + " exceeds the limit of " + maxDimension); +} FieldWriter newField = FieldWriter.create(fieldInfo, M, beamWidth, segmentWriteState.infoStream); fields.add(newField); ``` ## lucene/core/src/java/org/apache/lucene/index/IndexingChain.java: ## @@ -621,6 +621,12 @@ private void initializeFieldInfo(PerField pf) throws IOException { final Sort indexSort = indexWriterConfig.getIndexSort(); validateIndexSortDVType(indexSort, pf.fieldName, s.docValuesType); } +if (s.vectorDimension != 0) { + validateMaxVectorDimension( + pf.fieldName, + s.vectorDimension, + indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions()); +} Review Comment: I worry that this adds a hashtable lookup on a hot code path. Maybe it's not that bad for vectors, which are slow to index anyway, but I'd rather avoid it. What about making the codec responsible for checking the limit? Something like below: ```patch diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
[GitHub] [lucene] jpountz closed pull request #12435: Remove sort for uniqueValues in NumericDocValues
jpountz closed pull request #12435: Remove sort for uniqueValues in NumericDocValues URL: https://github.com/apache/lucene/pull/12435 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12435: Remove sort for uniqueValues in NumericDocValues
jpountz commented on PR #12435: URL: https://github.com/apache/lucene/pull/12435#issuecomment-1652157003 Closing as per the above comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1275240534 ## lucene/core/src/java/org/apache/lucene/util/hnsw/TopKnnCollector.java: ## @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.hnsw; Review Comment: @jpountz I am not sure, other nearest neighbor algos (like ivfpq) require more parameters that we currently don't provide. If we were to do something other than HNSW, I think a different collector would be required providing separate options. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1275257249 ## lucene/core/src/java/org/apache/lucene/search/KnnCollector.java: ## @@ -0,0 +1,85 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * KnnCollector is a knn collector used for gathering kNN results and providing topDocs from the + * gathered neighbors + */ +public interface KnnCollector { + + /** + * If search visits too many documents, the results collector will terminate early. Usually, this + * is due to some restricted filter on the document set. + * + * When collection is earlyTerminated, the results are not a correct representation of k + * nearest neighbors. + * + * @return is the current result set marked as incomplete? + */ + boolean earlyTerminated(); + + /** + * @param count increments the visited vector count, must be greater than 0. + */ + void incVisitedCount(int count); + + /** + * @return the current visited vector count + */ + long visitedCount(); + + /** + * @return the visited vector limit + */ + long visitLimit(); + + /** + * @return the expected number of collected results + */ + int k(); + + /** + * Collect the provided docId and include in the result set. + * + * @param docId of the vector to collect + * @param similarity its calculated similarity + * @return true if the vector is collected + */ + boolean collect(int docId, float similarity); + + /** + * This method is utilized during search to ensure only competitive results are explored. + * + * Consequently, if this results collector wants to collect `k` results, this should return + * {@link Float#NEGATIVE_INFINITY} when not full. + * + * When full, the minimum score should be returned. + * + * @return the current minimum competitive similarity in the collection + */ + float minCompetitiveSimilarity(); + + /** + * This drains the collected nearest kNN results and returns them in a new {@link TopDocs} + * collection, ordered by score descending Review Comment: will do! ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnCollector.java: ## @@ -0,0 +1,113 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +/** + * AbstractKnnCollector is the default implementation for a knn collector used for gathering kNN + * results and providing topDocs from the gathered neighbors + */ +public abstract class AbstractKnnCollector implements KnnCollector { + + private long visitedCount; + private final long visitLimit; + private final int k; + + protected AbstractKnnCollector(int k, long visitLimit) { +this.visitLimit = visitLimit; +this.k = k; + } + + /** + * If search visits too many documents, the results collector will terminate early. Usually, this + * is due to some restricted filter on the document set. + * + * When collection is earlyTerminated, the results are not a correct representation of k + * nearest neighbors. + * + * @return is the current result set marked as incomplete? + */ + @Override + public final boolean earlyTerminated() { +return visitedCount >= visitLimit; + } + + /** + * @param count increments the visited vector count, must be greater than 0. + */ + @Override + public final void incVisitedCount(int count) { +assert count > 0; +
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1275258761 ## lucene/core/src/java/org/apache/lucene/search/KnnByteVectorQuery.java: ## @@ -77,8 +78,16 @@ public KnnByteVectorQuery(String field, byte[] target, int k, Query filter) { @Override protected TopDocs approximateSearch(LeafReaderContext context, Bits acceptDocs, int visitedLimit) throws IOException { +FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field); +if (fi == null || fi.getVectorDimension() == 0) { + // The field does not exist or does not index vectors + return NO_RESULTS; +} +int k = Math.min(this.k, context.reader().getByteVectorValues(fi.name).size()); Review Comment: > that takes an int k, and call it here instead of the expert method that takes a KnnCollector? 🤔 I suppose? This way this queries use the `final` method from the leafReader and don't have to worry about this. ## lucene/core/src/java/org/apache/lucene/search/KnnByteVectorQuery.java: ## @@ -77,8 +78,16 @@ public KnnByteVectorQuery(String field, byte[] target, int k, Query filter) { @Override protected TopDocs approximateSearch(LeafReaderContext context, Bits acceptDocs, int visitedLimit) throws IOException { +FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field); +if (fi == null || fi.getVectorDimension() == 0) { + // The field does not exist or does not index vectors + return NO_RESULTS; +} +int k = Math.min(this.k, context.reader().getByteVectorValues(fi.name).size()); Review Comment: > that takes an int k, and call it here instead of the expert method that takes a KnnCollector? 🤔 I suppose? This way this queries use the `final` method from the leafReader and don't have to worry about this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12434: Add ParentJoin KNN support
benwtrent commented on code in PR #12434: URL: https://github.com/apache/lucene/pull/12434#discussion_r1275357936 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java: ## @@ -79,14 +80,13 @@ protected KnnVectorsReader() {} * * @param field the vector field to search * @param target the vector-valued query - * @param k the number of docs to return + * @param knnResults a KnnResults collector and relevant settings for gathering vector results * @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null} * if they are all allowed to match. - * @param visitedLimit the maximum number of nodes that the search is allowed to visit * @return the k nearest neighbor documents, along with their (similarity-specific) scores. */ public abstract TopDocs search( - String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException; + String field, float[] target, KnnResults knnResults, Bits acceptDocs) throws IOException; Review Comment: Since this method accepts a collector, it should be void. Methods that do not accept a collector and instead create one itself continue to return `TopDocs` where appropriate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jmazanec15 commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores
jmazanec15 commented on issue #12342: URL: https://github.com/apache/lucene/issues/12342#issuecomment-1652318052 Thanks @searchivarius sounds good. Additonally, @benwtrent in regards to https://github.com/apache/lucene/issues/12342#issuecomment-1650293826, I ran the non-transformed reverse test when using `np.flip(np_flat_ds_sorted, axis=0)` and got pretty high recall numbers: ``` 0.7100.86 40 0 48 200 10 2135520 1.00 post-filter 0.9734.03 40 90 48 200 100 2150331 1.00 post-filter 0.9916.72 40 190 48 200 200 2151494 1.00 post-filter 0.998 14.23 40 490 48 200 500 2169778 1.00 post-filter ``` Did you change this for your tests? I can repeat the set of tests again, but want to just confirm everything looks good -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #12436: Move max vector dims limit to Codec
mayya-sharipova commented on code in PR #12436: URL: https://github.com/apache/lucene/pull/12436#discussion_r1275409313 ## lucene/core/src/java/org/apache/lucene/index/IndexingChain.java: ## @@ -621,6 +621,12 @@ private void initializeFieldInfo(PerField pf) throws IOException { final Sort indexSort = indexWriterConfig.getIndexSort(); validateIndexSortDVType(indexSort, pf.fieldName, s.docValuesType); } +if (s.vectorDimension != 0) { + validateMaxVectorDimension( + pf.fieldName, + s.vectorDimension, + indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions()); +} Review Comment: @jpountz Thank you for the additional feedback. > I worry that this adds a hashtable lookup on a hot code path. Maybe it's not that bad for vectors, which are slow to index anyway, but I'd rather avoid it. This is not really a hot code path. We ask for `getCodec().knnVectorsFormat().getMaxDimensions` in the `initializeFieldInfo` function, that happens only once per a new field per segment. > What about making the codec responsible for checking the limit? I experimented with this idea, and encountered the following difficulty with it: - we need to create a new `FieldInfo` before passing it to `KnnFieldVectorsWriter addField(FieldInfo fieldInfo)`. - The way we create it is : `FieldInfo fi = fieldInfos.add(` by adding to the global fieldInfos. This means that if `FieldInfo` contains incorrect number of dimensions, it will be stored like this in the global fieldInfos, and we can't change it (for example with a second document with correct number of dims). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #12436: Move max vector dims limit to Codec
mayya-sharipova commented on code in PR #12436: URL: https://github.com/apache/lucene/pull/12436#discussion_r1275409313 ## lucene/core/src/java/org/apache/lucene/index/IndexingChain.java: ## @@ -621,6 +621,12 @@ private void initializeFieldInfo(PerField pf) throws IOException { final Sort indexSort = indexWriterConfig.getIndexSort(); validateIndexSortDVType(indexSort, pf.fieldName, s.docValuesType); } +if (s.vectorDimension != 0) { + validateMaxVectorDimension( + pf.fieldName, + s.vectorDimension, + indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions()); +} Review Comment: @jpountz Thank you for the additional feedback. > I worry that this adds a hashtable lookup on a hot code path. Maybe it's not that bad for vectors, which are slow to index anyway, but I'd rather avoid it. This is not really a hot code path. We ask for `getCodec().knnVectorsFormat().getMaxDimensions` in the `initializeFieldInfo` function, that happens only once per a new field per segment. > What about making the codec responsible for checking the limit? Thanks for the suggestion, I experimented with this idea, and encountered the following difficulty with it: - we need to create a new `FieldInfo` before passing it to `KnnFieldVectorsWriter addField(FieldInfo fieldInfo)`. - The way we create it is : `FieldInfo fi = fieldInfos.add(` by adding to the global fieldInfos. This means that if `FieldInfo` contains incorrect number of dimensions, it will be stored like this in the global fieldInfos, and we can't change it (for example with a second document with correct number of dims). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter opened a new pull request, #12464: hunspell: make the hash table load factor customizable
donnerpeter opened a new pull request, #12464: URL: https://github.com/apache/lucene/pull/12464 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #12436: Move max vector dims limit to Codec
mayya-sharipova commented on code in PR #12436: URL: https://github.com/apache/lucene/pull/12436#discussion_r1275409313 ## lucene/core/src/java/org/apache/lucene/index/IndexingChain.java: ## @@ -621,6 +621,12 @@ private void initializeFieldInfo(PerField pf) throws IOException { final Sort indexSort = indexWriterConfig.getIndexSort(); validateIndexSortDVType(indexSort, pf.fieldName, s.docValuesType); } +if (s.vectorDimension != 0) { + validateMaxVectorDimension( + pf.fieldName, + s.vectorDimension, + indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions()); +} Review Comment: @jpountz Thank you for the additional feedback. > I worry that this adds a hashtable lookup on a hot code path. Maybe it's not that bad for vectors, which are slow to index anyway, but I'd rather avoid it. This is not really a hot code path. We ask for `getCodec().knnVectorsFormat().getMaxDimensions` in the `initializeFieldInfo` function, that happens only once per a new field per segment. > What about making the codec responsible for checking the limit? Thanks for the suggestion, I experimented with this idea, and encountered the following difficulty with it: - we need to create a new `FieldInfo` before passing it to `KnnFieldVectorsWriter addField(FieldInfo fieldInfo)`. - The way we create it is : `FieldInfo fi = fieldInfos.add(` by adding to the global fieldInfos. This means that if `FieldInfo` contains incorrect number of dimensions, it will be stored like this in the global fieldInfos, and we can't change it (for example with a second document with correct number of dims). May be as an alternative we can do a validation as a separate method of `KnnVectorsWriter`: ```java public void validateFieldDims(int dims) ``` What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 opened a new pull request, #12465: Potential bug in IndexedDISI90#SPARSE->advanceExactWithinBlock
gautamworah96 opened a new pull request, #12465: URL: https://github.com/apache/lucene/pull/12465 While upgrading our Lucene to 9.7 from 9.6 we noticed that a few tests were failing. On digging deeper into the tests we realized that they were failing on a particular assert. This one: ``` if (disi.nextExistDocInBlock > targetInBlock) { assert !disi.exists; return false; } ``` This changes the behavior of this function as compared to the behavior in Lucene 9.6 (and before). It could be possible that we intentionally decided to change the behavior of this function, but to my eye, it feels like that was not the case? I attached a test that shows the assert failing. The test tries to call `advanceExact` which underneath it calls `advanceExactWithinBlock` on a document that is behind the current doc position. In the previous version of Lucene this would return a `false` and would exit. But with the upgrade, it fails mysteriously on an assert. I created this PR because IIUC the assert checks for something that is completely unrelated to the current method call? Why would we want to assert on the existence of a previous document when calling for `advanceExactWithinBlock` on a new document.. My preference here would be to remove just the assert statement and to revert to the previous behavior (of returning false when calling for a document that is behind the current cursor position). I could also be wrong though. Would appreciate a second look from some folks in the community.. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #12405: Skip docs with Docvalues in NumericLeafComparator
LuXugang commented on code in PR #12405: URL: https://github.com/apache/lucene/pull/12405#discussion_r1275760334 ## lucene/core/src/java/org/apache/lucene/util/NumericUtils.java: ## @@ -132,6 +132,52 @@ public static void add(int bytesPerDim, int dim, byte[] a, byte[] b, byte[] resu } } + /** + * Modify in-place the given bytes to the next value of the same length that compares greater than + * the current value. This returns false if, and only if, the value is currently equal to the + * maximum possible value. + */ + public static boolean nextUp(byte[] bytes) { Review Comment: `nextup` and `nextdown` only work well for int/long double/float case can't pass: ```java public void testNextUpDouble() { for (double i : new double[] {Double.MIN_VALUE, Double.MAX_VALUE}) { byte[] b = new byte[Long.BYTES]; long a = NumericUtils.doubleToSortableLong(i); NumericUtils.longToSortableBytes(a, b, 0); assertFalse(NumericUtils.nextUp(b)); assertEquals(i, NumericUtils.sortableLongToDouble(NumericUtils.sortableBytesToLong(b, 0)), 0.0); } } ``` Because `Double.MIN_VALUE`, `Double.MAX_VALUE`'s byte values are: [-128, 0, 0, 0, 0, 0, 0, 1] and [-1, -17, -1, -1, -1, -1, -1, -1] -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #12405: Skip docs with Docvalues in NumericLeafComparator
LuXugang commented on code in PR #12405: URL: https://github.com/apache/lucene/pull/12405#discussion_r1275760334 ## lucene/core/src/java/org/apache/lucene/util/NumericUtils.java: ## @@ -132,6 +132,52 @@ public static void add(int bytesPerDim, int dim, byte[] a, byte[] b, byte[] resu } } + /** + * Modify in-place the given bytes to the next value of the same length that compares greater than + * the current value. This returns false if, and only if, the value is currently equal to the + * maximum possible value. + */ + public static boolean nextUp(byte[] bytes) { Review Comment: @jpountz `nextup` and `nextdown` seems only work well for int/long double/float case can't pass: ```java public void testNextUpDouble() { for (double i : new double[] {Double.MIN_VALUE, Double.MAX_VALUE}) { byte[] b = new byte[Long.BYTES]; long a = NumericUtils.doubleToSortableLong(i); NumericUtils.longToSortableBytes(a, b, 0); if(i == Double.MIN_VALUE){ assertFalse(NumericUtils.nextDown(b)); }else { assertFalse(NumericUtils.nextUp(b)); } } } ``` Because `Double.MIN_VALUE`, `Double.MAX_VALUE`'s byte values are: [-128, 0, 0, 0, 0, 0, 0, 1] and [-1, -17, -1, -1, -1, -1, -1, -1] -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org