Re: [PR] Make TaskExecutor cx public and use TaskExecutor for concurrent HNSW graph build [lucene]
shubhamvishu commented on code in PR #12799: URL: https://github.com/apache/lucene/pull/12799#discussion_r1396959558 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswConcurrentMergeBuilder.java: ## @@ -77,42 +75,17 @@ public OnHeapHnswGraph build(int maxOrd) throws IOException { HNSW_COMPONENT, "build graph from " + maxOrd + " vectors, with " + workers.length + " workers"); } -List> futures = new ArrayList<>(); +List> futures = new ArrayList<>(); for (int i = 0; i < workers.length; i++) { int finalI = i; futures.add( - exec.submit( - () -> { -try { - workers[finalI].run(maxOrd); -} catch (IOException e) { - throw new RuntimeException(e); -} - })); -} -Throwable exc = null; -for (Future future : futures) { - try { -future.get(); - } catch (InterruptedException e) { -var newException = new ThreadInterruptedException(e); -if (exc == null) { - exc = newException; -} else { - exc.addSuppressed(newException); -} - } catch (ExecutionException e) { -if (exc == null) { - exc = e.getCause(); -} else { - exc.addSuppressed(e.getCause()); -} - } -} -if (exc != null) { - // The error handling was copied from TaskExecutor. should we just use TaskExecutor instead? - throw IOUtils.rethrowAlways(exc); + () -> { +workers[finalI].run(maxOrd); +return null; + }); } +TaskExecutor taskExecutor = new TaskExecutor(exec); Review Comment: I think with this we are only creating 1 task executor instance per knn field because the concurrent graph building happens in `HnswConcurrentMergeBuilder#build` so moving the task executor creation above in the hierarchy(to `ConcurrentHnswMerger`) shouldn't make any difference as we would still be creating 1 instance per knn field or maybe I'm wrong? I'm happy to change it if this is not the case or if you feel thats still better approach to pass the task executor from above instead of executor service? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make TaskExecutor cx public and use TaskExecutor for concurrent HNSW graph build [lucene]
shubhamvishu commented on code in PR #12799: URL: https://github.com/apache/lucene/pull/12799#discussion_r1396959558 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswConcurrentMergeBuilder.java: ## @@ -77,42 +75,17 @@ public OnHeapHnswGraph build(int maxOrd) throws IOException { HNSW_COMPONENT, "build graph from " + maxOrd + " vectors, with " + workers.length + " workers"); } -List> futures = new ArrayList<>(); +List> futures = new ArrayList<>(); for (int i = 0; i < workers.length; i++) { int finalI = i; futures.add( - exec.submit( - () -> { -try { - workers[finalI].run(maxOrd); -} catch (IOException e) { - throw new RuntimeException(e); -} - })); -} -Throwable exc = null; -for (Future future : futures) { - try { -future.get(); - } catch (InterruptedException e) { -var newException = new ThreadInterruptedException(e); -if (exc == null) { - exc = newException; -} else { - exc.addSuppressed(newException); -} - } catch (ExecutionException e) { -if (exc == null) { - exc = e.getCause(); -} else { - exc.addSuppressed(e.getCause()); -} - } -} -if (exc != null) { - // The error handling was copied from TaskExecutor. should we just use TaskExecutor instead? - throw IOUtils.rethrowAlways(exc); + () -> { +workers[finalI].run(maxOrd); +return null; + }); } +TaskExecutor taskExecutor = new TaskExecutor(exec); Review Comment: I think with this we are only creating 1 task executor instance per knn field because the concurrent graph building happens in `HnswConcurrentMergeBuilder#build`(where we make use of concurrency) so moving the task executor creation above in the hierarchy(to `ConcurrentHnswMerger`) shouldn't make any difference as we would still be creating 1 instance per knn field or maybe I'm wrong? I'm happy to change it if this is not the case or if you feel thats still better approach to pass the task executor from above instead of executor service? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-9951: Add InfoStream to ReplicationService [lucene]
mikemccand commented on PR #124: URL: https://github.com/apache/lucene/pull/124#issuecomment-1816528624 OK thank you for bringing closure @ChristophKaser. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Dry up DirectReader implementations [lucene]
original-brownbear opened a new pull request, #12823: URL: https://github.com/apache/lucene/pull/12823 This can be written in a much drier way that shouldn't come at any performance cost as far as I can see. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]
mikemccand commented on PR #12716: URL: https://github.com/apache/lucene/pull/12716#issuecomment-1816539511 Thanks @shubhamvishu and @dweiss and @bruno-roustant. Hashing is fun and hard :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix segmentInfos replace doesn't set userData [lucene]
Shibi-bala commented on PR #12626: URL: https://github.com/apache/lucene/pull/12626#issuecomment-1816600417 @uschindler hey, thanks for the approval! Read the contributing guidelines, but not entirely sure how to get permissions to merge this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix segmentInfos replace doesn't set userData [lucene]
uschindler commented on PR #12626: URL: https://github.com/apache/lucene/pull/12626#issuecomment-1816614768 You can't do it. Please add a Changes entry unter the 9.9 section, commit it to branch and I will merge and Backport your PR. I am just away from my computer at moment, sorry vor the delay. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10241: Updating OpenNLP to 1.9.4. [lucene]
cpoerschke merged PR #448: URL: https://github.com/apache/lucene/pull/448 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Update OpenNLP to 1.9.4 [LUCENE-10241] [lucene]
cpoerschke closed issue #11277: Update OpenNLP to 1.9.4 [LUCENE-10241] URL: https://github.com/apache/lucene/issues/11277 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Update OpenNLP to 1.9.4 [LUCENE-10241] [lucene]
cpoerschke commented on issue #11277: URL: https://github.com/apache/lucene/issues/11277#issuecomment-1816701100 #448 is the merged `main` branch pull request and https://github.com/apache/lucene/commit/b8094d49aaf5e5cb5182c0307e25eafa2d332dda is the `branch_9x` commit. Thanks @jzonthemtn! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Re-use information from graph traversal during exact search [lucene]
kaivalnp commented on PR #12820: URL: https://github.com/apache/lucene/pull/12820#issuecomment-1816720340 Thanks @jpountz! I realised something from your comment: My current implementation has a flaw, because it cannot handle the [`OrdinalTranslatedKnnCollector`](https://github.com/kaivalnp/lucene/blob/03624754eb3ccf9da114ff5fc358cc230466ea3f/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java#L232) correctly: The [`setVisited`](https://github.com/kaivalnp/lucene/blob/reuse-graph-search/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L253) call has the `BitSet visited` as packed ordinals, but the [`getVisited`](https://github.com/kaivalnp/lucene/blob/reuse-graph-search/lucene/core/src/java/org/apache/lucene/util/hnsw/OrdinalTranslatedKnnCollector.java#L91) call receives a `docId` (and not a `vectorId`) so we would need a reverse [`IntToIntFunction docIdToVectorOrdinal`](https://github.com/kaivalnp/lucene/blob/reuse-graph-search/lucene/core/src/java/org/apache/lucene/util/hnsw/OrdinalTranslatedKnnCollector.java#L31) to map it back to an ordinal This is straightforward for `DenseOffHeapVectorValues` or `EmptyOffHeapVectorValues` (because there is a 1-1 mapping between a doc and ordinal) but becomes a problem for `SparseOffHeapVectorValues` which has the `vectorOrdinaltoDocId` implemented as a [`DirectMonotonicReader`](https://github.com/kaivalnp/lucene/blob/reuse-graph-search/lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapFloatVectorValues.java#L139) - which I think is docIds stored one after another - so getting the docId for an ordinal is as simple as a lookup at that offset. However, getting an inverse of this can become costly (binary search -> returning the index) as opposed to the current constant time lookup I wonder how costly it would be to maintain the set of visited docs at the `KnnCollector` like you mentioned (perhaps using a `SparseFixedBitSet`)? We [already create a `BitSet`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L121) of `maxDoc` length to hold the filtered docs.. In the worst case, we would need another `BitSet` of the same length to store which docs are visited from graph search, then skip over those from `#exactSearch`. However, there may be a better opportunity here: since we want to go over docs that are "prefiltered but not visited", can we simply `#clear` the bits whenever we visit a node - we just need to find a way to do this cleanly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix segmentInfos replace doesn't set userData [lucene]
MarcusSorealheis commented on PR #12626: URL: https://github.com/apache/lucene/pull/12626#issuecomment-1816799469 @Shibi-bala It's here: https://github.com/apache/lucene/blob/c228e4bb66ca73c8150d8eaebe2bb999bcc6c9b1/lucene/CHANGES.txt#L147 You need to include your user and the GitHub user(s) that reviewed it/assisted as seen in the other entries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix segmentInfos replace doesn't set userData [lucene]
Shibi-bala commented on PR #12626: URL: https://github.com/apache/lucene/pull/12626#issuecomment-1816818775 Made the changes. Thanks @uschindler @MarcusSorealheis @msfroh 😁 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix CheckIndex to detect major corruption with old (not the latest) commit point [lucene]
mikemccand merged PR #12530: URL: https://github.com/apache/lucene/pull/12530 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix segmentInfos replace doesn't set userData [lucene]
uschindler merged PR #12626: URL: https://github.com/apache/lucene/pull/12626 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] segmentInfos.replace() doesn't set userData [lucene]
uschindler closed issue #12637: segmentInfos.replace() doesn't set userData URL: https://github.com/apache/lucene/issues/12637 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] CheckIndex cannot "fix" indexes that have individual segments with missing or corrupt .si files because sanity checks will fail trying to read the index initially. [LUCENE-6762] [lucene]
mikemccand commented on issue #7820: URL: https://github.com/apache/lucene/issues/7820#issuecomment-1816857195 I merged the first step in this issue -- detecting when this unique snowflake form of corruption strikes. Step 2 is to enable exorcism when there is an `_X.si` file missing, which is tricky because the exception thrown (while loading `SegmentInfos`) does not make it clear which segment (s) are broken. @gokaai I think you had some ideas on how to do this? If we improved the thrown exception to add a member recording which segment is broken, `CheckIndex` could catch that and know exactly which segment to exorcise. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use group-varint encoding for the tail of postings [lucene]
jpountz commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1817146436 Thanks @easyice. I took some time to look into the benchmark and improve a few things, hopefully you don't mind. Here is the output of the benchmark on my machine now: ``` Benchmark (numBytesPerInt) (size) Mode Cnt Score Error Units GroupVIntBenchmark.byteArrayReadGroupVInt 1 64 thrpt 5 24.483 ± 0.345 ops/us GroupVIntBenchmark.byteArrayReadGroupVInt 2 64 thrpt 5 23.346 ± 0.288 ops/us GroupVIntBenchmark.byteArrayReadGroupVInt 3 64 thrpt 5 16.318 ± 0.062 ops/us GroupVIntBenchmark.byteArrayReadGroupVInt 4 64 thrpt 5 24.748 ± 0.993 ops/us GroupVIntBenchmark.byteArrayReadVInt 1 64 thrpt 5 17.767 ± 0.081 ops/us GroupVIntBenchmark.byteArrayReadVInt 2 64 thrpt 5 7.256 ± 0.013 ops/us GroupVIntBenchmark.byteArrayReadVInt 3 64 thrpt 5 5.546 ± 0.449 ops/us GroupVIntBenchmark.byteArrayReadVInt 4 64 thrpt 5 4.475 ± 0.021 ops/us GroupVIntBenchmark.byteBufferReadGroupVInt 1 64 thrpt 5 21.812 ± 0.485 ops/us GroupVIntBenchmark.byteBufferReadGroupVInt 2 64 thrpt 5 20.623 ± 1.454 ops/us GroupVIntBenchmark.byteBufferReadGroupVInt 3 64 thrpt 5 13.601 ± 0.299 ops/us GroupVIntBenchmark.byteBufferReadGroupVInt 4 64 thrpt 5 22.649 ± 0.662 ops/us GroupVIntBenchmark.byteBufferReadVInt 1 64 thrpt 5 22.147 ± 0.083 ops/us GroupVIntBenchmark.byteBufferReadVInt 2 64 thrpt 5 8.072 ± 0.116 ops/us GroupVIntBenchmark.byteBufferReadVInt 3 64 thrpt 5 4.554 ± 0.394 ops/us GroupVIntBenchmark.byteBufferReadVInt 4 64 thrpt 5 4.145 ± 0.674 ops/us ``` The benchmark used to read directly from the in-memory byte[] by calling `rewind()` , I changed that to force it to read directly from the directoly to make the comparison a bit more fair. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use group-varint encoding for the tail of postings [lucene]
jpountz commented on code in PR #12782: URL: https://github.com/apache/lucene/pull/12782#discussion_r1391047570 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVIntWriter.java: ## @@ -0,0 +1,97 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene99; + +import java.io.IOException; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.ArrayUtil; + +/** + * Encode integers using group-varint. It uses VInt to encode tail values that are not enough for a + * group + */ +class GroupVIntWriter { + private int[] buffer = new int[4]; + private byte[] bytes = new byte[16]; + private int byteOffset = 0; + private int bufferOffset = 0; + private final IndexOutput output; + + public GroupVIntWriter(IndexOutput output) { +this.output = output; + } + + public void add(int v) throws IOException { +buffer = ArrayUtil.grow(buffer, bufferOffset + 1); +buffer[bufferOffset++] = v; + } + + public void reset(int numValues) { Review Comment: Let's remove the `numValues` parameter since it looks like we can't actually rely on it? ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVIntReader.java: ## @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene99; + +import java.io.IOException; +import org.apache.lucene.store.DataInput; + +/** + * Decode integers using group-varint. It will fully read the bytes for the block, to avoid repeated + * expensive bounds checking per readBytes. + */ +public class GroupVIntReader { + DataInput in; + + public GroupVIntReader() {} + + /** Called when decode a new block. */ + public void reset(DataInput indexInput) throws IOException { +this.in = indexInput; + } + + /** only readValues or nextInt can be called after reset */ + public void readValues(long[] docs, int limit) throws IOException { Review Comment: I'm not sure we need both a reset() and readValues(), maybe `readValues()` could take a `DataInput` directly? ## lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/GroupVIntBenchmark.java: ## @@ -0,0 +1,150 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.benchmark.jmh; + +import java.io.IOException; +import java.nio.file.Files; +import java.util.ArrayList; +import java.util.List; +import java.util.concurrent.ThreadLocalRandom; +import java.util.concurrent.TimeUnit; +import org.apache.lucene.codecs.lucene99.GroupVIntReader; +import org.apache.lucene.codecs.lucene99.GroupVIntWriter; +import org.apache.lucene.store.ByteArrayDataInput; +import org.apache.l
Re: [PR] Log number of visited nodes in knn query [lucene]
jpountz commented on PR #12819: URL: https://github.com/apache/lucene/pull/12819#issuecomment-1817157857 Logging doesn't sound like a good fit for this, would it be better exposed e.g. via the profiling query? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding HumanReadableQuery with a descrition param, used for debugging print output [lucene]
jpountz commented on code in PR #12816: URL: https://github.com/apache/lucene/pull/12816#discussion_r1397929325 ## lucene/misc/src/java/org/apache/lucene/misc/search/HumanReadableQuery.java: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.search; + +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.QueryVisitor; +import org.apache.lucene.search.ScoreMode; +import org.apache.lucene.search.Weight; + +/** + * A simple query wrapper for debug purposes. Behaves like the given query, but when printing to a + * string, it will prepend the description parameter to the query output. + */ +public final class HumanReadableQuery extends Query { + + private final Query in; + private final String description; + + /** + * Create a new HumanReadableQuery + * + * @param in the query to wrap + * @param description a human-readable description, used in toString() + */ + public HumanReadableQuery(Query in, String description) { +this.in = in; +this.description = description; + } + + /** + * @return the wrapped Query + */ + public Query getWrappedQuery() { +return in; + } + + /** + * @return the query description + */ + public String getDescription() { +return description; + } + + @Override + public Query rewrite(IndexSearcher indexSearcher) { +return in; + } + + @Override + public String toString(String field) { +return this.getDescription() + ":" + in.toString(field); + } + + @Override + public void visit(QueryVisitor visitor) { +in.visit(visitor); + } + + @Override + public boolean equals(Object other) { +return sameClassAs(other) && in.equals(((HumanReadableQuery) other).in); + } + + @Override + public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) { +throw new UnsupportedOperationException("HumanReadableQuery does not support #createWeight()"); + } + + @Override + public int hashCode() { +return in.hashCode(); Review Comment: Nit: Ideally a query and the same query wrapped in a `HumanReadableQuery` would have different hashcodes to avoid collisions. ```suggestion return 31 * classHash() + in.hashCode(); ``` ## lucene/misc/src/test/org/apache/lucene/misc/search/TestHumanReadableQuery.java: ## @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.search; + +import org.apache.lucene.document.Document; +import org.apache.lucene.document.Field; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.Term; +import org.apache.lucene.search.BooleanClause; +import org.apache.lucene.search.BooleanQuery; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.KnnFloatVectorQuery; +import org.apache.lucene.search.PhraseQuery; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.store.Directory; +import org.apache.lucene.tests.analysis.MockAnalyzer; +import org.apache.lucene.tests.analysis.MockTokenizer; +import org.apache.lucene.tests.index.RandomIndexWriter; +import org.apache.lucene.tests.search.CheckHits; +import org.apache.lucene.tests.util.English; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.tests.util.TestUtil; +import org.junit.AfterClass; +import org.junit.BeforeClass; + +public
Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]
vigyasharma commented on code in PR #12794: URL: https://github.com/apache/lucene/pull/12794#discussion_r1397994430 ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -26,26 +26,71 @@ * @lucene.experimental */ public final class TopKnnCollector extends AbstractKnnCollector { + private static final float DEFAULT_GREEDINESS = 0.9f; private final NeighborQueue queue; + private final float greediness; + private final NeighborQueue queueg; + private final MaxScoreAccumulator globalMinSimAcc; + private boolean kResultsCollected = false; + private float cachedGlobalMinSim = Float.NEGATIVE_INFINITY; + + // greediness of globally non-competitive search: [0,1] /** * @param k the number of neighbors to collect * @param visitLimit how many vector nodes the results are allowed to visit + * @param globalMinSimAcc the global minimum competitive similarity tracked across all segments */ - public TopKnnCollector(int k, int visitLimit) { + public TopKnnCollector(int k, int visitLimit, MaxScoreAccumulator globalMinSimAcc) { +super(k, visitLimit); +this.greediness = DEFAULT_GREEDINESS; +this.queue = new NeighborQueue(k, false); +int queuegSize = Math.max(1, Math.round((1 - greediness) * k)); +this.queueg = new NeighborQueue(queuegSize, false); +this.globalMinSimAcc = globalMinSimAcc; + } + + public TopKnnCollector( + int k, int visitLimit, MaxScoreAccumulator globalMinSimAcc, float greediness) { super(k, visitLimit); +this.greediness = greediness; this.queue = new NeighborQueue(k, false); +this.queueg = new NeighborQueue(Math.round((1 - greediness) * k), false); +this.globalMinSimAcc = globalMinSimAcc; } @Override public boolean collect(int docId, float similarity) { -return queue.insertWithOverflow(docId, similarity); +boolean result = queue.insertWithOverflow(docId, similarity); +queueg.insertWithOverflow(docId, similarity); + +boolean reachedKResults = (kResultsCollected == false && queue.size() == k()); +if (reachedKResults) { + kResultsCollected = true; +} +if (globalMinSimAcc != null && kResultsCollected) { + // as we've collected k results, we can start exchanging globally + globalMinSimAcc.accumulate(queue.topNode(), queue.topScore()); + + // periodically update the local copy of global similarity + if (reachedKResults || (visitedCount & globalMinSimAcc.modInterval) == 0) { +MaxScoreAccumulator.DocAndScore docAndScore = globalMinSimAcc.get(); +cachedGlobalMinSim = docAndScore.score; Review Comment: We only peek at the global minimum similarity after we have accumulated `k` local results. For my understanding, what would happen if we were to start using `globalMinSim` as soon as any leaf has collected its top K results? I'm curious how the graph structure and entry point calculation affects this. Looking at `graphSearcher.findBestEntryPoint(...)`, i think we navigate to the closest node we can find to our query on layer-1 and use that as the starting point for layer-0. Now layer-0 is a lot more dense than layer-1, and our best candidates are likely in the space between the entry-point on layer-1 and all its neighbors. Is it always a good idea to at least look at k nodes around the entry point, regardless of the global min. competitive score? ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -26,26 +26,71 @@ * @lucene.experimental */ public final class TopKnnCollector extends AbstractKnnCollector { + private static final float DEFAULT_GREEDINESS = 0.9f; Review Comment: So `greediness` is essentially how "greedy" our algorithm for picking top matches gets to be. At 1, we go with the global min competitive similarity. And by default, we continue to search the segment as long as we find neighbors that are better than the top 10% of our local picks from the segment `(1-0.9)*k)`. Should we add some documentation around this param? My initial, naive impression was that higher greediness might mean we pick more nodes per segment, which is quite the opposite. :) ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -26,26 +26,71 @@ * @lucene.experimental */ public final class TopKnnCollector extends AbstractKnnCollector { + private static final float DEFAULT_GREEDINESS = 0.9f; private final NeighborQueue queue; + private final float greediness; + private final NeighborQueue queueg; + private final MaxScoreAccumulator globalMinSimAcc; + private boolean kResultsCollected = false; + private float cachedGlobalMinSim = Float.NEGATIVE_INFINITY; + + // greediness of globally non-competitive search: [0,1] /** * @param k the number of neighbors to collect * @param visitLimit how many vector nodes the results are allowed to visit + * @param globalMi
Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]
vigyasharma commented on PR #12794: URL: https://github.com/apache/lucene/pull/12794#issuecomment-1817274998 We seem to consistently see an improvement in recall between single segment, and multi-segment runs (both seq and conc.) on baseline. Is this because with multiple segments, we get multiple entry points into the overall graph? Whereas in a single merged segment, we only have access to a sparser set of nodes in layer-1 while finding the single best entry point? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]
vigyasharma commented on PR #12794: URL: https://github.com/apache/lucene/pull/12794#issuecomment-1817282807 Do you have a mental model on what kind of graphs would see minimal loss of recall between baseline and candidate? Is this change better with denser (higher fanout) graphs? Would it depend on graph construction params like `beam-width` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding HumanReadableQuery with a descrition param, used for debugging print output [lucene]
slow-J commented on code in PR #12816: URL: https://github.com/apache/lucene/pull/12816#discussion_r1398045523 ## lucene/misc/src/test/org/apache/lucene/misc/search/TestHumanReadableQuery.java: ## @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.search; + +import org.apache.lucene.document.Document; +import org.apache.lucene.document.Field; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.Term; +import org.apache.lucene.search.BooleanClause; +import org.apache.lucene.search.BooleanQuery; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.KnnFloatVectorQuery; +import org.apache.lucene.search.PhraseQuery; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.store.Directory; +import org.apache.lucene.tests.analysis.MockAnalyzer; +import org.apache.lucene.tests.analysis.MockTokenizer; +import org.apache.lucene.tests.index.RandomIndexWriter; +import org.apache.lucene.tests.search.CheckHits; +import org.apache.lucene.tests.util.English; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.tests.util.TestUtil; +import org.junit.AfterClass; +import org.junit.BeforeClass; + +public class TestHumanReadableQuery extends LuceneTestCase { + private static IndexSearcher searcher; + private static IndexReader reader; + private static Directory directory; + private static final String field = "field"; + + @BeforeClass + public static void beforeClass() throws Exception { +directory = newDirectory(); +RandomIndexWriter writer = +new RandomIndexWriter( +random(), +directory, +newIndexWriterConfig(new MockAnalyzer(random(), MockTokenizer.SIMPLE, true)) +.setMaxBufferedDocs(TestUtil.nextInt(random(), 100, 1000)) +.setMergePolicy(newLogMergePolicy())); +// writer.infoStream = System.out; +for (int i = 0; i < 1000; i++) { + Document doc = new Document(); + doc.add(newTextField(field, English.intToEnglish(i), Field.Store.YES)); + writer.addDocument(doc); +} +reader = writer.getReader(); +searcher = newSearcher(reader); +writer.close(); + } + + @AfterClass + public static void afterClass() throws Exception { +reader.close(); +directory.close(); +searcher = null; +reader = null; +directory = null; + } + + public void testHitsEqualPhraseQuery() throws Exception { +PhraseQuery pQuery = new PhraseQuery(field, "seventy", "seven"); +CheckHits.checkHits( +random(), +pQuery, +field, +searcher, +new int[] {77, 177, 277, 377, 477, 577, 677, 777, 877, 977}); + +String description = "TestPhraseQuery"; +HumanReadableQuery hQuery = new HumanReadableQuery(pQuery, description); +CheckHits.checkHits( +random(), +hQuery, +field, +searcher, +new int[] {77, 177, 277, 377, 477, 577, 677, 777, 877, 977}); +assertEquals(hQuery.toString(), description + ":" + pQuery); Review Comment: Thanks, switching the params now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding HumanReadableQuery with a descrition param, used for debugging print output [lucene]
slow-J commented on PR #12816: URL: https://github.com/apache/lucene/pull/12816#issuecomment-1817340280 > I left minor comments but it looks good to me otherwise! Thanks for the feedback! Done the changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix Field.java documentation to refer to new IntField/FloatField/LongField/DoubleField #12125 [lucene]
jpountz commented on PR #12821: URL: https://github.com/apache/lucene/pull/12821#issuecomment-1817419020 Thanks for doing it, it looks like the PR includes unintended changes though? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove delayed seek optimization. [lucene]
jpountz merged PR #12815: URL: https://github.com/apache/lucene/pull/12815 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding HumanReadableQuery with a descrition param, used for debugging print output [lucene]
jpountz merged PR #12816: URL: https://github.com/apache/lucene/pull/12816 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Can/should `KnnByte/FloatVectorQuery` carry some human-meaningful opaque `toString` fragment? [lucene]
jpountz closed issue #12487: Can/should `KnnByte/FloatVectorQuery` carry some human-meaningful opaque `toString` fragment? URL: https://github.com/apache/lucene/issues/12487 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org