Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]
mayya-sharipova commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1895612417 > It's also important to check the order of execution. For instance what happens if all segments are executed serially (rather than in parallel), does it changes the recall? @jimczi The results are below --- Sequential run cohere: 768 dims; interval: 255 ### 1M vectors k=10, fanout=90 | |Avg visited nodes | QPS| Recall| | :--- |---: | ---: |---: | | Baseline Single segment | 2033| 1212|0.880| | Baseline 8 segments sequential | 13927| 175|0.974| | Candidate_with_queue sequential | 9407| 269|0.923| k=100, fanout=900 | |Avg visited nodes | QPS| Recall| | :--- |---: | ---: |---: | | Baseline Single segment | 13083| 167|0.964| | Baseline 8 segments sequential | 81824|28|0.997| | Candidate_with_queue sequential | 36851|62|0.983| ### 10M vectors k=10, fanout=90 | |Avg visited nodes | QPS| Recall| | :---|---: | ---: |---: | | Baseline Single segment | 2189| 984|0.929| | Baseline 19 segments sequential | 37656|59|0.951| | Candidate_with_queue sequential | 17748| 121|0.884| k=100, fanout=900 | |Avg visited nodes | QPS| Recall| | :--- |---: | ---: |---: | | Baseline Single segment | 15040| 122|0.950| | Baseline 19 segments sequential |229945| 9|0.990| | Candidate_with_queue sequential | 77910|27|0.964| -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix BWC test generation after mondernizing LineDocFile [lucene]
jpountz commented on code in PR #13021: URL: https://github.com/apache/lucene/pull/13021#discussion_r1455491681 ## lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestBackwardsCompatibility.java: ## @@ -219,22 +228,48 @@ public void testCreateMoreTermsIndex() throws Exception { new IndexWriterConfig(analyzer).setMergePolicy(mp).setUseCompoundFile(false); IndexWriter writer = new IndexWriter(dir, conf); LineFileDocs docs = new LineFileDocs(new Random(0)); +Field docIdDV = null; +Field titleDVField = null; for (int i = 0; i < 50; i++) { - writer.addDocument(docs.nextDoc()); + Document doc = docs.nextDoc(); + if (docIdDV == null) { +docIdDV = new NumericDocValuesField("docid_intDV", 0); +doc.add(docIdDV); + } Review Comment: Nit: this seems to assume that the `Document` instance that is returned is always the same, which feels a bit fragile. Maybe do a shallow copy of the document all the time? Or use `TestUtil.cloneDocument()`? ## lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestBackwardsCompatibility.java: ## @@ -204,11 +204,20 @@ private Path getIndexDir() { } public void testCreateMoreTermsIndex() throws Exception { - Path indexDir = getIndexDir().resolve("moreterms"); Files.deleteIfExists(indexDir); -Directory dir = newFSDirectory(indexDir); +try (Directory dir = newFSDirectory(indexDir)) { + createMoreTermsIndex(dir); +} + } + + public void testCreateMoreTermsIndexInternal() throws Exception { +try (Directory dir = newDirectory()) { + createMoreTermsIndex(dir); +} + } + public void createMoreTermsIndex(Directory dir) throws Exception { Review Comment: nit: it doesn't seem to need to be public? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix BWC test generation after mondernizing LineDocFile [lucene]
s1monw commented on code in PR #13021: URL: https://github.com/apache/lucene/pull/13021#discussion_r1455526064 ## lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestBackwardsCompatibility.java: ## @@ -219,22 +228,48 @@ public void testCreateMoreTermsIndex() throws Exception { new IndexWriterConfig(analyzer).setMergePolicy(mp).setUseCompoundFile(false); IndexWriter writer = new IndexWriter(dir, conf); LineFileDocs docs = new LineFileDocs(new Random(0)); +Field docIdDV = null; +Field titleDVField = null; for (int i = 0; i < 50; i++) { - writer.addDocument(docs.nextDoc()); + Document doc = docs.nextDoc(); + if (docIdDV == null) { +docIdDV = new NumericDocValuesField("docid_intDV", 0); +doc.add(docIdDV); + } Review Comment: yeah, so we are doing this in the other tests too bef0r my change. That's why i did it that way... I can clean this all up in a followup -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix BWC test generation after mondernizing LineDocFile [lucene]
jpountz commented on code in PR #13021: URL: https://github.com/apache/lucene/pull/13021#discussion_r1455533959 ## lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestBackwardsCompatibility.java: ## @@ -219,22 +228,48 @@ public void testCreateMoreTermsIndex() throws Exception { new IndexWriterConfig(analyzer).setMergePolicy(mp).setUseCompoundFile(false); IndexWriter writer = new IndexWriter(dir, conf); LineFileDocs docs = new LineFileDocs(new Random(0)); +Field docIdDV = null; +Field titleDVField = null; for (int i = 0; i < 50; i++) { - writer.addDocument(docs.nextDoc()); + Document doc = docs.nextDoc(); + if (docIdDV == null) { +docIdDV = new NumericDocValuesField("docid_intDV", 0); +doc.add(docIdDV); + } Review Comment: Sounds good! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix BWC test generation after mondernizing LineDocFile [lucene]
s1monw merged PR #13021: URL: https://github.com/apache/lucene/pull/13021 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]
tveasey commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1455672480 ## lucene/core/src/java/org/apache/lucene/util/hnsw/BlockingFloatHeap.java: ## @@ -0,0 +1,190 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.hnsw; + +import java.util.concurrent.locks.ReentrantLock; + +/** + * A blocking bounded min heap that stores floats. The top element is the lowest value of the heap. + * + * A primitive priority queue that maintains a partial ordering of its elements such that the + * least element can always be found in constant time. Implementation is based on {@link + * org.apache.lucene.util.LongHeap} + * + * @lucene.internal + */ +public final class BlockingFloatHeap { + private final int maxSize; + private final float[] heap; + private final ReentrantLock lock; + private int size; + + public BlockingFloatHeap(int maxSize) { +this.maxSize = maxSize; +this.heap = new float[maxSize + 1]; +this.lock = new ReentrantLock(); +this.size = 0; + } + + /** + * Inserts a value into this heap. + * + * If the number of values would exceed the heap's maxSize, the least value is discarded + * + * @param value the value to add + * @return the new 'top' element in the queue. + */ + public float offer(float value) { +lock.lock(); +try { + if (size < maxSize) { +push(value); +return heap[1]; + } else { +if (value >= heap[1]) { + updateTop(value); +} +return heap[1]; + } +} finally { + lock.unlock(); +} + } + + /** + * Inserts array of values into this heap. + * + * Values are expected to be sorted in ascending order. + * + * @param values a set of values to insert + * @return the new 'top' element in the queue. + */ + public float offer(float[] values) { +lock.lock(); +try { + for (int i = values.length - 1; i >= 0; i--) { +if (size < maxSize) { + push(values[i]); +} else { + if (values[i] >= heap[1]) { Review Comment: But I mean we don't need to keep iterating the loop, i.e. we keep iterating all the way down to 0 but know we will never push anything, instead we can just exit immediately. This is all done under a lock so it seems best to avoid this work, even if it is just the fast path of the loop. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]
tveasey commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1895920963 > This makes an interval of 255 a reasonable choice. I agree. This looks better to me. One thing I would be intrigued to try is the slight change in schedules as per [this](https://github.com/apache/lucene/pull/12962#discussion_r1453323532). Particularly, what happens if we delay using the information in `minCompetitiveSimilarity`. However, these results are already very good and we could push them out to a follow on PR. One last thing I think we should consider is exploring the variance we get in recall as a result of this change. Specifically, if we were to run with some random waits in the different segment searches what is the variation in the recalls we see? The danger is we get unlikely in ordering of searches and prune segment searches which contain the true nearest neighbours more aggressively sometimes. On average this shouldn't happen, but if we also see low variation in recall for the 1M test case in such a test case it would reassuring. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Cut over HNSW's neighbor lists to group-varint? [lucene]
easyice commented on issue #12871: URL: https://github.com/apache/lucene/issues/12871#issuecomment-1896075935 Hi Adrien, i tried the idea, this will got a ~10% speedup on JHM output, which is slightly less because it needs an extra loop to decode the delta. The benchmark using neighbor list = 32 (number of vints will be read), JMH output on java21: ``` Benchmark (baseline) Mode Cnt Score Error Units NeighborListsBenchMark.doSeektrue thrpt5 0.026 ± 0.002 ops/us NeighborListsBenchMark.doSeek false thrpt5 0.029 ± 0.001 ops/us ``` The POC code is [here](https://github.com/easyice/lucene/commit/81954223abcf787e85ac6c4a1184e922db29a3cc) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Enable CheckIndex to exorcise segments with missing segment infos (.si) (#7820) [lucene]
github-actions[bot] commented on PR #12872: URL: https://github.com/apache/lucene/pull/12872#issuecomment-1897533423 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org