[GitHub] [lucene] romseygeek commented on a diff in pull request #860: LUCENE-10553: Fix WANDScorer's handling of 0 and +Infty.
romseygeek commented on code in PR #860: URL: https://github.com/apache/lucene/pull/860#discussion_r863538587 ## lucene/core/src/java/org/apache/lucene/search/WANDScorer.java: ## @@ -86,7 +86,6 @@ static int scalingFactor(float f) { * sure we do not miss any matches. */ static long scaleMaxScore(float maxScore, int scalingFactor) { -assert scalingFactor(maxScore) >= scalingFactor; Review Comment: Does this assertion no longer hold? I'm a bit nervous about removing it, as it tripping was what alerted us to the bug in the first place. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
mayya-sharipova commented on PR #792: URL: https://github.com/apache/lucene/pull/792#issuecomment-1116030632 @LuXugang I've created a feature branch: https://github.com/apache/lucene/tree/vectors-disi-direct From my side, we are good to merge this PR into it, but I wonder if @jtibshirani has any other comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10436) Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery into a single FieldExistsQuery?
[ https://issues.apache.org/jira/browse/LUCENE-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531180#comment-17531180 ] Alan Woodward commented on LUCENE-10436: The backport of this has inadvertently made the previously public DocValuesFieldExistsQuery.getDocValuesDocIdSetIterator() method package-private. I think it's still a useful method (or at least, we're using it in elasticsearch) so it should probably stay public. Perhaps we should instead move it to the public DocValues utility class? > Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and > KnnVectorFieldExistsQuery into a single FieldExistsQuery? > -- > > Key: LUCENE-10436 > URL: https://issues.apache.org/jira/browse/LUCENE-10436 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Assignee: Zach Chen >Priority: Minor > Fix For: 9.2 > > Time Spent: 6.5h > Remaining Estimate: 0h > > Now that we require consistency across data structures, we could merge > DocValuesFieldExistsQuery, NormsFieldExistsQuery and > KnnVectorFieldExistsQuery together into a FieldExistsQuery that would require > that the field indexes either norms, doc values or vectors? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10555) avoid repeated NumericLeafComparator setScorer called
jianping weng created LUCENE-10555: -- Summary: avoid repeated NumericLeafComparator setScorer called Key: LUCENE-10555 URL: https://issues.apache.org/jira/browse/LUCENE-10555 Project: Lucene - Core Issue Type: Improvement Reporter: jianping weng ElasticSearch use CancellableBulkScorer to fast cancel long time query execution by dividing one segment docs to many small split docs. For every split docs, collector.setScorer(scorer) is called, then NumericLeafComparator#setScorer is called. As a result, for one segment, NumericLeafComparator#setScorer is called many times. Every time NumericLeafComparator#setScorer is called, the NumericLeafComparator#iteratorCost is reset to the Scorer.cost and increase many unnecessary pointValues#intersect calls to get competitive docs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10555) avoid repeated NumericLeafComparator setScorer calls
[ https://issues.apache.org/jira/browse/LUCENE-10555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jianping weng updated LUCENE-10555: --- Summary: avoid repeated NumericLeafComparator setScorer calls (was: avoid repeated NumericLeafComparator setScorer called ) > avoid repeated NumericLeafComparator setScorer calls > > > Key: LUCENE-10555 > URL: https://issues.apache.org/jira/browse/LUCENE-10555 > Project: Lucene - Core > Issue Type: Improvement >Reporter: jianping weng >Priority: Major > > ElasticSearch use CancellableBulkScorer to fast cancel long time query > execution by dividing one segment docs to many small split docs. For every > split docs, collector.setScorer(scorer) is called, then > NumericLeafComparator#setScorer is called. As a result, for one segment, > NumericLeafComparator#setScorer is called many times. > Every time NumericLeafComparator#setScorer is called, the > NumericLeafComparator#iteratorCost is reset to the Scorer.cost and increase > many unnecessary pointValues#intersect calls to get competitive docs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wjp719 opened a new pull request, #864: LUCENE-10555: avoid repeated NumericLeafComparator setScorer calls
wjp719 opened a new pull request, #864: URL: https://github.com/apache/lucene/pull/864 ElasticSearch use `CancellableBulkScorer` to fast cancel long time query execution by splitting one segment docs to many docs sets. For every docs sets, `collector.setScorer(scorer)` is called, then `NumericLeafComparator#setScorer` is called. As a result, for one segment, `NumericLeafComparator#setScorer` is called many times. Every time `NumericLeafComparator#setScorer` is called, the `NumericLeafComparator#iteratorCost` is reset to the Scorer.cost and increase many unnecessary `pointValues#intersect` calls to get competitive docIds. This result in performance degradation This pr checks `NumericLeafComparator#setScorer` to be called for only one time. # Checklist Please review the following and check all that apply: - [ ] I have reviewed the guidelines for [How to Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my code conforms to the standards described there to the best of my ability. - [ ] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [ ] I have developed this patch against the `main` branch. - [ ] I have run `./gradlew check`. - [ ] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader
mocobeta commented on PR #833: URL: https://github.com/apache/lucene/pull/833#issuecomment-1116116020 Hi, I am very new to this issue and interested in adding a benchmark for ExitableDirectoryReader to luceneutil. I think it'd be not a "benchmark" since it won't measure search performance, but check if the search returns in the expected timeout with incomplete results; this should work as an assertion that the reader works as expected with real data/queries and some realistic timeouts. It may be helpful to catch any degradations? It'll take some time for me, so I don't think we should hold this for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader
mocobeta commented on code in PR #833: URL: https://github.com/apache/lucene/pull/833#discussion_r863835107 ## lucene/core/src/test/org/apache/lucene/index/TestExitableDirectoryReader.java: ## @@ -428,6 +430,107 @@ public void testDocValues() throws IOException { directory.close(); } + public void testVectorValues() throws IOException { +Directory directory = newDirectory(); +IndexWriter writer = +new IndexWriter(directory, newIndexWriterConfig(new MockAnalyzer(random(; + +int numDoc = atLeast(20); +int deletedDoc = atMost(5); +int dimension = atLeast(3); +boolean vectorIndexed = false; + +for (int i = 0; i < numDoc; i++) { + Document doc = new Document(); + + if (random().nextBoolean()) { +vectorIndexed = true; +float[] value = new float[dimension]; +for (int j = 0; j < dimension; j++) { + value[j] = random().nextFloat(); +} +FieldType fieldType = +KnnVectorField.createFieldType(dimension, VectorSimilarityFunction.COSINE); +doc.add(new KnnVectorField("vector", value, fieldType)); + } + + doc.add(new StringField("id", Integer.toString(i), Field.Store.YES)); + writer.addDocument(doc); +} + +writer.forceMerge(1); +writer.commit(); + +for (int i = 0; i < deletedDoc; i++) { + writer.deleteDocuments(new Term("id", Integer.toString(i))); +} + +writer.close(); + +QueryTimeout randomQueryTimeout; Review Comment: nit - at first impression, I thought this would represent some random time spans. Actually this is either immediate, infinite, or disabled. I wonder if we simply name it `queryTimeout`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
rmuir commented on code in PR #633: URL: https://github.com/apache/lucene/pull/633#discussion_r863840319 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java: ## @@ -86,6 +86,20 @@ public MergeSpecification findMerges( return mergeSpec; } + @Override + public MergeSpecification findMerges(CodecReader... readers) throws IOException { +if (random.nextBoolean() == true) { + return super.findMerges(readers); +} else { + // create a oneMerge for each reader to let them get concurrently processed by addIndexes() Review Comment: > Thanks for reviewing this @rmuir, @uschindler. I understand your concern. Failures on concurrent code are already tough to reproduce; it gets much harder to do this from a randomized test seed. > > It is worth noting that the `findMerges(CodecReader[])` API modified in `MockRandomMergePolicy`, is only invoked within the `addIndexes(CodecReaders[])` API, which should contain its scope of impact. But I definitely don't want to add noise with random failures on distantly related tests. I'll remove this new policy from `MockRandomMergePolicy`, which is frequently used for `IndexWriterConfig` in multiple tests. > This is true, but also part of the risk: `addIndexes(CodecReaders[])` is already pretty scary, correctness-wise. This is typically used to expertly "transform" an index, otherwise you wouldn't use this API. It is probably undertested today, I just don't want to mix concurrency randomly into the few tests it has. > For my awareness, what is the best practice in Lucene for testing non-default, concurrency related code paths? Randomized tests help validate infrequently triggered code, but concurrency poses has reproducibility challenges. For such cases, do we try to rely entirely on change-specific concurrency unit tests? Well, we need to think about it. I actually hate that ConcurrentMergeScheduler gets used the way it does in the tests, for this same reason. And that's the actual problem here... I think, not your logic in the mergepolicy? So really my concerns are around mergescheduler in tests preventing reproducibility. CMS is the default if someone tests with just `new IndexWriterConfig` and it is almost always the default if they use `LuceneTestCase.newIndexWriterConfig`. Maybe we can improve our tests to use `SerialMergeScheduler` and `ConcurrentMergeScheduler(n=1)` more often, so that reproducibility is improved across the board. We can have separate specific tests using `ConcurrentMergeScheduler(n>1)`? It really isn't anything new from your change here, I think i was just a bit frustrated the last time i tried to reproduce a seed recently (https://issues.apache.org/jira/browse/LUCENE-10529), so I want us to be careful about blanket application of concurrency to a ton of existing tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader
mocobeta commented on code in PR #833: URL: https://github.com/apache/lucene/pull/833#discussion_r863840972 ## lucene/core/src/test/org/apache/lucene/index/TestExitableDirectoryReader.java: ## @@ -428,6 +430,107 @@ public void testDocValues() throws IOException { directory.close(); } + public void testVectorValues() throws IOException { +Directory directory = newDirectory(); +IndexWriter writer = +new IndexWriter(directory, newIndexWriterConfig(new MockAnalyzer(random(; + +int numDoc = atLeast(20); +int deletedDoc = atMost(5); +int dimension = atLeast(3); +boolean vectorIndexed = false; + +for (int i = 0; i < numDoc; i++) { + Document doc = new Document(); + + if (random().nextBoolean()) { +vectorIndexed = true; +float[] value = new float[dimension]; +for (int j = 0; j < dimension; j++) { + value[j] = random().nextFloat(); +} +FieldType fieldType = +KnnVectorField.createFieldType(dimension, VectorSimilarityFunction.COSINE); +doc.add(new KnnVectorField("vector", value, fieldType)); + } + + doc.add(new StringField("id", Integer.toString(i), Field.Store.YES)); + writer.addDocument(doc); +} + +writer.forceMerge(1); +writer.commit(); + +for (int i = 0; i < deletedDoc; i++) { + writer.deleteDocuments(new Term("id", Integer.toString(i))); +} + +writer.close(); + +QueryTimeout randomQueryTimeout; Review Comment: Or, I guess we can test all three timeout patterns in this test without randomness, like `testExitableFilterTermsIndexReader()`? Sorry if I'm missing any points. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader
mocobeta commented on code in PR #833: URL: https://github.com/apache/lucene/pull/833#discussion_r863855038 ## lucene/core/src/test/org/apache/lucene/index/TestExitableDirectoryReader.java: ## @@ -428,6 +430,107 @@ public void testDocValues() throws IOException { directory.close(); } + public void testVectorValues() throws IOException { +Directory directory = newDirectory(); +IndexWriter writer = +new IndexWriter(directory, newIndexWriterConfig(new MockAnalyzer(random(; + +int numDoc = atLeast(20); +int deletedDoc = atMost(5); +int dimension = atLeast(3); +boolean vectorIndexed = false; + +for (int i = 0; i < numDoc; i++) { + Document doc = new Document(); + + if (random().nextBoolean()) { Review Comment: I'm just curious - what's the point of this randomness? Wouldn't it be okay in adding the vector values for all test docs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang opened a new pull request, #865: Lucene-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
LuXugang opened a new pull request, #865: URL: https://github.com/apache/lucene/pull/865 follow-up of https://github.com/apache/lucene/pull/792 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on pull request #865: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
LuXugang commented on PR #865: URL: https://github.com/apache/lucene/pull/865#issuecomment-1116261200 Hi @mayya-sharipova , full RP presented. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader
jpountz commented on PR #833: URL: https://github.com/apache/lucene/pull/833#issuecomment-1116331660 Actually I was thinking of search performance as something that we would like to measure, since these wrappers can induce some performance overhead. E.g. the baseline could run on the raw reader and the contender would wrap the reader with ExitableDirectoryReader and a very large timeout that's almost certainly not going to be hit, so that we could measure how much search performance we're trading by introducing these wrappers and timeout checks. Maybe luceneutil could have an option to do this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10527) Use bigger maxConn for last layer in HNSW
[ https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani updated LUCENE-10527: -- Description: Recently I was rereading the HNSW paper ([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using a different maxConn for the upper layers vs. the bottom one (which contains the full neighborhood graph). Specifically, they suggest using maxConn=M for upper layers and maxConn=2*M for the bottom. This differs from what we do, which is to use maxConn=M for all layers. I tried updating our logic using a hacky patch, and noticed an improvement in latency for higher recall values (which is consistent with the paper's observation): *Results on glove-100-angular* Parameters: M=32, efConstruction=100 !image-2022-04-20-14-53-58-484.png|width=400,height=367! As we'd expect, indexing becomes a bit slower: {code:java} Baseline: Indexed 1183514 documents in 733s Candidate: Indexed 1183514 documents in 948s{code} When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a big difference in recall for the same settings of M and efConstruction. (Even adding graph layers in LUCENE-10054 didn't really affect recall.) With this change, the recall is now very similar: *Results on glove-100-angular* Parameters: M=32, efConstruction=100 {code:java} kApproach Recall QPS 10 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563 4410.499 50 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798 1956.280 100 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862 1209.734 100 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958 341.428 800 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974 230.396 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980 188.757 10 hnswlib ({'M': 32, 'efConstruction': 100})0.552 16745.433 50 hnswlib ({'M': 32, 'efConstruction': 100})0.794 5738.468 100 hnswlib ({'M': 32, 'efConstruction': 100})0.860 3336.386 500 hnswlib ({'M': 32, 'efConstruction': 100})0.956 832.982 800 hnswlib ({'M': 32, 'efConstruction': 100})0.973 541.097 1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979 442.163 {code} I think it'd be nice update to maxConn so that we faithfully implement the paper's algorithm. This is probably least surprising for users, and I don't see a strong reason to take a different approach from the paper? Let me know what you think! was: Recently I was rereading the HNSW paper ([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using a different maxConn for the upper layers vs. the bottom one (which contains the full neighborhood graph). Specifically, they suggest using maxConn=M for upper layers and maxConn=2*M for the bottom. This differs from what we do, which is to use maxConn=M for all layers. I tried updating our logic using a hacky patch, and noticed an improvement in latency for higher recall values (which is consistent with the paper's observation): *Results on glove-100-angular* Parameters: M=32, efConstruction=100 !image-2022-04-20-14-53-58-484.png|width=400,height=367! As we'd expect, indexing becomes a bit slower: {code:java} Baseline: Indexed 1183514 documents in 733s Candidate: Indexed 1183514 documents in 948s{code} When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a big difference in recall for the same settings of M and efConstruction. (Even adding graph layers in LUCENE-10054 didn't really affect recall.) With this change, the recall is now very similar: *Results on glove-100-angular* Parameters: M=32, efConstruction=100 {code:java} num_candsApproach Recall QPS 10 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563 4410.499 50 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798 1956.280 100 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862 1209.734 100 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958 341.428 800 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974 230.396 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980 188.757 10 hnswlib ({'M': 32, 'efConstruction': 100})0.552 16745.433 50 hnswlib ({'M': 32, 'efConstruction': 100})0.794 5738.468 100 hnswlib ({'M': 32, 'efConstruction': 100})0.860 3336.386 500 hnswlib ({'M': 32, '
[GitHub] [lucene] jtibshirani commented on pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
jtibshirani commented on PR #792: URL: https://github.com/apache/lucene/pull/792#issuecomment-1116736228 Thanks for running those benchmarks! I'm working on a quick review before you merge it. One thing I was surprised about with the benchmarks it that you find searches take 60ms or more. In https://issues.apache.org/jira/browse/LUCENE-10527 we see that on a similar sized dataset, the latency always below 10ms. (It's reported in QPS or queries-per-second, but it's sequential so latency is just `1000 / QPS`). Maybe you could give some information about your machine and benchmark set-up (was there a warmup?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader
mocobeta commented on PR #833: URL: https://github.com/apache/lucene/pull/833#issuecomment-1116850417 > the baseline could run on the raw reader and the contender would wrap the reader with ExitableDirectoryReader and a very large timeout that's almost certainly not going to be hit, so that we could measure how much search performance we're trading by introducing these wrappers and timeout checks This sounds good to me (with a benchmark, we could tune `DOCS_BETWEEN_TIMEOUT_CHECK`?). I'll take a look - I think it's feasible but could take some time for me :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on pull request #862: LUCENE-9848 Sort HNSW graph neighbors for construction
mayya-sharipova commented on PR #862: URL: https://github.com/apache/lucene/pull/862#issuecomment-1116854352 @msokolov Thanks so much for your feedback. I've addressed it in 160904904c94ffd4d194fc2509124c0e2eb9c44a. I've also did another set of benchmarking with new changes, this time with higher `M` and `efConstruction` values: glove-100-angular M:32 efConstruction:500 Happy to report that indexing took just slightly more (3%), while for searches reported slight gains in both recall and QPS: **Indexing Speed** baseline: Indexed 1183514 documents in 3697s candidate Indexed 1183514 documents in 3814s **Search** | | baseline recall | baseline QPS | candidate recall | candidate QPS | | --- | --: | ---: | ---: | : | | n_cands=10 | 0.634 | 2628.879 |0.659 | 2659.589 | | n_cands=20 | 0.696 | 2034.672 |0.719 | 2186.015 | | n_cands=40 | 0.762 | 1611.403 |0.784 | 1666.735 | | n_cands=80 | 0.824 | 1150.786 |0.843 | 1161.111 | | n_cands=120 | 0.855 | 878.114 |0.874 | 890.783 | | n_cands=200 | 0.892 | 610.891 |0.907 | 621.173 | | n_cands=400 | 0.931 | 343.649 |0.944 | 359.924 | | n_cands=600 | 0.949 | 239.217 |0.960 | 256.863 | | n_cands=800 | 0.961 | 189.523 |0.970 | 199.867 | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
jtibshirani commented on code in PR #792: URL: https://github.com/apache/lucene/pull/792#discussion_r864338089 ## lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java: ## @@ -320,13 +323,19 @@ private static class FieldEntry { final int numLevels; final int dimension; private final int size; -final int[] ordToDoc; -private final IntUnaryOperator ordToDocOperator; final int[][] nodesByLevel; // for each level the start offsets in vectorIndex file from where to read neighbours final long[] graphOffsetsByLevel; - -FieldEntry(DataInput input, VectorSimilarityFunction similarityFunction) throws IOException { +final long docsWithFieldOffset; Review Comment: It'd be great to add short comments here explaining what these different values are used for. ## lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java: ## @@ -507,8 +515,90 @@ public BytesRef binaryValue(int targetOrd) throws IOException { } private void readValue(int targetOrd) throws IOException { - dataIn.seek((long) targetOrd * byteSize); - dataIn.readBytes(byteBuffer.array(), byteBuffer.arrayOffset(), byteSize); + slice.seek((long) targetOrd * byteSize); + slice.readBytes(byteBuffer.array(), byteBuffer.arrayOffset(), byteSize); +} + +public IndexedDISI initDISI(IndexInput vectorData) throws IOException { + // dense + if (fieldEntry == null || fieldEntry.docsWithFieldOffset == -1) { +return null; + } + assert fieldEntry.docsWithFieldOffset != -2; + // sparse + return new IndexedDISI( + vectorData, + fieldEntry.docsWithFieldOffset, + fieldEntry.docsWithFieldLength, + fieldEntry.jumpTableEntryCount, + fieldEntry.denseRankPower, + fieldEntry.size); +} + +public static final OffHeapVectorValues emptyOffHeapVectorValues(int dimension) Review Comment: I think we should avoid creating this new empty `OffHeapVectorValues` class to keep things simple. Instead we can just assert in `getOffHeapVectorValues` that the field is not empty and make sure to always return early if the field is empty. ## lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java: ## @@ -400,27 +400,42 @@ static class OffHeapVectorValues extends VectorValues private final int dimension; private final int size; -private final int[] ordToDoc; -private final IntUnaryOperator ordToDocOperator; +// dataIn was used to init a new IndexedDIS for #randomAccess() private final IndexInput dataIn; +private final IndexInput slice; private final BytesRef binaryValue; private final ByteBuffer byteBuffer; private final int byteSize; private final float[] value; +private final IndexedDISI disi; +private final FieldEntry fieldEntry; +final DirectMonotonicReader ordToDoc; private int ord = -1; private int doc = -1; -OffHeapVectorValues(int dimension, int size, int[] ordToDoc, IndexInput dataIn) { +OffHeapVectorValues( +int dimension, int size, FieldEntry fieldEntry, IndexInput dataIn, IndexInput slice) +throws IOException { this.dimension = dimension; this.size = size; - this.ordToDoc = ordToDoc; - ordToDocOperator = ordToDoc == null ? IntUnaryOperator.identity() : (ord) -> ordToDoc[ord]; + this.fieldEntry = fieldEntry; this.dataIn = dataIn; + this.slice = slice; + this.disi = initDISI(dataIn); Review Comment: I think it'd be clearer if we separated out the sparse and dense cases into two different subclasses. Currently it feels pretty fragile/ confusing, for example in `Lucene91HnswVectorsWriter` line 148 we create an `OffHeapVectorValues` that has null for one `dataIn` but not for `slice`. Instead we could just have this be its own class like `OffHeapVectorValues.Dense` that does not require the `dataIn` at all. I also think we can simplify the dense logic. We don't need different variables `ord` and `doc` because these are always equal in the dense case. ## lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java: ## @@ -335,23 +344,23 @@ private static class FieldEntry { dimension = input.readInt(); size = input.readInt(); - int denseSparseMarker = input.readByte(); - if (denseSparseMarker == -1) { -ordToDoc = null; // each document has a vector value + docsWithFieldOffset = input.readLong(); + docsWithFieldLength = input.readLong(); + jumpTableEntryCount = input.readShort(); + denseRankPower = input.readByte(); + + // sparse + if (docsWithFieldOffset != -1 && docsWithFieldOffset != -2) { Review Comment: I think it'd be clearer to just say `docsWithFieldOffset == 0` here. T
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #862: LUCENE-9848 Sort HNSW graph neighbors for construction
mayya-sharipova commented on code in PR #862: URL: https://github.com/apache/lucene/pull/862#discussion_r864407239 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -21,32 +21,64 @@ /** * NeighborArray encodes the neighbors of a node and their mutual scores in the HNSW graph as a pair - * of growable arrays. + * of growable arrays. Nodes are arranged in the sorted order of their scores in descending order + * (if scoresDescOrder is true), or in the ascending order of their scores (if scoresDescOrder is + * false) * * @lucene.internal */ public class NeighborArray { - + private final boolean scoresDescOrder; private int size; float[] score; int[] node; - public NeighborArray(int maxSize) { + public NeighborArray(int maxSize, boolean descOrder) { node = new int[maxSize]; score = new float[maxSize]; +this.scoresDescOrder = descOrder; } + /** + * Add a new node to the NeighborArray. The new node must be worse than all previously stored + * nodes. + */ public void add(int newNode, float newScore) { if (size == node.length - 1) { node = ArrayUtil.grow(node, (size + 1) * 3 / 2); score = ArrayUtil.growExact(score, node.length); } +if (size > 0) { + float previousScore = score[size - 1]; + assert ((scoresDescOrder && (previousScore >= newScore)) + || (scoresDescOrder == false && (previousScore <= newScore))) + : "Nodes are added in the incorrect order!"; +} node[size] = newNode; score[size] = newScore; ++size; } + /** Add a new node to the NeighborArray into a correct sort position according to its score. */ + public void addAndSort(int newNode, float newScore) { +if (size == node.length - 1) { + node = ArrayUtil.grow(node, (size + 1) * 3 / 2); + score = ArrayUtil.growExact(score, node.length); +} +int insertionPoint = +scoresDescOrder +? descSortFindRightMostInsertionPoint(newScore) +: ascSortFindRightMostInsertionPoint(newScore); +for (int i = size; i > insertionPoint; i--) { Review Comment: TIL: that we can use System.arraycopy for the reverse case as well. Addressed in: 160904904c94ffd4d194fc2509124c0e2eb9c44a ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -21,32 +21,64 @@ /** * NeighborArray encodes the neighbors of a node and their mutual scores in the HNSW graph as a pair - * of growable arrays. + * of growable arrays. Nodes are arranged in the sorted order of their scores in descending order + * (if scoresDescOrder is true), or in the ascending order of their scores (if scoresDescOrder is + * false) * * @lucene.internal */ public class NeighborArray { - + private final boolean scoresDescOrder; private int size; float[] score; int[] node; - public NeighborArray(int maxSize) { + public NeighborArray(int maxSize, boolean descOrder) { node = new int[maxSize]; score = new float[maxSize]; +this.scoresDescOrder = descOrder; } + /** + * Add a new node to the NeighborArray. The new node must be worse than all previously stored + * nodes. + */ public void add(int newNode, float newScore) { if (size == node.length - 1) { node = ArrayUtil.grow(node, (size + 1) * 3 / 2); score = ArrayUtil.growExact(score, node.length); } +if (size > 0) { + float previousScore = score[size - 1]; + assert ((scoresDescOrder && (previousScore >= newScore)) + || (scoresDescOrder == false && (previousScore <= newScore))) + : "Nodes are added in the incorrect order!"; +} node[size] = newNode; score[size] = newScore; ++size; } + /** Add a new node to the NeighborArray into a correct sort position according to its score. */ + public void addAndSort(int newNode, float newScore) { Review Comment: Addressed in: 160904904c94ffd4d194fc2509124c0e2eb9c44a -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #862: LUCENE-9848 Sort HNSW graph neighbors for construction
mayya-sharipova commented on code in PR #862: URL: https://github.com/apache/lucene/pull/862#discussion_r864407369 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -72,8 +104,38 @@ public void removeLast() { size--; } + public void removeIndex(int idx) { +for (int i = idx; i < (size - 1); i++) { + node[i] = node[i + 1]; + score[i] = score[i + 1]; +} +size--; + } + @Override public String toString() { return "NeighborArray[" + size + "]"; } + + private int ascSortFindRightMostInsertionPoint(float newScore) { +int start = 0; Review Comment: Great comment, addressed in: 160904904c94ffd4d194fc2509124c0e2eb9c44a ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -72,8 +104,38 @@ public void removeLast() { size--; } + public void removeIndex(int idx) { +for (int i = idx; i < (size - 1); i++) { + node[i] = node[i + 1]; Review Comment: Great comment, addressed in: 160904904c94ffd4d194fc2509124c0e2eb9c44a -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
LuXugang commented on code in PR #792: URL: https://github.com/apache/lucene/pull/792#discussion_r864437136 ## lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java: ## @@ -335,23 +344,23 @@ private static class FieldEntry { dimension = input.readInt(); size = input.readInt(); - int denseSparseMarker = input.readByte(); - if (denseSparseMarker == -1) { -ordToDoc = null; // each document has a vector value + docsWithFieldOffset = input.readLong(); + docsWithFieldLength = input.readLong(); + jumpTableEntryCount = input.readShort(); + denseRankPower = input.readByte(); + + // sparse + if (docsWithFieldOffset != -1 && docsWithFieldOffset != -2) { Review Comment: In sparse case, `docsWithFieldOffset` is always a value greater than 0, so add an assertion like `docsWithFieldOffset > 0` seems like can make code clearer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
LuXugang commented on code in PR #792: URL: https://github.com/apache/lucene/pull/792#discussion_r864438991 ## lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java: ## @@ -335,23 +344,23 @@ private static class FieldEntry { dimension = input.readInt(); size = input.readInt(); - int denseSparseMarker = input.readByte(); - if (denseSparseMarker == -1) { -ordToDoc = null; // each document has a vector value + docsWithFieldOffset = input.readLong(); + docsWithFieldLength = input.readLong(); + jumpTableEntryCount = input.readShort(); + denseRankPower = input.readByte(); + + // sparse + if (docsWithFieldOffset != -1 && docsWithFieldOffset != -2) { Review Comment: Thanks @jtibshirani ,In sparse case, `docsWithFieldOffset` is always a value greater than 0, the clearer code could be write like: ``` // dense or empty if (docsWithFieldOffset == -1 || docsWithFieldOffset == -2) { ... } else { // sparse ... } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
LuXugang commented on code in PR #792: URL: https://github.com/apache/lucene/pull/792#discussion_r864438991 ## lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java: ## @@ -335,23 +344,23 @@ private static class FieldEntry { dimension = input.readInt(); size = input.readInt(); - int denseSparseMarker = input.readByte(); - if (denseSparseMarker == -1) { -ordToDoc = null; // each document has a vector value + docsWithFieldOffset = input.readLong(); + docsWithFieldLength = input.readLong(); + jumpTableEntryCount = input.readShort(); + denseRankPower = input.readByte(); + + // sparse + if (docsWithFieldOffset != -1 && docsWithFieldOffset != -2) { Review Comment: Thanks @jtibshirani ,In sparse case, `docsWithFieldOffset` is always a value greater than 0, the clearer code may could be like: ``` // dense or empty if (docsWithFieldOffset == -1 || docsWithFieldOffset == -2) { ... } else { // sparse ... } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta opened a new pull request, #866: Make CONTRIBUTING.md a bit more succinct
mocobeta opened a new pull request, #866: URL: https://github.com/apache/lucene/pull/866 I think it'd be good to try to keep the contributing guide succinct as far as possible so that it is helpful for experienced developers (the main targets of the document, I think). - remove specific Gradle commands - Instead, we can link to the corresponding help documents. - remove the instruction for making a patch - I assume contributors who prefer patches to pull requests are already familiar with it. - remove the explanation about Yetus - I don't think it is still working on the Jira Lucene project. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #866: Make CONTRIBUTING.md a bit more succinct
mocobeta commented on PR #866: URL: https://github.com/apache/lucene/pull/866#issuecomment-1116946180 It's a trivial change in documentation. I'll wait until tomorrow then merge it to main. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
LuXugang commented on code in PR #792: URL: https://github.com/apache/lucene/pull/792#discussion_r864475729 ## lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java: ## @@ -507,8 +515,90 @@ public BytesRef binaryValue(int targetOrd) throws IOException { } private void readValue(int targetOrd) throws IOException { - dataIn.seek((long) targetOrd * byteSize); - dataIn.readBytes(byteBuffer.array(), byteBuffer.arrayOffset(), byteSize); + slice.seek((long) targetOrd * byteSize); + slice.readBytes(byteBuffer.array(), byteBuffer.arrayOffset(), byteSize); +} + +public IndexedDISI initDISI(IndexInput vectorData) throws IOException { + // dense + if (fieldEntry == null || fieldEntry.docsWithFieldOffset == -1) { +return null; + } + assert fieldEntry.docsWithFieldOffset != -2; + // sparse + return new IndexedDISI( + vectorData, + fieldEntry.docsWithFieldOffset, + fieldEntry.docsWithFieldLength, + fieldEntry.jumpTableEntryCount, + fieldEntry.denseRankPower, + fieldEntry.size); +} + +public static final OffHeapVectorValues emptyOffHeapVectorValues(int dimension) Review Comment: > I think it'd be clearer if we separated out the sparse and dense cases into two different subclasses Totally agreed! It makes code much more readable. but I still want to keep `emptyOffHeapVectorValues` which has been renamed to `EmptyOffHeapVectorValues` in latest PR, I would like to keep same pattern like `Lucene90NormsProducer#getNorms` or `Lucene90DocValuesProducer#getXXX`. Addressd in https://github.com/apache/lucene/pull/792/commits/49f67a57e8e294fbed2a63d4887169103f05073b . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
LuXugang commented on code in PR #792: URL: https://github.com/apache/lucene/pull/792#discussion_r864476448 ## lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java: ## @@ -335,23 +344,23 @@ private static class FieldEntry { dimension = input.readInt(); size = input.readInt(); - int denseSparseMarker = input.readByte(); - if (denseSparseMarker == -1) { -ordToDoc = null; // each document has a vector value + docsWithFieldOffset = input.readLong(); + docsWithFieldLength = input.readLong(); + jumpTableEntryCount = input.readShort(); + denseRankPower = input.readByte(); + + // sparse + if (docsWithFieldOffset != -1 && docsWithFieldOffset != -2) { Review Comment: addressd in https://github.com/apache/lucene/pull/792/commits/e2931d90126375c488fa0060ff85619452081af7 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
LuXugang commented on PR #792: URL: https://github.com/apache/lucene/pull/792#issuecomment-1116957549 > Maybe you could give some information about your machine and benchmark set-up (was there a warmup?) Here is my benchmark test demo : https://github.com/LuXugang/Lucene-7.5.0/commit/b69ae6c70665878f95115a6a49715c84c760b4c6 . Dataset in demo is `glove.6B` from https://github.com/mikemccand/luceneutil -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org