[GitHub] [lucene] jpountz closed pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.
jpountz closed pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks. URL: https://github.com/apache/lucene/pull/227 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shubhamvishu commented on a diff in pull request #11832: Added static factory method for loading VectorValues
shubhamvishu commented on code in PR #11832: URL: https://github.com/apache/lucene/pull/11832#discussion_r988712945 ## lucene/core/src/java/org/apache/lucene/index/SlowCodecReaderWrapper.java: ## @@ -163,7 +163,7 @@ private static KnnVectorsReader readerToVectorReader(LeafReader reader) { return new KnnVectorsReader() { @Override public VectorValues getVectorValues(String field) throws IOException { -return reader.getVectorValues(field); +return VectorValues.getVectorValues(reader, field); Review Comment: > The new method looks good, but it's no longer used anywhere under src/java anymore, which makes me wonder if it's actually helpful. One reasoning is we can have this to be consistent like how we have similar static functions in `DocValues`. There no code using it right now but this could be used maybe in future? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11839: gradle aggregate coverage report
rmuir commented on issue #11839: URL: https://github.com/apache/lucene/issues/11839#issuecomment-1269611646 @zhaih please take it! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #11832: Added static factory method for loading VectorValues
jpountz commented on PR #11832: URL: https://github.com/apache/lucene/pull/11832#issuecomment-1269629285 > One reasoning is we can have this to be consistent like how we have similar static functions in DocValues. There no code using it right now but this could be used maybe in future? I'm starting to have more second thoughts about whether we should actually add this new method. On doc values, I'm seeing value because it helps save a few annoying null checks and then everything should work as expected on the empty instances. Saving null checks applies for vectors too, but the `EMPTY` vectors instance also returns `0` for the `dimension()` method, which is something I could see some code paths to complain about if they end up seeing different values for the dimension across different segments of the same index. So it's not clear to me that it would prevent more bugs than it would introduce. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #11831: GITHUB-11761: Move minimum TieredMergePolicy delete percentage from 2…
jpountz commented on PR #11831: URL: https://github.com/apache/lucene/pull/11831#issuecomment-1269633237 Why did we have to update the number of allowed deletes back to 33% on some tests, did they fail otherwise? Is there another way how we could improve these tests to cope better with the new default? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #11831: GITHUB-11761: Move minimum TieredMergePolicy delete percentage from 2…
jpountz commented on PR #11831: URL: https://github.com/apache/lucene/pull/11831#issuecomment-1269635726 E.g. maybe it would be better to add more documents in `testNRTIsCurrentAfterDelete` in order for the single delete not to introduce more than 20% deletes, and keep the `TieredMergePolicy` defaults? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on issue #11830: Store HNSW graph connections more compactly
benwtrent commented on issue #11830: URL: https://github.com/apache/lucene/issues/11830#issuecomment-1270205020 I was able to replicate @jtibshirani results. Using the `glove-100-angular` from ann-benchmarks, here is the graph data size: ``` Baseline: 154M _0_Lucene94HnswVectorsFormat_0.vex PackedInts: 103M _0_Lucene94HnswVectorsFormat_0.vex ``` Now, the implementation for `PackedInts` followed closely to @jtibshirani "hacky patch". There are various changes required on read, and new information needs to be stored in regards to the PackedInts version used. In all (disregarding the fact that a new codec must be written), the code change is pretty small. Other good news is that when running ann-benchmark on `glove-100-angular`, there is no discernible difference in search performance. I am not sure how much traffic the off-heap reader gets in this benchmark. It seems to be it would be read once and stored in memory, though I might be misunderstanding something. @jtibshirani @mayya-sharipova what do y'all think? Any ideas on how to force off-heap for search for better numbers on search changes? Honestly, we may not care about slightly slower off-heap search performance as smaller sizes are more likely to be in the page cache. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on issue #11830: Store HNSW graph connections more compactly
jtibshirani commented on issue #11830: URL: https://github.com/apache/lucene/issues/11830#issuecomment-1270359029 @benwtrent thanks for looking into this! To clarify, `OffHeapHnswGraph` will always be used for searches. Usually it will be paged in from disk before heavy searching, but it's still the same data structure. The on-heap version is only used for indexing. Could you include a comparison of the search latency and recall numbers you see with this approach? Sometimes with our benchmarks it's easy to miss small differences in performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on issue #11830: Store HNSW graph connections more compactly
benwtrent commented on issue #11830: URL: https://github.com/apache/lucene/issues/11830#issuecomment-1270415680 > Could you include a comparison of the search latency and recall numbers you see with this approach? Sometimes with our benchmarks it's easy to miss small differences in performance. Yes, I can provide some images and numbers. What I saw when running `glove-100-angular` batched is that there is no difference. Will run more tests for sure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on issue #11838: Adding concurrency to query rewrite?
jtibshirani commented on issue #11838: URL: https://github.com/apache/lucene/issues/11838#issuecomment-1270564173 This is a great question! To me it'd make sense to try to move costly steps into weight creation or scoring. It feels a little "off" to do the bulk of a query's work during rewrite. However, I understand why we chose to use rewrite for `KnnVectorQuery`. It allows us the query to match the global top `k` across segments, as opposed to matching `k` docs per segment. This means the number of documents it matches doesn't depend on the number of segments, and it's easier to combine with other queries. Maybe this is not the right trade-off though in terms of behavior vs. performance 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org