Re: [PR] Introduce TestLucene90DocValuesFormatVariableSkipIntervalfor testing docvalues skipper index [lucene]

2024-07-08 Thread via GitHub
iverase commented on PR #13550: URL: https://github.com/apache/lucene/pull/13550#issuecomment-2216740642 @jpountz I added a new test and change the title and description of the issue as we don't need to add a new Codec. -- This is an automated message from the Apache Git Service. To respo

Re: [PR] Introduce TestLucene90DocValuesFormatVariableSkipIntervalfor testing docvalues skipper index [lucene]

2024-07-08 Thread via GitHub
iverase commented on code in PR #13550: URL: https://github.com/apache/lucene/pull/13550#discussion_r1669846808 ## lucene/test-framework/src/java/org/apache/lucene/tests/codecs/skipper/SkipperCodec.java: ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF

Re: [PR] Introduce TestLucene90DocValuesFormatVariableSkipIntervalfor testing docvalues skipper index [lucene]

2024-07-08 Thread via GitHub
iverase commented on code in PR #13550: URL: https://github.com/apache/lucene/pull/13550#discussion_r1669846499 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseDocValuesFormatTestCase.java: ## @@ -157,6 +158,7 @@ public void testNumberMergeAwayAllValuesWithSk

Re: [PR] Binary search all terms. [lucene]

2024-07-08 Thread via GitHub
vsop-479 commented on PR #13192: URL: https://github.com/apache/lucene/pull/13192#issuecomment-2216245652 Or, we just supply this (maybe only for `non-allEqual` leaf blocks) as an option, So, users can use it when their applications are not busy. -- This is an automated message from the

Re: [I] WordBreakSpellChecker.generateBreakUpSuggestions() should do breadth first search [lucene]

2024-07-08 Thread via GitHub
hossman commented on issue #12100: URL: https://github.com/apache/lucene/issues/12100#issuecomment-2215985880 I realized today that I had been working on branch_9x, so i've updated the patch to apply cleanly to main [WordBreakSpellChecker.breadthfirst.GH-12100.patch.txt](https://gith

Re: [PR] Use `IndexInput#prefetch` for terms dictionary lookups. [lucene]

2024-07-08 Thread via GitHub
github-actions[bot] commented on PR #13359: URL: https://github.com/apache/lucene/pull/13359#issuecomment-2215694348 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] Fix quantized vector writer ram estimates [lucene]

2024-07-08 Thread via GitHub
benwtrent commented on PR #13553: URL: https://github.com/apache/lucene/pull/13553#issuecomment-2215645063 @gautamworah96 @msokolov this might be part of the reason for the OOMs, the estimates were completely ignoring the float[] vector sizes for fieldwriters 🤦 . I plan on iterating on this

Re: [PR] Reduce heap usage for knn index writers [lucene]

2024-07-08 Thread via GitHub
benwtrent commented on PR #13538: URL: https://github.com/apache/lucene/pull/13538#issuecomment-2215637537 Related: https://github.com/apache/lucene/pull/13553 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

Re: [PR] Fix quantized vector writer ram estimates [lucene]

2024-07-08 Thread via GitHub
benwtrent commented on code in PR #13553: URL: https://github.com/apache/lucene/pull/13553#discussion_r1669465330 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -299,9 +299,7 @@ public void finish() throws IOException {

Re: [PR] Fix quantized vector writer ram estimates [lucene]

2024-07-08 Thread via GitHub
benwtrent commented on code in PR #13553: URL: https://github.com/apache/lucene/pull/13553#discussion_r1669464616 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java: ## @@ -172,9 +172,6 @@ public void finish() throws IOException { public

[PR] Fix quantized vector writer ram estimates [lucene]

2024-07-08 Thread via GitHub
benwtrent opened a new pull request, #13553: URL: https://github.com/apache/lucene/pull/13553 I still need to write a test, but wanted to open this PR early. Scalar Quantized vector writer ram usage estimates completely ignores the raw float vectors. Meaning, if you have flush based o

Re: [PR] Improve VectorUtil::xorBitCount perf on ARM [lucene]

2024-07-08 Thread via GitHub
uschindler commented on PR #13545: URL: https://github.com/apache/lucene/pull/13545#issuecomment-2215482308 I reverted the addition of the file to 9.x branch: 86d080a4e0b4e53e0c9a3f2e2b120bff204c7276 -- This is an automated message from the Apache Git Service. To respond to the message, p

Re: [I] Significant drop in recall for 8 bit Scalar Quantizer [lucene]

2024-07-08 Thread via GitHub
MilindShyani commented on issue #13519: URL: https://github.com/apache/lucene/issues/13519#issuecomment-2215327156 @benwtrent Apologies for the late response! I am traveling (and marveling 2000 year old pyramids) right now. The transformation you wrote indeed matches mine. Thinking about th

Re: [I] Make HNSW merges faster [lucene]

2024-07-08 Thread via GitHub
benwtrent commented on issue #12440: URL: https://github.com/apache/lucene/issues/12440#issuecomment-2215292611 I had another idea, I wonder if we can initialize HNSW via coarse grained clusters. Depending on the clustering algorithm used, we can use clusters built from various segments to

Re: [PR] Refactor and javadoc update for KNN vector writer classes [lucene]

2024-07-08 Thread via GitHub
zhaih merged PR #13548: URL: https://github.com/apache/lucene/pull/13548 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [PR] WIP: draft of intra segment concurrency [lucene]

2024-07-08 Thread via GitHub
stefanvodita commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1669100332 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -328,42 +336,65 @@ protected LeafSlice[] slices(List leaves) { /** Static method to

Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

2024-07-08 Thread via GitHub
uschindler commented on issue #13551: URL: https://github.com/apache/lucene/issues/13551#issuecomment-2214871854 > * it should be a one-liner using `setPreload` to preload "*.vec" if we wanted to do it either from FSDirectory.open or by default in MMapDirectory It is trivial: ```ja

Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

2024-07-08 Thread via GitHub
uschindler commented on issue #13551: URL: https://github.com/apache/lucene/issues/13551#issuecomment-2214868089 I don't think we should change anything here in MMapDirectory. It is all available and easy to do for one that wants to do this. Elasticserach is doing this for some files, but w

Re: [PR] Improve VectorUtil::xorBitCount perf on ARM [lucene]

2024-07-08 Thread via GitHub
uschindler commented on code in PR #13545: URL: https://github.com/apache/lucene/pull/13545#discussion_r1669064660 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -212,6 +212,14 @@ public static int int4DotProductPacked(byte[] unpacked, byte[] packed) {

Re: [PR] Improve VectorUtil::xorBitCount perf on ARM [lucene]

2024-07-08 Thread via GitHub
uschindler commented on PR #13545: URL: https://github.com/apache/lucene/pull/13545#issuecomment-2214846241 See: https://github.com/apache/lucene/commit/c8b4a76ecc93a98c779364b18f62c9b67552c192#diff-dd8d7417893f9b2fecaef29491b94d5daeaae6d496c4b21bb9633b4f7b060e59 -- This is an automated m

Re: [PR] Improve VectorUtil::xorBitCount perf on ARM [lucene]

2024-07-08 Thread via GitHub
uschindler commented on PR #13545: URL: https://github.com/apache/lucene/pull/13545#issuecomment-2214845553 Hi, in the backport to 9.x the benchmark file was wrongly merged. It landed in the test directory. In 9.x we have no benchmark-jmh module in Gradle, so the file should have been le

[I] Test TestIndexWriterWithThreads#testIOExceptionDuringWriteSegmentWithThreadsOnlyOnce Failed [lucene]

2024-07-08 Thread via GitHub
aoli-al opened a new issue, #13552: URL: https://github.com/apache/lucene/issues/13552 ### Description I saw a flaky test, `TestIndexWriterWithThreads#testIOExceptionDuringWriteSegmentWithThreadsOnlyOnce` caused by concurrency issues recently: ``` MockDirectoryWrapper: can

Re: [PR] Refactor and javadoc update for KNN vector writer classes [lucene]

2024-07-08 Thread via GitHub
zhaih commented on code in PR #13548: URL: https://github.com/apache/lucene/pull/13548#discussion_r1669056420 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -139,6 +142,54 @@ public int nextDoc() throws IOException { } } + /** + * Give

Re: [PR] Refactor and javadoc update for KNN vector writer classes [lucene]

2024-07-08 Thread via GitHub
zhaih commented on code in PR #13548: URL: https://github.com/apache/lucene/pull/13548#discussion_r1669054287 ## lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java: ## @@ -356,7 +356,7 @@ BufferedUpdate next() throws IOException { } } -BytesRe

Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

2024-07-08 Thread via GitHub
mikemccand commented on issue #13551: URL: https://github.com/apache/lucene/issues/13551#issuecomment-2214828022 Oh sorry I used the wrong term (thank you @rmuir for clarifying!): it's not a swap storm I'm seeing, it's a page storm. The OS has plenty of free ram (reported by `free`), and t

Re: [PR] Refactor and javadoc update for KNN vector writer classes [lucene]

2024-07-08 Thread via GitHub
benwtrent commented on code in PR #13548: URL: https://github.com/apache/lucene/pull/13548#discussion_r1669043121 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -139,6 +142,54 @@ public int nextDoc() throws IOException { } } + /** + *

Re: [PR] WIP: draft of intra segment concurrency [lucene]

2024-07-08 Thread via GitHub
msokolov commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1668890994 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -328,42 +336,65 @@ protected LeafSlice[] slices(List leaves) { /** Static method to seg

Re: [PR] WIP: draft of intra segment concurrency [lucene]

2024-07-08 Thread via GitHub
msokolov commented on PR #13542: URL: https://github.com/apache/lucene/pull/13542#issuecomment-2214811101 I wonder if we should tackle the issue with caching / cloning scorers? We have scorers/scorerSuppliers that do a lot of up-front work when created and we don't want to duplicate that wo

Re: [PR] WIP: draft of intra segment concurrency [lucene]

2024-07-08 Thread via GitHub
jpountz commented on PR #13542: URL: https://github.com/apache/lucene/pull/13542#issuecomment-2214643611 > The change in expectation should be reflected in the Collector API semantics though (rather that CollectorManager?), is that what you meant? I was referring to `CollectorManager

Re: [PR] Improve VectorUtil::xorBitCount perf on ARM [lucene]

2024-07-08 Thread via GitHub
ChrisHegarty merged PR #13545: URL: https://github.com/apache/lucene/pull/13545 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucen

Re: [I] Significant drop in recall for 8 bit Scalar Quantizer [lucene]

2024-07-08 Thread via GitHub
benwtrent commented on issue #13519: URL: https://github.com/apache/lucene/issues/13519#issuecomment-2214591883 @jbhateja could you unpack how this would actually work when using dot-product with linear scale corrections? I would imagine we could switch to an "unsigned byte" compariso

Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

2024-07-08 Thread via GitHub
jpountz commented on issue #13551: URL: https://github.com/apache/lucene/issues/13551#issuecomment-2214541782 It wouldn't solve the issue, only mitigate it, but hopefully cold start performance gets better when we start leveraging `IndexInput#prefetch` to load multiple vectors from disk con

Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

2024-07-08 Thread via GitHub
rmuir commented on issue #13551: URL: https://github.com/apache/lucene/issues/13551#issuecomment-2214496576 > It's not easy to do -- you wouldn't know up front that the application will do KNN searching at all. And, maybe only certain vectors in the `.vec` will ever be accessed and so you n

Re: [PR] Introduce a SkipperCodec for testing docvalues skipper index [lucene]

2024-07-08 Thread via GitHub
jpountz commented on code in PR #13550: URL: https://github.com/apache/lucene/pull/13550#discussion_r1668890679 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseDocValuesFormatTestCase.java: ## @@ -157,6 +158,7 @@ public void testNumberMergeAwayAllValuesWithSk

[I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

2024-07-08 Thread via GitHub
mikemccand opened a new issue, #13551: URL: https://github.com/apache/lucene/issues/13551 ### Description This is really a "discussion" issue. I'm not sure at all that the idea is feasible: I've been testing `luceneutil` with heavy KNN indexing (Cohere wikipedia `en` 768 dime

Re: [PR] TaskExecutor should not fork unnecessarily [lucene]

2024-07-08 Thread via GitHub
original-brownbear commented on PR #13472: URL: https://github.com/apache/lucene/pull/13472#issuecomment-2213989662 Sure thing, on it! :) sorry could've done that right away, tired me just didn't realise it this morning . -- This is an automated message from the Apache Git Service. To res

Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]

2024-07-08 Thread via GitHub
uschindler commented on PR #13535: URL: https://github.com/apache/lucene/pull/13535#issuecomment-2213707284 There are some test failures due to strict thread checking. I think the mock input should only do this when its in confined mode. -- This is an automated message from the Apache Git

Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]

2024-07-08 Thread via GitHub
ChrisHegarty commented on PR #13535: URL: https://github.com/apache/lucene/pull/13535#issuecomment-2213644712 Thanks for the comments so far. I updated the PR to only check same-thread semantics for MSII clone and slice. And also added some basic thread checks to MockIndexInputWrapper. I

Re: [PR] WIP: draft of intra segment concurrency [lucene]

2024-07-08 Thread via GitHub
javanna commented on PR #13542: URL: https://github.com/apache/lucene/pull/13542#issuecomment-2213602344 Bulk reply to some of the feedback I got: hi @shubhamvishu , > I know it might be too early to ask(as changes are not yet consolidated), but curious if we have any early be

Re: [PR] Override single byte writes to OutputStreamIndexOutput to remove locking [lucene]

2024-07-08 Thread via GitHub
jpountz merged PR #13543: URL: https://github.com/apache/lucene/pull/13543 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Override single byte writes to OutputStreamIndexOutput to remove locking [lucene]

2024-07-08 Thread via GitHub
jpountz commented on code in PR #13543: URL: https://github.com/apache/lucene/pull/13543#discussion_r1668261240 ## lucene/core/src/java/org/apache/lucene/store/OutputStreamIndexOutput.java: ## @@ -135,5 +135,19 @@ void writeLong(long i) throws IOException { BitUtil.VH_LE_

Re: [PR] Optimize MaxScoreBulkScorer [lucene]

2024-07-08 Thread via GitHub
jpountz merged PR #13544: URL: https://github.com/apache/lucene/pull/13544 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Replace AtomicLong with LongAdder in HitsThresholdChecker [lucene]

2024-07-08 Thread via GitHub
jpountz merged PR #13546: URL: https://github.com/apache/lucene/pull/13546 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] WIP: draft of intra segment concurrency [lucene]

2024-07-08 Thread via GitHub
javanna commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1668225258 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -328,42 +336,65 @@ protected LeafSlice[] slices(List leaves) { /** Static method to segr

Re: [PR] Binary search all terms. [lucene]

2024-07-08 Thread via GitHub
vsop-479 commented on PR #13192: URL: https://github.com/apache/lucene/pull/13192#issuecomment-2213224505 > This many new allocations Maybe we can share these allocations(`suffixes`, `positions`, `positions`) from `searchers`, since they are just immutable and non-stateful data. --

Re: [PR] TaskExecutor should not fork unnecessarily [lucene]

2024-07-08 Thread via GitHub
jpountz commented on PR #13472: URL: https://github.com/apache/lucene/pull/13472#issuecomment-2213213760 I just pushed an annotation for this change: https://github.com/mikemccand/luceneutil/commit/a64ac17a9d1a935649837990f2accbace0b93262. Several queries got a bit faster with a low p

Re: [PR] TaskExecutor should not fork unnecessarily [lucene]

2024-07-08 Thread via GitHub
jpountz commented on PR #13472: URL: https://github.com/apache/lucene/pull/13472#issuecomment-2213200920 @original-brownbear Would you like to work on a PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov