Re: [I] [DISCUSS] Could we have a different ANN algorithm for Learned Sparse Vectors? [lucene]

2024-10-14 Thread via GitHub
atris commented on issue #13675: URL: https://github.com/apache/lucene/issues/13675#issuecomment-2412918581 I have recently been interested in this direction and plan on spending non trivial amount of time on this over the next few weeks. Assuming we haven't started dev on this, I am assign

Re: [I] Add an S3-based directory. [lucene]

2024-10-14 Thread via GitHub
atris commented on issue #13868: URL: https://github.com/apache/lucene/issues/13868#issuecomment-2412914455 @jpountz Interestingly, I have been spending some time on making this work - if its ok, I will self assign this issue? -- This is an automated message from the Apache Git Service. T

Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]

2024-10-14 Thread via GitHub
vsop-479 commented on PR #12868: URL: https://github.com/apache/lucene/pull/12868#issuecomment-2412698518 BTW, why `StringHelper` is a abstract class? Can we make it final? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [I] Make dynamic range facets value collection and sorting faster [lucene]

2024-10-14 Thread via GitHub
HoustonPutman commented on issue #13760: URL: https://github.com/apache/lucene/issues/13760#issuecomment-2412577287 @timgrein , I've posted a PR for my idea that @stefanvodita mentioned. If you have the JMH benchmark, I'd love to test it out on mine as well. -- This is an automated messag

Re: [PR] Remove synchronization from IndexWriter.isClosed [lucene]

2024-10-14 Thread via GitHub
github-actions[bot] commented on PR #13834: URL: https://github.com/apache/lucene/pull/13834#issuecomment-2412576598 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

[PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]

2024-10-14 Thread via GitHub
HoustonPutman opened a new pull request, #13914: URL: https://github.com/apache/lucene/pull/13914 Resolves #13760 ### Description This is using a similar approach to how Solr used to compute multiple percentiles at a single time. Basically utilize the quick select metho

[I] Support multi-tenant RAM buffers for IndexWriter [lucene]

2024-10-14 Thread via GitHub
mdmarshmallow opened a new issue, #13913: URL: https://github.com/apache/lucene/issues/13913 ### Description This is related to https://github.com/apache/lucene/issues/13883. The idea is to allow users to specify the RAM usage once and it will be automatically spread across N IndexWr

Re: [PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]

2024-10-14 Thread via GitHub
msokolov commented on PR #13910: URL: https://github.com/apache/lucene/pull/13910#issuecomment-2412421807 I wonder whether we should backport the fixes to the `Lucene90HnswVectorsReader`? I tend to think we ought to, although the usage might be tiny to nonexistent -- This is an automated

Re: [PR] Fix 9.12.0 backcompat break (Lucene 9.12.0 cannot read 9.11.x indices written with quantized HNSW, `Lucene99HnswScalarQuantizedVectorsFormat`) [lucene]

2024-10-14 Thread via GitHub
javanna commented on PR #13874: URL: https://github.com/apache/lucene/pull/13874#issuecomment-2412365840 I did find above a diff that is exactly the same as the PR I opened to update AddBackcompatindices :) (#13911) . I also opened a PR against main to cherry-pick this change and make the

[PR] Align TestGenerateBwcIndices.java with AddBackcompatindices.py [lucene]

2024-10-14 Thread via GitHub
javanna opened a new pull request, #13911: URL: https://github.com/apache/lucene/pull/13911 We updated TestGenerateBwcIndices to create int7 HNSW indices instead of int8 with #13874. The corresponding python code part of the release wizard needs to be updated accordingly. -- This

Re: [PR] Fix 9.12.0 backcompat break (Lucene 9.12.0 cannot read 9.11.x indices written with quantized HNSW, `Lucene99HnswScalarQuantizedVectorsFormat`) [lucene]

2024-10-14 Thread via GitHub
javanna commented on PR #13874: URL: https://github.com/apache/lucene/pull/13874#issuecomment-2412274304 Heya, I am working on generating the backwards indices after releasing Lucene 10, and here are my observations: - I think that we need to forward port this change to main as well?

Re: [PR] Avoid slicing memory segments unnecessarily [lucene]

2024-10-14 Thread via GitHub
uschindler commented on PR #13906: URL: https://github.com/apache/lucene/pull/13906#issuecomment-2411866037 P.P.S.: Just some background: The code was copied from ByteBufferIndexInput where the clones were necessary. With MemorySegment we no longer create `dup()`s, as MemorySegment is state

Re: [PR] Avoid slicing memory segments unnecessarily [lucene]

2024-10-14 Thread via GitHub
original-brownbear commented on PR #13906: URL: https://github.com/apache/lucene/pull/13906#issuecomment-2411864517 > One thing that is a bit awkward to me is that it makes clones cheaper than slices, so e.g. refactoring TermsEnum#postings to work on a slice that contains just the postings

[PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]

2024-10-14 Thread via GitHub
msokolov opened a new pull request, #13910: URL: https://github.com/apache/lucene/pull/13910 While exploring some recall-related failures in another PR I went looking for a unit test that checks HNSW/KNN recall and couldn't find any. I think we used to have one but maybe we removed it becau

Re: [PR] Avoid slicing memory segments unnecessarily [lucene]

2024-10-14 Thread via GitHub
uschindler commented on code in PR #13906: URL: https://github.com/apache/lucene/pull/13906#discussion_r1799835249 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -563,6 +563,13 @@ public final MemorySegmentIndexInput slice(String sliceDesc

Re: [PR] Avoid slicing memory segments unnecessarily [lucene]

2024-10-14 Thread via GitHub
uschindler commented on code in PR #13906: URL: https://github.com/apache/lucene/pull/13906#discussion_r1799826647 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -563,6 +563,13 @@ public final MemorySegmentIndexInput slice(String sliceDesc

Re: [PR] [SOLR-11191] SolrIndexSplitter: Init support for routing docs by _root_ when available [lucene-solr]

2024-10-14 Thread via GitHub
zkendall commented on code in PR #2686: URL: https://github.com/apache/lucene-solr/pull/2686#discussion_r1799751372 ## solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java: ## @@ -130,9 +131,26 @@ public SolrIndexSplitter(SplitIndexCommand cmd) { } routeFie

Re: [PR] Avoid slicing memory segments unnecessarily [lucene]

2024-10-14 Thread via GitHub
jpountz commented on code in PR #13906: URL: https://github.com/apache/lucene/pull/13906#discussion_r1799801328 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -563,6 +563,13 @@ public final MemorySegmentIndexInput slice(String sliceDescrip

Re: [PR] [SOLR-11191] SolrIndexSplitter: Init support for routing docs by _root_ when available [lucene-solr]

2024-10-14 Thread via GitHub
dsmiley commented on code in PR #2686: URL: https://github.com/apache/lucene-solr/pull/2686#discussion_r1799803611 ## solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java: ## @@ -130,9 +131,26 @@ public SolrIndexSplitter(SplitIndexCommand cmd) { } routeFiel

Re: [I] Create a bot to check if there is a CHANGES entry for new PRs [lucene]

2024-10-14 Thread via GitHub
stefanvodita commented on issue #13898: URL: https://github.com/apache/lucene/issues/13898#issuecomment-2411705472 @prudhvigodithi - I opened #13898 inspired by your PR, tested it on my fork, and it's working correctly. @javanna - I like that idea! Maybe we can do that as a follow-up.

[PR] Add changelog verifier [lucene]

2024-10-14 Thread via GitHub
stefanvodita opened a new pull request, #13909: URL: https://github.com/apache/lucene/pull/13909 ### Description This runs along the checks we already have for PR creation/update and warns us if there is no CHANGES.txt entry. -- This is an automated message from the Apache Git

Re: [PR] Test changelog verifier [lucene]

2024-10-14 Thread via GitHub
stefanvodita commented on PR #13908: URL: https://github.com/apache/lucene/pull/13908#issuecomment-2411670171 Testing a Github workflow, not intended to merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

Re: [PR] Test changelog verifier [lucene]

2024-10-14 Thread via GitHub
stefanvodita closed pull request #13908: Test changelog verifier URL: https://github.com/apache/lucene/pull/13908 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

[PR] Test changelog verifier [lucene]

2024-10-14 Thread via GitHub
stefanvodita opened a new pull request, #13908: URL: https://github.com/apache/lucene/pull/13908 test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail

Re: [PR] [SOLR-11191] SolrIndexSplitter: Init support for routing docs by _root_ when available [lucene-solr]

2024-10-14 Thread via GitHub
zkendall commented on code in PR #2686: URL: https://github.com/apache/lucene-solr/pull/2686#discussion_r1799751372 ## solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java: ## @@ -130,9 +131,26 @@ public SolrIndexSplitter(SplitIndexCommand cmd) { } routeFie

Re: [PR] [SOLR-11191] SolrIndexSplitter: Init support for routing docs by _root_ when available [lucene-solr]

2024-10-14 Thread via GitHub
zkendall commented on code in PR #2686: URL: https://github.com/apache/lucene-solr/pull/2686#discussion_r1799751372 ## solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java: ## @@ -130,9 +131,26 @@ public SolrIndexSplitter(SplitIndexCommand cmd) { } routeFie

Re: [PR] [SOLR-11191] SolrIndexSplitter: Init support for routing docs by _root_ when available [lucene-solr]

2024-10-14 Thread via GitHub
zkendall commented on code in PR #2686: URL: https://github.com/apache/lucene-solr/pull/2686#discussion_r1799751372 ## solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java: ## @@ -130,9 +131,26 @@ public SolrIndexSplitter(SplitIndexCommand cmd) { } routeFie

Re: [PR] [SOLR-11191] SolrIndexSplitter: Init support for routing docs by _root_ when available [lucene-solr]

2024-10-14 Thread via GitHub
zkendall commented on code in PR #2686: URL: https://github.com/apache/lucene-solr/pull/2686#discussion_r1799751372 ## solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java: ## @@ -130,9 +131,26 @@ public SolrIndexSplitter(SplitIndexCommand cmd) { } routeFie

Re: [I] Make dynamic range facets value collection and sorting faster [lucene]

2024-10-14 Thread via GitHub
timgrein commented on issue #13760: URL: https://github.com/apache/lucene/issues/13760#issuecomment-2411631641 Cool, I've found some performance improvements (~10-15%), which can be reproduced through a new `jmh` benchmark I've added. I'll open a PR the next few days and tag you :) -- Th

Re: [PR] Added support for highlighting `IndexOrDocValuesQuery` [lucene]

2024-10-14 Thread via GitHub
prudhvigodithi commented on PR #13902: URL: https://github.com/apache/lucene/pull/13902#issuecomment-2411600165 Thanks for the review and approval @mkhludnev @jpountz. @getsaurabh02 @dblock -- This is an automated message from the Apache Git Service. To respond to the message, please lo

Re: [I] `IndexOrDocValuesQuery` does not support query highlighting [lucene]

2024-10-14 Thread via GitHub
prudhvigodithi commented on issue #12686: URL: https://github.com/apache/lucene/issues/12686#issuecomment-2411600774 Hey @harshavamsi the PR to support the `IndexOrDocValuesQuery` for query highlighting is now merged. Thanks. @getsaurabh02 @msfroh -- This is an automated message fro

Re: [PR] Dry up EverythingEnum and BlockDocsEnum in Lucene912PostingsReader [lucene]

2024-10-14 Thread via GitHub
original-brownbear merged PR #13901: URL: https://github.com/apache/lucene/pull/13901 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...

Re: [PR] Dry up EverythingEnum and BlockDocsEnum in Lucene912PostingsReader [lucene]

2024-10-14 Thread via GitHub
original-brownbear commented on PR #13901: URL: https://github.com/apache/lucene/pull/13901#issuecomment-241157 Thanks Adrien! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific com

Re: [I] Create a community metrics dashboard [lucene]

2024-10-14 Thread via GitHub
prudhvigodithi commented on issue #13896: URL: https://github.com/apache/lucene/issues/13896#issuecomment-2411592601 Thanks @stefanvodita, I would like to have the maintainers of the repo to take the initial stab at this metrics dashboard (come up the server, app, choosing the database and

[PR] Only call madvise when necessary. [lucene]

2024-10-14 Thread via GitHub
jpountz opened a new pull request, #13907: URL: https://github.com/apache/lucene/pull/13907 This commit tries to save calls to `madvise` which are not necessary, either because they map to the OS' default, or because the advice would be overridden later on anyway. I have not noticed specifi

Re: [PR] Dry up EverythingEnum and BlockDocsEnum in Lucene912PostingsReader [lucene]

2024-10-14 Thread via GitHub
original-brownbear commented on code in PR #13901: URL: https://github.com/apache/lucene/pull/13901#discussion_r1799704584 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/Lucene912PostingsReader.java: ## @@ -429,9 +403,44 @@ public PostingsEnum reset(IntBlockTermState

[PR] Avoid slicing memory segments unnecessarily [lucene]

2024-10-14 Thread via GitHub
original-brownbear opened a new pull request, #13906: URL: https://github.com/apache/lucene/pull/13906 No need to slice when it's a clone that is to be used with random access, we already enforce thread access rules anyway. Also, no point in copying the memory segment instance via noop slic

Re: [PR] Avoid allocating liveDocs for no soft-deletes (#13895) [lucene]

2024-10-14 Thread via GitHub
dnhatn merged PR #13903: URL: https://github.com/apache/lucene/pull/13903 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [I] Create a community metrics dashboard [lucene]

2024-10-14 Thread via GitHub
stefanvodita commented on issue #13896: URL: https://github.com/apache/lucene/issues/13896#issuecomment-2411304207 That sounds like a plan! I'd be thrilled if you were driving this @prudhvigodithi - thank you for offering your help. It would also be great to get feedback from others on wh

Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]

2024-10-14 Thread via GitHub
jpountz merged PR #12868: URL: https://github.com/apache/lucene/pull/12868 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

[PR] Remove redundant code in PointInSetQuery [lucene]

2024-10-14 Thread via GitHub
easyice opened a new pull request, #13905: URL: https://github.com/apache/lucene/pull/13905 Clean up unused variable `MergePointVisitor#sortedPackedPoints` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] Make MaxScoreBulkScorer repartition scorers when the min competitive increases. [lucene]

2024-10-14 Thread via GitHub
jpountz commented on PR #13800: URL: https://github.com/apache/lucene/pull/13800#issuecomment-243632 Here's the fix for conjunctions: https://github.com/apache/lucene/pull/13904. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Gi

Re: [PR] Dry up EverythingEnum and BlockDocsEnum in Lucene912PostingsReader [lucene]

2024-10-14 Thread via GitHub
jpountz commented on code in PR #13901: URL: https://github.com/apache/lucene/pull/13901#discussion_r1799409756 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/Lucene912PostingsReader.java: ## @@ -429,9 +403,44 @@ public PostingsEnum reset(IntBlockTermState termState,

Re: [PR] Added support for highlighting `IndexOrDocValuesQuery` [lucene]

2024-10-14 Thread via GitHub
jpountz merged PR #13902: URL: https://github.com/apache/lucene/pull/13902 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Replace Map with IntObjectHashMap for KnnVectorsReader [lucene]

2024-10-14 Thread via GitHub
bugmakerr commented on PR #13763: URL: https://github.com/apache/lucene/pull/13763#issuecomment-2410922885 hi @jpountz ,since we have moved to lucene 10, should we merge this and add back #13686? -- This is an automated message from the Apache Git Service. To respond to the message, p

Re: [PR] Use RandomAccessInput instead of seeking in Lucene90DocValuesProducer [lucene]

2024-10-14 Thread via GitHub
uschindler commented on PR #13894: URL: https://github.com/apache/lucene/pull/13894#issuecomment-2410803086 I was a bit afraid that this might have been caused by missing "prefetch()" implementation, but for MemorySegmentIndexInput also the random access slice uses the correct implementatio

Re: [PR] Use RandomAccessInput instead of seeking in Lucene90DocValuesProducer [lucene]

2024-10-14 Thread via GitHub
uschindler commented on PR #13894: URL: https://github.com/apache/lucene/pull/13894#issuecomment-2410749511 I dont think this is a real slowdown caused by the commit here. It is more caused by the Hotspot optimizer misinterpreting something. We should get the assembly code from the benchmar

Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]

2024-10-14 Thread via GitHub
shubhamvishu commented on PR #12868: URL: https://github.com/apache/lucene/pull/12868#issuecomment-2410748497 @jpountz I simplified the expression now. Let me know if the change looks good? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, ple

Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]

2024-10-14 Thread via GitHub
shubhamvishu commented on code in PR #12868: URL: https://github.com/apache/lucene/pull/12868#discussion_r1799172633 ## lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java: ## @@ -150,9 +149,10 @@ private FuzzySet(FixedBitSet filter, int bloomSize, int hashCount

Re: [PR] Better handle dynamic pruning when the leading clause has a single impact block. [lucene]

2024-10-14 Thread via GitHub
jpountz commented on PR #13904: URL: https://github.com/apache/lucene/pull/13904#issuecomment-2410663119 I confirmed that this makes things better when all clauses are `ConstantScoreQuery`s, luceneutil doesn't report a slowdown: ``` TaskQPS baseline

Re: [PR] Use RandomAccessInput instead of seeking in Lucene90DocValuesProducer [lucene]

2024-10-14 Thread via GitHub
original-brownbear commented on PR #13894: URL: https://github.com/apache/lucene/pull/13894#issuecomment-2410525472 Looks to me like a few inlining decisions changed (judging by the profiling and the relative weight of inlined versions vs non-inlined versions of some method). I'm not even e

[PR] Better handle dynamic pruning when the leading clause has a single impact block. [lucene]

2024-10-14 Thread via GitHub
jpountz opened a new pull request, #13904: URL: https://github.com/apache/lucene/pull/13904 `BlockMaxConjunctionBulkScorer` only checks if it can early exit based on impacts once per window, and windows are computed using impact blocks of the leading clause. So this logic is defeated if the

Re: [PR] PR 13757 follow-up: add missing with-discountOverlaps Similarity constructor variants, CHANGES.txt entries (#13845) [lucene]

2024-10-14 Thread via GitHub
javanna commented on code in PR #13891: URL: https://github.com/apache/lucene/pull/13891#discussion_r1798961357 ## lucene/CHANGES.txt: ## @@ -47,6 +52,9 @@ API Changes the entire segment should be scored. Subclasses that override the method should instead override its replac

Re: [I] Create a bot to check if there is a CHANGES entry for new PRs [lucene]

2024-10-14 Thread via GitHub
javanna commented on issue #13898: URL: https://github.com/apache/lucene/issues/13898#issuecomment-2410385807 Yes to this! It would be great to combine this with setting the milestone :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on t

Re: [PR] Use RandomAccessInput instead of seeking in Lucene90DocValuesProducer [lucene]

2024-10-14 Thread via GitHub
jpountz commented on PR #13894: URL: https://github.com/apache/lucene/pull/13894#issuecomment-2410306410 Looks like there was a slowdown on some taxo facets tasks (which use binary doc values to store the taxonomy). `OrHighMedDayTaxoFacets`, `AndHighHighDayTaxoFacets` and `MedTermDayTaxoFac