Re: [I] New binary vector format doesn't perform well with small-dimension datasets [lucene]

2025-03-18 Thread via GitHub
benwtrent commented on issue #14342: URL: https://github.com/apache/lucene/issues/14342#issuecomment-2733531171 First, thank you @lpld for digging in and running these benchmarks! OK, I think I see the weirdness with the `mnist` data set. Its not about it being a transformer model, it

Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]

2025-03-18 Thread via GitHub
javanna commented on code in PR #14364: URL: https://github.com/apache/lucene/pull/14364#discussion_r2000898675 ## lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestSuggestField.java: ## @@ -951,7 +951,16 @@ static IndexWriterConfig iwcWithSuggestField(Analyz

Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

2025-03-18 Thread via GitHub
gf2121 commented on PR #14365: URL: https://github.com/apache/lucene/pull/14365#issuecomment-2733285724 I'm seeing even results on `wikimediumall` ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-18 Thread via GitHub
dweiss commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r2000465944 ## lucene/core/src/java/org/apache/lucene/util/automaton/Automata.java: ## @@ -608,7 +608,24 @@ public static Automaton makeStringUnion(Iterable utf8Strings) { if

Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

2025-03-18 Thread via GitHub
gf2121 closed pull request #13521: Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree URL: https://github.com/apache/lucene/pull/13521 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

2025-03-18 Thread via GitHub
gf2121 merged PR #14361: URL: https://github.com/apache/lucene/pull/14361 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

2025-03-18 Thread via GitHub
gf2121 commented on PR #13521: URL: https://github.com/apache/lucene/pull/13521#issuecomment-2732037641 Closing this in favor of https://github.com/apache/lucene/pull/14361. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [I] Support modifying segmentInfos.counter in IndexWriter [lucene]

2025-03-18 Thread via GitHub
vigyasharma commented on issue #14362: URL: https://github.com/apache/lucene/issues/14362#issuecomment-2731919938 Hi @guojialiang92, Could you elaborate more on how you plan to use this capability? It's not immediately obvious why modifying `segmentInfos.counter` will help with peer recover

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-18 Thread via GitHub
rmuir commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r2002272398 ## lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java: ## @@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) { return alts;

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-18 Thread via GitHub
vigyasharma commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r2000345662 ## lucene/core/src/test/org/apache/lucene/index/TestMultiTenantMergeScheduler.java: ## @@ -0,0 +1,73 @@ +package org.apache.lucene.index; + +import org.apache.luce

Re: [I] Create a bot to check if there is a CHANGES entry for new PRs [lucene]

2025-03-18 Thread via GitHub
stefanvodita commented on issue #13898: URL: https://github.com/apache/lucene/issues/13898#issuecomment-2733970610 Just to clarify - the restriction @dweiss mentioned applies to the `changelog-enforcer` action, but not to the `checkout` action we are using. @pseudo-nymous - I'm seeing

Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

2025-03-18 Thread via GitHub
jpountz commented on PR #14365: URL: https://github.com/apache/lucene/pull/14365#issuecomment-2734597577 Maybe we should stop only adding doc IDs to the `BulkAdder` if they are greater than the max collected doc so far. Skipping these doc IDs looks like it hurts vectorization, I played with

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-18 Thread via GitHub
msfroh commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r2002184400 ## lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java: ## @@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) { return alts;

Re: [I] New binary vector format doesn't perform well with small-dimension datasets [lucene]

2025-03-18 Thread via GitHub
benwtrent commented on issue #14342: URL: https://github.com/apache/lucene/issues/14342#issuecomment-2734567137 OK, a colleague and I spent some time digging into this and Option 0 (a bug) turned out to be the case. Its a 5 character change (like all good bugs), but here are the new recall

[PR] Fix for changelog verifier and milestone setter automation [lucene]

2025-03-18 Thread via GitHub
pseudo-nymous opened a new pull request, #14369: URL: https://github.com/apache/lucene/pull/14369 ### Description This pull request contains a fix for changelog automation that has been added recently. We have seen failures where either diff calculation wrt base commit was wrong or base

Re: [I] Create a bot to check if there is a CHANGES entry for new PRs [lucene]

2025-03-18 Thread via GitHub
pseudo-nymous commented on issue #13898: URL: https://github.com/apache/lucene/issues/13898#issuecomment-2735279420 @stefanvodita I have added a fix for it. Please take a look. https://github.com/apache/lucene/pull/14369 -- This is an automated message from the Apache Git Service. To res

Re: [PR] Fix for changelog verifier and milestone setter automation [lucene]

2025-03-18 Thread via GitHub
pseudo-nymous commented on PR #14369: URL: https://github.com/apache/lucene/pull/14369#issuecomment-2735285383 We can fetch all history using checkout actions itself using flag `fetch-depth: 0`. But it fetches all the history for all branches and tags which is not required here. -- This

Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]

2025-03-18 Thread via GitHub
gf2121 commented on PR #14365: URL: https://github.com/apache/lucene/pull/14365#issuecomment-2735314110 Thanks for running benchmark, the speed up is great! > Skipping these doc IDs looks like it hurts vectorization, I played with disabling these if statements locally and get a good s

[PR] Align comments with upgraded postings format [lucene]

2025-03-18 Thread via GitHub
amosbird opened a new pull request, #14368: URL: https://github.com/apache/lucene/pull/14368 ### Description Update prefetch heuristic comments to reflect that skip data is now inlined into postings lists. -- This is an automated message from the Apache Git Service. To

Re: [PR] Align comments with upgraded postings format [lucene]

2025-03-18 Thread via GitHub
jpountz merged PR #14368: URL: https://github.com/apache/lucene/pull/14368 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-18 Thread via GitHub
vigyasharma commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r2001826532 ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,70 @@ +package org.apache.lucene.index; + +import java.util.concurren

Re: [PR] Align comments with upgraded postings format [lucene]

2025-03-18 Thread via GitHub
jpountz commented on PR #14368: URL: https://github.com/apache/lucene/pull/14368#issuecomment-2734638261 Good catch! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsu