Re: [PR] Use Vector API to decode BKD docIds [lucene]

2025-03-14 Thread via GitHub
jpountz commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2724038977 I have some small concerns: - The fact that the 512 step is tied to the number of points per leaf, though it's not a big deal at all, postings are similar: their encoding logic is sp

Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-14 Thread via GitHub
jpountz commented on PR #14333: URL: https://github.com/apache/lucene/pull/14333#issuecomment-2724046501 I started looking at the code but you would know better: does this new encoding make it easier to know the length of leaf blocks while traversing the terms index so that we could prefetc

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-14 Thread via GitHub
dweiss commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2724314718 Or we can just embrace the fact that it can be a non-minimal NFA and justlet it run like that (with NFARunAutomaton). -- This is an automated message from the Apache Git Service. To res

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-14 Thread via GitHub
DivyanshIITB commented on PR #14335: URL: https://github.com/apache/lucene/pull/14335#issuecomment-2724394013 Just a gentle reminder -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specif

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-14 Thread via GitHub
dweiss commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2724062564 I don't know Unicode as well as Rob so I can't say what these alternate case folding equivalence classes are... but they definitely don't have a "canonical" representation with rega

Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-14 Thread via GitHub
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r1994867386 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java: ## @@ -0,0 +1,486 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one o

Re: [PR] Create vectorized versions of ScalarQuantizer.quantize and recalculateCorrectiveOffset [lucene]

2025-03-14 Thread via GitHub
thecoop commented on code in PR #14304: URL: https://github.com/apache/lucene/pull/14304#discussion_r1987194449 ## lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ## @@ -907,4 +907,87 @@ public static long int4BitDotProduct128(byte[]

Re: [PR] Speed up scoring conjunctions a bit. [lucene]

2025-03-14 Thread via GitHub
jpountz commented on PR #14345: URL: https://github.com/apache/lucene/pull/14345#issuecomment-2724895262 Nightly benchmarks confirmed the speedup: https://benchmarks.mikemccandless.com/FilteredAndHighHigh.html. I'll push an annotation. -- This is an automated message from the Apache Git

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-03-14 Thread via GitHub
jpountz commented on PR #14335: URL: https://github.com/apache/lucene/pull/14335#issuecomment-2724923272 Apologies I had missed your reply. > should this be a shared global pool across all IndexWriters, or should each writer have its own pool? It should be shared, we don't want

Re: [PR] Improve DenseConjunctionBulkScorer's sparse fallback. [lucene]

2025-03-14 Thread via GitHub
jpountz merged PR #14354: URL: https://github.com/apache/lucene/pull/14354 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-14 Thread via GitHub
dweiss commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2725496380 Ok, fair enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[PR] removing constructor with deprecated attribute 'onlyLongestMatch [lucene]

2025-03-14 Thread via GitHub
renatoh opened a new pull request, #14356: URL: https://github.com/apache/lucene/pull/14356 ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-14 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2724585337 > Or we can just embrace the fact that it can be a non-minimal NFA and justlet it run like that (with NFARunAutomaton). I don't think this is currently a good option either: users wo

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-14 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2725736846 It isn't a good idea. If the user wants to "erase case differences" then they should apply `foldcase(ch)`. That's what case-folding means. That CaseFolding class does everything, except, t

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-14 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2724580292 This is why i recommended to not use the unicode function and to start simple. Then you have a potential way to get it working efficiently. -- This is an automated message from t

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-14 Thread via GitHub
msfroh commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2725709282 This is kind of what I had in mind: ```java private static int canonicalize(int codePoint) { int[] alternatives = CaseFolding.lookupAlternates(codePoint); if (alte

Re: [PR] PointInSetQuery use reverse collection to improve performance [lucene]

2025-03-14 Thread via GitHub
hanbj commented on PR #14352: URL: https://github.com/apache/lucene/pull/14352#issuecomment-2724230306 Thank you for providing ideas. In scenarios with multiple dimensions, the internal nodes in the bkd tree can only be sorted according to a certain dimension. Different internal nodes may h

Re: [PR] Use Vector API to decode BKD docIds [lucene]

2025-03-14 Thread via GitHub
jpountz commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2726015514 Thanks for running benchmarks. So it looks like the JVM doesn't think these shorter loops (with step 128) are worth unrolling? This makes me wonder how something like that performs on y

Re: [PR] Use Vector API to decode BKD docIds [lucene]

2025-03-14 Thread via GitHub
gf2121 commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2725390772 > There must be something that happens with this 512 step that doesn't happen otherwise such as using different instructions, loop unrolling, better CPU pipelining or something else.

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-14 Thread via GitHub
msfroh commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2726097192 Hmm... I'm thinking of just requiring that input is lowercase (per `Character.lowerCase(c)`), then check for collisions on uppercase versions when adding transitions, and throw an excepti