Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-27 Thread via GitHub
vsop-479 commented on code in PR #11888: URL: https://github.com/apache/lucene/pull/11888#discussion_r1542387588 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java: ## @@ -642,6 +651,99 @@ public SeekStatus scanToTermLeaf(BytesRef targ

Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-27 Thread via GitHub
vsop-479 commented on code in PR #11888: URL: https://github.com/apache/lucene/pull/11888#discussion_r1542363416 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java: ## @@ -642,6 +651,99 @@ public SeekStatus scanToTermLeaf(BytesRef targ

Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-27 Thread via GitHub
vsop-479 commented on code in PR #11888: URL: https://github.com/apache/lucene/pull/11888#discussion_r1542357624 ## lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestLucene99PostingsFormat.java: ## @@ -143,4 +141,13 @@ private void doTestImpactSerialization(List impact

Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-27 Thread via GitHub
vsop-479 commented on code in PR #11888: URL: https://github.com/apache/lucene/pull/11888#discussion_r1542233210 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java: ## @@ -642,6 +651,99 @@ public SeekStatus scanToTermLeaf(BytesRef targ

Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-27 Thread via GitHub
vsop-479 commented on code in PR #11888: URL: https://github.com/apache/lucene/pull/11888#discussion_r1542231368 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java: ## @@ -642,6 +651,99 @@ public SeekStatus scanToTermLeaf(BytesRef targ

Re: [PR] remove now-unnecessary snowball mojibake hack [lucene]

2024-03-27 Thread via GitHub
rmuir commented on PR #13231: URL: https://github.com/apache/lucene/pull/13231#issuecomment-2024165548 Give me some time for a couple more PRs to get this shell script doing less, and I think we'll be able to totally nuke it. I kept the `iconv` check here for UTF-8 correctness becaus

[PR] remove now-unnecessary snowball mojibake hack [lucene]

2024-03-27 Thread via GitHub
rmuir opened a new pull request, #13231: URL: https://github.com/apache/lucene/pull/13231 Remove this hack, to reduce more logic in this script. It is no longer needed as of https://github.com/snowballstem/snowball-website/commit/b934d6b565e268b3db080140cc145f532cd6e648 -- This is

Re: [PR] More consistently use a SEQUENTIAL ReadAdvice for merging. [lucene]

2024-03-27 Thread via GitHub
uschindler commented on code in PR #13229: URL: https://github.com/apache/lucene/pull/13229#discussion_r1542132603 ## lucene/core/src/java/org/apache/lucene/store/IOContext.java: ## @@ -88,4 +84,18 @@ public IOContext(MergeInfo mergeInfo) { // Merges read input segments seq

Re: [PR] More consistently use a SEQUENTIAL ReadAdvice for merging. [lucene]

2024-03-27 Thread via GitHub
uschindler commented on code in PR #13229: URL: https://github.com/apache/lucene/pull/13229#discussion_r1542131095 ## lucene/core/src/java/org/apache/lucene/store/IOContext.java: ## @@ -88,4 +84,18 @@ public IOContext(MergeInfo mergeInfo) { // Merges read input segments seq

Re: [PR] Subtract deleted file size from the cache size of NRTCachingDirectory. [lucene]

2024-03-27 Thread via GitHub
uschindler commented on PR #13206: URL: https://github.com/apache/lucene/pull/13206#issuecomment-2024093146 Could you add a changes.txt entry in the 9.11 bugfix section? Will merge this PR tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, pl

Re: [PR] upgrade snowball to 34f3612e5e8c (round two) [lucene]

2024-03-27 Thread via GitHub
rmuir merged PR #13227: URL: https://github.com/apache/lucene/pull/13227 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-27 Thread via GitHub
mikemccand commented on code in PR #11888: URL: https://github.com/apache/lucene/pull/11888#discussion_r1541971598 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java: ## @@ -642,6 +651,99 @@ public SeekStatus scanToTermLeaf(BytesRef ta

Re: [PR] Add new pluggable vector similarity to field info [lucene]

2024-03-27 Thread via GitHub
benwtrent commented on PR #13200: URL: https://github.com/apache/lucene/pull/13200#issuecomment-2023902003 Tests still all fail, but now I think it compiles. Many deprecation warnings to go through and clean up still. One concern I had was on `FieldInfo`. Do we want to ask for a fully

Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-27 Thread via GitHub
mikemccand commented on code in PR #11888: URL: https://github.com/apache/lucene/pull/11888#discussion_r1541892891 ## lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestLucene99PostingsFormat.java: ## @@ -143,4 +141,13 @@ private void doTestImpactSerialization(List impa

Re: [PR] Mark TimeLimitingCollector as deprecated [lucene]

2024-03-27 Thread via GitHub
vigyasharma commented on PR #13220: URL: https://github.com/apache/lucene/pull/13220#issuecomment-2023475693 @jpountz I was checking for consensus. I'm aligned with deprecating. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] Clean up variable-gaps terms format. [lucene]

2024-03-27 Thread via GitHub
jpountz merged PR #13216: URL: https://github.com/apache/lucene/pull/13216 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Clean up variable-gaps terms format. [lucene]

2024-03-27 Thread via GitHub
jpountz commented on PR #13216: URL: https://github.com/apache/lucene/pull/13216#issuecomment-2023348250 I'm merging only to `main` for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [PR] Test KNN query works seamlessly regardless of underlying format [lucene]

2024-03-27 Thread via GitHub
tteofili commented on code in PR #13225: URL: https://github.com/apache/lucene/pull/13225#discussion_r1541406253 ## lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java: ## @@ -949,4 +951,58 @@ public int hashCode() { return 31 * classHash() + doc

Re: [PR] upgrade snowball to 34f3612e5e8c (round two) [lucene]

2024-03-27 Thread via GitHub
rmuir commented on PR #13227: URL: https://github.com/apache/lucene/pull/13227#issuecomment-2023119889 I know, that seems crazy, but I think it would be the ultimate goal. Will have to find an "easy" / "dead-simple" way to achieve publishing snowball properly that avoids craziness of maven

Re: [PR] Use `IOContext#RANDOM` when appropriate. [lucene]

2024-03-27 Thread via GitHub
jpountz commented on PR #13222: URL: https://github.com/apache/lucene/pull/13222#issuecomment-2023049088 I opened #13229, is it what you had in mind? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] Test KNN query works seamlessly regardless of underlying format [lucene]

2024-03-27 Thread via GitHub
tteofili commented on code in PR #13225: URL: https://github.com/apache/lucene/pull/13225#discussion_r1541314263 ## lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java: ## @@ -949,4 +951,58 @@ public int hashCode() { return 31 * classHash() + doc

Re: [PR] Test KNN query works seamlessly regardless of underlying format [lucene]

2024-03-27 Thread via GitHub
mayya-sharipova commented on code in PR #13225: URL: https://github.com/apache/lucene/pull/13225#discussion_r1541305072 ## lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java: ## @@ -949,4 +951,58 @@ public int hashCode() { return 31 * classHash(

Re: [PR] upgrade snowball to 34f3612e5e8c (round two) [lucene]

2024-03-27 Thread via GitHub
rmuir commented on PR #13227: URL: https://github.com/apache/lucene/pull/13227#issuecomment-2022952918 > The only downside is that we can't detect anymore if snowball code calls unsafe shit like forgetting locales or charsets. And we also won't if it starts being published as jar and

Re: [PR] upgrade snowball to 34f3612e5e8c (round two) [lucene]

2024-03-27 Thread via GitHub
uschindler commented on PR #13227: URL: https://github.com/apache/lucene/pull/13227#issuecomment-2022951336 > thanks, i will do this. It isn't an option to call lucene ArrayUtil methods etc from snowball code. The only downside is that we can't detect anymore if snowball code calls u

Re: [PR] upgrade snowball to 34f3612e5e8c (round two) [lucene]

2024-03-27 Thread via GitHub
rmuir commented on PR #13227: URL: https://github.com/apache/lucene/pull/13227#issuecomment-2022934383 thanks, i will do this. It isn't an option to call lucene ArrayUtil methods etc from snowball code. -- This is an automated message from the Apache Git Service. To respond to the message

Re: [PR] Use singleton for single block, all-zeros DirectMonotonicReader.Meta [lucene]

2024-03-27 Thread via GitHub
benwtrent commented on PR #13224: URL: https://github.com/apache/lucene/pull/13224#issuecomment-2022932116 I agree with @jpountz 's concern here. But I think the `zero` checks can be done as things are read in and we can return the static object. -- This is an automated message from the A

Re: [PR] upgrade snowball to 34f3612e5e8c (round two) [lucene]

2024-03-27 Thread via GitHub
uschindler commented on PR #13227: URL: https://github.com/apache/lucene/pull/13227#issuecomment-2022932644 > @uschindler How can i exclude the org.tartarus code from this check (or really all checks) without touching it? > > ``` > Forbidden method invocation: java.util.Arrays#copy

Re: [PR] Disjunction as CompetitiveIterator for numeric dynamic pruning [lucene]

2024-03-27 Thread via GitHub
gf2121 commented on PR #13221: URL: https://github.com/apache/lucene/pull/13221#issuecomment-2022926905 I run 10 rounds on wikimediumall index (normal index, no index sorting / force merge). Result looks positive in general, but we do meet slight regression for high-cardinality field in sev

[I] Enhance ContinuousIds optimisation to store the diff between docIds as a Vint [lucene]

2024-03-27 Thread via GitHub
expani opened a new issue, #13228: URL: https://github.com/apache/lucene/issues/13228 ### Description One of the optimisations introduced by [LUCENE-10233](https://issues.apache.org/jira/browse/LUCENE-10233) was to compress continuous doc Ids (strictly sorted) by only storing the sta

Re: [PR] upgrade snowball to 34f3612e5e8c (round two) [lucene]

2024-03-27 Thread via GitHub
rmuir commented on PR #13227: URL: https://github.com/apache/lucene/pull/13227#issuecomment-2022906420 @uschindler How can i exclude the org.tartarus code from this check (or really all checks) without touching it? ``` Forbidden method invocation: java.util.Arrays#copyOf(**) [Prefe

Re: [PR] Use `IOContext#RANDOM` when appropriate. [lucene]

2024-03-27 Thread via GitHub
jpountz commented on PR #13222: URL: https://github.com/apache/lucene/pull/13222#issuecomment-2022902471 > What do you think? I had similar thoughts is mind, so that sounds good to me. I'm still curious about how to fix the bigger issue wrt reader pooling. Should `getMergeInsta

Re: [PR] Mark TimeLimitingCollector as deprecated [lucene]

2024-03-27 Thread via GitHub
jpountz commented on PR #13220: URL: https://github.com/apache/lucene/pull/13220#issuecomment-2022882774 @vigyasharma I'd like to double check with you if you're good with deprecating before merging, I'm not sure if your previous comment was candidly checking for consensus, or if you were i

Re: [PR] upgrade snowball to 26db1ab9adbf437f37a6facd3ee2aad1da9eba03 [lucene]

2024-03-27 Thread via GitHub
rmuir merged PR #13209: URL: https://github.com/apache/lucene/pull/13209 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [PR] Use singleton for single block, all-zeros DirectMonotonicReader.Meta [lucene]

2024-03-27 Thread via GitHub
jpountz commented on PR #13224: URL: https://github.com/apache/lucene/pull/13224#issuecomment-2022824497 The idea makes sense to me, but the fact that we're checking for a specific `blockShift` of `16` looks fragile to me. If codecs change the value of `blockShift` tomorrow, this will break

[I] Segment with heavy deletes not picked for merge in TieredMergePolicy [lucene]

2024-03-27 Thread via GitHub
khushbr opened a new issue, #13226: URL: https://github.com/apache/lucene/issues/13226 ### Description ### Description We have a cluster, running on Lucene v8.7.0 and configured with `TieredMergePolicy`. We are seeing a peculiar behavior where segments with heavy deletes are not

[PR] Test KNN query works seamlessly regardless of underlying format [lucene]

2024-03-27 Thread via GitHub
tteofili opened a new pull request, #13225: URL: https://github.com/apache/lucene/pull/13225 ### Description This introduces just a test to check that a `KnnVectorQuery` runs the same when a same field is indexed with different `KnnVectorFormats` (e.g. `Lucene99HnswVectorsFormat` and

Re: [PR] Use singleton for single block, all-zeros DirectMonotonicReader.Meta [lucene]

2024-03-27 Thread via GitHub
original-brownbear commented on code in PR #13224: URL: https://github.com/apache/lucene/pull/13224#discussion_r1540997664 ## lucene/core/src/java/org/apache/lucene/util/packed/DirectMonotonicReader.java: ## @@ -39,6 +39,9 @@ public final class DirectMonotonicReader extends Long

Re: [PR] Expand scalar quantization with adding half-byte (int4) quantization [lucene]

2024-03-27 Thread via GitHub
benwtrent commented on code in PR #13197: URL: https://github.com/apache/lucene/pull/13197#discussion_r1540992353 ## lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/VectorUtilBenchmark.java: ## @@ -36,10 +36,12 @@ public class VectorUtilBenchmark { private byt

Re: [PR] Use singleton for single block, all-zeros DirectMonotonicReader.Meta [lucene]

2024-03-27 Thread via GitHub
benwtrent commented on code in PR #13224: URL: https://github.com/apache/lucene/pull/13224#discussion_r1540991260 ## lucene/core/src/java/org/apache/lucene/util/packed/DirectMonotonicReader.java: ## @@ -39,6 +39,9 @@ public final class DirectMonotonicReader extends LongValues i

Re: [PR] Add timeout support to AbstractKnnVectorQuery [lucene]

2024-03-27 Thread via GitHub
benwtrent commented on PR #13202: URL: https://github.com/apache/lucene/pull/13202#issuecomment-2022583444 Looking at the benchmarking, we are adding a 5% overhead to all vector operations when using float32. As vector operations get faster (consider hamming distance with exploring more vec

[PR] Use singleton for single block, all-zeros DirectMonotonicReader.Meta [lucene]

2024-03-27 Thread via GitHub
original-brownbear opened a new pull request, #13224: URL: https://github.com/apache/lucene/pull/13224 Having a single block of all zeros is a fairly common case that is using a lot of heap for duplicate instances in some use-cases in ES. => read a singleton for it to save the duplication

Re: [PR] Expand scalar quantization with adding half-byte (int4) quantization [lucene]

2024-03-27 Thread via GitHub
tteofili commented on code in PR #13197: URL: https://github.com/apache/lucene/pull/13197#discussion_r1540836373 ## lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/VectorUtilBenchmark.java: ## @@ -36,10 +36,12 @@ public class VectorUtilBenchmark { private byte

Re: [PR] Expand scalar quantization with adding half-byte (int4) quantization [lucene]

2024-03-27 Thread via GitHub
tteofili commented on PR #13197: URL: https://github.com/apache/lucene/pull/13197#issuecomment-2022409522 I tend to agree on being opinionated on a set of allowed configurations for what concerns the number of bits (4 and 7). Given the speed-space trade-off for packing, I think it's usefu

[PR] Recommend lowering the default mmap readahead. [lucene]

2024-03-27 Thread via GitHub
jpountz opened a new pull request, #13223: URL: https://github.com/apache/lucene/pull/13223 This is a follow-up of a discussion on #13219. `mmap` has a higher readahead than regular `read()` operations by default, e.g. 128kB instead of 16kB on my Linux box. On indexes that exceed the size o

Re: [PR] Use `IOContext#RANDOM` when appropriate. [lucene]

2024-03-27 Thread via GitHub
uschindler commented on PR #13222: URL: https://github.com/apache/lucene/pull/13222#issuecomment-2022362746 About the announced comment: When we merge we want to use sequential, as the kernel may earlier free the pages. But actually I am not sure, if we really need this: After merging the f

Re: [PR] Use `IOContext#RANDOM` when appropriate. [lucene]

2024-03-27 Thread via GitHub
uschindler commented on PR #13222: URL: https://github.com/apache/lucene/pull/13222#issuecomment-2022357798 In addition, we should still look into getting the IOContexts correct when we merge. The current solution is not ideal, but somehow not really changeable. When you clone an indexinput

Re: [PR] Get better cost estimate on MultiTermQuery over few terms [lucene]

2024-03-27 Thread via GitHub
rquesada-tibco commented on code in PR #13201: URL: https://github.com/apache/lucene/pull/13201#discussion_r1540777517 ## lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java: ## @@ -292,7 +292,21 @@ public long cost() { }; }

Re: [PR] Disjunction as CompetitiveIterator for numeric dynamic pruning [lucene]

2024-03-27 Thread via GitHub
gf2121 commented on PR #13221: URL: https://github.com/apache/lucene/pull/13221#issuecomment-2022295704 I separately build two `wikimedium10m` indices that force merged and reverse sorted by `dayOfYear`/`lastMod` and here is the result: **wikimedium10m.lucene_baseline.Lucene99.dvfield

Re: [PR] Use `IOContext#RANDOM` when appropriate. [lucene]

2024-03-27 Thread via GitHub
jpountz commented on PR #13222: URL: https://github.com/apache/lucene/pull/13222#issuecomment-2022301350 No worries, Uwe. Looking forward to your suggestions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

Re: [PR] Use `IOContext#RANDOM` when appropriate. [lucene]

2024-03-27 Thread via GitHub
uschindler commented on PR #13222: URL: https://github.com/apache/lucene/pull/13222#issuecomment-2022255027 Hi, I have some problem regarding merging with it - and a suggestion. Please hold with merging. -- This is an automated message from the Apache Git Service. To respond to the messag

Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-27 Thread via GitHub
vsop-479 commented on PR #11888: URL: https://github.com/apache/lucene/pull/11888#issuecomment-2022219035 @mikemccand Thanks for your review. I measured performance on `wikimediumall`: # iter1 TaskQPS baseline StdDevQPS my_modified_version StdDev

Re: [PR] Disjunction as CompetitiveIterator for numeric dynamic pruning [lucene]

2024-03-27 Thread via GitHub
gf2121 commented on PR #13221: URL: https://github.com/apache/lucene/pull/13221#issuecomment-2022184017 > A downside is that this approach may be less memory-efficient, since we store competitive docs as integers, never as a bit set like today. But we may be able to work around it by just s

Re: [I] Replace boolean flags on IOContext with an enum [lucene]

2024-03-27 Thread via GitHub
jpountz closed issue #13211: Replace boolean flags on IOContext with an enum URL: https://github.com/apache/lucene/issues/13211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] Replace boolean flags on `IOContext` with an enum. [lucene]

2024-03-27 Thread via GitHub
jpountz merged PR #13219: URL: https://github.com/apache/lucene/pull/13219 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Break point estimate when threshold exceeded [lucene]

2024-03-27 Thread via GitHub
gf2121 commented on PR #13199: URL: https://github.com/apache/lucene/pull/13199#issuecomment-2022144257 Nightly benchmark: https://home.apache.org/~mikemccand/lucenebench/TermDTSort.html https://home.apache.org/~mikemccand/lucenebench/TermDayOfYearSort.html https://home.apache.org/~m