Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-22 Thread via GitHub
jpountz commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1822293603 There seems to be a speedup on [prefix queries](http://people.apache.org/~mikemccand/lucenebench/Prefix3.html) in nightly benchmarks, I'll add an annotation. -- This is an automated m

[PR] Skip decoding tail freqs when they are not needed. [lucene]

2023-11-22 Thread via GitHub
jpountz opened a new pull request, #12832: URL: https://github.com/apache/lucene/pull/12832 When we moved to group-varint for tail postings, we stop interleaving docs and freqs and instead wrote all docs first, then all freqs. This means that we can now skip decoding frequencies when they a

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-22 Thread via GitHub
jpountz commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1822300926 Also the [size](http://people.apache.org/~mikemccand/lucenebench/indexing.html#FixedIndexSize) increase is hardly noticeable. -- This is an automated message from the Apache Git Servi

Re: [PR] Skip decoding tail freqs when they are not needed. [lucene]

2023-11-22 Thread via GitHub
easyice commented on PR #12832: URL: https://github.com/apache/lucene/pull/12832#issuecomment-1822307766 Great idea :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To un

Re: [PR] Simplify advancing on postings/impacts enums [lucene]

2023-11-22 Thread via GitHub
gf2121 commented on code in PR #12810: URL: https://github.com/apache/lucene/pull/12810#discussion_r1401710042 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/LegacyMultiLevelSkipListReader.java: ## @@ -0,0 +1,263 @@ +/* + * Licensed to the Apache Software F

Re: [PR] Skip decoding tail freqs when they are not needed. [lucene]

2023-11-22 Thread via GitHub
jpountz merged PR #12832: URL: https://github.com/apache/lucene/pull/12832 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Skip decoding tail freqs when they are not needed. [lucene]

2023-11-22 Thread via GitHub
jpountz commented on PR #12832: URL: https://github.com/apache/lucene/pull/12832#issuecomment-1822375974 Thanks @easyice and @gf2121 for looking! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

Re: [PR] Simplify advancing on postings/impacts enums [lucene]

2023-11-22 Thread via GitHub
jpountz commented on code in PR #12810: URL: https://github.com/apache/lucene/pull/12810#discussion_r1401733088 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/LegacyMultiLevelSkipListReader.java: ## @@ -0,0 +1,263 @@ +/* + * Licensed to the Apache Software

Re: [PR] Simplify advancing on postings/impacts enums [lucene]

2023-11-22 Thread via GitHub
gf2121 commented on code in PR #12810: URL: https://github.com/apache/lucene/pull/12810#discussion_r1401710042 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/LegacyMultiLevelSkipListReader.java: ## @@ -0,0 +1,263 @@ +/* + * Licensed to the Apache Software F

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-22 Thread via GitHub
jpountz commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1822453957 For reference, I computed the most frequent `flag` values on wikibigall, which are the values that might be worth optimizing for: - 0x55 (4 2-bytes ints): 29.6% - 0xaa (5 3-bytes

[PR] Improve group-varint benchmark to reproduce value distribution of wikbigall. [lucene]

2023-11-22 Thread via GitHub
jpountz opened a new pull request, #12833: URL: https://github.com/apache/lucene/pull/12833 Instead of using a fixed number of bits per value, the group-varint benchmark now tries to reproduce the distribution of the number of bits per values that can be observed on tail postings of wikibig

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-22 Thread via GitHub
easyice commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1822504456 It's very important as a reference! Thanks a lot! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

Re: [PR] Improve group-varint benchmark to reproduce value distribution of wikbigall. [lucene]

2023-11-22 Thread via GitHub
jpountz commented on PR #12833: URL: https://github.com/apache/lucene/pull/12833#issuecomment-1822504924 Here is the output of the benchmark on my machine: ``` Benchmark (size) Mode Cnt Score Error Units GroupVIntBenchmark.byteArrayRead

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-22 Thread via GitHub
jpountz commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1822506995 I opened a PR to feed some of this data into the micro benchmark to make it more realistic: https://github.com/apache/lucene/pull/12833. -- This is an automated message from the Apache

Re: [PR] Improve group-varint benchmark to reproduce value distribution of wikbigall. [lucene]

2023-11-22 Thread via GitHub
gf2121 commented on code in PR #12833: URL: https://github.com/apache/lucene/pull/12833#discussion_r1401871396 ## lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/GroupVIntBenchmark.java: ## @@ -103,11 +127,16 @@ void initByteBufferInput(long[] docs) throws Exceptio

Re: [PR] Improve group-varint benchmark to reproduce value distribution of wikbigall. [lucene]

2023-11-22 Thread via GitHub
jpountz commented on code in PR #12833: URL: https://github.com/apache/lucene/pull/12833#discussion_r1401912046 ## lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/GroupVIntBenchmark.java: ## @@ -103,11 +127,16 @@ void initByteBufferInput(long[] docs) throws Excepti

[PR] hunspell: allow in-memory entry sorting for faster dictionary loading [lucene]

2023-11-22 Thread via GitHub
donnerpeter opened a new pull request, #12834: URL: https://github.com/apache/lucene/pull/12834 ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] hunspell: allow in-memory entry sorting for faster dictionary loading [lucene]

2023-11-22 Thread via GitHub
donnerpeter commented on code in PR #12834: URL: https://github.com/apache/lucene/pull/12834#discussion_r1401913469 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/SortingStrategy.java: ## @@ -0,0 +1,180 @@ +/* + * Licensed to the Apache Software Foundatio

Re: [PR] Improve group-varint benchmark to reproduce value distribution of wikbigall. [lucene]

2023-11-22 Thread via GitHub
easyice commented on PR #12833: URL: https://github.com/apache/lucene/pull/12833#issuecomment-1822611956 Looks good to me, Thank you @jpountz . otherwise i'm a bit curious that `byteArrayReadGroupVInt ` is so much faster than `byteBufferReadGroupVInt`. -- This is an automated message from

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-22 Thread via GitHub
mikemccand commented on code in PR #12699: URL: https://github.com/apache/lucene/pull/12699#discussion_r1400933055 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java: ## @@ -1190,4 +1176,63 @@ public void seekExact(long ord) { public lon

Re: [PR] Make FSTCompiler.Builder build() throw IOException [lucene]

2023-11-22 Thread via GitHub
mikemccand merged PR #12830: URL: https://github.com/apache/lucene/pull/12830 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-22 Thread via GitHub
mikemccand commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1402017454 ## lucene/core/src/java/org/apache/lucene/util/fst/ByteBuffersFSTReader.java: ## @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-22 Thread via GitHub
mikemccand commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1402022525 ## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ## @@ -435,6 +433,13 @@ public FST(FSTMetadata metadata, DataInput in, Outputs outputs, FSTStore f

Re: [I] Remove the FST constructors with DataInput for metadata [lucene]

2023-11-22 Thread via GitHub
mikemccand closed issue #12822: Remove the FST constructors with DataInput for metadata URL: https://github.com/apache/lucene/issues/12822 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

Re: [PR] Remove FST constructors with DataInput for metadata [lucene]

2023-11-22 Thread via GitHub
mikemccand merged PR #12803: URL: https://github.com/apache/lucene/pull/12803 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [PR] Remove FST constructors with DataInput for metadata [lucene]

2023-11-22 Thread via GitHub
mikemccand commented on PR #12803: URL: https://github.com/apache/lucene/pull/12803#issuecomment-1822773144 Hmm trying to backport but `FSTTermsReader.java` had conflicts which I tried to resolve and then scary test failures and now I ran out of time for the moment! Will take it up again s

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-22 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1402058230 ## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ## @@ -435,6 +433,13 @@ public FST(FSTMetadata metadata, DataInput in, Outputs outputs, FSTStore f

Re: [PR] Improve group-varint benchmark to reproduce value distribution of wikbigall. [lucene]

2023-11-22 Thread via GitHub
jpountz merged PR #12833: URL: https://github.com/apache/lucene/pull/12833 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-22 Thread via GitHub
gf2121 commented on code in PR #12699: URL: https://github.com/apache/lucene/pull/12699#discussion_r1402112063 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java: ## @@ -1190,4 +1176,63 @@ public void seekExact(long ord) { public long or

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-22 Thread via GitHub
gf2121 commented on code in PR #12699: URL: https://github.com/apache/lucene/pull/12699#discussion_r1402119508 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java: ## @@ -1190,4 +1176,63 @@ public void seekExact(long ord) { public long or

Re: [I] Move group-varint encoding/decoding logic to DataOutput/DataInput? [lucene]

2023-11-22 Thread via GitHub
easyice commented on issue #12826: URL: https://github.com/apache/lucene/issues/12826#issuecomment-1822845516 Sorry for the late reply, I got lost in the JMH wrong loop for a while, Now I got the correct result, `memorySegmentReadGroupVInt` is faster than `byteBufferReadGroupVInt` in

Re: [I] Move group-varint encoding/decoding logic to DataOutput/DataInput? [lucene]

2023-11-22 Thread via GitHub
jpountz commented on issue #12826: URL: https://github.com/apache/lucene/issues/12826#issuecomment-1822851422 Cool! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsub

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-22 Thread via GitHub
gf2121 commented on code in PR #12699: URL: https://github.com/apache/lucene/pull/12699#discussion_r1402131137 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java: ## @@ -104,13 +104,9 @@ public SegmentTermsEnumFrame(SegmentTermsEnum st

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-22 Thread via GitHub
gf2121 commented on code in PR #12699: URL: https://github.com/apache/lucene/pull/12699#discussion_r1402136479 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java: ## @@ -247,7 +243,7 @@ void rewind() { nextEnt = -1; hasTerms

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-22 Thread via GitHub
gf2121 commented on code in PR #12699: URL: https://github.com/apache/lucene/pull/12699#discussion_r1402138507 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/IntersectTermsEnumFrame.java: ## @@ -142,12 +138,20 @@ public void setState(int state) { } v

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-22 Thread via GitHub
gf2121 commented on PR #12699: URL: https://github.com/apache/lucene/pull/12699#issuecomment-1822923162 Thanks for review @mikemccand ! > but some head scratching -- hard to remember how these two crazy iterators work. Agree that this is head scratching... I make a chart to try

Re: [PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]

2023-11-22 Thread via GitHub
jpountz commented on code in PR #12622: URL: https://github.com/apache/lucene/pull/12622#discussion_r1402306694 ## lucene/core/src/java/org/apache/lucene/index/IndexWriter.java: ## @@ -3475,6 +3475,8 @@ public void addIndexesReaderMerge(MergePolicy.OneMerge merge) throws IOExce

Re: [PR] Add KeywordField and StringValueFacetCounts example [lucene]

2023-11-22 Thread via GitHub
msokolov merged PR #12817: URL: https://github.com/apache/lucene/pull/12817 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [PR] Add KeywordField and StringValueFacetCounts example [lucene]

2023-11-22 Thread via GitHub
msokolov commented on PR #12817: URL: https://github.com/apache/lucene/pull/12817#issuecomment-1823101661 It just occurred to me; perhaps we should add a CHANGE log entry? And it could be nice to backport to 9.x if you like -- This is an automated message from the Apache Git Service. To r

Re: [PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]

2023-11-22 Thread via GitHub
jpountz commented on code in PR #12622: URL: https://github.com/apache/lucene/pull/12622#discussion_r1402394564 ## lucene/core/src/java/org/apache/lucene/index/IndexWriter.java: ## @@ -5160,20 +5177,74 @@ public int length() { } mergeReaders.add(wrappedReader);

Re: [PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]

2023-11-22 Thread via GitHub
jpountz commented on PR #12622: URL: https://github.com/apache/lucene/pull/12622#issuecomment-1823130379 @s1monw I pushed a commit that should address your feedback -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the U

Re: [PR] Add KeywordField and StringValueFacetCounts example [lucene]

2023-11-22 Thread via GitHub
stefanvodita commented on PR #12817: URL: https://github.com/apache/lucene/pull/12817#issuecomment-1823134763 Thank you @msokolov! I pushed a [commit](https://github.com/stefanvodita/lucene/commit/0b7498fe1af9ccb7a71df79655b3e3dbff3b253f) with a CHANGES entry. Do you need me to open a PR ag

Re: [PR] Add support for index sorting with document blocks [lucene]

2023-11-22 Thread via GitHub
jpountz commented on PR #12829: URL: https://github.com/apache/lucene/pull/12829#issuecomment-1823167735 In general, I like the idea of making block joins more of a first-class citizen. I have been thinking for a long time about changing how blocks are identified from using bitsets to using

Re: [PR] Simplify advancing on postings/impacts enums [lucene]

2023-11-22 Thread via GitHub
jpountz commented on code in PR #12810: URL: https://github.com/apache/lucene/pull/12810#discussion_r1402438164 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/LegacyMultiLevelSkipListReader.java: ## @@ -0,0 +1,263 @@ +/* + * Licensed to the Apache Software

Re: [PR] hunspell: allow in-memory entry sorting for faster dictionary loading [lucene]

2023-11-22 Thread via GitHub
dweiss commented on code in PR #12834: URL: https://github.com/apache/lucene/pull/12834#discussion_r1402449022 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/SortingStrategy.java: ## @@ -0,0 +1,180 @@ +/* + * Licensed to the Apache Software Foundation (AS

Re: [PR] Dry up DirectReader implementations [lucene]

2023-11-22 Thread via GitHub
jpountz closed pull request #12823: Dry up DirectReader implementations URL: https://github.com/apache/lucene/pull/12823 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsu

Re: [PR] Dry up DirectReader implementations [lucene]

2023-11-22 Thread via GitHub
jpountz commented on PR #12823: URL: https://github.com/apache/lucene/pull/12823#issuecomment-1823222857 It got no additional feedback in the last couple days, so I'll default to closing if you don't mind. Thanks for contributing! -- This is an automated message from the Apache Git Servic

Re: [PR] hunspell: allow in-memory entry sorting for faster dictionary loading [lucene]

2023-11-22 Thread via GitHub
donnerpeter commented on code in PR #12834: URL: https://github.com/apache/lucene/pull/12834#discussion_r1402703007 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/SortingStrategy.java: ## @@ -0,0 +1,180 @@ +/* + * Licensed to the Apache Software Foundatio

Re: [PR] During concurrent slice searches in IndexSearcher stop other tasks if one throws an Exception [lucene]

2023-11-22 Thread via GitHub
stefanvodita commented on code in PR #12756: URL: https://github.com/apache/lucene/pull/12756#discussion_r1402701132 ## lucene/core/src/test/org/apache/lucene/search/TestIndexSearcher.java: ## @@ -293,4 +298,218 @@ public void testNullExecutorNonNullTaskExecutor() { IndexSe

Re: [PR] hunspell: allow in-memory entry sorting for faster dictionary loading [lucene]

2023-11-22 Thread via GitHub
dweiss commented on code in PR #12834: URL: https://github.com/apache/lucene/pull/12834#discussion_r1402711522 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/SortingStrategy.java: ## @@ -0,0 +1,180 @@ +/* + * Licensed to the Apache Software Foundation (AS

Re: [PR] hunspell: allow in-memory entry sorting for faster dictionary loading [lucene]

2023-11-22 Thread via GitHub
dweiss commented on code in PR #12834: URL: https://github.com/apache/lucene/pull/12834#discussion_r1402712071 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/SortingStrategy.java: ## @@ -0,0 +1,180 @@ +/* + * Licensed to the Apache Software Foundation (AS

Re: [PR] Add support for index sorting with document blocks [lucene]

2023-11-22 Thread via GitHub
msokolov commented on PR #12829: URL: https://github.com/apache/lucene/pull/12829#issuecomment-1823517625 I like @jpountz's idea to make the value of this field be the number of children. It is simple and makes sense, and is pretty close to having the degree of flexibility that the current

Re: [PR] Simplify advancing on postings/impacts enums [lucene]

2023-11-22 Thread via GitHub
gf2121 commented on code in PR #12810: URL: https://github.com/apache/lucene/pull/12810#discussion_r1402985278 ## lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListReader.java: ## @@ -63,7 +63,7 @@ public abstract class MultiLevelSkipListReader implements Closeabl

[PR] Improve DirectReader java doc [lucene]

2023-11-22 Thread via GitHub
gf2121 opened a new pull request, #12835: URL: https://github.com/apache/lucene/pull/12835 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-ma