Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2024-02-22 Thread via GitHub
wjp719 commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1959222579 > Have you got specific errors? could you give some detailed message? Thanks! I have no errors,I didn't realize the new format was used, Thanks. -- This is an automated messag

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2024-02-22 Thread via GitHub
easyice commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1958968291 > @easyice Hi, I have doubt that the encoding data result using group-varint encoding is different from the old way, so is this way compatible with the old index format data? This

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2024-02-21 Thread via GitHub
wjp719 commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1958873159 @easyice Hi, I have doubt that the encoding data result using group-varint encoding is different from the old way, so is this way compatible with the old index format data? -- This is

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-22 Thread via GitHub
jpountz commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1822506995 I opened a PR to feed some of this data into the micro benchmark to make it more realistic: https://github.com/apache/lucene/pull/12833. -- This is an automated message from the Apache

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-22 Thread via GitHub
easyice commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1822504456 It's very important as a reference! Thanks a lot! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-22 Thread via GitHub
jpountz commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1822453957 For reference, I computed the most frequent `flag` values on wikibigall, which are the values that might be worth optimizing for: - 0x55 (4 2-bytes ints): 29.6% - 0xaa (5 3-bytes

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-22 Thread via GitHub
jpountz commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1822300926 Also the [size](http://people.apache.org/~mikemccand/lucenebench/indexing.html#FixedIndexSize) increase is hardly noticeable. -- This is an automated message from the Apache Git Servi

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-22 Thread via GitHub
jpountz commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1822293603 There seems to be a speedup on [prefix queries](http://people.apache.org/~mikemccand/lucenebench/Prefix3.html) in nightly benchmarks, I'll add an annotation. -- This is an automated m

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-20 Thread via GitHub
jpountz merged PR #12782: URL: https://github.com/apache/lucene/pull/12782 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-20 Thread via GitHub
easyice commented on code in PR #12782: URL: https://github.com/apache/lucene/pull/12782#discussion_r1398894042 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVIntWriter.java: ## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-20 Thread via GitHub
easyice commented on code in PR #12782: URL: https://github.com/apache/lucene/pull/12782#discussion_r1398890215 ## lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestGroupVInt.java: ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-20 Thread via GitHub
jpountz commented on code in PR #12782: URL: https://github.com/apache/lucene/pull/12782#discussion_r1398826440 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVIntWriter.java: ## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-20 Thread via GitHub
jpountz commented on code in PR #12782: URL: https://github.com/apache/lucene/pull/12782#discussion_r1398811234 ## lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestGroupVInt.java: ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-19 Thread via GitHub
easyice commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1818157148 I ran some rounds of wikimediumall(sometimes there is noise), It looks a bit speed up : `.doc` files were 0.4% larger overall (5.45GB to 5.47GB) Round 1 ```

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-18 Thread via GitHub
easyice commented on code in PR #12782: URL: https://github.com/apache/lucene/pull/12782#discussion_r1398242374 ## lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/GroupVIntBenchmark.java: ## @@ -0,0 +1,150 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-18 Thread via GitHub
easyice commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1817536690 Wow, what an incredible speedup! I would not have expected bulk decoding with read directly is so much faster than read from array, Thank you for your time, and I'm sorry i didn't try t

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-17 Thread via GitHub
jpountz commented on code in PR #12782: URL: https://github.com/apache/lucene/pull/12782#discussion_r1391047570 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVIntWriter.java: ## @@ -0,0 +1,97 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-17 Thread via GitHub
jpountz commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1817146436 Thanks @easyice. I took some time to look into the benchmark and improve a few things, hopefully you don't mind. Here is the output of the benchmark on my machine now: ``` Benc

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-14 Thread via GitHub
easyice commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1810036730 Thank you @jpountz , I pushed the benchmark code, and added a new comparison between `ByteArrayDataInput` vs `ByteBufferIndexInput` . For `readVInt`, the `ByteBufferIndexInput` is a bit

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-13 Thread via GitHub
jpountz commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1808314262 At least in theory, group varint could be made faster than vints even with single-byte integers, because a single check on `flag == 0` would tell us that all 4 integers have a single byt

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-13 Thread via GitHub
jpountz commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1808043672 Could you check in your benchmark under `lucene/benchmark-jmh` so that we could play with it? -- This is an automated message from the Apache Git Service. To respond to the message, pl

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-11 Thread via GitHub
easyice commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1806834098 @jpountz You are right, recomputing the length is faster than table lookup, here is the benchmark when reading the ints, each value will takes 4 bytes: ``` GroupVInt.readGroup

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-09 Thread via GitHub
easyice commented on PR #12782: URL: https://github.com/apache/lucene/pull/12782#issuecomment-1805059427 @jpountz @rmuir Thanks for your suggestions, it's very helpful for me! I will run the benchmark for recomputing length vs table lookup. -- This is an automated message from the Apach

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-09 Thread via GitHub
easyice commented on code in PR #12782: URL: https://github.com/apache/lucene/pull/12782#discussion_r1388845597 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVintReader.java: ## @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-09 Thread via GitHub
easyice commented on code in PR #12782: URL: https://github.com/apache/lucene/pull/12782#discussion_r1388843109 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVintWriter.java: ## @@ -0,0 +1,97 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-09 Thread via GitHub
rmuir commented on code in PR #12782: URL: https://github.com/apache/lucene/pull/12782#discussion_r1388595300 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVintReader.java: ## @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one o

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-09 Thread via GitHub
jpountz commented on code in PR #12782: URL: https://github.com/apache/lucene/pull/12782#discussion_r1388273135 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVintWriter.java: ## @@ -0,0 +1,97 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

[PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-08 Thread via GitHub
easyice opened a new pull request, #12782: URL: https://github.com/apache/lucene/pull/12782 As discussed in issue https://github.com/apache/lucene/issues/12717 the read performance of group-varint is 14-30%% faster than vint, the `Mode` 16-248 is the number of ints will be read.