[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977532#comment-16977532 ]
ASF subversion and git services commented on LUCENE-9027: --------------------------------------------------------- Commit cb1a72ad1642400888683e1f735a0316334b2484 in lucene-solr's branch refs/heads/branch_8x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=cb1a72a ] LUCENE-9027: Use SIMD instructions to decode postings. (#973) > SIMD-based decoding of postings lists > ------------------------------------- > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Priority: Minor > Time Spent: 3h 50m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as > a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the > current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java > recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 > compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or > 2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt 5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt 5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt 5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt 5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt 5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt 5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt 5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt 5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt 5 10.988 ± 1.179 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 BE > thrpt 5 11.226 ± 0.088 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 6 LE > thrpt 5 9.791 ± 0.305 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 6 BE > thrpt 5 9.403 ± 3.598 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 7 LE > thrpt 5 10.256 ± 0.211 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 7 BE > thrpt 5 10.314 ± 0.382 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 8 LE > thrpt 5 16.516 ± 0.380 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 8 BE > thrpt 5 16.375 ± 0.427 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 9 LE > thrpt 5 9.067 ± 0.066 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 9 BE > thrpt 5 9.078 ± 0.178 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 10 LE > thrpt 5 8.913 ± 0.074 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 10 BE > thrpt 5 8.893 ± 0.101 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 11 LE > thrpt 5 7.908 ± 0.118 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 11 BE > thrpt 5 7.864 ± 0.097 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 12 LE > thrpt 5 9.220 ± 0.103 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 12 BE > thrpt 5 9.186 ± 0.241 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 13 LE > thrpt 5 7.119 ± 0.071 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 13 BE > thrpt 5 7.066 ± 0.059 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 14 LE > thrpt 5 12.483 ± 0.171 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 14 BE > thrpt 5 12.473 ± 0.117 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 15 LE > thrpt 5 6.202 ± 0.192 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 15 BE > thrpt 5 6.187 ± 0.399 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 16 LE > thrpt 5 12.798 ± 0.249 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 16 BE > thrpt 5 12.987 ± 0.208 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 1 LE > thrpt 5 7.248 ± 0.096 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 1 BE > thrpt 5 7.292 ± 0.114 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 2 LE > thrpt 5 8.923 ± 0.099 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 2 BE > thrpt 5 8.899 ± 0.028 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 3 LE > thrpt 5 9.192 ± 0.082 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 3 BE > thrpt 5 9.090 ± 0.066 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 4 LE > thrpt 5 7.947 ± 0.039 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 4 BE > thrpt 5 7.809 ± 0.298 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 5 LE > thrpt 5 8.342 ± 0.568 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 5 BE > thrpt 5 8.259 ± 0.572 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 6 LE > thrpt 5 15.594 ± 0.149 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 6 BE > thrpt 5 14.012 ± 0.160 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 7 LE > thrpt 5 12.686 ± 0.271 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 7 BE > thrpt 5 12.806 ± 0.160 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 8 LE > thrpt 5 13.571 ± 0.135 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 8 BE > thrpt 5 13.312 ± 0.110 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 9 LE > thrpt 5 11.812 ± 0.108 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 9 BE > thrpt 5 12.874 ± 0.168 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 10 LE > thrpt 5 12.882 ± 0.114 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 10 BE > thrpt 5 12.142 ± 0.091 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 11 LE > thrpt 5 12.302 ± 0.111 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 11 BE > thrpt 5 10.762 ± 0.250 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 12 LE > thrpt 5 12.505 ± 0.070 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 12 BE > thrpt 5 12.149 ± 0.083 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 13 LE > thrpt 5 11.159 ± 0.341 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 13 BE > thrpt 5 10.395 ± 0.222 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 14 LE > thrpt 5 11.004 ± 0.058 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 14 BE > thrpt 5 10.312 ± 0.369 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 15 LE > thrpt 5 11.236 ± 0.117 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 15 BE > thrpt 5 9.792 ± 0.202 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 16 LE > thrpt 5 10.607 ± 0.105 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromLongs 16 BE > thrpt 5 10.340 ± 0.070 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 1 LE > thrpt 5 20.925 ± 0.368 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 1 BE > thrpt 5 13.396 ± 0.485 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 2 LE > thrpt 5 20.628 ± 0.494 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 2 BE > thrpt 5 13.584 ± 0.194 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 3 LE > thrpt 5 19.932 ± 1.609 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 3 BE > thrpt 5 13.296 ± 0.095 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 4 LE > thrpt 5 21.065 ± 0.767 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 4 BE > thrpt 5 13.557 ± 0.051 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 5 LE > thrpt 5 19.630 ± 0.067 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 5 BE > thrpt 5 12.916 ± 0.186 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 6 LE > thrpt 5 20.253 ± 0.701 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 6 BE > thrpt 5 12.820 ± 0.048 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 7 LE > thrpt 5 18.944 ± 0.160 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 7 BE > thrpt 5 12.562 ± 0.128 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 8 LE > thrpt 5 22.778 ± 2.023 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 8 BE > thrpt 5 13.658 ± 0.158 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 9 LE > thrpt 5 18.527 ± 0.169 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 9 BE > thrpt 5 12.045 ± 0.111 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 10 LE > thrpt 5 16.610 ± 0.997 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 10 BE > thrpt 5 11.208 ± 0.087 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 11 LE > thrpt 5 17.961 ± 0.080 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 11 BE > thrpt 5 11.594 ± 0.084 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 12 LE > thrpt 5 16.980 ± 2.372 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 12 BE > thrpt 5 11.135 ± 0.050 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 13 LE > thrpt 5 17.592 ± 0.269 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 13 BE > thrpt 5 11.132 ± 0.227 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 14 LE > thrpt 5 16.964 ± 0.423 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 14 BE > thrpt 5 10.953 ± 0.326 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 15 LE > thrpt 5 17.972 ± 0.572 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 15 BE > thrpt 5 10.872 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 16 LE > thrpt 5 24.152 ± 0.213 ops/us > PackedIntsDecodeBenchmark.decodeSIMD 16 BE > thrpt 5 12.984 ± 0.348 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 1 LE > thrpt 5 14.567 ± 0.714 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 1 BE > thrpt 5 10.541 ± 0.079 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 2 LE > thrpt 5 15.395 ± 0.687 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 2 BE > thrpt 5 11.142 ± 0.052 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 3 LE > thrpt 5 15.802 ± 0.623 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 3 BE > thrpt 5 10.656 ± 0.278 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 4 LE > thrpt 5 17.732 ± 0.276 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 4 BE > thrpt 5 11.289 ± 0.209 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 5 LE > thrpt 5 16.230 ± 0.389 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 5 BE > thrpt 5 10.216 ± 0.184 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 6 LE > thrpt 5 16.478 ± 0.682 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 6 BE > thrpt 5 10.379 ± 0.157 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 8 LE > thrpt 5 18.222 ± 0.388 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 8 BE > thrpt 5 11.153 ± 0.619 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 10 LE > thrpt 5 15.138 ± 0.321 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 10 BE > thrpt 5 9.384 ± 0.671 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 16 LE > thrpt 5 20.776 ± 0.397 ops/us > PackedIntsDecodeBenchmark.decodeSimpleSIMD 16 BE > thrpt 5 10.199 ± 0.146 ops/us > PackedIntsDecodeBenchmark.readLongs 1 LE > thrpt 5 30.220 ± 0.652 ops/us > PackedIntsDecodeBenchmark.readLongs 1 BE > thrpt 5 16.324 ± 0.226 ops/us > PackedIntsDecodeBenchmark.readLongs 2 LE > thrpt 5 30.952 ± 0.329 ops/us > PackedIntsDecodeBenchmark.readLongs 2 BE > thrpt 5 16.492 ± 0.397 ops/us > PackedIntsDecodeBenchmark.readLongs 3 LE > thrpt 5 30.156 ± 0.979 ops/us > PackedIntsDecodeBenchmark.readLongs 3 BE > thrpt 5 16.273 ± 0.441 ops/us > PackedIntsDecodeBenchmark.readLongs 4 LE > thrpt 5 29.925 ± 0.718 ops/us > PackedIntsDecodeBenchmark.readLongs 4 BE > thrpt 5 15.930 ± 0.350 ops/us > PackedIntsDecodeBenchmark.readLongs 5 LE > thrpt 5 29.773 ± 0.979 ops/us > PackedIntsDecodeBenchmark.readLongs 5 BE > thrpt 5 15.775 ± 0.257 ops/us > PackedIntsDecodeBenchmark.readLongs 6 LE > thrpt 5 29.591 ± 1.285 ops/us > PackedIntsDecodeBenchmark.readLongs 6 BE > thrpt 5 15.732 ± 0.226 ops/us > PackedIntsDecodeBenchmark.readLongs 7 LE > thrpt 5 29.708 ± 0.909 ops/us > PackedIntsDecodeBenchmark.readLongs 7 BE > thrpt 5 15.433 ± 0.562 ops/us > PackedIntsDecodeBenchmark.readLongs 8 LE > thrpt 5 29.828 ± 0.689 ops/us > PackedIntsDecodeBenchmark.readLongs 8 BE > thrpt 5 15.390 ± 0.188 ops/us > PackedIntsDecodeBenchmark.readLongs 9 LE > thrpt 5 29.127 ± 0.309 ops/us > PackedIntsDecodeBenchmark.readLongs 9 BE > thrpt 5 15.180 ± 0.199 ops/us > PackedIntsDecodeBenchmark.readLongs 10 LE > thrpt 5 29.085 ± 0.538 ops/us > PackedIntsDecodeBenchmark.readLongs 10 BE > thrpt 5 14.887 ± 1.687 ops/us > PackedIntsDecodeBenchmark.readLongs 11 LE > thrpt 5 28.904 ± 0.329 ops/us > PackedIntsDecodeBenchmark.readLongs 11 BE > thrpt 5 14.936 ± 0.119 ops/us > PackedIntsDecodeBenchmark.readLongs 12 LE > thrpt 5 29.025 ± 0.299 ops/us > PackedIntsDecodeBenchmark.readLongs 12 BE > thrpt 5 14.685 ± 0.154 ops/us > PackedIntsDecodeBenchmark.readLongs 13 LE > thrpt 5 28.963 ± 0.244 ops/us > PackedIntsDecodeBenchmark.readLongs 13 BE > thrpt 5 14.569 ± 0.100 ops/us > PackedIntsDecodeBenchmark.readLongs 14 LE > thrpt 5 28.584 ± 1.409 ops/us > PackedIntsDecodeBenchmark.readLongs 14 BE > thrpt 5 14.340 ± 0.594 ops/us > PackedIntsDecodeBenchmark.readLongs 15 LE > thrpt 5 28.744 ± 0.314 ops/us > PackedIntsDecodeBenchmark.readLongs 15 BE > thrpt 5 14.222 ± 0.105 ops/us > PackedIntsDecodeBenchmark.readLongs 16 LE > thrpt 5 26.638 ± 0.452 ops/us > PackedIntsDecodeBenchmark.readLongs 16 BE > thrpt 5 13.906 ± 0.604 ops/us > {noformat} > The thing that is a bit frustrating is that the best throughputs are obtained > on a ByteBuffer that is configured to use the little endian byte order (which > is the native byte order of my machine) while Java/Lucene default to big > endian. So if we want that kind of throughput we'll probably need to add ways > to read data in the native byte order in the IndexInput API. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org