Adrien Grand created LUCENE-9027:
------------------------------------

             Summary: SIMD-based decoding of postings lists
                 Key: LUCENE-9027
                 URL: https://issues.apache.org/jira/browse/LUCENE-9027
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Adrien Grand


[~rcmuir] has been mentioning the idea for quite some time that we might be 
able to write the decoding logic in such a way that Java would use SIMD 
instructions. More recently [~paul.masurel] wrote a [blog 
post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
Lucene could still do decode multiple ints at once in a single instruction by 
packing two ints in a long and we've had some discussions about what we could 
try in Lucene to speed up the decoding of postings. This made me want to look a 
bit deeper at what we could do.

Our current decoding logic reads data in a byte[] and decodes packed integers 
from it. Unfortunately it doesn't make use of SIMD instructions and looks like 
[this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].

I confirmed by looking at the generated assembly that if I take an array of 
integers and shift them all by the same number of bits then Java will use SIMD 
instructions to shift multiple integers at once. This led me to writing this 
[implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
 that tries as much as possible to shift long sequences of ints by the same 
number of bits to speed up decoding. It is indeed faster than the current logic 
we have, up to about 2x faster for some numbers of bits per value.

Currently the best 
[implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
 I've been able to come up with combines the above idea with the idea that Paul 
mentioned in his blog that consists of emulating SIMD from Java by packing 
multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a bit harder 
to read but gives another speedup on top of the above implementation.

I have a [JMH benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] 
available in case someone would like to play with this and maybe even come up 
with an even faster implementation. It is 2-2.5x faster than our current 
implementation for most numbers of bits per value. I'm copying results here:
{noformat}
 * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves as
   a baseline.
 * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
   current Lucene codec does.
 * `decodeNaiveFromLongs` decodes from longs on the fly.
 * `decodeSimpleSIMD` is a simple implementation that relies on how Java
   recognizes some patterns and uses SIMD instructions.
 * `decodeSIMD` is a more complex implementation that both relies on the C2
   compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
   2 ints in a long in order to decompress multiple values at once.

Benchmark                                       (bitsPerValue)  (byteOrder)   
Mode  Cnt   Score   Error   Units
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               1           LE  
thrpt    5  12.912 ± 0.393  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               1           BE  
thrpt    5  12.862 ± 0.395  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               2           LE  
thrpt    5  13.040 ± 1.162  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               2           BE  
thrpt    5  13.027 ± 0.270  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               3           LE  
thrpt    5  12.409 ± 0.637  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               3           BE  
thrpt    5  12.268 ± 0.947  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               4           LE  
thrpt    5  14.177 ± 2.263  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               4           BE  
thrpt    5  11.457 ± 0.150  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               5           LE  
thrpt    5  10.988 ± 1.179  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               5           BE  
thrpt    5  11.226 ± 0.088  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               6           LE  
thrpt    5   9.791 ± 0.305  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               6           BE  
thrpt    5   9.403 ± 3.598  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               7           LE  
thrpt    5  10.256 ± 0.211  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               7           BE  
thrpt    5  10.314 ± 0.382  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               8           LE  
thrpt    5  16.516 ± 0.380  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               8           BE  
thrpt    5  16.375 ± 0.427  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               9           LE  
thrpt    5   9.067 ± 0.066  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes               9           BE  
thrpt    5   9.078 ± 0.178  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes              10           LE  
thrpt    5   8.913 ± 0.074  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes              10           BE  
thrpt    5   8.893 ± 0.101  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes              11           LE  
thrpt    5   7.908 ± 0.118  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes              11           BE  
thrpt    5   7.864 ± 0.097  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes              12           LE  
thrpt    5   9.220 ± 0.103  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes              12           BE  
thrpt    5   9.186 ± 0.241  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes              13           LE  
thrpt    5   7.119 ± 0.071  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes              13           BE  
thrpt    5   7.066 ± 0.059  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes              14           LE  
thrpt    5  12.483 ± 0.171  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes              14           BE  
thrpt    5  12.473 ± 0.117  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes              15           LE  
thrpt    5   6.202 ± 0.192  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes              15           BE  
thrpt    5   6.187 ± 0.399  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes              16           LE  
thrpt    5  12.798 ± 0.249  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromBytes              16           BE  
thrpt    5  12.987 ± 0.208  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               1           LE  
thrpt    5   7.248 ± 0.096  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               1           BE  
thrpt    5   7.292 ± 0.114  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               2           LE  
thrpt    5   8.923 ± 0.099  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               2           BE  
thrpt    5   8.899 ± 0.028  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               3           LE  
thrpt    5   9.192 ± 0.082  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               3           BE  
thrpt    5   9.090 ± 0.066  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               4           LE  
thrpt    5   7.947 ± 0.039  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               4           BE  
thrpt    5   7.809 ± 0.298  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               5           LE  
thrpt    5   8.342 ± 0.568  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               5           BE  
thrpt    5   8.259 ± 0.572  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               6           LE  
thrpt    5  15.594 ± 0.149  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               6           BE  
thrpt    5  14.012 ± 0.160  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               7           LE  
thrpt    5  12.686 ± 0.271  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               7           BE  
thrpt    5  12.806 ± 0.160  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               8           LE  
thrpt    5  13.571 ± 0.135  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               8           BE  
thrpt    5  13.312 ± 0.110  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               9           LE  
thrpt    5  11.812 ± 0.108  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs               9           BE  
thrpt    5  12.874 ± 0.168  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs              10           LE  
thrpt    5  12.882 ± 0.114  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs              10           BE  
thrpt    5  12.142 ± 0.091  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs              11           LE  
thrpt    5  12.302 ± 0.111  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs              11           BE  
thrpt    5  10.762 ± 0.250  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs              12           LE  
thrpt    5  12.505 ± 0.070  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs              12           BE  
thrpt    5  12.149 ± 0.083  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs              13           LE  
thrpt    5  11.159 ± 0.341  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs              13           BE  
thrpt    5  10.395 ± 0.222  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs              14           LE  
thrpt    5  11.004 ± 0.058  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs              14           BE  
thrpt    5  10.312 ± 0.369  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs              15           LE  
thrpt    5  11.236 ± 0.117  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs              15           BE  
thrpt    5   9.792 ± 0.202  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs              16           LE  
thrpt    5  10.607 ± 0.105  ops/us
PackedIntsDecodeBenchmark.decodeNaiveFromLongs              16           BE  
thrpt    5  10.340 ± 0.070  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         1           LE  
thrpt    5  20.925 ± 0.368  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         1           BE  
thrpt    5  13.396 ± 0.485  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         2           LE  
thrpt    5  20.628 ± 0.494  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         2           BE  
thrpt    5  13.584 ± 0.194  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         3           LE  
thrpt    5  19.932 ± 1.609  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         3           BE  
thrpt    5  13.296 ± 0.095  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         4           LE  
thrpt    5  21.065 ± 0.767  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         4           BE  
thrpt    5  13.557 ± 0.051  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         5           LE  
thrpt    5  19.630 ± 0.067  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         5           BE  
thrpt    5  12.916 ± 0.186  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         6           LE  
thrpt    5  20.253 ± 0.701  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         6           BE  
thrpt    5  12.820 ± 0.048  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         7           LE  
thrpt    5  18.944 ± 0.160  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         7           BE  
thrpt    5  12.562 ± 0.128  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         8           LE  
thrpt    5  22.778 ± 2.023  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         8           BE  
thrpt    5  13.658 ± 0.158  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         9           LE  
thrpt    5  18.527 ± 0.169  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                         9           BE  
thrpt    5  12.045 ± 0.111  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                        10           LE  
thrpt    5  16.610 ± 0.997  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                        10           BE  
thrpt    5  11.208 ± 0.087  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                        11           LE  
thrpt    5  17.961 ± 0.080  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                        11           BE  
thrpt    5  11.594 ± 0.084  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                        12           LE  
thrpt    5  16.980 ± 2.372  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                        12           BE  
thrpt    5  11.135 ± 0.050  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                        13           LE  
thrpt    5  17.592 ± 0.269  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                        13           BE  
thrpt    5  11.132 ± 0.227  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                        14           LE  
thrpt    5  16.964 ± 0.423  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                        14           BE  
thrpt    5  10.953 ± 0.326  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                        15           LE  
thrpt    5  17.972 ± 0.572  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                        15           BE  
thrpt    5  10.872 ± 0.150  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                        16           LE  
thrpt    5  24.152 ± 0.213  ops/us
PackedIntsDecodeBenchmark.decodeSIMD                        16           BE  
thrpt    5  12.984 ± 0.348  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                   1           LE  
thrpt    5  14.567 ± 0.714  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                   1           BE  
thrpt    5  10.541 ± 0.079  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                   2           LE  
thrpt    5  15.395 ± 0.687  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                   2           BE  
thrpt    5  11.142 ± 0.052  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                   3           LE  
thrpt    5  15.802 ± 0.623  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                   3           BE  
thrpt    5  10.656 ± 0.278  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                   4           LE  
thrpt    5  17.732 ± 0.276  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                   4           BE  
thrpt    5  11.289 ± 0.209  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                   5           LE  
thrpt    5  16.230 ± 0.389  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                   5           BE  
thrpt    5  10.216 ± 0.184  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                   6           LE  
thrpt    5  16.478 ± 0.682  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                   6           BE  
thrpt    5  10.379 ± 0.157  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                   8           LE  
thrpt    5  18.222 ± 0.388  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                   8           BE  
thrpt    5  11.153 ± 0.619  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                  10           LE  
thrpt    5  15.138 ± 0.321  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                  10           BE  
thrpt    5   9.384 ± 0.671  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                  16           LE  
thrpt    5  20.776 ± 0.397  ops/us
PackedIntsDecodeBenchmark.decodeSimpleSIMD                  16           BE  
thrpt    5  10.199 ± 0.146  ops/us
PackedIntsDecodeBenchmark.readLongs                          1           LE  
thrpt    5  30.220 ± 0.652  ops/us
PackedIntsDecodeBenchmark.readLongs                          1           BE  
thrpt    5  16.324 ± 0.226  ops/us
PackedIntsDecodeBenchmark.readLongs                          2           LE  
thrpt    5  30.952 ± 0.329  ops/us
PackedIntsDecodeBenchmark.readLongs                          2           BE  
thrpt    5  16.492 ± 0.397  ops/us
PackedIntsDecodeBenchmark.readLongs                          3           LE  
thrpt    5  30.156 ± 0.979  ops/us
PackedIntsDecodeBenchmark.readLongs                          3           BE  
thrpt    5  16.273 ± 0.441  ops/us
PackedIntsDecodeBenchmark.readLongs                          4           LE  
thrpt    5  29.925 ± 0.718  ops/us
PackedIntsDecodeBenchmark.readLongs                          4           BE  
thrpt    5  15.930 ± 0.350  ops/us
PackedIntsDecodeBenchmark.readLongs                          5           LE  
thrpt    5  29.773 ± 0.979  ops/us
PackedIntsDecodeBenchmark.readLongs                          5           BE  
thrpt    5  15.775 ± 0.257  ops/us
PackedIntsDecodeBenchmark.readLongs                          6           LE  
thrpt    5  29.591 ± 1.285  ops/us
PackedIntsDecodeBenchmark.readLongs                          6           BE  
thrpt    5  15.732 ± 0.226  ops/us
PackedIntsDecodeBenchmark.readLongs                          7           LE  
thrpt    5  29.708 ± 0.909  ops/us
PackedIntsDecodeBenchmark.readLongs                          7           BE  
thrpt    5  15.433 ± 0.562  ops/us
PackedIntsDecodeBenchmark.readLongs                          8           LE  
thrpt    5  29.828 ± 0.689  ops/us
PackedIntsDecodeBenchmark.readLongs                          8           BE  
thrpt    5  15.390 ± 0.188  ops/us
PackedIntsDecodeBenchmark.readLongs                          9           LE  
thrpt    5  29.127 ± 0.309  ops/us
PackedIntsDecodeBenchmark.readLongs                          9           BE  
thrpt    5  15.180 ± 0.199  ops/us
PackedIntsDecodeBenchmark.readLongs                         10           LE  
thrpt    5  29.085 ± 0.538  ops/us
PackedIntsDecodeBenchmark.readLongs                         10           BE  
thrpt    5  14.887 ± 1.687  ops/us
PackedIntsDecodeBenchmark.readLongs                         11           LE  
thrpt    5  28.904 ± 0.329  ops/us
PackedIntsDecodeBenchmark.readLongs                         11           BE  
thrpt    5  14.936 ± 0.119  ops/us
PackedIntsDecodeBenchmark.readLongs                         12           LE  
thrpt    5  29.025 ± 0.299  ops/us
PackedIntsDecodeBenchmark.readLongs                         12           BE  
thrpt    5  14.685 ± 0.154  ops/us
PackedIntsDecodeBenchmark.readLongs                         13           LE  
thrpt    5  28.963 ± 0.244  ops/us
PackedIntsDecodeBenchmark.readLongs                         13           BE  
thrpt    5  14.569 ± 0.100  ops/us
PackedIntsDecodeBenchmark.readLongs                         14           LE  
thrpt    5  28.584 ± 1.409  ops/us
PackedIntsDecodeBenchmark.readLongs                         14           BE  
thrpt    5  14.340 ± 0.594  ops/us
PackedIntsDecodeBenchmark.readLongs                         15           LE  
thrpt    5  28.744 ± 0.314  ops/us
PackedIntsDecodeBenchmark.readLongs                         15           BE  
thrpt    5  14.222 ± 0.105  ops/us
PackedIntsDecodeBenchmark.readLongs                         16           LE  
thrpt    5  26.638 ± 0.452  ops/us
PackedIntsDecodeBenchmark.readLongs                         16           BE  
thrpt    5  13.906 ± 0.604  ops/us
{noformat}
The thing that is a bit frustrating is that the best throughputs are obtained 
on a ByteBuffer that is configured to use the little endian byte order (which 
is the native byte order of my machine) while Java/Lucene default to big 
endian. So if we want that kind of throughput we'll probably need to add ways 
to read data in the native byte order in the IndexInput API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to