costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-972988872


   I've tested `Packed64LongByte` multiple times against `TestPackedInts` and 
it seems to pass. Note that I've focused mainly on the `get` and `set` methods 
and removed the specialized fill and bulk get/set methods due to slightly 
different encoding.
   
   tl;dr - `Packed64` has best performance, followed by 
`Packed64VarHandleLongLong` and lastly by `Packed64LongByte` with a significant 
gap. My guess is the long[] array is more cache friendly while the byte[] isn't 
- by sliding across the byte[] to read longs, the caches are thrashed which is 
much more expensive than doing bitwise operations.
   The data in case of byte[] support this - performance is great for the 1 and 
64 case (where the sliding is minimal since new positions belong to the same 
byte block), while for the rest it falls to the ground.
   
   I've included the benchmark in the commit; the tests are running on latest 
JDK 17.0.1 on an Ryzen 5900x.
   
   ```
   openjdk version "17.0.1" 2021-10-19
   OpenJDK Runtime Environment Temurin-17.0.1+12 (build 17.0.1+12)
   OpenJDK 64-Bit Server VM Temurin-17.0.1+12 (build 17.0.1+12, mixed mode, 
sharing)
   ```
   
   ```
   # JMH version: 1.33
   # VM version: JDK 17.0.1, OpenJDK 64-Bit Server VM, 17.0.1+12
   
   Benchmark                                       (bpv)  (size)   Mode  Cnt    
  Score      Error  Units
   Packed64Benchmark.packed64                          1   10240  thrpt    3  
80129.482 ± 2603.161  ops/s
   Packed64Benchmark.packed64                          4   10240  thrpt    3  
87680.030 ± 2725.245  ops/s
   Packed64Benchmark.packed64                          5   10240  thrpt    3  
56768.849 ±  633.855  ops/s
   Packed64Benchmark.packed64                          8   10240  thrpt    3  
91869.977 ± 1579.788  ops/s
   Packed64Benchmark.packed64                         11   10240  thrpt    3  
71802.824 ± 2124.577  ops/s
   Packed64Benchmark.packed64                         16   10240  thrpt    3  
94892.804 ± 1876.298  ops/s
   Packed64Benchmark.packed64                         23   10240  thrpt    3  
65401.053 ±  690.302  ops/s
   Packed64Benchmark.packed64                         25   10240  thrpt    3  
69623.043 ±  679.354  ops/s
   Packed64Benchmark.packed64                         31   10240  thrpt    3  
66732.829 ± 1788.781  ops/s
   Packed64Benchmark.packed64                         32   10240  thrpt    3  
99938.841 ±  197.846  ops/s
   Packed64Benchmark.packed64                         47   10240  thrpt    3  
52174.874 ± 2369.976  ops/s
   Packed64Benchmark.packed64                         59   10240  thrpt    3  
55240.166 ±  165.920  ops/s
   Packed64Benchmark.packed64                         61   10240  thrpt    3  
54909.784 ± 5064.028  ops/s
   Packed64Benchmark.packed64                         64   10240  thrpt    3  
99261.182 ± 2746.589  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte      1   10240  thrpt    3  
56909.631 ± 1986.897  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte      4   10240  thrpt    3  
31584.303 ± 1063.071  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte      5   10240  thrpt    3  
23703.179 ± 2127.908  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte      8   10240  thrpt    3  
17958.177 ±  825.106  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     11   10240  thrpt    3  
17810.271 ±  379.489  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     16   10240  thrpt    3  
19043.959 ± 2185.749  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     23   10240  thrpt    3  
17784.221 ±  824.539  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     25   10240  thrpt    3  
17728.510 ± 1063.605  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     31   10240  thrpt    3  
17742.034 ±  198.159  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     32   10240  thrpt    3  
21951.514 ± 7648.112  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     47   10240  thrpt    3  
17741.987 ±  866.598  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     59   10240  thrpt    3  
19481.548 ±  446.086  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     61   10240  thrpt    3  
18999.020 ±  949.164  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     64   10240  thrpt    3  
74178.395 ± 1219.876  ops/s
   Packed64Benchmark.packed64VarHandleLongLong         1   10240  thrpt    3  
73824.221 ± 3210.970  ops/s
   Packed64Benchmark.packed64VarHandleLongLong         4   10240  thrpt    3  
82908.298 ± 8099.224  ops/s
   Packed64Benchmark.packed64VarHandleLongLong         5   10240  thrpt    3  
54176.175 ± 2005.872  ops/s
   Packed64Benchmark.packed64VarHandleLongLong         8   10240  thrpt    3  
83556.558 ± 1658.929  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        11   10240  thrpt    3  
56512.522 ±  380.595  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        16   10240  thrpt    3  
83820.820 ± 2000.138  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        23   10240  thrpt    3  
62570.823 ±  604.178  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        25   10240  thrpt    3  
63497.723 ± 2172.188  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        31   10240  thrpt    3  
62699.661 ± 3982.879  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        32   10240  thrpt    3  
84795.726 ± 7409.939  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        47   10240  thrpt    3  
55820.366 ± 1247.217  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        59   10240  thrpt    3  
49562.673 ± 1131.955  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        61   10240  thrpt    3  
53540.899 ± 1268.365  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        64   10240  thrpt    3  
85549.020 ± 3293.070  ops/s
   ```
   
   I'm sure the `Packed64LongByte` (p64lb) implementation could be improved 
however the gap in performance is significant.
   p64lb is somewhat reasonable only when dealing with pbv of 64 and 1 - in all 
other cases performance simply drops.
   Packed64LongLong (p64ll) tracks the performance of `Packed64` (p64) pretty 
well (which is expected).
   
   To understand where the gap cap comes from, I've done more tests for the 
best bpv of 64 and worse case 23 and it boils down to this line
   
https://github.com/apache/lucene/blob/6bd5c14bf3c6113244773d2e1eb39233dc33ad71/lucene/core/src/java/org/apache/lucene/util/packed/Packed64.java#L73
   
   The bigger the shift, the better the performance - not because of the shift 
itself but rather because the higher the chances data falls in the same data 
block which for consecutive operations is going to benefit from caching.
   To test this out, I've changed the code in `Packed64LongAndByte` from :
   ```
   final int elementPos = (int) (majorBitPos >>> 3);     // /8
   ```
   to 
   
   ```
   final int elementPos = (int) (majorBitPos >>> 6);     // /64
   ```
   
   which almost doubled the performance of pb64lb (which is still half of 
packed64):
   
   ```
   Benchmark                                       (bpv)  (size)   Mode  Cnt    
  Score      Error  Units
   Packed64Benchmark.packed64                         23   10240  thrpt    3  
65188.193 ± 3034.465  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     23   10240  thrpt    3  
30850.017 ±  422.149  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        23   10240  thrpt    3  
56759.947 ± 1055.780  ops/s
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to