RamakrishnaChilaka opened a new pull request, #15198:
URL: https://github.com/apache/lucene/pull/15198

   This PR optimizes the expand8 routine by leveraging the JDK Vector API. 
   
   #### Benchmarks
   I have validated performance using a standalone benchmark (see 
[postings_expand_benchmark](https://github.com/RamakrishnaChilaka/-postings_expand_benchmark))
 for block_size: 256. Key take-aways are as follows.
   
   | Benchmark                     | Mode | Cnt |   Score   |  Error  | Units |
   |-------------------------------|------|-----|-----------|---------|-------|
   | expand16 (Scalar)             | thrpt|  5  | 112.842   | ± 0.221 | ops/us |
   | expand16 (Vector)             | thrpt|  5  | 105.594   | ± 1.307 | ops/us |
   | expand8 (Scalar)              | thrpt|  5  |  66.726   | ± 0.452 | ops/us |
   | expand8 (Vector)              | thrpt|  5  | 105.821   | ± 0.272 | ops/us |
   
   * **expand8**: Vectorized version is ~59% faster than scalar (66.7 → 105.8 
ops/us).
   * **expand16**: Scalar slightly outperforms vector (112.8 vs 105.6 ops/us).
   
   ### Lucene Microbenchmarks
   
   ```
   
   baseline
   Benchmark                                (bpv)   Mode  Cnt   Score   Error   
Units
   PostingIndexInputBenchmark.decode            2  thrpt   15  35.409 ± 0.120  
ops/us
   PostingIndexInputBenchmark.decode            3  thrpt   15  29.128 ± 0.017  
ops/us
   PostingIndexInputBenchmark.decode            4  thrpt   15  41.492 ± 0.305  
ops/us
   PostingIndexInputBenchmark.decode            5  thrpt   15  32.205 ± 0.350  
ops/us
   PostingIndexInputBenchmark.decode            6  thrpt   15  31.237 ± 0.245  
ops/us
   PostingIndexInputBenchmark.decode            7  thrpt   15  29.984 ± 0.582  
ops/us
   PostingIndexInputBenchmark.decode            8  thrpt   15  56.366 ± 0.134  
ops/us
   PostingIndexInputBenchmark.decode            9  thrpt   15  22.802 ± 0.077  
ops/us
   PostingIndexInputBenchmark.decode           10  thrpt   15  23.502 ± 0.037  
ops/us
   PostingIndexInputBenchmark.decodeVector      2  thrpt   15  53.151 ± 0.070  
ops/us
   PostingIndexInputBenchmark.decodeVector      3  thrpt   15  48.863 ± 1.455  
ops/us
   PostingIndexInputBenchmark.decodeVector      4  thrpt   15  54.284 ± 2.195  
ops/us
   PostingIndexInputBenchmark.decodeVector      5  thrpt   15  39.302 ± 0.659  
ops/us
   PostingIndexInputBenchmark.decodeVector      6  thrpt   15  38.414 ± 0.830  
ops/us
   PostingIndexInputBenchmark.decodeVector      7  thrpt   15  39.609 ± 0.551  
ops/us
   PostingIndexInputBenchmark.decodeVector      8  thrpt   15  56.373 ± 0.118  
ops/us
   PostingIndexInputBenchmark.decodeVector      9  thrpt   15  27.295 ± 0.351  
ops/us
   PostingIndexInputBenchmark.decodeVector     10  thrpt   15  30.058 ± 0.172  
ops/us
   
   
   contender
   Benchmark                                (bpv)   Mode  Cnt   Score   Error   
Units
   PostingIndexInputBenchmark.decode            2  thrpt   15  35.238 ± 0.209  
ops/us
   PostingIndexInputBenchmark.decode            3  thrpt   15  29.214 ± 0.098  
ops/us
   PostingIndexInputBenchmark.decode            4  thrpt   15  41.559 ± 0.580  
ops/us
   PostingIndexInputBenchmark.decode            5  thrpt   15  32.543 ± 0.175  
ops/us
   PostingIndexInputBenchmark.decode            6  thrpt   15  31.323 ± 0.061  
ops/us
   PostingIndexInputBenchmark.decode            7  thrpt   15  29.525 ± 0.315  
ops/us
   PostingIndexInputBenchmark.decode            8  thrpt   15  52.348 ± 0.079  
ops/us
   PostingIndexInputBenchmark.decode            9  thrpt   15  24.919 ± 0.056  
ops/us
   PostingIndexInputBenchmark.decode           10  thrpt   15  26.581 ± 0.049  
ops/us
   PostingIndexInputBenchmark.decodeVector      2  thrpt   15  71.223 ± 6.921  
ops/us
   PostingIndexInputBenchmark.decodeVector      3  thrpt   15  53.237 ± 1.962  
ops/us
   PostingIndexInputBenchmark.decodeVector      4  thrpt   15  73.437 ± 0.284  
ops/us
   PostingIndexInputBenchmark.decodeVector      5  thrpt   15  41.201 ± 2.067  
ops/us
   PostingIndexInputBenchmark.decodeVector      6  thrpt   15  46.622 ± 0.289  
ops/us
   PostingIndexInputBenchmark.decodeVector      7  thrpt   15  45.505 ± 1.044  
ops/us
   PostingIndexInputBenchmark.decodeVector      8  thrpt   15  58.368 ± 0.977  
ops/us
   PostingIndexInputBenchmark.decodeVector      9  thrpt   15  27.243 ± 0.358  
ops/us
   PostingIndexInputBenchmark.decodeVector     10  thrpt   15  30.059 ± 0.105  
ops/us
   ``` 
   
   ### Summary
   bpv -9,10 uses primitive size as 16, hence no change in performance.
   
   |   bpv | baseline vector (ops/μs) | contender vector (ops/μs) |          Δ |
   | ----: | -----------------------: | ------------------------: | ---------: |
   |     2 |                     53.2 |                      71.2 |    +33.8 % |
   |     3 |                     48.9 |                      53.2 |     +8.8 % |
   |     4 |                     54.3 |                      73.4 |    +35.2 % |
   |     5 |                     39.3 |                      41.2 |     +4.8 % |
   |     6 |                     38.4 |                      46.6 |    +21.4 % |
   |     7 |                     39.6 |                      45.5 |    +14.9 % |
   | 8    |                 56.3 |                  58.4 | +3.7 % |
   |     9 |                     27.3 |                      27.2 |     –0.4 % |
   |    10 |                     30.1 |                      30.1 |      0.0 % |
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to