msfroh commented on PR #13521:
URL: https://github.com/apache/lucene/pull/13521#issuecomment-2315907724

   Okay -- I was able to speed up the SIMD implementation a fair bit. Honestly, 
my main stupid mistake was that I hadn't declared `LONG_SPECIES` as `static 
final`, which probably prevented some inlining.
   
   I removed the array allocations in each call, as well as the scalar 
operations within the vector loop.
   
   ```
           private static final VectorSpecies<Long> LONG_SPECIES = 
LongVector.SPECIES_MAX;
           private final long[] inputScratch = new long[512 / 3]; // We know 
that count is <= 512
           private final long[] outputScratch = new long[inputScratch.length * 
3];
           @Override
           public void decode(IndexInput in, int start, int count, int[] 
docIDs) throws IOException {
               int i = 0;
   
               int bound = LONG_SPECIES.loopBound(count / 3) * 3;
               for (int j = 0; j < bound / 3; j++) {
                   inputScratch[j] = in.readLong();
               }
   
               int inc = LONG_SPECIES.length() * 3;
               for (; i < bound; i += inc) {
                   LongVector longVector = LongVector.fromArray(LONG_SPECIES, 
inputScratch, i/3);
                   longVector.lanewise(VectorOperators.LSHR, 42)
                           .intoArray(outputScratch, i);
                   longVector.lanewise(VectorOperators.AND, 0x000003FFFFE00000L)
                           .lanewise(VectorOperators.LSHR, 21)
                           .intoArray(outputScratch, i + LONG_SPECIES.length());
                   longVector.lanewise(VectorOperators.AND, 0x001FFFFFL)
                           .intoArray(outputScratch, i + LONG_SPECIES.length() 
* 2);
               }
               for (int j = 0; j < bound; j += LONG_SPECIES.length() * 3) {
                   for (int k = 0; k < LONG_SPECIES.length(); k++) {
                       docIDs[j + k] = (int) outputScratch[j + k];
                       docIDs[j + k + 1] = (int) outputScratch[j + k + 
LONG_SPECIES.length()];
                       docIDs[j + k + 2] = (int) outputScratch[j + k + 
LONG_SPECIES.length() * 2];
                   }
               }
               for (; i < count - 2; i += 3) {
                   long packedLong = in.readLong();
                   docIDs[i] = (int) (packedLong >>> 42);
                   docIDs[i + 1] = (int) ((packedLong & 0x000003FFFFE00000L) 
>>> 21);
                   docIDs[i + 2] = (int) (packedLong & 0x001FFFFFL);
               }
               for (; i < count; i++) {
                   docIDs[i] = in.readInt();
               }
           }
   ```
   
   It's still slower than the scalar implementation, but it's a lot closer:
   
   ```
   Benchmark                               (encoderName)  Mode  Cnt     Score   
 Error  Units
   DocIdEncodingBenchmark.decode    Bit21WithSimdEncoder  avgt    5  1032.151 ± 
 9.343  ms/op
   DocIdEncodingBenchmark.decode  Bit21With3StepsEncoder  avgt    5   845.505 ± 
 5.924  ms/op
   DocIdEncodingBenchmark.decode  Bit21With2StepsEncoder  avgt    5   851.975 ± 
 1.618  ms/op
   DocIdEncodingBenchmark.decode            Bit24Encoder  avgt    5   913.055 ± 
79.916  ms/op
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to