msfroh commented on PR #13521:
URL: https://github.com/apache/lucene/pull/13521#issuecomment-2313906920

   I tried modifying the loop to process 4 longs per iteration and noticed no 
difference on my Xeon host, which is unsurprising since there was no difference 
between 1 and 3.
   
   I also tried the following SIMD implementation of `decode`:
   
   ```
           @Override
           public void decode(IndexInput in, int start, int count, int[] 
docIDs) throws IOException {
               int i = 0;
   
               long[] inputScratch = new long[LONG_SPECIES.length()];
               long[] outputScratch = new long[LONG_SPECIES.length() * 3];
               int bound = LONG_SPECIES.loopBound(count / 3) * 3;
   
               for (; i < bound; i += outputScratch.length) {
                   for (int j = 0; j < LONG_SPECIES.length(); j++) {
                       inputScratch[j] = in.readLong();
                   }
                   LongVector longVector = LongVector.fromArray(LONG_SPECIES, 
inputScratch, 0);
                   longVector.lanewise(VectorOperators.LSHR, 42)
                           .intoArray(outputScratch, 0);
                   longVector.lanewise(VectorOperators.AND, 0x000003FFFFE00000L)
                           .lanewise(VectorOperators.LSHR, 21)
                           .intoArray(outputScratch, LONG_SPECIES.length());
                   longVector.lanewise(VectorOperators.AND, 0x001FFFFFL)
                           .intoArray(outputScratch, LONG_SPECIES.length() * 2);
                   for (int j = 0; j < LONG_SPECIES.length(); j++) {
                       docIDs[i + j] = (int) outputScratch[j];
                       docIDs[i + j + 1] = (int) outputScratch[j + 
LONG_SPECIES.length()];
                       docIDs[i + j + 2] = (int) outputScratch[j + 
LONG_SPECIES.length() * 2];
                   }
               }
               for (; i < count - 2; i += 3) {
                   long packedLong = in.readLong();
                   docIDs[i] = (int) (packedLong >>> 42);
                   docIDs[i + 1] = (int) ((packedLong & 0x000003FFFFE00000L) 
>>> 21);
                   docIDs[i + 2] = (int) (packedLong & 0x001FFFFFL);
               }
               for (; i < count; i++) {
                   docIDs[i] = in.readInt();
               }
           }
   ```
   
   Unfortunately, it performs noticeably worse than the other implementations:
   
   ```
   Benchmark                               (encoderName)  Mode  Cnt     Score   
 Error  Units
   DocIdEncodingBenchmark.decode    Bit21WithSimdEncoder  avgt    5  2191.040 ± 
14.913  ms/op
   DocIdEncodingBenchmark.decode  Bit21With3StepsEncoder  avgt    5   850.331 ± 
 4.576  ms/op
   DocIdEncodingBenchmark.decode  Bit21With2StepsEncoder  avgt    5   859.980 ± 
 4.567  ms/op
   DocIdEncodingBenchmark.decode            Bit24Encoder  avgt    5   912.914 ± 
 5.488  ms/op
   ```
   
   Maybe I'm doing it wrong 🤷 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to