msfroh commented on PR #13521: URL: https://github.com/apache/lucene/pull/13521#issuecomment-2315907724
Okay -- I was able to speed up the SIMD implementation a fair bit. Honestly, my main stupid mistake was that I hadn't declared `LONG_SPECIES` as `static final`, which probably prevented some inlining. I removed the array allocations in each call, as well as the scalar operations within the vector loop. ``` private static final VectorSpecies<Long> LONG_SPECIES = LongVector.SPECIES_MAX; private final long[] inputScratch = new long[512 / 3]; // We know that count is <= 512 private final long[] outputScratch = new long[inputScratch.length * 3]; @Override public void decode(IndexInput in, int start, int count, int[] docIDs) throws IOException { int i = 0; int bound = LONG_SPECIES.loopBound(count / 3) * 3; for (int j = 0; j < bound / 3; j++) { inputScratch[j] = in.readLong(); } int inc = LONG_SPECIES.length() * 3; for (; i < bound; i += inc) { LongVector longVector = LongVector.fromArray(LONG_SPECIES, inputScratch, i/3); longVector.lanewise(VectorOperators.LSHR, 42) .intoArray(outputScratch, i); longVector.lanewise(VectorOperators.AND, 0x000003FFFFE00000L) .lanewise(VectorOperators.LSHR, 21) .intoArray(outputScratch, i + LONG_SPECIES.length()); longVector.lanewise(VectorOperators.AND, 0x001FFFFFL) .intoArray(outputScratch, i + LONG_SPECIES.length() * 2); } for (int j = 0; j < bound; j += LONG_SPECIES.length() * 3) { for (int k = 0; k < LONG_SPECIES.length(); k++) { docIDs[j + k] = (int) outputScratch[j + k]; docIDs[j + k + 1] = (int) outputScratch[j + k + LONG_SPECIES.length()]; docIDs[j + k + 2] = (int) outputScratch[j + k + LONG_SPECIES.length() * 2]; } } for (; i < count - 2; i += 3) { long packedLong = in.readLong(); docIDs[i] = (int) (packedLong >>> 42); docIDs[i + 1] = (int) ((packedLong & 0x000003FFFFE00000L) >>> 21); docIDs[i + 2] = (int) (packedLong & 0x001FFFFFL); } for (; i < count; i++) { docIDs[i] = in.readInt(); } } ``` It's still slower than the scalar implementation, but it's a lot closer: ``` Benchmark (encoderName) Mode Cnt Score Error Units DocIdEncodingBenchmark.decode Bit21WithSimdEncoder avgt 5 1032.151 ± 9.343 ms/op DocIdEncodingBenchmark.decode Bit21With3StepsEncoder avgt 5 845.505 ± 5.924 ms/op DocIdEncodingBenchmark.decode Bit21With2StepsEncoder avgt 5 851.975 ± 1.618 ms/op DocIdEncodingBenchmark.decode Bit24Encoder avgt 5 913.055 ± 79.916 ms/op ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org