msfroh commented on PR #13521:
URL: https://github.com/apache/lucene/pull/13521#issuecomment-2315907724
Okay -- I was able to speed up the SIMD implementation a fair bit. Honestly,
my main stupid mistake was that I hadn't declared `LONG_SPECIES` as `static
final`, which probably prevented some inlining.
I removed the array allocations in each call, as well as the scalar
operations within the vector loop.
```
private static final VectorSpecies<Long> LONG_SPECIES =
LongVector.SPECIES_MAX;
private final long[] inputScratch = new long[512 / 3]; // We know
that count is <= 512
private final long[] outputScratch = new long[inputScratch.length *
3];
@Override
public void decode(IndexInput in, int start, int count, int[]
docIDs) throws IOException {
int i = 0;
int bound = LONG_SPECIES.loopBound(count / 3) * 3;
for (int j = 0; j < bound / 3; j++) {
inputScratch[j] = in.readLong();
}
int inc = LONG_SPECIES.length() * 3;
for (; i < bound; i += inc) {
LongVector longVector = LongVector.fromArray(LONG_SPECIES,
inputScratch, i/3);
longVector.lanewise(VectorOperators.LSHR, 42)
.intoArray(outputScratch, i);
longVector.lanewise(VectorOperators.AND, 0x000003FFFFE00000L)
.lanewise(VectorOperators.LSHR, 21)
.intoArray(outputScratch, i + LONG_SPECIES.length());
longVector.lanewise(VectorOperators.AND, 0x001FFFFFL)
.intoArray(outputScratch, i + LONG_SPECIES.length()
* 2);
}
for (int j = 0; j < bound; j += LONG_SPECIES.length() * 3) {
for (int k = 0; k < LONG_SPECIES.length(); k++) {
docIDs[j + k] = (int) outputScratch[j + k];
docIDs[j + k + 1] = (int) outputScratch[j + k +
LONG_SPECIES.length()];
docIDs[j + k + 2] = (int) outputScratch[j + k +
LONG_SPECIES.length() * 2];
}
}
for (; i < count - 2; i += 3) {
long packedLong = in.readLong();
docIDs[i] = (int) (packedLong >>> 42);
docIDs[i + 1] = (int) ((packedLong & 0x000003FFFFE00000L)
>>> 21);
docIDs[i + 2] = (int) (packedLong & 0x001FFFFFL);
}
for (; i < count; i++) {
docIDs[i] = in.readInt();
}
}
```
It's still slower than the scalar implementation, but it's a lot closer:
```
Benchmark (encoderName) Mode Cnt Score
Error Units
DocIdEncodingBenchmark.decode Bit21WithSimdEncoder avgt 5 1032.151 ±
9.343 ms/op
DocIdEncodingBenchmark.decode Bit21With3StepsEncoder avgt 5 845.505 ±
5.924 ms/op
DocIdEncodingBenchmark.decode Bit21With2StepsEncoder avgt 5 851.975 ±
1.618 ms/op
DocIdEncodingBenchmark.decode Bit24Encoder avgt 5 913.055 ±
79.916 ms/op
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]