gsmiller commented on PR #12417: URL: https://github.com/apache/lucene/pull/12417#issuecomment-1634807010
OK, one last follow-up on my idea of using tighter lanes to get more concurrency for shift+mask while decoding: I educated myself a little bit more and explored one last idea of using a scatter instruction to copy the output vector to a byte[] that is padded with 0s such that each decoded value is 4 bytes instead of 1. My thought was that we could potentially throw a ByteBuffer on top of this and interpret the data as integers. From some micro-benchmarks I ran, it looks like the "scatter" is very costly and tanks performance, so I'm convinced at this point it's not worth pursuing the "tighter lane" idea further. Here's essentially what I was doing just to close the loop (I was experimenting with 64 values at a time instead of 128, and a bit width of 2 just to make things a bit simpler and only load data to a vector register once): ``` private static final int[] scatterMap = new int[16]; static { int upto = 3; for (int i = 0; i < 16; i++) { scatterMap[i] = upto; upto += 4; } } public static void unpackSimd2(byte[] in, byte[] out) { ByteVector inVec = ByteVector.fromArray(BYTE_SPECIES_128, in, 0); ByteVector outVec; int upto = 0; for (int shift = 0; shift < 8; shift += 2) { outVec = inVec.lanewise(VectorOperators.LSHR, shift).and(BYTE_MASK); outVec.intoArray(out, upto, scatterMap, 0); upto += 64; } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org