kaivalnp commented on code in PR #14863: URL: https://github.com/apache/lucene/pull/14863#discussion_r2254730354
########## lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ########## @@ -530,7 +566,41 @@ private int dotProductBody512Int4Packed(byte[] unpacked, byte[] packed, int limi return sum; } - private int dotProductBody256Int4Packed(byte[] unpacked, byte[] packed, int limit) { + private static int dotProductBody512Int4PackedPacked( + ByteVectorLoader a, ByteVectorLoader b, int limit) { + int sum = 0; + // iterate in chunks of 1024 items to ensure we don't overflow the short accumulator + for (int i = 0; i < limit; i += 4096) { + ShortVector acc0 = ShortVector.zero(ShortVector.SPECIES_512); + ShortVector acc1 = ShortVector.zero(ShortVector.SPECIES_512); + int innerLimit = Math.min(limit - i, 4096); + for (int j = 0; j < innerLimit; j += ByteVector.SPECIES_256.length()) { + // packed + var vb8 = b.load(ByteVector.SPECIES_256, i + j); + // packed + var va8 = a.load(ByteVector.SPECIES_256, i + j); + + // upper + ByteVector prod8 = vb8.and((byte) 0x0F).mul(va8.and((byte) 0x0F)); + Vector<Short> prod16 = prod8.convertShape(ZERO_EXTEND_B2S, ShortVector.SPECIES_512, 0); Review Comment: > This is still doing a `convertShape` operation In this conversion, the bit sizes are the same (`ShortVector.SPECIES_256` -> `IntVector.SPECIES_256`) so it is not shape-changing, and should be cheap? I saw this in the Javadocs: ``` If the old and new species have the same shape, the behavior is exactly the same as the simpler, shape-invariant method convert(). In such cases, the simpler method convert() should be used, to make code easier to reason about. Otherwise, this is a shape-changing operation, and may have special implementation costs ``` Also, we're only doing the `convertShape` outside the inner loop (once per "chunk") > is there a way to extract the accumulated values from the short accs without converting to an int first? We could probably reinterpret it as integers, and do some bit manipulation. I changed the original accumulation logic from: ```java IntVector intAcc0 = acc0.convertShape(S2I, IntVector.SPECIES_256, 0).reinterpretAsInts(); IntVector intAcc1 = acc0.convertShape(S2I, IntVector.SPECIES_256, 1).reinterpretAsInts(); IntVector intAcc2 = acc1.convertShape(S2I, IntVector.SPECIES_256, 0).reinterpretAsInts(); IntVector intAcc3 = acc1.convertShape(S2I, IntVector.SPECIES_256, 1).reinterpretAsInts(); sum += intAcc0.add(intAcc1).add(intAcc2).add(intAcc3).reduceLanes(ADD); ``` ..to: ```java IntVector acc0i = acc0.reinterpretAsInts(); IntVector acc1i = acc1.reinterpretAsInts(); IntVector intAcc0 = acc0i.and(0xFFFF); // retain lower short IntVector intAcc1 = acc0i.lanewise(LSHR, 16); // retain upper short IntVector intAcc2 = acc1i.and(0xFFFF); IntVector intAcc3 = acc1i.lanewise(LSHR, 16); sum += intAcc0.add(intAcc1).add(intAcc2).add(intAcc3).reduceLanes(ADD); ``` ..and the results are: ``` VectorUtilBenchmark.binaryHalfByteVectorPackedPacked 1024 thrpt 15 12.941 ± 0.194 ops/us ``` The performance is almost similar. I can use `convert` instead of `convertShape` for better readability (and enforcing that this operation is not shape-changing) as the Javadocs suggest -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org