thecoop commented on code in PR #14863: URL: https://github.com/apache/lucene/pull/14863#discussion_r2254223052
########## lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ########## @@ -530,7 +566,41 @@ private int dotProductBody512Int4Packed(byte[] unpacked, byte[] packed, int limi return sum; } - private int dotProductBody256Int4Packed(byte[] unpacked, byte[] packed, int limit) { + private static int dotProductBody512Int4PackedPacked( + ByteVectorLoader a, ByteVectorLoader b, int limit) { + int sum = 0; + // iterate in chunks of 1024 items to ensure we don't overflow the short accumulator + for (int i = 0; i < limit; i += 4096) { + ShortVector acc0 = ShortVector.zero(ShortVector.SPECIES_512); + ShortVector acc1 = ShortVector.zero(ShortVector.SPECIES_512); + int innerLimit = Math.min(limit - i, 4096); + for (int j = 0; j < innerLimit; j += ByteVector.SPECIES_256.length()) { + // packed + var vb8 = b.load(ByteVector.SPECIES_256, i + j); + // packed + var va8 = a.load(ByteVector.SPECIES_256, i + j); + + // upper + ByteVector prod8 = vb8.and((byte) 0x0F).mul(va8.and((byte) 0x0F)); + Vector<Short> prod16 = prod8.convertShape(ZERO_EXTEND_B2S, ShortVector.SPECIES_512, 0); Review Comment: Yes, that kind of thing. The inner and/mul operations could be done in a loop rather than unrolled - probably won't have an effect? This is still doing a `convertShape` operation - is there a way to extract the accumulated values from the short accs without converting to an int first? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org