Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

via GitHub Wed, 31 Jul 2024 18:04:53 -0700


rmuir commented on code in PR #13572:
URL: https://github.com/apache/lucene/pull/13572#discussion_r1699259519



##########
lucene/core/src/c/dotProduct.c:
##########
@@ -0,0 +1,143 @@
+// dotProduct.c
+
+#include <stdio.h>
+#include <arm_neon.h>
+
+#ifdef __ARM_ACLE
+#include <arm_acle.h>
+#endif
+
+#if (defined(__ARM_FEATURE_SVE) && !defined(__APPLE__)) 
+#include <arm_sve.h>
+/*
+ * Unrolled and vectorized int8 dotProduct implementation using SVE 
instructions
+ * NOTE: Clang 15.0 compiler on Apple M3 Max compiles the code below 
sucessfully 
+ * with '-march=native+sve' option but throws "Illegal Hardware Instruction" 
error
+ * Looks like Apple M3 does not implement SVE and Apple's official 
documentation
+ * is not explicit about this or at least I could not find it. 
+ * 
+ */
+int32_t vdot8s_sve(int8_t *vec1, int8_t *vec2, int32_t limit) {
+    int32_t result = 0;
+    int32_t i = 0;
+    // Vectors of 8-bit signed integers
+    svint8_t va1, va2, va3, va4;
+    svint8_t vb1, vb2, vb3, vb4;
+    // Init accumulators
+    svint32_t acc1 = svdup_n_s32(0);
+    svint32_t acc2 = svdup_n_s32(0);
+    svint32_t acc3 = svdup_n_s32(0);
+    svint32_t acc4 = svdup_n_s32(0);
+
+    // Number of 8-bits elements in the SVE vector
+    int32_t vec_length = svcntb();
+
+    // Manually unroll the loop
+    for (i = 0; i + 4 * vec_length <= limit; i += 4 * vec_length) {
+       // Load vectors into the Z registers which can range from 128-bit to 
2048-bit wide
+       // The predicate register - P determines which bytes are active
+       // svptrue_b8() returns a predictae in which every element is true
+       //
+        va1 = svld1_s8(svptrue_b8(), vec1 + i);
+        vb1 = svld1_s8(svptrue_b8(), vec2 + i);
+
+        va2 = svld1_s8(svptrue_b8(), vec1 + i + vec_length);
+        vb2 = svld1_s8(svptrue_b8(), vec2 + i + vec_length);
+
+        va3 = svld1_s8(svptrue_b8(), vec1 + i + 2 * vec_length);
+        vb3 = svld1_s8(svptrue_b8(), vec2 + i + 2 * vec_length);
+
+        va4 = svld1_s8(svptrue_b8(), vec1 + i + 3 * vec_length);
+        vb4 = svld1_s8(svptrue_b8(), vec2 + i + 3 * vec_length);
+
+       // Dot product using SDOT instruction on Z vectors
+       acc1 = svdot_s32(acc1, va1, vb1);
+       acc2 = svdot_s32(acc2, va2, vb2);
+       acc3 = svdot_s32(acc3, va3, vb3);
+       acc4 = svdot_s32(acc4, va4, vb4);
+    }       
+    // Add correspponding active elements in each of the vectors 
+    acc1 = svadd_s32_x(svptrue_b8() , acc1, acc2);
+    acc3 = svadd_s32_x(svptrue_b8() , acc3, acc4);
+    acc1 = svadd_s32_x(svptrue_b8(), acc1, acc3);
+    
+    // REDUCE: Add every vector element in target and write result to scalar
+    result = svaddv_s32(svptrue_b8(), acc1);
+
+    // Scalar tail. TODO: Use FMA
+    for (; i < limit; i++) {
+        result += vec1[i] * vec2[i];
+    }

Review Comment:
   for any "tails" like this where we manually unroll and vectorize the main 
loop, we can add pragmas to prevent GCC/LLVM from trying to unroll and 
vectorize the tail. It is not strictly necessary but will lead to tighter code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

Reply via email to