Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

via GitHub Fri, 25 Oct 2024 14:23:14 -0700


goankur commented on code in PR #13572:
URL: https://github.com/apache/lucene/pull/13572#discussion_r1817385236



##########
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/VectorUtilBenchmark.java:
##########
@@ -84,6 +91,76 @@ public void init() {
       floatsA[i] = random.nextFloat();
       floatsB[i] = random.nextFloat();
     }
+    // Java 21+ specific initialization
+    final int runtimeVersion = Runtime.version().feature();
+    if (runtimeVersion >= 21) {
+      // Reflection based code to eliminate the use of Preview classes in JMH 
benchmarks
+      try {
+        final Class<?> vectorUtilSupportClass = 
VectorUtil.getVectorUtilSupportClass();
+        final var className = 
"org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport";
+        if (vectorUtilSupportClass.getName().equals(className) == false) {
+          nativeBytesA = null;
+          nativeBytesB = null;
+        } else {
+          MethodHandles.Lookup lookup = MethodHandles.lookup();
+          final var MemorySegment = "java.lang.foreign.MemorySegment";
+          final var methodType =
+              MethodType.methodType(lookup.findClass(MemorySegment), 
byte[].class);
+          MethodHandle nativeMemorySegment =
+              lookup.findStatic(vectorUtilSupportClass, "nativeMemorySegment", 
methodType);
+          byte[] a = new byte[size];

Review Comment:
   Yes this is the setup code for the benchmark. We run setup once every 
`iteration` for a total of `15` iterations across `3` forks (5 iterations per 
fork)  for each `size` being tested. Each fork is preceded by 3 warm-up 
iterations.
   So before **each** iteration we generate random numbers in range [0-127] in 
two on-heap `byte[]`, allocate off-heap memory segments and populate them with 
contents from `byte[]`. These off-heap memory segments are provided to 
`VectorUtil.NATIVE_DOT_PRODUCT` method handle. 
   
   (Code snippet below for reference)
   
   ```
   @Param({"1", "128", "207", "256", "300", "512", "702", "1024"})
     int size;
   
   @Setup(Level.Iteration)
   public void init() {
   ...
   }
   ```
   
   > I wonder if we would see something different if we generated a large 
number of vectors and randomized which ones we compare on each run. Also would 
performance vary if the vectors are sequential in their buffer (ie vector 0 
starts at 0, vector 1 starts at size...)
   
   I guess the question you are hinting at is how does the performance vary 
when the two candidate vectors are further apart in memory (L1 cache / L2 cache 
/ L3 cache / Main-memory). Do the gains from native implementation become 
insignificant with increasing distance ?  Its an interesting question and I 
propose that we add benchmark method(s) to answer them in a follow up PR. Does 
that sound reasonable ?
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

Reply via email to