goankur commented on code in PR #13572: URL: https://github.com/apache/lucene/pull/13572#discussion_r1817385236
########## lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/VectorUtilBenchmark.java: ########## @@ -84,6 +91,76 @@ public void init() { floatsA[i] = random.nextFloat(); floatsB[i] = random.nextFloat(); } + // Java 21+ specific initialization + final int runtimeVersion = Runtime.version().feature(); + if (runtimeVersion >= 21) { + // Reflection based code to eliminate the use of Preview classes in JMH benchmarks + try { + final Class<?> vectorUtilSupportClass = VectorUtil.getVectorUtilSupportClass(); + final var className = "org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport"; + if (vectorUtilSupportClass.getName().equals(className) == false) { + nativeBytesA = null; + nativeBytesB = null; + } else { + MethodHandles.Lookup lookup = MethodHandles.lookup(); + final var MemorySegment = "java.lang.foreign.MemorySegment"; + final var methodType = + MethodType.methodType(lookup.findClass(MemorySegment), byte[].class); + MethodHandle nativeMemorySegment = + lookup.findStatic(vectorUtilSupportClass, "nativeMemorySegment", methodType); + byte[] a = new byte[size]; Review Comment: Not the same two vectors on each iteration. We run setup once every `iteration` for a total of `15` iterations across `3` forks (5 iterations per fork) for each `size` being tested. Each fork is preceded by 3 warm-up iterations. Before **each** iteration we generate random numbers in range [0-127] in two on-heap `byte[]`, allocate off-heap memory segments and populate them with contents from `byte[]`. These off-heap memory segments are provided to `VectorUtil.NATIVE_DOT_PRODUCT` method handle. (Code snippet below for reference) ``` @Param({"1", "128", "207", "256", "300", "512", "702", "1024"}) int size; @Setup(Level.Iteration) public void init() { ... } ``` > I wonder if we would see something different if we generated a large number of vectors and randomized which ones we compare on each run. Also would performance vary if the vectors are sequential in their buffer (ie vector 0 starts at 0, vector 1 starts at size...) I guess the question you are hinting at is how does the performance vary when the two candidate vectors are further apart in memory (L1 cache / L2 cache / L3 cache / Main-memory). Do the gains from native implementation become insignificant with increasing distance ? Its an interesting question and I propose that we add benchmark method(s) to answer them in a follow up PR. Does that sound reasonable ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org