Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

via GitHub Fri, 01 Nov 2024 07:02:08 -0700


msokolov commented on code in PR #13572:
URL: https://github.com/apache/lucene/pull/13572#discussion_r1825833969



##########
lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java:
##########
@@ -291,25 +296,125 @@ private float squareDistanceBody(float[] a, float[] b, 
int limit) {
     return res1.add(res2).reduceLanes(ADD);
   }
 
-  // Binary functions, these all follow a general pattern like this:
-  //
-  //   short intermediate = a * b;
-  //   int accumulator = (int)accumulator + (int)intermediate;
-  //
-  // 256 or 512 bit vectors can process 64 or 128 bits at a time, respectively
-  // intermediate results use 128 or 256 bit vectors, respectively
-  // final accumulator uses 256 or 512 bit vectors, respectively
-  //
-  // We also support 128 bit vectors, going 32 bits at a time.
-  // This is slower but still faster than not vectorizing at all.
-
+  /**
+   * This method SHOULD NOT be used directly when the native dot-product is 
enabled. This is because
+   * it allocates off-heap memory in the hot code-path and copies the input 
byte-vectors onto it.
+   * This is necessary even with Panama APIs because native code is given a 
pointer to a memory
+   * address and if the content at that address can be moved at anytime (by 
Java GC in case of heap
+   * allocated memory) then safety of the native computation cannot be 
guaranteed. If we try to pass
+   * MemorySegment backing on-heap memory to native code we get
+   * "java.lang.UnsupportedOperationException: Not a native address"
+   *
+   * <p>Stack overflow thread:

Review Comment:
   Thanks this is helpful, but let's not permanently record the SO thread in 
the code comments



##########
lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java:
##########
@@ -291,25 +296,125 @@ private float squareDistanceBody(float[] a, float[] b, 
int limit) {
     return res1.add(res2).reduceLanes(ADD);
   }
 
-  // Binary functions, these all follow a general pattern like this:
-  //
-  //   short intermediate = a * b;
-  //   int accumulator = (int)accumulator + (int)intermediate;
-  //
-  // 256 or 512 bit vectors can process 64 or 128 bits at a time, respectively
-  // intermediate results use 128 or 256 bit vectors, respectively
-  // final accumulator uses 256 or 512 bit vectors, respectively
-  //
-  // We also support 128 bit vectors, going 32 bits at a time.
-  // This is slower but still faster than not vectorizing at all.
-
+  /**
+   * This method SHOULD NOT be used directly when the native dot-product is 
enabled. This is because
+   * it allocates off-heap memory in the hot code-path and copies the input 
byte-vectors onto it.
+   * This is necessary even with Panama APIs because native code is given a 
pointer to a memory
+   * address and if the content at that address can be moved at anytime (by 
Java GC in case of heap
+   * allocated memory) then safety of the native computation cannot be 
guaranteed. If we try to pass
+   * MemorySegment backing on-heap memory to native code we get
+   * "java.lang.UnsupportedOperationException: Not a native address"
+   *
+   * <p>Stack overflow thread:
+   * 
https://stackoverflow.com/questions/69521289/jep-412-pass-a-on-heap-byte-array-to-native-code-getting-unsupportedoperatione
+   * explains the issue in more detail.
+   *
+   * <p>Q1. Why did we enable the native code path here if its inefficient ? 
A1. So that it can be
+   * exercised in <b>unit-tests</b> to test correctness of underlying 
vectorized implementation in C
+   * without directly relying on preview Panama APIs. Without this, unit-tests 
would need to use
+   * reflection to 1) Get a method handle of {@link #dotProduct(MemorySegment, 
MemorySegment)} as it
+   * is not part of {@link VectorUtilSupport} 2) Get a method handle of a 
to-be-defined method that
+   * would copy byte[] to off-heap {@link MemorySegment} and return them as 
Objects 3) Invoke the
+   * method handle retrieved in 1) with Objects in 2) All doable but adds 
unnecessary complexity in
+   * the unit tests suite.
+   *
+   * <p>Q2. Which method should HNSW scoring components use for computing 
dot-product in native code
+   * on byte-vectors? A2. They should use {@link #dotProduct(MemorySegment, 
MemorySegment)} directly
+   * and control the creation and reuse of off-heap MemorySegments as 
dotProduct is in the hot code
+   * path for both indexing and searching. It is easy to see that 
stored-vectors residing in
+   * Memory-mapped files can simply be accessed using {@link 
MemorySegment#asSlice(long, long,
+   * long)} which does not allocate new memory and does not require copying 
vector data onto JVM
+   * heap. For target-vector, copying to off-heap memory will still be needed 
but allocation can
+   * happen once per scorer.
+   *
+   * <p>Q3. Should JMH benchmarks measure the performance of this method? A3. 
No, because they would
+   * be measuring the combined cost of memory-allocations and native method 
invocation. Instead JMH
+   * benchmarks should measure the performance of @link 
#dotProduct(MemorySegment, MemorySegment)}
+   * which will be used directly by HNSW scoring components. This means that 
benchmarking code would
+   * need to rely on reflection based method invocation described in A1. above 
to avoid the use of
+   * panama preview APIs. Adding this complexity in benchmarking suite should 
be OK to get an
+   * accurate read on performance.
+   *
+   * @param a first byte vector
+   * @param b second byte vector
+   * @return int dot-product score
+   */
   @Override
   public int dotProduct(byte[] a, byte[] b) {
-    return dotProduct(MemorySegment.ofArray(a), MemorySegment.ofArray(b));
+    MemorySegment memSegA;
+    MemorySegment memSegB;
+
+    if (NATIVE_DOT_PRODUCT_ENABLED == true && TEST_NATIVE_DOT_PRODUCT == true) 
{
+      memSegA = Arena.ofAuto().allocate(a.length, 
ValueLayout.JAVA_BYTE.byteAlignment());
+      memSegB = Arena.ofAuto().allocate(b.length, 
ValueLayout.JAVA_BYTE.byteAlignment());
+      MemorySegment.copy(a, 0, memSegA, JAVA_BYTE, 0, a.length);
+      MemorySegment.copy(b, 0, memSegB, JAVA_BYTE, 0, b.length);
+    } else {
+      memSegA = MemorySegment.ofArray(a);
+      memSegB = MemorySegment.ofArray(b);
+    }
+    return dotProduct(memSegA, memSegB);
   }
 
+  /**
+   * For use in JMH benchmarking.
+   *
+   * @param numBytes Dimension of the byte vector
+   * @return offHeap memory segment
+   */
+  public static MemorySegment offHeapByteVector(int numBytes) {
+    ThreadLocalRandom random = ThreadLocalRandom.current();
+    MemorySegment seg = Arena.ofAuto().allocate(numBytes, 
JAVA_BYTE.byteAlignment());
+    for (int i = 0; i < numBytes; i++) {
+      seg.setAtIndex(JAVA_BYTE, i, (byte) random.nextInt(Byte.MAX_VALUE + 1));
+    }
+    return seg;
+  }
+
+  /**
+   * Only used in JMH benchmarks. Helpful for benchmarking the performance of 
compiler
+   * auto-vectorized method (NativeMethodHandles.SIMPLE_DOT_PRODUCT_IMPL) vs 
manually unrolled and
+   * vectorized implementation (NativeMethodHandles.DOT_PRODUCT_IMPL)
+   *
+   * @param a MemorySegment of vector a
+   * @param b MemorySegment of vector b
+   * @return int dot-product score
+   */
+  public static int simpleNativeDotProduct(MemorySegment a, MemorySegment b) {
+    assert a.byteSize() == b.byteSize();
+    try {
+      int limit = (int) a.byteSize();
+      return (int) NativeMethodHandles.SIMPLE_DOT_PRODUCT_IMPL.invokeExact(a, 
b, limit);
+    } catch (Throwable ex$) {
+      throw new AssertionError("should not reach here", ex$);
+    }
+  }
+
+  /**
+   * Binary functions, these all follow a general pattern like this: short 
intermediate = a * b; int
+   * accumulator = (int)accumulator + (int)intermediate; 256 or 512 bit 
vectors can process 64 or
+   * 128 bits at a time, respectively intermediate results use 128 or 256 bit 
vectors, respectively
+   * final accumulator uses 256 or 512 bit vectors, respectively We also 
support 128 bit vectors,
+   * going 32 bits at a time. This is slower but still faster than not 
vectorizing at all. Compute
+   * the dot-product between two byte vectors, each provided as MemorySegment. 
The MemorySegment
+   * could be on-heap or off-heap. In case both MemorySegments are off-heap 
the computation is
+   * delegated to native code.
+   *
+   * @param a MemorySegment of byte vector a
+   * @param b MemorySegment of byte vector b
+   * @return dot-product score
+   */
   public static int dotProduct(MemorySegment a, MemorySegment b) {
     assert a.byteSize() == b.byteSize();
+    // Use native dot-product implementation if both vectors are off-heap
+    if (a.isNative() && b.isNative()) {
+      try {
+        int limit = (int) a.byteSize();
+        return (int) NativeMethodHandles.DOT_PRODUCT_IMPL.invokeExact(a, b, 
limit);
+      } catch (Throwable ex$) {

Review Comment:
   what are we trying to catch here?



##########
lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java:
##########
@@ -291,25 +296,125 @@ private float squareDistanceBody(float[] a, float[] b, 
int limit) {
     return res1.add(res2).reduceLanes(ADD);
   }
 
-  // Binary functions, these all follow a general pattern like this:
-  //
-  //   short intermediate = a * b;
-  //   int accumulator = (int)accumulator + (int)intermediate;
-  //
-  // 256 or 512 bit vectors can process 64 or 128 bits at a time, respectively
-  // intermediate results use 128 or 256 bit vectors, respectively
-  // final accumulator uses 256 or 512 bit vectors, respectively
-  //
-  // We also support 128 bit vectors, going 32 bits at a time.
-  // This is slower but still faster than not vectorizing at all.
-
+  /**
+   * This method SHOULD NOT be used directly when the native dot-product is 
enabled. This is because
+   * it allocates off-heap memory in the hot code-path and copies the input 
byte-vectors onto it.
+   * This is necessary even with Panama APIs because native code is given a 
pointer to a memory
+   * address and if the content at that address can be moved at anytime (by 
Java GC in case of heap
+   * allocated memory) then safety of the native computation cannot be 
guaranteed. If we try to pass
+   * MemorySegment backing on-heap memory to native code we get
+   * "java.lang.UnsupportedOperationException: Not a native address"
+   *
+   * <p>Stack overflow thread:
+   * 
https://stackoverflow.com/questions/69521289/jep-412-pass-a-on-heap-byte-array-to-native-code-getting-unsupportedoperatione
+   * explains the issue in more detail.
+   *
+   * <p>Q1. Why did we enable the native code path here if its inefficient ? 
A1. So that it can be
+   * exercised in <b>unit-tests</b> to test correctness of underlying 
vectorized implementation in C
+   * without directly relying on preview Panama APIs. Without this, unit-tests 
would need to use
+   * reflection to 1) Get a method handle of {@link #dotProduct(MemorySegment, 
MemorySegment)} as it
+   * is not part of {@link VectorUtilSupport} 2) Get a method handle of a 
to-be-defined method that
+   * would copy byte[] to off-heap {@link MemorySegment} and return them as 
Objects 3) Invoke the
+   * method handle retrieved in 1) with Objects in 2) All doable but adds 
unnecessary complexity in
+   * the unit tests suite.
+   *
+   * <p>Q2. Which method should HNSW scoring components use for computing 
dot-product in native code
+   * on byte-vectors? A2. They should use {@link #dotProduct(MemorySegment, 
MemorySegment)} directly
+   * and control the creation and reuse of off-heap MemorySegments as 
dotProduct is in the hot code
+   * path for both indexing and searching. It is easy to see that 
stored-vectors residing in
+   * Memory-mapped files can simply be accessed using {@link 
MemorySegment#asSlice(long, long,
+   * long)} which does not allocate new memory and does not require copying 
vector data onto JVM
+   * heap. For target-vector, copying to off-heap memory will still be needed 
but allocation can
+   * happen once per scorer.
+   *
+   * <p>Q3. Should JMH benchmarks measure the performance of this method? A3. 
No, because they would
+   * be measuring the combined cost of memory-allocations and native method 
invocation. Instead JMH
+   * benchmarks should measure the performance of @link 
#dotProduct(MemorySegment, MemorySegment)}
+   * which will be used directly by HNSW scoring components. This means that 
benchmarking code would
+   * need to rely on reflection based method invocation described in A1. above 
to avoid the use of
+   * panama preview APIs. Adding this complexity in benchmarking suite should 
be OK to get an
+   * accurate read on performance.
+   *
+   * @param a first byte vector
+   * @param b second byte vector
+   * @return int dot-product score
+   */
   @Override
   public int dotProduct(byte[] a, byte[] b) {
-    return dotProduct(MemorySegment.ofArray(a), MemorySegment.ofArray(b));
+    MemorySegment memSegA;
+    MemorySegment memSegB;
+
+    if (NATIVE_DOT_PRODUCT_ENABLED == true && TEST_NATIVE_DOT_PRODUCT == true) 
{
+      memSegA = Arena.ofAuto().allocate(a.length, 
ValueLayout.JAVA_BYTE.byteAlignment());
+      memSegB = Arena.ofAuto().allocate(b.length, 
ValueLayout.JAVA_BYTE.byteAlignment());
+      MemorySegment.copy(a, 0, memSegA, JAVA_BYTE, 0, a.length);
+      MemorySegment.copy(b, 0, memSegB, JAVA_BYTE, 0, b.length);
+    } else {
+      memSegA = MemorySegment.ofArray(a);
+      memSegB = MemorySegment.ofArray(b);
+    }
+    return dotProduct(memSegA, memSegB);
   }
 
+  /**
+   * For use in JMH benchmarking.
+   *
+   * @param numBytes Dimension of the byte vector
+   * @return offHeap memory segment
+   */
+  public static MemorySegment offHeapByteVector(int numBytes) {

Review Comment:
   maybe put "random" in the name so we can easily tell this is creating a 
random value



##########
lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java:
##########
@@ -291,25 +296,125 @@ private float squareDistanceBody(float[] a, float[] b, 
int limit) {
     return res1.add(res2).reduceLanes(ADD);
   }
 
-  // Binary functions, these all follow a general pattern like this:
-  //
-  //   short intermediate = a * b;
-  //   int accumulator = (int)accumulator + (int)intermediate;
-  //
-  // 256 or 512 bit vectors can process 64 or 128 bits at a time, respectively
-  // intermediate results use 128 or 256 bit vectors, respectively
-  // final accumulator uses 256 or 512 bit vectors, respectively
-  //
-  // We also support 128 bit vectors, going 32 bits at a time.
-  // This is slower but still faster than not vectorizing at all.
-
+  /**
+   * This method SHOULD NOT be used directly when the native dot-product is 
enabled. This is because
+   * it allocates off-heap memory in the hot code-path and copies the input 
byte-vectors onto it.
+   * This is necessary even with Panama APIs because native code is given a 
pointer to a memory
+   * address and if the content at that address can be moved at anytime (by 
Java GC in case of heap
+   * allocated memory) then safety of the native computation cannot be 
guaranteed. If we try to pass
+   * MemorySegment backing on-heap memory to native code we get
+   * "java.lang.UnsupportedOperationException: Not a native address"
+   *
+   * <p>Stack overflow thread:
+   * 
https://stackoverflow.com/questions/69521289/jep-412-pass-a-on-heap-byte-array-to-native-code-getting-unsupportedoperatione
+   * explains the issue in more detail.
+   *
+   * <p>Q1. Why did we enable the native code path here if its inefficient ? 
A1. So that it can be

Review Comment:
   I don't think the unit-testing use case is a good one -- unit tests should 
be testing the actual production code path. I see that the required test setup 
would be complex, but IMO it's better to have simpler production code and 
complex unit tests than complex unused production code and simple unit tests!



##########
lucene/core/src/java21/org/apache/lucene/internal/vectorization/Lucene99MemorySegmentScalarQuantizedVectorScorer.java:
##########
@@ -0,0 +1,407 @@
+/*

Review Comment:
   lmk



##########
lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java:
##########
@@ -291,25 +296,125 @@ private float squareDistanceBody(float[] a, float[] b, 
int limit) {
     return res1.add(res2).reduceLanes(ADD);
   }
 
-  // Binary functions, these all follow a general pattern like this:
-  //
-  //   short intermediate = a * b;
-  //   int accumulator = (int)accumulator + (int)intermediate;
-  //
-  // 256 or 512 bit vectors can process 64 or 128 bits at a time, respectively
-  // intermediate results use 128 or 256 bit vectors, respectively
-  // final accumulator uses 256 or 512 bit vectors, respectively
-  //
-  // We also support 128 bit vectors, going 32 bits at a time.
-  // This is slower but still faster than not vectorizing at all.
-
+  /**
+   * This method SHOULD NOT be used directly when the native dot-product is 
enabled. This is because
+   * it allocates off-heap memory in the hot code-path and copies the input 
byte-vectors onto it.
+   * This is necessary even with Panama APIs because native code is given a 
pointer to a memory
+   * address and if the content at that address can be moved at anytime (by 
Java GC in case of heap
+   * allocated memory) then safety of the native computation cannot be 
guaranteed. If we try to pass
+   * MemorySegment backing on-heap memory to native code we get
+   * "java.lang.UnsupportedOperationException: Not a native address"
+   *
+   * <p>Stack overflow thread:
+   * 
https://stackoverflow.com/questions/69521289/jep-412-pass-a-on-heap-byte-array-to-native-code-getting-unsupportedoperatione
+   * explains the issue in more detail.
+   *
+   * <p>Q1. Why did we enable the native code path here if its inefficient ? 
A1. So that it can be
+   * exercised in <b>unit-tests</b> to test correctness of underlying 
vectorized implementation in C
+   * without directly relying on preview Panama APIs. Without this, unit-tests 
would need to use
+   * reflection to 1) Get a method handle of {@link #dotProduct(MemorySegment, 
MemorySegment)} as it
+   * is not part of {@link VectorUtilSupport} 2) Get a method handle of a 
to-be-defined method that
+   * would copy byte[] to off-heap {@link MemorySegment} and return them as 
Objects 3) Invoke the
+   * method handle retrieved in 1) with Objects in 2) All doable but adds 
unnecessary complexity in
+   * the unit tests suite.
+   *
+   * <p>Q2. Which method should HNSW scoring components use for computing 
dot-product in native code
+   * on byte-vectors? A2. They should use {@link #dotProduct(MemorySegment, 
MemorySegment)} directly
+   * and control the creation and reuse of off-heap MemorySegments as 
dotProduct is in the hot code
+   * path for both indexing and searching. It is easy to see that 
stored-vectors residing in
+   * Memory-mapped files can simply be accessed using {@link 
MemorySegment#asSlice(long, long,
+   * long)} which does not allocate new memory and does not require copying 
vector data onto JVM
+   * heap. For target-vector, copying to off-heap memory will still be needed 
but allocation can
+   * happen once per scorer.
+   *
+   * <p>Q3. Should JMH benchmarks measure the performance of this method? A3. 
No, because they would

Review Comment:
   These comments are helpful for the PR, but I think in the actual code we 
would want to simplify and maybe say something like: do not call in production. 
Indeed we could possibly even add an `assert false: "inefficient 
implementation, do not use"` ? And in production fall back to the non-native 
impl



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

Reply via email to