gsmiller commented on a change in pull request #69:
URL: https://github.com/apache/lucene/pull/69#discussion_r609199774



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java
##########
@@ -121,4 +167,146 @@ void skip(DataInput in) throws IOException {
       in.skipBytes(forUtil.numBytes(bitsPerValue) + (numExceptions << 1));
     }
   }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
1. Note this assumes
+   * there are no exceptions to apply.
+   */
+  private static void prefixSumOfOnes(long[] longs, long base) {
+    System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
+    // This loop gets auto-vectorized
+    for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
+      longs[i] += base;
+    }
+  }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
{@code val}. Note
+   * this assumes there are no exceptions to apply.
+   */
+  private static void prefixSumOf(long[] longs, long base, long val) {
+    for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
+      longs[i] = (i + 1) * val + base;

Review comment:
       Thanks, I'll have a look at that link and see about setting up a 
different instance type.
   
   In the meantime, for what it's worth, I used `PrintAssembly` and was able to 
get the assembly for both `prefixSumOf` and `prefixSumOfOnes` on my linux box. 
Same thing I did on my MBP, but on the linux machine. In `prefixSumOfOnes`, I'm 
seeing this block, which looks an awful lot like vectorized instructions:
   ```
     0x00007f62379d6831: vmovq  %rbx,%xmm0
     0x00007f62379d6836: vpunpcklqdq %xmm0,%xmm0,%xmm0
     0x00007f62379d683a: vinserti128 $0x1,%xmm0,%ymm0,%ymm0
                                                   ;*aload_0 {reexecute=0 
rethrow=0 return_oop=0}
                                                   ; - 
jpountz.PForDeltaDecoder::prefixSumOfOnes@21 (line 23)
   
     0x00007f62379d6840: vpaddq 0x10(%r13,%r10,8),%ymm0,%ymm1
     0x00007f62379d6847: vmovdqu %ymm1,0x10(%r13,%r10,8)
     0x00007f62379d684e: vpaddq 0x30(%r13,%r10,8),%ymm0,%ymm1
     0x00007f62379d6855: vmovdqu %ymm1,0x30(%r13,%r10,8)
     0x00007f62379d685c: vpaddq 0x50(%r13,%r10,8),%ymm0,%ymm1
     0x00007f62379d6863: vmovdqu %ymm1,0x50(%r13,%r10,8)
     0x00007f62379d686a: vpaddq 0x70(%r13,%r10,8),%ymm0,%ymm1
     0x00007f62379d6871: vmovdqu %ymm1,0x70(%r13,%r10,8)
     0x00007f62379d6878: vpaddq 0x90(%r13,%r10,8),%ymm0,%ymm1
     0x00007f62379d6882: vmovdqu %ymm1,0x90(%r13,%r10,8)
     0x00007f62379d688c: vpaddq 0xb0(%r13,%r10,8),%ymm0,%ymm1
     0x00007f62379d6896: vmovdqu %ymm1,0xb0(%r13,%r10,8)
     0x00007f62379d68a0: vpaddq 0xd0(%r13,%r10,8),%ymm0,%ymm1
     0x00007f62379d68aa: vmovdqu %ymm1,0xd0(%r13,%r10,8)
     0x00007f62379d68b4: vpaddq 0xf0(%r13,%r10,8),%ymm0,%ymm1
     0x00007f62379d68be: vmovdqu %ymm1,0xf0(%r13,%r10,8)  ;*lastore 
{reexecute=0 rethrow=0 return_oop=0}
                                                   ; - 
jpountz.PForDeltaDecoder::prefixSumOfOnes@27 (line 23)
   ```
   
   Unfortunately, I'm not seeing the same thing in `prefixSumOf` (for the 
following implementation):
   ```
       private static void prefixSumOf(long val, long[] arr, long base) {
           for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
               arr[i] = IDENTITY_PLUS_ONE[i] * val + base;
           }
       }
   ```
   
   So in trying to answer the original question here of whether-or-not this 
code is getting auto-vectorized, I'm guessing it's not based on what I'm 
seeing, but I'd still like to get `perfasm` working. It's entirely possible I'm 
misinterpreting the assembly as well. I can post the full decompiled assembly 
of both methods here if interested. It sure will be nice when we can just ask 
Java to do this explicitly...
   
   Finally, I also tried rewriting this to:
   ```
       private static void prefixSumOf(long val, long[] arr, long base) {
           System.arraycopy(IDENTITY_PLUS_ONE, 0, arr, 0, ForUtil.BLOCK_SIZE);
           for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
               arr[i] *= val;
           }
           for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
               arr[i] += base;
           }
       }
   ```
   With this implementation, it looks like the second loop is getting 
auto-vectorized but it doesn't appear to me like the first loop is. I ran this 
"two loop" approach against the original "single loop" implementation in a 
microbench, and it seemed to perform slightly worse (but not by a lot).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to