[GitHub] [lucene] gsmiller commented on pull request #12417: forutil add vectorized and scalar code

via GitHub Sat, 08 Jul 2023 13:53:39 -0700


gsmiller commented on PR #12417:
URL: https://github.com/apache/lucene/pull/12417#issuecomment-1627487242


   [Disclaimer: I'm pretty new to working with the incubator vector code, so 
please excuse wonky terminology or silly ideas]
   
   @ChrisHegarty have you experimented at all with narrower vector lanes for 
lower bpv? I poked around your bitpacking repo a bit this morning and 
experimented a bit with the 2bpv unpacking case. If I'm understanding 
correctly, you essentially have four 32-bit lanes involved in the shift/mask 
operations, so you're getting 4x parallelism right? (i.e., you're "generating" 
four output ints in each vectorized op?). I wonder if there's any way to use 
8bit lanes and get 16x parallelism in each shift/mask op? I messed about with 
this a little bit, and—if I unpack into a byte[] instead of int[]—this produced 
a pretty nice improvement in the benchmarks[1]. What I'm not sure of is how to 
efficiently get into an int[] for output. Maybe that's the killer? Anyway, 
wanted to just float this question out there as a newbie to the space but 
curious and trying to learn more.
   
   [1] here's what I experimented with:
   ```
       private static final VectorSpecies<Byte> BYTE_SPECIES_128 = 
ByteVector.SPECIES_128;
       [...]
       byte[] input = [...] // input is in bytes in this case, not int[]
       byte[] byteOut = new byte[128];
       ByteVector inVec = ByteVector.fromArray(BYTE_SPECIES_128, input, 0);
       ByteVector outVec;
       int outOff = 0;
       final byte mask = (1 << 2) - 1;
   
       outVec = inVec.and(mask);
       outVec.intoArray(byteOut, outOff);
   
       outVec = inVec.lanewise(VectorOperators.LSHR, 2).and(mask);
       outVec.intoArray(byteOut, outOff+=16);
   
       outVec = inVec.lanewise(VectorOperators.LSHR, 4).and(mask);
       outVec.intoArray(byteOut, outOff+=16);
   
       [... you get the idea ...]
   ```
   
   ```
   Benchmark                    Mode  Cnt   Score   Error   Units
   Benchmark.decode2SimdPack   thrpt    5  58.689 ± 0.191  ops/us
   Benchmark.decode2SimdPack2  thrpt    5  85.847 ± 1.800  ops/us
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12417: forutil add vectorized and scalar code

Reply via email to