Re: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2]

Xiaohong Gong Tue, 01 Jul 2025 02:07:58 -0700

On Tue, 1 Jul 2025 06:41:32 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:


>> Ping again! Thanks in advance!
>
>> @XiaohongGong I'm a little busy at the moment, and soon going on a summer 
>> vacation, so I cannot promise a full review soon. Feel free to ask someone 
>> else to have a look.
>> 
>> I quickly looked through your new benchmark results you published after 
>> integration of #25539. There seem to still be a few cases where `Gain < 1`. 
>> Especially:
>> 
>> ```
>> GatherOperationsBenchmark.microShortGather512_MASK         256 thrpt  30  
>> ops/ms 11587.465  10674.598  0.92
>> GatherOperationsBenchmark.microShortGather512_MASK        1024 thrpt  30  
>> ops/ms  2902.731   2629.739  0.90
>> GatherOperationsBenchmark.microShortGather512_MASK        4096 thrpt  30  
>> ops/ms   741.546    671.124  0.90
>> ```
>> 
>> and
>> 
>> ```
>> GatherOperationsBenchmark.microShortGather256_MASK         256 thrpt  30  
>> ops/ms 11339.217  10951.141  0.96
>> GatherOperationsBenchmark.microShortGather256_MASK        1024 thrpt  30  
>> ops/ms  2840.081   2718.823  0.95
>> GatherOperationsBenchmark.microShortGather256_MASK        4096 thrpt  30  
>> ops/ms   725.334    696.343  0.96
>> ```
>> 
>> and
>> 
>> ```
>> GatherOperationsBenchmark.microByteGather512_MASK           64 thrpt  30  
>> ops/ms 50588.210  48220.741  0.95
>> ```
>> 
>> Do you know what happens in those cases?
> 
> Thanks for your input! Yes, I spent some time making an analysis on these 
> little regressions. Seems there are the architecture HW influences like the 
> cache miss or code alignment. I tried with a larger loop alignment like 32, 
> and the performance will be improved and regressions are gone. Since I'm not 
> quite familiar with X86 architectures, I'm not sure of the exact point. Any 
> suggestions on that?

> @XiaohongGong Maybe someone from Intel (@jatin-bhateja @sviswa7) can help you 
> with the x86 specific issues. You could always use hardware counters to 
> measure cache misses. Also if the vectors are not cache-line aligned, there 
> may be split loads or stores. Also that can be measured with hardware 
> counters. Maybe the benchmark needs to be improved somehow, to account for 
> issues with alignment.

I also tried to measure cache misses with perf on my x86 machine, and I noticed 
the cache miss is increased. The generated code layout of the test/benchmark is 
changed with my changes in Java side, so I guess maybe the alignment is 
different with before. To verify my thought, I used the vm option 
`-XX:OptoLoopAlignment=32`, and the performance can be improved a lot compared 
with the version without my change. So I think the patch itself maybe 
acceptable even we noticed minor regressions.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3022195040

Re: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2]

Reply via email to