On Tue, 1 Jul 2025 06:41:32 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:
>> Ping again! Thanks in advance! > >> @XiaohongGong I'm a little busy at the moment, and soon going on a summer >> vacation, so I cannot promise a full review soon. Feel free to ask someone >> else to have a look. >> >> I quickly looked through your new benchmark results you published after >> integration of #25539. There seem to still be a few cases where `Gain < 1`. >> Especially: >> >> ``` >> GatherOperationsBenchmark.microShortGather512_MASK 256 thrpt 30 >> ops/ms 11587.465 10674.598 0.92 >> GatherOperationsBenchmark.microShortGather512_MASK 1024 thrpt 30 >> ops/ms 2902.731 2629.739 0.90 >> GatherOperationsBenchmark.microShortGather512_MASK 4096 thrpt 30 >> ops/ms 741.546 671.124 0.90 >> ``` >> >> and >> >> ``` >> GatherOperationsBenchmark.microShortGather256_MASK 256 thrpt 30 >> ops/ms 11339.217 10951.141 0.96 >> GatherOperationsBenchmark.microShortGather256_MASK 1024 thrpt 30 >> ops/ms 2840.081 2718.823 0.95 >> GatherOperationsBenchmark.microShortGather256_MASK 4096 thrpt 30 >> ops/ms 725.334 696.343 0.96 >> ``` >> >> and >> >> ``` >> GatherOperationsBenchmark.microByteGather512_MASK 64 thrpt 30 >> ops/ms 50588.210 48220.741 0.95 >> ``` >> >> Do you know what happens in those cases? > > Thanks for your input! Yes, I spent some time making an analysis on these > little regressions. Seems there are the architecture HW influences like the > cache miss or code alignment. I tried with a larger loop alignment like 32, > and the performance will be improved and regressions are gone. Since I'm not > quite familiar with X86 architectures, I'm not sure of the exact point. Any > suggestions on that? > @XiaohongGong Maybe someone from Intel (@jatin-bhateja @sviswa7) can help you > with the x86 specific issues. You could always use hardware counters to > measure cache misses. Also if the vectors are not cache-line aligned, there > may be split loads or stores. Also that can be measured with hardware > counters. Maybe the benchmark needs to be improved somehow, to account for > issues with alignment. I also tried to measure cache misses with perf on my x86 machine, and I noticed the cache miss is increased. The generated code layout of the test/benchmark is changed with my changes in Java side, so I guess maybe the alignment is different with before. To verify my thought, I used the vm option `-XX:OptoLoopAlignment=32`, and the performance can be improved a lot compared with the version without my change. So I think the patch itself maybe acceptable even we noticed minor regressions. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3022195040