https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116582
--- Comment #2 from Jan Hubicka <hubicka at gcc dot gnu.org> --- it is mysterious. I was looking into why in some cases the gather is a win in micro-benchmark and loss in real benchmark. Indeed distribution of indices makes difference. If I make indices random then the performance effect is neutral: jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float -march=native gather.c -Ofast -mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts -mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out | grep gather ; perf stat ./a.out Performance counter stats for './a.out': 454.77 msec task-clock:u # 0.999 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 663 page-faults:u # 1.458 K/sec 1,854,500,227 cycles:u # 4.078 GHz 4,788,337 stalled-cycles-frontend:u # 0.26% frontend cycles idle 651,597,070 instructions:u # 0.35 insn per cycle # 0.01 stalled cycles per insn 58,222,408 branches:u # 128.027 M/sec 60,269 branch-misses:u # 0.10% of all branches 0.455155383 seconds time elapsed 0.455154000 seconds user 0.000000000 seconds sys jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float -march=native gather.c -Ofast -mtune-ctrl=use_gather_4parts -mtune-ctrl=use_gather_8parts -mtune-ctrl=use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out | grep gather ; perf stat ./a.out 401212: 62 f2 7d 4a 92 04 8d vgatherdps 0x404080(,%zmm1,4),%zmm0{%k2} Performance counter stats for './a.out': 448.84 msec task-clock:u # 0.999 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 663 page-faults:u # 1.477 K/sec 1,834,437,666 cycles:u # 4.087 GHz 4,522,424 stalled-cycles-frontend:u # 0.25% frontend cycles idle 160,137,040 instructions:u # 0.09 insn per cycle # 0.03 stalled cycles per insn 27,502,394 branches:u # 61.274 M/sec 60,328 branch-misses:u # 0.22% of all branches 0.449240415 seconds time elapsed 0.449224000 seconds user 0.000000000 seconds sys If I make stride 8 then it is a win: #include <stdlib.h> #define M 1024*1024 int indices[M]; T a[M], b[M]; __attribute__ ((noipa)) void test () { for (int i = 0; i < 1024* 16; i++) a[i] += b[indices[i]]; } int main() { for (int i = 0 ; i < M; i++) indices[i] = (i * 8)%M; for (int i = 0 ; i < 10000; i++) test (); return 0; } jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float -march=native gather.c -Ofast -mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts -mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out | grep gather ; perf stat ./a.out Performance counter stats for './a.out': 5,827.78 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 222 page-faults:u # 38.093 /sec 23,975,482,386 cycles:u # 4.114 GHz 784,362,546 stalled-cycles-frontend:u # 3.27% frontend cycles idle 576,680,806 instructions:u # 0.02 insn per cycle # 1.36 stalled cycles per insn 41,523,290 branches:u # 7.125 M/sec 53,461 branch-misses:u # 0.13% of all branches 5.828522527 seconds time elapsed 5.828224000 seconds user 0.000000000 seconds sys jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float -march=native gather.c -Ofast -mtune-ctrl=use_gather_4parts -mtune-ctrl=use_gather_8parts -mtune-ctrl=use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out | grep gather ; perf stat ./a.out 401252: 62 f2 7d 4a 92 04 8d vgatherdps 0x404080(,%zmm1,4),%zmm0{%k2} Performance counter stats for './a.out': 825.60 msec task-clock:u # 0.999 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 220 page-faults:u # 266.472 /sec 3,398,434,808 cycles:u # 4.116 GHz 76,631,825 stalled-cycles-frontend:u # 2.25% frontend cycles idle 85,196,162 instructions:u # 0.03 insn per cycle # 0.90 stalled cycles per insn 10,790,219 branches:u # 13.069 M/sec 458,624 branch-misses:u # 4.25% of all branches 0.826054290 seconds time elapsed 0.826079000 seconds user 0.000000000 seconds sys if I make it fully sequential: #include <stdlib.h> #define M 1024*1024 int indices[M]; T a[M], b[M]; __attribute__ ((noipa)) void test () { for (int i = 0; i < 1024* 16; i++) a[i] += b[indices[i]]; } int main() { for (int i = 0 ; i < M; i++) indices[i] = i; for (int i = 0 ; i < 10000; i++) test (); return 0; } then it is loss: jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float -march=native gather.c -Ofast -mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts -mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out | grep gather ; perf stat ./a.out Performance counter stats for './a.out': 36.87 msec task-clock:u # 0.990 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 174 page-faults:u # 4.720 K/sec 145,666,993 cycles:u # 3.951 GHz 764,475 stalled-cycles-frontend:u # 0.52% frontend cycles idle 576,521,394 instructions:u # 3.96 insn per cycle # 0.00 stalled cycles per insn 41,508,228 branches:u # 1.126 G/sec 23,851 branch-misses:u # 0.06% of all branches 0.037254683 seconds time elapsed 0.037398000 seconds user 0.000000000 seconds sys jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float -march=native gather.c -Ofast -mtune-ctrl=use_gather_4parts -mtune-ctrl=use_gather_8parts -mtune-ctrl=use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out | grep gather ; perf stat ./a.out 401252: 62 f2 7d 4a 92 04 8d vgatherdps 0x404080(,%zmm1,4),%zmm0{%k2} Performance counter stats for './a.out': 59.06 msec task-clock:u # 0.994 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 172 page-faults:u # 2.912 K/sec 236,520,114 cycles:u # 4.005 GHz 879,734 stalled-cycles-frontend:u # 0.37% frontend cycles idle 85,061,573 instructions:u # 0.36 insn per cycle # 0.01 stalled cycles per insn 10,788,320 branches:u # 182.674 M/sec 24,354 branch-misses:u # 0.23% of all branches 0.059442906 seconds time elapsed 0.059527000 seconds user 0.000000000 seconds sys if I make it inverse sequential then it is also loss: jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float -march=native gather.c -Ofast -mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts -mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out | grep gather ; perf stat ./a.out Performance counter stats for './a.out': 36.84 msec task-clock:u # 0.985 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 158 page-faults:u # 4.289 K/sec 146,386,127 cycles:u # 3.974 GHz 778,318 stalled-cycles-frontend:u # 0.53% frontend cycles idle 576,521,379 instructions:u # 3.94 insn per cycle # 0.00 stalled cycles per insn 41,508,213 branches:u # 1.127 G/sec 23,784 branch-misses:u # 0.06% of all branches 0.037391272 seconds time elapsed 0.037422000 seconds user 0.000000000 seconds sys jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float -march=native gather.c -Ofast -mtune-ctrl=use_gather_4parts -mtune-ctrl=use_gather_8parts -mtune-ctrl=use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out | grep gather ; perf stat ./a.out 401252: 62 f2 7d 4a 92 04 8d vgatherdps 0x404080(,%zmm1,4),%zmm0{%k2} Performance counter stats for './a.out': 59.14 msec task-clock:u # 0.993 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 160 page-faults:u # 2.705 K/sec 236,706,166 cycles:u # 4.002 GHz 921,999 stalled-cycles-frontend:u # 0.39% frontend cycles idle 85,061,576 instructions:u # 0.36 insn per cycle # 0.01 stalled cycles per insn 10,788,314 branches:u # 182.419 M/sec 24,412 branch-misses:u # 0.23% of all branches 0.059586646 seconds time elapsed 0.059648000 seconds user 0.000000000 seconds sys Stride 2 is neutral, stride 4 is win, stride 8 is win, stride 16 is win. I checked that this is reproducible On zen5 gather is a 5-10% loss on parest. I suppose CPU's prefetching logic is behaving a lot different when gathers are used.