https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116582

--- Comment #2 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
it is mysterious.  I was looking into why in some cases the gather is a win in
micro-benchmark and loss in real benchmark. Indeed distribution of indices
makes difference.

If I make indices random then the performance effect is neutral:

jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float  -march=native
gather.c  -Ofast -mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts
-mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details ; objdump -d
a.out | grep gather ; perf stat ./a.out

 Performance counter stats for './a.out':

            454.77 msec task-clock:u                     #    0.999 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
               663      page-faults:u                    #    1.458 K/sec       
     1,854,500,227      cycles:u                         #    4.078 GHz         
         4,788,337      stalled-cycles-frontend:u        #    0.26% frontend
cycles idle      
       651,597,070      instructions:u                   #    0.35  insn per
cycle            
                                                  #    0.01  stalled cycles per
insn   
        58,222,408      branches:u                       #  128.027 M/sec       
            60,269      branch-misses:u                  #    0.10% of all
branches           

       0.455155383 seconds time elapsed

       0.455154000 seconds user
       0.000000000 seconds sys


jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float  -march=native
gather.c  -Ofast -mtune-ctrl=use_gather_4parts -mtune-ctrl=use_gather_8parts
-mtune-ctrl=use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out
| grep gather ; perf stat ./a.out
  401212:       62 f2 7d 4a 92 04 8d    vgatherdps
0x404080(,%zmm1,4),%zmm0{%k2}

 Performance counter stats for './a.out':

            448.84 msec task-clock:u                     #    0.999 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
               663      page-faults:u                    #    1.477 K/sec       
     1,834,437,666      cycles:u                         #    4.087 GHz         
         4,522,424      stalled-cycles-frontend:u        #    0.25% frontend
cycles idle      
       160,137,040      instructions:u                   #    0.09  insn per
cycle            
                                                  #    0.03  stalled cycles per
insn   
        27,502,394      branches:u                       #   61.274 M/sec       
            60,328      branch-misses:u                  #    0.22% of all
branches           

       0.449240415 seconds time elapsed

       0.449224000 seconds user
       0.000000000 seconds sys


If I make stride 8 then it is a win:
#include <stdlib.h>
#define M 1024*1024
int indices[M];
T a[M], b[M];
__attribute__ ((noipa))
void
test ()
{
  for (int i = 0; i < 1024* 16; i++)
    a[i] += b[indices[i]];
}
int
main()
{
  for (int i = 0 ; i < M; i++)
    indices[i] = (i * 8)%M;
  for (int i = 0 ; i < 10000; i++)
    test ();
  return 0;
}

jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float  -march=native
gather.c  -Ofast -mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts
-mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details ; objdump -d
a.out | grep gather ; perf stat ./a.out

 Performance counter stats for './a.out':

          5,827.78 msec task-clock:u                     #    1.000 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
               222      page-faults:u                    #   38.093 /sec        
    23,975,482,386      cycles:u                         #    4.114 GHz         
       784,362,546      stalled-cycles-frontend:u        #    3.27% frontend
cycles idle      
       576,680,806      instructions:u                   #    0.02  insn per
cycle            
                                                  #    1.36  stalled cycles per
insn   
        41,523,290      branches:u                       #    7.125 M/sec       
            53,461      branch-misses:u                  #    0.13% of all
branches           

       5.828522527 seconds time elapsed

       5.828224000 seconds user
       0.000000000 seconds sys


jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float  -march=native
gather.c  -Ofast -mtune-ctrl=use_gather_4parts -mtune-ctrl=use_gather_8parts
-mtune-ctrl=use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out
| grep gather ; perf stat ./a.out
  401252:       62 f2 7d 4a 92 04 8d    vgatherdps
0x404080(,%zmm1,4),%zmm0{%k2}

 Performance counter stats for './a.out':

            825.60 msec task-clock:u                     #    0.999 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
               220      page-faults:u                    #  266.472 /sec        
     3,398,434,808      cycles:u                         #    4.116 GHz         
        76,631,825      stalled-cycles-frontend:u        #    2.25% frontend
cycles idle      
        85,196,162      instructions:u                   #    0.03  insn per
cycle            
                                                  #    0.90  stalled cycles per
insn   
        10,790,219      branches:u                       #   13.069 M/sec       
           458,624      branch-misses:u                  #    4.25% of all
branches           

       0.826054290 seconds time elapsed

       0.826079000 seconds user
       0.000000000 seconds sys


if I make it fully sequential:
#include <stdlib.h>
#define M 1024*1024
int indices[M];
T a[M], b[M];
__attribute__ ((noipa))
void
test ()
{
  for (int i = 0; i < 1024* 16; i++)
    a[i] += b[indices[i]];
}
int
main()
{
  for (int i = 0 ; i < M; i++)
    indices[i] = i;
  for (int i = 0 ; i < 10000; i++)
    test ();
  return 0;
}

then it is loss:
jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float  -march=native
gather.c  -Ofast -mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts
-mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details ; objdump -d
a.out | grep gather ; perf stat ./a.out

 Performance counter stats for './a.out':

             36.87 msec task-clock:u                     #    0.990 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
               174      page-faults:u                    #    4.720 K/sec       
       145,666,993      cycles:u                         #    3.951 GHz         
           764,475      stalled-cycles-frontend:u        #    0.52% frontend
cycles idle      
       576,521,394      instructions:u                   #    3.96  insn per
cycle            
                                                  #    0.00  stalled cycles per
insn   
        41,508,228      branches:u                       #    1.126 G/sec       
            23,851      branch-misses:u                  #    0.06% of all
branches           

       0.037254683 seconds time elapsed

       0.037398000 seconds user
       0.000000000 seconds sys


jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float  -march=native
gather.c  -Ofast -mtune-ctrl=use_gather_4parts -mtune-ctrl=use_gather_8parts
-mtune-ctrl=use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out
| grep gather ; perf stat ./a.out
  401252:       62 f2 7d 4a 92 04 8d    vgatherdps
0x404080(,%zmm1,4),%zmm0{%k2}

 Performance counter stats for './a.out':

             59.06 msec task-clock:u                     #    0.994 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
               172      page-faults:u                    #    2.912 K/sec       
       236,520,114      cycles:u                         #    4.005 GHz         
           879,734      stalled-cycles-frontend:u        #    0.37% frontend
cycles idle      
        85,061,573      instructions:u                   #    0.36  insn per
cycle            
                                                  #    0.01  stalled cycles per
insn   
        10,788,320      branches:u                       #  182.674 M/sec       
            24,354      branch-misses:u                  #    0.23% of all
branches           

       0.059442906 seconds time elapsed

       0.059527000 seconds user
       0.000000000 seconds sys

if I make it inverse sequential then it is also loss:
jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float  -march=native
gather.c  -Ofast -mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts
-mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details ; objdump -d
a.out | grep gather ; perf stat ./a.out

 Performance counter stats for './a.out':

             36.84 msec task-clock:u                     #    0.985 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
               158      page-faults:u                    #    4.289 K/sec       
       146,386,127      cycles:u                         #    3.974 GHz         
           778,318      stalled-cycles-frontend:u        #    0.53% frontend
cycles idle      
       576,521,379      instructions:u                   #    3.94  insn per
cycle            
                                                  #    0.00  stalled cycles per
insn   
        41,508,213      branches:u                       #    1.127 G/sec       
            23,784      branch-misses:u                  #    0.06% of all
branches           

       0.037391272 seconds time elapsed

       0.037422000 seconds user
       0.000000000 seconds sys


jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float  -march=native
gather.c  -Ofast -mtune-ctrl=use_gather_4parts -mtune-ctrl=use_gather_8parts
-mtune-ctrl=use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out
| grep gather ; perf stat ./a.out
  401252:       62 f2 7d 4a 92 04 8d    vgatherdps
0x404080(,%zmm1,4),%zmm0{%k2}

 Performance counter stats for './a.out':

             59.14 msec task-clock:u                     #    0.993 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
               160      page-faults:u                    #    2.705 K/sec       
       236,706,166      cycles:u                         #    4.002 GHz         
           921,999      stalled-cycles-frontend:u        #    0.39% frontend
cycles idle      
        85,061,576      instructions:u                   #    0.36  insn per
cycle            
                                                  #    0.01  stalled cycles per
insn   
        10,788,314      branches:u                       #  182.419 M/sec       
            24,412      branch-misses:u                  #    0.23% of all
branches           

       0.059586646 seconds time elapsed

       0.059648000 seconds user
       0.000000000 seconds sys

Stride 2 is neutral, stride 4 is win, stride 8 is win, stride 16 is win.
I checked that this is reproducible

On zen5 gather is a 5-10% loss on parest. I suppose CPU's prefetching logic is
behaving a lot different when gathers are used.

Reply via email to