[Bug tree-optimization/107715] TSVC s161 for double runs at zen4 30 times slower when vectorization is enabled

rguenth at gcc dot gnu.org via Gcc-bugs Wed, 16 Nov 2022 07:09:23 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107715


--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Because store data races are allowed with -Ofast masked stores are not used so
we instead get

  vect__ifc__80.24_114 = VEC_COND_EXPR <mask__58.15_104, vect__45.20_109,
vect__ifc__78.23_113>;
  _ifc__80 = _58 ? _45 : _ifc__78;
  MEM <vector(8) double> [(double *)vectp_c.25_116] = vect__ifc__80.24_114;

which somehow is later turned into masked stores?  In fact we expand from

  vect__43.18_107 = MEM <vector(8) double> [(double *)&a + ivtmp.75_134 * 1];
  vect__ifc__78.23_113 = MEM <vector(8) double> [(double *)&c + 8B +
ivtmp.75_134 * 1];
  _97 = .COND_FMA (mask__58.15_104, vect_pretmp_36.14_102,
vect_pretmp_36.14_102, vect__43.18_107, vect__ifc__78.23_113);
  MEM <vector(8) double> [(double *)&c + 8B + ivtmp.75_134 * 1] = _97;
  vect__38.29_121 = MEM <vector(8) double> [(double *)&c + ivtmp.75_134 * 1];
  vect__39.32_124 = MEM <vector(8) double> [(double *)&e + ivtmp.75_134 * 1];
  _98 = vect__35.11_99 >= { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };
  _100 = .COND_FMA (_98, vect_pretmp_36.14_102, vect__39.32_124,
vect__38.29_121, vect__43.18_107);
  MEM <vector(8) double> [(double *)&a + ivtmp.75_134 * 1] = _100;

the vectorizer has optimize_mask_stores () which is supposed to replace
.MASK_STORE with

 if (mask != { 0, 0, 0 ... })
   <code depending on the mask store>

and thus optimize the mask == 0 case.  But that only triggers for .MASK_STORE.

You can see this when you force .MASK_STORE via -O3 -ffast-math (without
-fallow-store-data-races) you get this effect:

.L2:
        vcmppd  $13, %zmm1, %zmm0, %k1
        kortestb        %k1, %k1
        jne     .L33
.L3:
        addq    $64, %rax
        cmpq    $255936, %rax
        je      .L34
.L4:
        vmovapd b(%rax), %zmm0
        vmovapd d(%rax), %zmm2
        vcmppd  $1, %zmm1, %zmm0, %k1
        kortestb        %k1, %k1
        je      .L2
        vmovapd %zmm2, %zmm3
        vfmadd213pd     a(%rax), %zmm2, %zmm3
        vmovupd %zmm3, c+8(%rax){%k1}
        vcmppd  $13, %zmm1, %zmm0, %k1
        kortestb        %k1, %k1
        je      .L3
        .p2align 4
        .p2align 3
.L33:
        vmovapd c(%rax), %zmm3
        vfmadd132pd     e(%rax), %zmm3, %zmm2
        vmovapd %zmm2, a(%rax){%k1}
        addq    $64, %rax
        cmpq    $255936, %rax
        jne     .L4
.L34:
        kortestb        %k3, %k3
        jne     .L35

maybe you can benchmark with that?  Still it shouldn't be 40 times slower,
but maybe that's the cache effect of using 4 arrays instead of 3?  4
arrays need 1MB while 3 fit into 750kB?  L1 is exactly 1MB so we might
run into aliasing issues there with the 4 arrays.

[Bug tree-optimization/107715] TSVC s161 for double runs at zen4 30 times slower when vectorization is enabled

Reply via email to