https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107715
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- Because store data races are allowed with -Ofast masked stores are not used so we instead get vect__ifc__80.24_114 = VEC_COND_EXPR <mask__58.15_104, vect__45.20_109, vect__ifc__78.23_113>; _ifc__80 = _58 ? _45 : _ifc__78; MEM <vector(8) double> [(double *)vectp_c.25_116] = vect__ifc__80.24_114; which somehow is later turned into masked stores? In fact we expand from vect__43.18_107 = MEM <vector(8) double> [(double *)&a + ivtmp.75_134 * 1]; vect__ifc__78.23_113 = MEM <vector(8) double> [(double *)&c + 8B + ivtmp.75_134 * 1]; _97 = .COND_FMA (mask__58.15_104, vect_pretmp_36.14_102, vect_pretmp_36.14_102, vect__43.18_107, vect__ifc__78.23_113); MEM <vector(8) double> [(double *)&c + 8B + ivtmp.75_134 * 1] = _97; vect__38.29_121 = MEM <vector(8) double> [(double *)&c + ivtmp.75_134 * 1]; vect__39.32_124 = MEM <vector(8) double> [(double *)&e + ivtmp.75_134 * 1]; _98 = vect__35.11_99 >= { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 }; _100 = .COND_FMA (_98, vect_pretmp_36.14_102, vect__39.32_124, vect__38.29_121, vect__43.18_107); MEM <vector(8) double> [(double *)&a + ivtmp.75_134 * 1] = _100; the vectorizer has optimize_mask_stores () which is supposed to replace .MASK_STORE with if (mask != { 0, 0, 0 ... }) <code depending on the mask store> and thus optimize the mask == 0 case. But that only triggers for .MASK_STORE. You can see this when you force .MASK_STORE via -O3 -ffast-math (without -fallow-store-data-races) you get this effect: .L2: vcmppd $13, %zmm1, %zmm0, %k1 kortestb %k1, %k1 jne .L33 .L3: addq $64, %rax cmpq $255936, %rax je .L34 .L4: vmovapd b(%rax), %zmm0 vmovapd d(%rax), %zmm2 vcmppd $1, %zmm1, %zmm0, %k1 kortestb %k1, %k1 je .L2 vmovapd %zmm2, %zmm3 vfmadd213pd a(%rax), %zmm2, %zmm3 vmovupd %zmm3, c+8(%rax){%k1} vcmppd $13, %zmm1, %zmm0, %k1 kortestb %k1, %k1 je .L3 .p2align 4 .p2align 3 .L33: vmovapd c(%rax), %zmm3 vfmadd132pd e(%rax), %zmm3, %zmm2 vmovapd %zmm2, a(%rax){%k1} addq $64, %rax cmpq $255936, %rax jne .L4 .L34: kortestb %k3, %k3 jne .L35 maybe you can benchmark with that? Still it shouldn't be 40 times slower, but maybe that's the cache effect of using 4 arrays instead of 3? 4 arrays need 1MB while 3 fit into 750kB? L1 is exactly 1MB so we might run into aliasing issues there with the 4 arrays.