https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108376
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Last reconfirmed| |2023-01-12 Status|UNCONFIRMED |NEW Ever confirmed|0 |1 --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- As far as I can see a[] is all zeros. AOCC basically preserves the loop control flow when if (a[i] < 0.) for all elements processed in the iteration, likewise for if (b[i] > a[i]) but GCC if-converts this all down to combined masking of the guarded code. I think the testcase as-is is too artificial to be relevant. GCC has code to do such thing to convert masked stores, but in this case we are not using masked stores or masked loads: .L3: vmovaps a(%rax), %ymm3 vmovaps b(%rax), %ymm4 vmovaps c(%rax), %ymm7 addq $32, %rax vmovaps c-32(%rax), %ymm0 vmovaps e-32(%rax), %ymm5 vcmpps $1, %ymm1, %ymm3, %k1 vcmpps $14, %ymm3, %ymm4, %k1{%k1} vfmadd231ps d-32(%rax), %ymm5, %ymm0{%k1} vfmadd231ps d-32(%rax), %ymm5, %ymm0 vblendmps %ymm0, %ymm7, %ymm0{%k1} vmovaps %ymm0, c-32(%rax) cmpq $128000, %rax jne .L3 I suspect if you do a less optimal initialization of a/b then the AOCC code will be slower. Note GCC applies unroll-and-jam to the loop (the outer iteration is visibly redundant, so we are eventually doing half of the work as AOCC ;)) Confirmed for us not vectorizing control flow but if-converting.