[Bug middle-end/108376] TSVC s1279 runs 40% faster with aocc than gcc at zen4

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 12 Jan 2023 02:34:30 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108376


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2023-01-12
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
As far as I can see a[] is all zeros.  AOCC basically preserves the
loop control flow when if (a[i] < 0.) for all elements processed in the
iteration, likewise for if (b[i] > a[i]) but GCC if-converts this all
down to combined masking of the guarded code.

I think the testcase as-is is too artificial to be relevant.  GCC
has code to do such thing to convert masked stores, but in this case
we are not using masked stores or masked loads:

.L3:
        vmovaps a(%rax), %ymm3
        vmovaps b(%rax), %ymm4
        vmovaps c(%rax), %ymm7
        addq    $32, %rax
        vmovaps c-32(%rax), %ymm0
        vmovaps e-32(%rax), %ymm5
        vcmpps  $1, %ymm1, %ymm3, %k1
        vcmpps  $14, %ymm3, %ymm4, %k1{%k1}
        vfmadd231ps     d-32(%rax), %ymm5, %ymm0{%k1}
        vfmadd231ps     d-32(%rax), %ymm5, %ymm0
        vblendmps       %ymm0, %ymm7, %ymm0{%k1}
        vmovaps %ymm0, c-32(%rax)
        cmpq    $128000, %rax
        jne     .L3

I suspect if you do a less optimal initialization of a/b then the AOCC
code will be slower.

Note GCC applies unroll-and-jam to the loop (the outer iteration is
visibly redundant, so we are eventually doing half of the work as AOCC ;))

Confirmed for us not vectorizing control flow but if-converting.

[Bug middle-end/108376] TSVC s1279 runs 40% faster with aocc than gcc at zen4

Reply via email to