https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91934

--- Comment #4 from Dmitrii Tochanskii <tochansky at tochlab dot net> ---
I'm not a good specialist in avx, so I just see something like loop unroll or
may be very log data preparation. For example:

=========
        vmovups ymm3, YMMWORD PTR [r8+r9]
        vmovups ymm5, YMMWORD PTR [rax]
        vmovups ymm8, YMMWORD PTR [r9+32+r8]
        vfmadd132ps     ymm3, ymm5, YMMWORD PTR [rcx+r9]
        vmovups ymm5, YMMWORD PTR [rax+32]
        add     rax, 64
        vfmadd132ps     ymm8, ymm5, YMMWORD PTR [r9+32+rcx]
        vmovups YMMWORD PTR [rax-64], ymm3
        vmovups YMMWORD PTR [rax-32], ymm8
        vmovups ymm2, YMMWORD PTR [r11+r9]
        vmovups ymm7, YMMWORD PTR [r11+32+r9]
        vmovups ymm4, YMMWORD PTR [r10+32+r9]
        vmovups ymm1, YMMWORD PTR [r10+r9]
        vshufps ymm6, ymm2, ymm7, 136
        vperm2f128      ymm5, ymm6, ymm6, 3
        vshufps ymm3, ymm3, ymm8, 136
        vshufps ymm0, ymm6, ymm5, 68
        vshufps ymm5, ymm6, ymm5, 238
        vinsertf128     ymm0, ymm0, xmm5, 1
        vperm2f128      ymm5, ymm3, ymm3, 3
        vshufps ymm6, ymm3, ymm5, 68
        vshufps ymm5, ymm3, ymm5, 238
        vinsertf128     ymm6, ymm6, xmm5, 1
        vshufps ymm5, ymm1, ymm4, 136
        vperm2f128      ymm3, ymm5, ymm5, 3
        vshufps ymm8, ymm5, ymm3, 68
        vshufps ymm3, ymm5, ymm3, 238
        vinsertf128     ymm3, ymm8, xmm3, 1
        vfmadd132ps     ymm0, ymm6, ymm3
        vshufps ymm2, ymm2, ymm7, 221
        vperm2f128      ymm3, ymm2, ymm2, 3
        vshufps ymm1, ymm1, ymm4, 221
.....
=====
As far I understand there are too much mov's,'shifts' and so on per actual
'multiply-and-add'.

I don't have 8.2 now but according to godbolt it generates more adequate code.

Reply via email to