https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91934
--- Comment #4 from Dmitrii Tochanskii <tochansky at tochlab dot net> --- I'm not a good specialist in avx, so I just see something like loop unroll or may be very log data preparation. For example: ========= vmovups ymm3, YMMWORD PTR [r8+r9] vmovups ymm5, YMMWORD PTR [rax] vmovups ymm8, YMMWORD PTR [r9+32+r8] vfmadd132ps ymm3, ymm5, YMMWORD PTR [rcx+r9] vmovups ymm5, YMMWORD PTR [rax+32] add rax, 64 vfmadd132ps ymm8, ymm5, YMMWORD PTR [r9+32+rcx] vmovups YMMWORD PTR [rax-64], ymm3 vmovups YMMWORD PTR [rax-32], ymm8 vmovups ymm2, YMMWORD PTR [r11+r9] vmovups ymm7, YMMWORD PTR [r11+32+r9] vmovups ymm4, YMMWORD PTR [r10+32+r9] vmovups ymm1, YMMWORD PTR [r10+r9] vshufps ymm6, ymm2, ymm7, 136 vperm2f128 ymm5, ymm6, ymm6, 3 vshufps ymm3, ymm3, ymm8, 136 vshufps ymm0, ymm6, ymm5, 68 vshufps ymm5, ymm6, ymm5, 238 vinsertf128 ymm0, ymm0, xmm5, 1 vperm2f128 ymm5, ymm3, ymm3, 3 vshufps ymm6, ymm3, ymm5, 68 vshufps ymm5, ymm3, ymm5, 238 vinsertf128 ymm6, ymm6, xmm5, 1 vshufps ymm5, ymm1, ymm4, 136 vperm2f128 ymm3, ymm5, ymm5, 3 vshufps ymm8, ymm5, ymm3, 68 vshufps ymm3, ymm5, ymm3, 238 vinsertf128 ymm3, ymm8, xmm3, 1 vfmadd132ps ymm0, ymm6, ymm3 vshufps ymm2, ymm2, ymm7, 221 vperm2f128 ymm3, ymm2, ymm2, 3 vshufps ymm1, ymm1, ymm4, 221 ..... ===== As far I understand there are too much mov's,'shifts' and so on per actual 'multiply-and-add'. I don't have 8.2 now but according to godbolt it generates more adequate code.