https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127
--- Comment #5 from Hongtao.liu <crazylht at gmail dot com> --- (In reply to Michael_S from comment #3) > (In reply to Alexander Monakov from comment #2) > > Richard, though register moves are resolved by renaming, they still occupy a > > uop in all stages except execution, and since renaming is one of the > > narrowest points in the pipeline (only up to 4 uops/cycle on Intel), > > reducing number of uops generally helps. > > > > In Michael's the actual memory address has two operands: > > > > < vmovapd %ymm1, %ymm10 > > < vmovapd %ymm1, %ymm11 > > < vfnmadd213pd (%rdx,%rax), %ymm9, %ymm10 > > < vfnmadd213pd (%rcx,%rax), %ymm7, %ymm11 > > --- > > > vmovupd (%rdx,%rax), %ymm10 > > > vmovupd (%rcx,%rax), %ymm11 > > > vfnmadd231pd %ymm1, %ymm9, %ymm10 > > > vfnmadd231pd %ymm1, %ymm7, %ymm11 > > We can add peephole2 pattern for this particular situation(Assume the transformation won't hurt the performance when instructions are outside of inner loops), but not sure if GCC could hanlde it in *global view*(handle them differently inside/outside of a loop).