https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #5 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Michael_S from comment #3)
> (In reply to Alexander Monakov from comment #2)
> > Richard, though register moves are resolved by renaming, they still occupy a
> > uop in all stages except execution, and since renaming is one of the
> > narrowest points in the pipeline (only up to 4 uops/cycle on Intel),
> > reducing number of uops generally helps.
> > 
> > In Michael's the actual memory address has two operands:
> > 
> > <   vmovapd %ymm1, %ymm10
> > <   vmovapd %ymm1, %ymm11
> > <   vfnmadd213pd    (%rdx,%rax), %ymm9, %ymm10
> > <   vfnmadd213pd    (%rcx,%rax), %ymm7, %ymm11
> > ---
> > >   vmovupd (%rdx,%rax), %ymm10
> > >   vmovupd (%rcx,%rax), %ymm11
> > >   vfnmadd231pd    %ymm1, %ymm9, %ymm10
> > >   vfnmadd231pd    %ymm1, %ymm7, %ymm11
> > 

We can add peephole2 pattern for this particular situation(Assume the
transformation won't hurt the performance when instructions are outside of
inner loops), but not sure if GCC could hanlde it in *global view*(handle them
differently inside/outside of a loop).

Reply via email to