https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #3 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Alexander Monakov from comment #2)
> Richard, though register moves are resolved by renaming, they still occupy a
> uop in all stages except execution, and since renaming is one of the
> narrowest points in the pipeline (only up to 4 uops/cycle on Intel),
> reducing number of uops generally helps.
> 
> In Michael's the actual memory address has two operands:
> 
> <     vmovapd %ymm1, %ymm10
> <     vmovapd %ymm1, %ymm11
> <     vfnmadd213pd    (%rdx,%rax), %ymm9, %ymm10
> <     vfnmadd213pd    (%rcx,%rax), %ymm7, %ymm11
> ---
> >     vmovupd (%rdx,%rax), %ymm10
> >     vmovupd (%rcx,%rax), %ymm11
> >     vfnmadd231pd    %ymm1, %ymm9, %ymm10
> >     vfnmadd231pd    %ymm1, %ymm7, %ymm11
> 
> The "uop" that carries operands of vfnmadd213pd gets "unlaminated" before
> renaming (because otherwise there would be too many operands to handle).
> Hence the original code has 4 uops after decoding, 6 uops before renaming,
> and the transformed code has 4 uops before renaming. Execution handles 4
> uops in both cases.
> 
> FMA unlamination is mentioned in
> https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-
> modes
> 

That's pretty much what I assumed.
More so, gcc variant occupies 2 reservation station entries (2 fused uOps) vs 4
entries by de-transformed sequence. But capacity of reservation stations is not
a bottleneck (and generally very rarely a bottleneck in tight FP loops) while
rename stage is one of the bottlenecks, supposedly together with decode and L2
read throughput, but rename limits are more severe than the other two.

Anyway, as I mentioned in original post, the question *why* clever gcc
transformation makes execution slower is of lesser interest. What matters is a
fact that it makes execution slower.

> Michael, you can probably measure it for yourself with
> 
>    perf stat -e
> cycles,instructions,uops_retired.all,uops_retired.retire_slots

I am playing with this code on msys2. Strongly suspect that 'perf stat' is not
going to work here. Also, repeating myself, I am not too interested in *why* a
clever code is slower than simple one. For me it's enough to know that it *is*
slower.

[generalization mode on]
As a rule of thumb, I'd say that in inner loops [both on Skylake and on
Haswell/Broadwell] FMA3 with memory source operand almost never a measurable
win relatively to RISCy load+FMA3 sequence, even when number of uOPs going into
rename stage is the same. Most of the time it's a wash and sometimes a loss.
So, compiler that works overtime to produce such complex instructions, can
spend its effort better elsewhere. IMHO.
Outside of inner loops it's different. Here [in SSE/AVX code] most of the time
the main bottleneck is instruction fetch. So, optimizer should try to emit the
shortest sequence, as measured in bytes and does not have to pay a lot of
attention to anything else.
[generalization mode off]

Reply via email to