https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127
--- Comment #3 from Michael_S <already5chosen at yahoo dot com> --- (In reply to Alexander Monakov from comment #2) > Richard, though register moves are resolved by renaming, they still occupy a > uop in all stages except execution, and since renaming is one of the > narrowest points in the pipeline (only up to 4 uops/cycle on Intel), > reducing number of uops generally helps. > > In Michael's the actual memory address has two operands: > > < vmovapd %ymm1, %ymm10 > < vmovapd %ymm1, %ymm11 > < vfnmadd213pd (%rdx,%rax), %ymm9, %ymm10 > < vfnmadd213pd (%rcx,%rax), %ymm7, %ymm11 > --- > > vmovupd (%rdx,%rax), %ymm10 > > vmovupd (%rcx,%rax), %ymm11 > > vfnmadd231pd %ymm1, %ymm9, %ymm10 > > vfnmadd231pd %ymm1, %ymm7, %ymm11 > > The "uop" that carries operands of vfnmadd213pd gets "unlaminated" before > renaming (because otherwise there would be too many operands to handle). > Hence the original code has 4 uops after decoding, 6 uops before renaming, > and the transformed code has 4 uops before renaming. Execution handles 4 > uops in both cases. > > FMA unlamination is mentioned in > https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing- > modes > That's pretty much what I assumed. More so, gcc variant occupies 2 reservation station entries (2 fused uOps) vs 4 entries by de-transformed sequence. But capacity of reservation stations is not a bottleneck (and generally very rarely a bottleneck in tight FP loops) while rename stage is one of the bottlenecks, supposedly together with decode and L2 read throughput, but rename limits are more severe than the other two. Anyway, as I mentioned in original post, the question *why* clever gcc transformation makes execution slower is of lesser interest. What matters is a fact that it makes execution slower. > Michael, you can probably measure it for yourself with > > perf stat -e > cycles,instructions,uops_retired.all,uops_retired.retire_slots I am playing with this code on msys2. Strongly suspect that 'perf stat' is not going to work here. Also, repeating myself, I am not too interested in *why* a clever code is slower than simple one. For me it's enough to know that it *is* slower. [generalization mode on] As a rule of thumb, I'd say that in inner loops [both on Skylake and on Haswell/Broadwell] FMA3 with memory source operand almost never a measurable win relatively to RISCy load+FMA3 sequence, even when number of uOPs going into rename stage is the same. Most of the time it's a wash and sometimes a loss. So, compiler that works overtime to produce such complex instructions, can spend its effort better elsewhere. IMHO. Outside of inner loops it's different. Here [in SSE/AVX code] most of the time the main bottleneck is instruction fetch. So, optimizer should try to emit the shortest sequence, as measured in bytes and does not have to pay a lot of attention to anything else. [generalization mode off]