https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
--- Comment #21 from Alexander Monakov <amonakov at gcc dot gnu.org> --- (In reply to Michael_S from comment #19) > > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be > > 'unlaminated' (turned to 2 uops before renaming), so selecting independent > > IVs for the two arrays actually helps on this testcase. > > Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx), > %ymm3, %ymm0' would be turned into 2 uops. The difference is at which point in the pipeline. The latter goes through renaming as one fused uop. > Misuse of load+op is far bigger problem in this particular test case than > sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns > loop that can potentially run at 3 clocks per iteration into loop of 4+ > clocks per iteration. Sorry, which assembler output this refers to? > But I consider it a separate issue. I reported similar issue in 97127, but > here it is more serious. It looks to me that the issue is not soluble within > existing gcc optimization framework. The only chance is if you accept my old > and simple advice - within inner loops pretend that AVX is RISC, i.e. > generate code as if load-op form of AVX instructions weren't existing. In bug 97127 the best explanation we have so far is we don't optimally handle the case where non-memory inputs of an fma are reused, so we can't combine a load with an fma without causing an extra register copy (PR 97127 comment 16 demonstrates what I mean). I cannot imagine such trouble arising with more common commutative operations like mul/add, especially with non-destructive VEX encoding. If you hit such examples, I would suggest to report them also, because their root cause might be different. In general load-op combining should be very helpful on x86, because it reduces the number of uops flowing through the renaming stage, which is one of the narrowest points in the pipeline.