https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110215
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |vmakarov at gcc dot gnu.org Keywords|EH | --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- The issue is that we fail to sink d_29 = {t_28, t_28, t_28 t_28}; we compute a good place in select_best_block but then since it is at the same loop depth as the original place we apply /* If BEST_BB is at the same nesting level, then require it to have significantly lower execution frequency to avoid gratuitous movement. */ if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb) /* If result of comparsion is unknown, prefer EARLY_BB. Thus use !(...>=..) rather than (...<...) */ && !(best_bb->count * 100 >= early_bb->count * threshold)) return best_bb; and fail to sink. I'm not exactly sure why we do the above - we probably should when best_bb post-dominates early_bb, also if the sunk stmt possibly (or provably) will enlarge lifetime of its uses (but that's also hard to guess since we process sinking of the defs of the uses only afterwards). In this case we have a single use and a single def so sinking shouldn't make things worse. We could also weight in spilling class of a reg here. In our case we have the dominated block with a higher(!) count than the dominating block which means the profile is corrupt. With --param sink-frequency-threshold we sink the ctor and the feeding division but still get .L5: movq (%rbx), %rax pxor %xmm1, %xmm1 leaq 0(%rbp,%rax), %rdx .p2align 4,,10 .p2align 3 .L4: movaps (%rsp), %xmm0 addps (%rax), %xmm0 addq $16, %rax movaps %xmm0, -16(%rax) addps %xmm0, %xmm1 cmpq %rax, %rdx jne .L4 movaps %xmm1, %xmm0 movhlps %xmm1, %xmm0 addps %xmm0, %xmm1 movaps %xmm1, %xmm0 shufps $85, %xmm1, %xmm0 addps %xmm1, %xmm0 .LEHB1: call _Z1gf addq $8, %rbx cmpq %rbx, %r12 jne .L5 because we (rightfully so) refuse to sink into the outer loop. What we fail to do is hoist the reload out of the inner loop (I suppose clang does exactly that). We don't have any pass after reload that would perform loop invatiant motion, I'm not sure how this situation is handled in general in RA - is a post-RA pass optimizing the spill/reload placement "globally" usually done?