On Thu, Apr 16, 2026 at 12:58 PM Richard Biener <[email protected]> wrote:
>
> On Thu, 16 Apr 2026, Uros Bizjak wrote:
>
> > Hello!
> >
> > After pass_reorder_blocks, there remain some propagating opportunities
> > for late_combine.  Looking at gcc.target/i386/pr90178.c, we get a
> > trivial sequence of:
> >
> > gcc -O2 -mavx -mvzeroupper -m32:
> >
> > .L5:
> >     xorl    %ecx, %ecx
> >     ...
> >     movl    %ecx, %eax
> >     ret
> >
> > Putting another instance of pass_late_combine after
> > pass_reorder_blocks improves the assembly in a non-trivial way:
> >
> >  @@ -28,10 +28,8 @@
> >      cmpl    %edx, %ebx
> >      je    .L5
> >  .L4:
> > -    movl    %eax, %ecx
> >      cmpl    %esi, (%eax)
> >      jne    .L11
> > -    movl    %ecx, %eax
> >      popl    %ebx
> >      .cfi_remember_state
> >      .cfi_restore 3
> > @@ -44,17 +42,16 @@
> >      .p2align 3
> >  .L5:
> >      .cfi_restore_state
> > -    xorl    %ecx, %ecx
> > +    xorl    %eax, %eax
> >      popl    %ebx
> >      .cfi_restore 3
> >      .cfi_def_cfa_offset 8
> >      popl    %esi
> >      .cfi_restore 6
> >      .cfi_def_cfa_offset 4
> > -    movl    %ecx, %eax
> >      ret
> >      .cfi_endproc
> >  .LFE0:
> >      .size    find_ptr, .-find_ptr
> >
> > which looks like it is worth putting a new pass here.
> >
> > A comparison of sizes of default x86_64 linux build shows noticeable
> > code size improvement:
> >
> > $ size vmlinux-old.o vmlinux-new.o
> >   text    data     bss     dec     hex filename
> > 29432351        4932443  754228 35119022        217dfae vmlinux-old.o
> > 29415516        4932443  754228 35102187        2179deb vmlinux-new.o
> >
> > which shows a code size reduction of 16835 bytes.
> >
> > Any thoughts?
>
> Did you check other places to schedule the pass?

I was interested to exercise opportunities, exposed by bbro pass (as
mentioned in [1]), so the natural place to put the new pass is after
bbro pass:

On x86_32, IRA zeroes %ecx, which is later copied to %eax in the
terminal basic block:

   12: NOTE_INSN_BASIC_BLOCK 3
    7: cx:SI=0
      REG_EQUAL 0
   45: pc=L36
   ...
   36: L36:
   39: NOTE_INSN_BASIC_BLOCK 7
   37: ax:SI=cx:SI
   38: use ax:SI

This sequence is reordered in bbro pass to:

   28: L28:
   12: NOTE_INSN_BASIC_BLOCK 7
   69: {cx:SI=0;clobber flags:CC;}
      REG_UNUSED flags:CC
   71: ax:SI=cx:SI
      REG_DEAD cx:SI
   72: use ax:SI
   73: NOTE_INSN_EPILOGUE_BEG
   74: bx:SI=[sp:SI++]
      REG_CFA_ADJUST_CFA sp:SI=sp:SI+0x4
      REG_CFA_RESTORE bx:SI
   75: si:SI=[sp:SI++]
      REG_CFA_ADJUST_CFA sp:SI=sp:SI+0x4
      REG_CFA_RESTORE si:SI
   76: simple_return

> Did you try moving the existing postreload late_combine later?

While moving the pass results in the same code for the above trivial
testcase, it regressed linux code size considerably:

$size *.o
  text    data     bss     dec     hex filename
29483880        4932443  754228 35170551        218a8f7 vmlinux-moved.o
29415516        4932443  754228 35102187        2179deb vmlinux-new.o
29432351        4932443  754228 35119022        217dfae vmlinux-old.o

The postreload late_combine pass apparently frees some registers, so
follow-up passes can use them. This is not the case when the pass is
simply moved to the new location. The instructions are still combined
when the pass is moved after bbro pass, but used registers are "dead"
for the preceding optimization passes.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2026-April/713048.html

Uros.

Reply via email to