4.9 regression] problem with code like this: res = ((uint64_t)resh << 32) | resl;

law at redhat dot com Thu, 06 Feb 2014 23:06:12 -0800

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40977


Jeffrey A. Law <law at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2014-02-07
                 CC|                            |law at redhat dot com
      Known to work|                            |
           Assignee|unassigned at gcc dot gnu.org      |law at redhat dot com
     Ever confirmed|0                           |1
      Known to fail|                            |

--- Comment #8 from Jeffrey A. Law <law at redhat dot com> ---
The current trunk looks better than gcc-4.4, but it's still not as good as
gcc-3.4

After reload the key insns like this:

(insn 25 24 28 6 (set (reg:DI 0 %d0 [orig:47 D.1386 ] [47])
        (ashift:DI (zero_extend:DI (reg/v:SI 8 %a0 [orig:31 resh ] [31]))
            (const_int 32 [0x20]))) l.c:54 302 {ashldi_extsi}
     (nil))
(note 28 25 43 6 NOTE_INSN_DELETED)
(insn 43 28 44 6 (set (reg:SI 0 %d0)
        (reg:SI 0 %d0 [ D.1386 ])) l.c:57 39 {*movsi_m68k2}
     (nil))
(insn 44 43 36 6 (set (reg:SI 1 %d1 [orig:0+4 ] [0])
        (reg:SI 6 %d6 [orig:44 resl ] [44])) l.c:57 39 {*movsi_m68k2}
     (nil))

You can safely ignore insn 43, it'll get zapped because it's a NOP.

The key here is to realize that insn 25 generates two instructions, one which
sets d0, the other sets d1.  The instruction setting d1 is dead as that value
will be overwritten by the instruction generated for insn 44.   But GCC is
particularly bad at discovering and exploiting these kind of situations.

This can be fixed by changing ashldi_extsi from a define_insn into a suitable
define_insn_and_split which will decompose the insn into its component parts. 
That gets us something like this:

(insn 49 24 50 6 (set (reg:SI 0 %d0 [ D.1386 ])
        (reg/v:SI 8 %a0 [orig:31 resh ] [31])) l.c:54 38 {*movsi_m68k}
     (nil))
(insn 50 49 28 6 (set (reg:SI 1 %d1 [orig:47 D.1386+4 ] [47])
        (const_int 0 [0])) l.c:54 36 {*movsi_const0_68040_60}
     (nil))
(note 28 50 44 6 NOTE_INSN_DELETED)
(insn 44 28 36 6 (set (reg:SI 1 %d1 [orig:0+4 ] [0])
        (reg:SI 6 %d6 [orig:44 resl ] [44])) l.c:57 39 {*movsi_m68k2}
     (nil))

Now the double-word set originally associated with insn 25 is represented by
insns 49 and 50.  And we're in a form that the DCE code can easily digest and
determine that insn 50 is dead.  This results in:

(insn 49 24 28 6 (set (reg:SI 0 %d0 [ D.1386 ])
        (reg/v:SI 8 %a0 [orig:31 resh ] [31])) l.c:54 38 {*movsi_m68k}
     (expr_list:REG_DEAD (reg/v:SI 8 %a0 [orig:31 resh ] [31])
        (nil)))
(note 28 49 44 6 NOTE_INSN_DELETED)
(insn 44 28 36 6 (set (reg:SI 1 %d1 [orig:0+4 ] [0])
        (reg:SI 6 %d6 [orig:44 resl ] [44])) l.c:57 39 {*movsi_m68k2}
     (expr_list:REG_DEAD (reg:SI 6 %d6 [orig:44 resl ] [44])


Which is, much better.

The final assembly code looks like:

MUL64:
        movem.l #15872,-(%sp)
        move.l 24(%sp),%a1
        move.l 28(%sp),%d5
#APP
| 47 "l.c" 1
        | Inlined umul_ppmm
        move.l  %a1,%d0
        move.l  %d5,%d1
        move.l  %d0,%d2
        swap    %d0
        move.l  %d1,%d3
        swap    %d1
        move.w  %d2,%d4
        mulu    %d3,%d4
        mulu    %d1,%d2
        mulu    %d0,%d3
        mulu    %d0,%d1
        move.l  %d4,%d0
        eor.w   %d0,%d0
        swap    %d0
        add.l   %d0,%d2
        add.l   %d3,%d2
        jcc     1f
        add.l   #65536,%d1
1:      swap    %d2
        moveq   #0,%d0
        move.w  %d2,%d0
        move.w  %d4,%d2
        move.l  %d2,%d6
        add.l   %d1,%d0
        move.l  %d0,%a0
#NO_APP
        tst.l %a1
        jlt .L6
        tst.l %d5
        jlt .L7
.L3:
        move.l %a0,%d0
        move.l %d6,%d1
        movem.l (%sp)+,#124
        rts
.L7:
        sub.l %a1,%a0
        move.l %a0,%d0
        move.l %d6,%d1
        movem.l (%sp)+,#124
        rts
.L6:
        sub.l %d5,%a0
        tst.l %d5
        jge .L3
        jra .L7


Which should be as good as or better than the gcc-3.4 code, with the possible
exception of codesize.  But the compiler has tried to optimize the most likely
path through the function (neither argument is negative).  As a result we have
a bit of tail duplication.

[Bug target/40977] [4.7/4.8/4.9 regression] problem with code like this: res = ((uint64_t)resh << 32) | resl;

Reply via email to