------- Comment #12 from dave at hiauly1 dot hia dot nrc dot ca  2006-09-10 
18:39 -------
Subject: Re:  g++.dg/tree-ssa/ivopts-1.C fails

> The linux output doesn't have the -4 offset:
> 
>        ldi 1,%r28
>        stw %r2,-20(%r30)
> .LCFI0:
>        copy %r28,%r21
>        ldo 128(%r30),%r30
> .LCFI1:
>        ldi 4,%r19
>        stw %r28,-120(%r30)
>        ldo -120(%r30),%r26
>        ldi 16,%r20
> .L2:
>        addl %r19,%r26,%r28
>        ldo 4(%r19),%r19
>        comb,<> %r20,%r19,.L2
>        stw %r21,0(%r28)
>        bl _Z3fooR3Foo,%r2
>        nop

It seems like the code generated for the hppa-unknown-linux-gnu target
has been getting worse in each release since 4.0.0.  Here is the 4.0.0
assembler code:

        stw %r2,-20(%r30)
.LCFI0:
        ldi 1,%r19
        ldo 128(%r30),%r30
.LCFI1:
        ldo -120(%r30),%r28
        ldo -104(%r30),%r20
        stw %r19,0(%r28)
.L8:
        ldo 4(%r28),%r28
        comb,<>,n %r28,%r20,.L8
        stw %r19,0(%r28)
        bl _Z3fooR3Foo,%r2
        ldo -120(%r30),%r26

In this code, the function prologue and loop setup are only six instructions.
The loop is three instructions.  The delay slot for the call to _Z3fooR3Foo
is filled.  The value loaded into r26 is the same as loaded into r28 before
the loop.  I would say this code is close to optimal aside from unrolling
loop completely.

Here is the code generated by Debian 4.1.1-13:

        stw %r2,-20(%r30)
.LCFI0:
        ldi 0,%r19
        ldo 128(%r30),%r30
.LCFI1:
        ldi 1,%r21
        ldo -120(%r30),%r26
        ldi 16,%r20
.L2:
        addl %r19,%r26,%r28
        ldo 4(%r19),%r19
        comb,<> %r20,%r19,.L2
        stw %r21,0(%r28)
        bl _Z3fooR3Foo,%r2
        nop

The prologue and loop setup are still six instructions.  However,
the loop is now four instructions per iteration and we iterate one
more time.  Thus, we have a significant performance regression
relative to 4.0.0 in the handling of this loop.

Possibly, this results from the compiler trying to avoid loading the
address "-120(%r30)" twice.  However, the cost for loading small offsets
is the same as doing a register copy.  A register copy is just a ldo
instruction with an offset of 0.

The 4.2.0 code is worse than the 4.1.1 code in that the prologue and
loop setup have now grown to eight instructions.  However, we are
back to three iterations.  It seems like the compiler may be trying
to avoid non-zero offsets.  I don't know why this would be.
hppa_address_cost is pretty simple.

As far as the the difference between linux and hpux goes, the main
difference is that long doubles are 64 bits on linux and 128 bits
on hpux.  This would affect the sizes of the cost arrays in expmed.c.
I'm wondering if somehow some of the values in these tables are
getting corrupted.

There's one puzzling difference in the linux and hpux tree dumps.
In the hpux dump, some variables are "long unsigned int" whereas in
the linux dump they are "unsigned int".

Dave


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27707

Reply via email to