------- Comment #12 from dave at hiauly1 dot hia dot nrc dot ca 2006-09-10
18:39 -------
Subject: Re: g++.dg/tree-ssa/ivopts-1.C fails
> The linux output doesn't have the -4 offset:
>
> ldi 1,%r28
> stw %r2,-20(%r30)
> .LCFI0:
> copy %r28,%r21
> ldo 128(%r30),%r30
> .LCFI1:
> ldi 4,%r19
> stw %r28,-120(%r30)
> ldo -120(%r30),%r26
> ldi 16,%r20
> .L2:
> addl %r19,%r26,%r28
> ldo 4(%r19),%r19
> comb,<> %r20,%r19,.L2
> stw %r21,0(%r28)
> bl _Z3fooR3Foo,%r2
> nop
It seems like the code generated for the hppa-unknown-linux-gnu target
has been getting worse in each release since 4.0.0. Here is the 4.0.0
assembler code:
stw %r2,-20(%r30)
.LCFI0:
ldi 1,%r19
ldo 128(%r30),%r30
.LCFI1:
ldo -120(%r30),%r28
ldo -104(%r30),%r20
stw %r19,0(%r28)
.L8:
ldo 4(%r28),%r28
comb,<>,n %r28,%r20,.L8
stw %r19,0(%r28)
bl _Z3fooR3Foo,%r2
ldo -120(%r30),%r26
In this code, the function prologue and loop setup are only six instructions.
The loop is three instructions. The delay slot for the call to _Z3fooR3Foo
is filled. The value loaded into r26 is the same as loaded into r28 before
the loop. I would say this code is close to optimal aside from unrolling
loop completely.
Here is the code generated by Debian 4.1.1-13:
stw %r2,-20(%r30)
.LCFI0:
ldi 0,%r19
ldo 128(%r30),%r30
.LCFI1:
ldi 1,%r21
ldo -120(%r30),%r26
ldi 16,%r20
.L2:
addl %r19,%r26,%r28
ldo 4(%r19),%r19
comb,<> %r20,%r19,.L2
stw %r21,0(%r28)
bl _Z3fooR3Foo,%r2
nop
The prologue and loop setup are still six instructions. However,
the loop is now four instructions per iteration and we iterate one
more time. Thus, we have a significant performance regression
relative to 4.0.0 in the handling of this loop.
Possibly, this results from the compiler trying to avoid loading the
address "-120(%r30)" twice. However, the cost for loading small offsets
is the same as doing a register copy. A register copy is just a ldo
instruction with an offset of 0.
The 4.2.0 code is worse than the 4.1.1 code in that the prologue and
loop setup have now grown to eight instructions. However, we are
back to three iterations. It seems like the compiler may be trying
to avoid non-zero offsets. I don't know why this would be.
hppa_address_cost is pretty simple.
As far as the the difference between linux and hpux goes, the main
difference is that long doubles are 64 bits on linux and 128 bits
on hpux. This would affect the sizes of the cost arrays in expmed.c.
I'm wondering if somehow some of the values in these tables are
getting corrupted.
There's one puzzling difference in the linux and hpux tree dumps.
In the hpux dump, some variables are "long unsigned int" whereas in
the linux dump they are "unsigned int".
Dave
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27707