------- Comment #12 from dave at hiauly1 dot hia dot nrc dot ca 2006-09-10 18:39 ------- Subject: Re: g++.dg/tree-ssa/ivopts-1.C fails
> The linux output doesn't have the -4 offset: > > ldi 1,%r28 > stw %r2,-20(%r30) > .LCFI0: > copy %r28,%r21 > ldo 128(%r30),%r30 > .LCFI1: > ldi 4,%r19 > stw %r28,-120(%r30) > ldo -120(%r30),%r26 > ldi 16,%r20 > .L2: > addl %r19,%r26,%r28 > ldo 4(%r19),%r19 > comb,<> %r20,%r19,.L2 > stw %r21,0(%r28) > bl _Z3fooR3Foo,%r2 > nop It seems like the code generated for the hppa-unknown-linux-gnu target has been getting worse in each release since 4.0.0. Here is the 4.0.0 assembler code: stw %r2,-20(%r30) .LCFI0: ldi 1,%r19 ldo 128(%r30),%r30 .LCFI1: ldo -120(%r30),%r28 ldo -104(%r30),%r20 stw %r19,0(%r28) .L8: ldo 4(%r28),%r28 comb,<>,n %r28,%r20,.L8 stw %r19,0(%r28) bl _Z3fooR3Foo,%r2 ldo -120(%r30),%r26 In this code, the function prologue and loop setup are only six instructions. The loop is three instructions. The delay slot for the call to _Z3fooR3Foo is filled. The value loaded into r26 is the same as loaded into r28 before the loop. I would say this code is close to optimal aside from unrolling loop completely. Here is the code generated by Debian 4.1.1-13: stw %r2,-20(%r30) .LCFI0: ldi 0,%r19 ldo 128(%r30),%r30 .LCFI1: ldi 1,%r21 ldo -120(%r30),%r26 ldi 16,%r20 .L2: addl %r19,%r26,%r28 ldo 4(%r19),%r19 comb,<> %r20,%r19,.L2 stw %r21,0(%r28) bl _Z3fooR3Foo,%r2 nop The prologue and loop setup are still six instructions. However, the loop is now four instructions per iteration and we iterate one more time. Thus, we have a significant performance regression relative to 4.0.0 in the handling of this loop. Possibly, this results from the compiler trying to avoid loading the address "-120(%r30)" twice. However, the cost for loading small offsets is the same as doing a register copy. A register copy is just a ldo instruction with an offset of 0. The 4.2.0 code is worse than the 4.1.1 code in that the prologue and loop setup have now grown to eight instructions. However, we are back to three iterations. It seems like the compiler may be trying to avoid non-zero offsets. I don't know why this would be. hppa_address_cost is pretty simple. As far as the the difference between linux and hpux goes, the main difference is that long doubles are 64 bits on linux and 128 bits on hpux. This would affect the sizes of the cost arrays in expmed.c. I'm wondering if somehow some of the values in these tables are getting corrupted. There's one puzzling difference in the linux and hpux tree dumps. In the hpux dump, some variables are "long unsigned int" whereas in the linux dump they are "unsigned int". Dave -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27707