http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295
--- Comment #5 from Oleg Endo <olegendo at gcc dot gnu.org> 2013-03-05 12:28:22 UTC --- (In reply to comment #4) > > Why is a new ABI important? > Because currently, there is no way to pass something like struct { float x, y, z, w }; as function arguments in registers, although the default SH ABI could allow passing up to 3 of such vectors. The same applies to typedef float v4sf __attribute__ ((vector_size (16))); or std::array<float, 4> However, code that does that will be incompatible with existing calling conventions etc, thus a new (additional and optional) ABI. > 4.9? That sounds like it could be years off... :( 4.8 is about to be released soon. 4.9 should follow at around the same time next year. Of course you can still grab the current development version and use it anytime. > > I'm not sure what you mean by 'inline-asm style intrinsics'? Something like: static inline void* get_gbr (void) throw () { void* retval; __asm__ volatile ("stc gbr, %0" : "=r" (retval) : ); return retval; } > Last time I used inline-asm blocks in GCC it totally broke the optimisation. > It > wouldn't reorder across inline-asm blocks, and it couldn't eliminate any > redundant load/stores appearing within the block in the event the value was > already resident. > > Can you give me a small demonstration of what you mean? > I found whenever I touch inline-asm, the block just grows and grows in scope > upwards until my whole tight routine is written in asm... but that was some > years back, GCC3 era. > Yes, there are some limits of what the compiler can do with an asm block. It won't analyze the contents of the asm block, only the placeholders. Thus it usually can't eliminate redundant loads/stores. > > I'll report examples here as I find compelling situations. > > But on a tangent, can you explain this behaviour? It's really ruining my code: > > float testfunc(float v, float v2) > { > return v*v2 + v; > } > > Compiled with: -O3 -mfused-madd > > testfunc: > .LFB1: > .cfi_startproc > mov.l .L3,r1 ; > lds.l @r1+,fpscr ; <- why does it mess with fpscr? > add #-4,r1 > fmov fr5,fr0 > add #4,r1 ; <- +4 after -4... redundant? > fmac fr0,fr4,fr0 > rts > lds.l @r1+,fpscr > .L4: > .align 2 > .L3: > .long __fpscr_values > .cfi_endproc > > There's a lot of rubbish in there... I expect: > > testfunc: > .LFB1: > .cfi_startproc > fmov fr5,fr0 > fmac fr0,fr4,fr0 > rts > .cfi_endproc > The fpscr value is changed because its default setting is to operate on double-precision float values. This is the default configuration of the compiler. You can change it by using e.g. -m4-single, which will assume that FPSCR setting is configured for single-precision at function entry/return. The +4 -4 thing is a known problem and stems from the fact that the FPSCR load/store insns are available only as post-inc/pre-dec. > > I'm also noticing that -ffast-math is inhibiting fmac emission in some cases: > > Compiled with: -O3 -mfused-madd -ffast-math > > testfunc: > .LFB1: > .cfi_startproc > mov.l .L3,r1 > lds.l @r1+,fpscr > fldi1 fr0 ; what is a 1.0 doing here? > add #-4,r1 > add #4,r1 > fadd fr4,fr0 ; v+1 ?? > fmul fr5,fr0 ; (v+1)*v2 ?? That's not what the code does... > rts > lds.l @r1+,fpscr > > What's going on there? That doesn't even look correct... The transformation is legitimate, although unlucky, since using fmac would be better in this case. The original expression 'v*v2 + v' is converted to '(1 + v2)*v' and that's what the code does. Probably you compiled for little endian and got confused by the floating point register ordering for arguments. It goes like ... fr5 = arg 0 fr4 = arg 1 fr7 = arg 2 fr6 = arg 3 ... This is another reason for adding a new ABI, BTW.