http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295
--- Comment #4 from Manu Evans 2013-03-05 01:55:08
UTC ---
(In reply to comment #3)
> (In reply to comment #2)
> > +1
> >
> > I'm seeing the same pattern.
> > Infact, I'm noticing a lot of my maths code seems to be performing a lot of
> > redundant moves.
>
> Some examples would be great regarding this matter, although I can already
> imagine what the code looks like. One of the problems is the auto-inc-dec
> pass
> (see PR 50749). A long time ago the rule of thumb for SH4 programmers was
> "read float values with post-inc addressing in your C code, and write float
> values with pre-dec addressing". This does not work anymore, since all memory
> accesses are turned into array like index based addresses internally in the
> compiler. Then the auto-inc-dec RTL pass is supposed to find post-inc and
> pre-dec addressing mode opportunities, but it fails to do so in most cases.
> I have started writing a replacement RTL pass that would try to optimize
> addressing mode selections. I hope to get it in for GCC 4.9.
>
> Anyway, if you have some example code that you can share, it would be really
> appreciated and helpful during development for testing purposes.
>
> > Are there actually any builtins/intrinsics available for the SH4?
> > How do I access the awesome vector operations without breaking out the
> > inline
> > asm?
>
> There aren't that many HW vector ops on SH4, just fipr and ftrv. At the
> moment, there are no builtins for those, so you'd have to use inline asm
> intrinsics. Like I mentioned in comment #1, I'd rather make the compiler
> figure out opportunities from portable generic code. Although for ftrv the
> patterns might be a bit complicated, also because the compiler then has
> to
> manage the 2nd FPU regs bank...
>
> > It would be nice to have some intrinsics that understand vectors as
> > sequences
> > of 4 float regs, and automate a sequential (vector) load.
>
> That would be the job of the address-mode-selection RTL pass. It would also
> improve overall code quality on SH. The fastest way to load 4 float vectors
> is
> to use 2x fmov.d. The compiler could also do that automatically, but this
> requires FPSCR switching, which unfortunately also needs some rework (e.g. see
> PR 53513, PR 6526).
>
> And on top of that, we also have PR 13423. It seems that the proper fix for
> this is a new reworked (vector) ABI for SH.
Well I hope you find the time for all this, the (small) sh4 community will love
you! :)
Why is a new ABI important?
> > Also, the ftrv opcode doesn't seem to be accessible either.
>
> True. I really hope that I'll find enough time to brush up SH FPU code
> generation for GCC 4.9. Until then, I'd suggest to use inline-asm style
> intrinsics.
4.9? That sounds like it could be years off... :(
I'm not sure what you mean by 'inline-asm style intrinsics'?
Last time I used inline-asm blocks in GCC it totally broke the optimisation. It
wouldn't reorder across inline-asm blocks, and it couldn't eliminate any
redundant load/stores appearing within the block in the event the value was
already resident.
Can you give me a small demonstration of what you mean?
I found whenever I touch inline-asm, the block just grows and grows in scope
upwards until my whole tight routine is written in asm... but that was some
years back, GCC3 era.
I'll report examples here as I find compelling situations.
But on a tangent, can you explain this behaviour? It's really ruining my code:
float testfunc(float v, float v2)
{
return v*v2 + v;
}
Compiled with: -O3 -mfused-madd
testfunc:
.LFB1:
.cfi_startproc
mov.l.L3,r1 ;
lds.l@r1+,fpscr ; <- why does it mess with fpscr?
add#-4,r1
fmovfr5,fr0
add#4,r1 ; <- +4 after -4... redundant?
fmacfr0,fr4,fr0
rts
lds.l@r1+,fpscr
.L4:
.align 2
.L3:
.long__fpscr_values
.cfi_endproc
There's a lot of rubbish in there... I expect:
testfunc:
.LFB1:
.cfi_startproc
fmovfr5,fr0
fmacfr0,fr4,fr0
rts
.cfi_endproc
I'm also noticing that -ffast-math is inhibiting fmac emission in some cases:
Compiled with: -O3 -mfused-madd -ffast-math
testfunc:
.LFB1:
.cfi_startproc
mov.l.L3,r1
lds.l@r1+,fpscr
fldi1fr0 ; what is a 1.0 doing here?
add#-4,r1
add#4,r1
faddfr4,fr0 ; v+1 ??
fmulfr5,fr0 ; (v+1)*v2 ?? That's not what the code does...
rts
lds.l@r1+,fpscr
What's going on there? That doesn't even look correct...
Cheers!