On Wed, Jun 05, 2013 at 10:06:08PM +0200, Segher Boessenkool wrote:
> >I also wonder whether it would be useful to have 32-bit do the
> >vector logical
> >ops in gprs as well. At the moment, the patches don't allow it
> >(vector types
> >must be done in the altivec/vsx registers, an TImode is done by
> >splitting the
> >operation into 4 separate categories). On the 64-bit side, having
> >__int128_t
> >passed in GPRs, means you want to avoid ping-ponging between the
> >GPRs and VSX
> >registers. In addition, the atomic quad word support (patch #7)
> >has to run in
> >GPRs, so we need add/subtract/logical to have versions that run in
> >GPRs.
>
> It might work better if you added a mode V1TI for TI in vector
> regs, and then used plain TI only for GPRs. It certainly will
> make things a lot more regular; whether it actually works better,
> I have no idea.
>
> The way you have things now, only after reload the vector patterns
> are split to GPR patterns; much too late to do most optimisations
> on it. On the other hand, deciding early what register set some
> op should go to isn't too pleasant either; is it always the best
> choice to use the vector regs when possible?
It depends. For example consider:
#ifndef TYPE
#define TYPE __int128_t
#endif
TYPE a_and (TYPE p, TYPE q) { return p & q; }
void p_and (TYPE *p, TYPE *q, TYPE *r) { *p = *q & *r; }
In a_and, p and q are passed in GPRs, so you want to use the GPR based
instructions. In p_and, it is simpler to do the instruction in the VSX
registers.
This is what my code from patch 4 generates:
.L.a_and:
and 3,3,5
and 4,4,6
blr
.L.p_and:
lxvd2x 12,0,4
lxvd2x 0,0,5
xxland 0,12,0
stxvd2x 0,0,3
blr
Unfortunately when I added the TImode in VSX registers, I didn't notice this,
and the current code generates:
.L.a_and:
addi 9,1,-16
std 3,0(9)
std 4,8(9)
ori 2,2,0
lxvd2x 12,0,9
std 5,0(9)
std 6,8(9)
ori 2,2,0
lxvd2x 0,0,9
xxland 0,12,0
stxvd2x 0,0,9
ori 2,2,0
ld 3,0(9)
ld 4,8(9)
blr
.L.p_and:
lxvd2x 12,0,4
lxvd2x 0,0,5
xxland 0,12,0
stxvd2x 0,0,3
blr
Previous versions (and -mno-vsx-timode) generate:
.L.a_and:
and 3,3,5
and 4,4,6
blr
.L.p_and:
ld 10,0(4)
ld 9,0(5)
and 9,10,9
std 9,0(3)
ld 10,8(4)
ld 9,8(5)
and 9,10,9
std 9,8(3)
blr
Note, that the scheduler does not interleave the loads and the and's, instead
it does ld/ld/and/std.
This bouncing back and forth will get somewhat worse when the support for doing
128int_t add/subtract in the vector registers is added. We don't want to hard
wire doing all of TImode in vector registers, because this breaks the 8-byte
atomic fetch_and_add functions (without having to use an UNSPEC to hide the
add).
--
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460, USA
email: [email protected], phone: +1 (978) 899-4797