Re: [PATCH, rs6000] power8 patches, patch #4 (revised), new power8 builtins

Michael Meissner Wed, 05 Jun 2013 13:25:48 -0700

On Wed, Jun 05, 2013 at 10:06:08PM +0200, Segher Boessenkool wrote:
> >I also wonder whether it would be useful to have 32-bit do the
> >vector logical
> >ops in gprs as well.  At the moment, the patches don't allow it
> >(vector types
> >must be done in the altivec/vsx registers, an TImode is done by
> >splitting the
> >operation into 4 separate categories).  On the 64-bit side, having
> >__int128_t
> >passed in GPRs, means you want to avoid ping-ponging between the
> >GPRs and VSX
> >registers.  In addition, the atomic quad word support (patch #7)
> >has to run in
> >GPRs, so we need add/subtract/logical to have versions that run in
> >GPRs.
> 
> It might work better if you added a mode V1TI for TI in vector
> regs, and then used plain TI only for GPRs.  It certainly will
> make things a lot more regular; whether it actually works better,
> I have no idea.
> 
> The way you have things now, only after reload the vector patterns
> are split to GPR patterns; much too late to do most optimisations
> on it.  On the other hand, deciding early what register set some
> op should go to isn't too pleasant either; is it always the best
> choice to use the vector regs when possible?


It depends.  For example consider:

#ifndef TYPE
#define TYPE __int128_t
#endif

TYPE a_and (TYPE p, TYPE q) { return p & q; }
void p_and (TYPE *p, TYPE *q, TYPE *r) { *p = *q & *r; }

In a_and, p and q are passed in GPRs, so you want to use the GPR based
instructions.  In p_and, it is simpler to do the instruction in the VSX
registers.

This is what my code from patch 4 generates:

.L.a_and:
        and 3,3,5
        and 4,4,6
        blr

.L.p_and:
        lxvd2x 12,0,4
        lxvd2x 0,0,5
        xxland 0,12,0
        stxvd2x 0,0,3
        blr

Unfortunately when I added the TImode in VSX registers, I didn't notice this,
and the current code generates:

.L.a_and:
        addi 9,1,-16
        std 3,0(9)
        std 4,8(9)
        ori 2,2,0
        lxvd2x 12,0,9
        std 5,0(9)
        std 6,8(9)
        ori 2,2,0
        lxvd2x 0,0,9
        xxland 0,12,0
        stxvd2x 0,0,9
        ori 2,2,0
        ld 3,0(9)
        ld 4,8(9)
        blr

.L.p_and:
        lxvd2x 12,0,4
        lxvd2x 0,0,5
        xxland 0,12,0
        stxvd2x 0,0,3
        blr

Previous versions (and -mno-vsx-timode) generate:

.L.a_and:
        and 3,3,5
        and 4,4,6
        blr

.L.p_and:
        ld 10,0(4)
        ld 9,0(5)
        and 9,10,9
        std 9,0(3)
        ld 10,8(4)
        ld 9,8(5)
        and 9,10,9
        std 9,8(3)
        blr

Note, that the scheduler does not interleave the loads and the and's, instead
it does ld/ld/and/std.

This bouncing back and forth will get somewhat worse when the support for doing
128int_t add/subtract in the vector registers is added.  We don't want to hard
wire doing all of TImode in vector registers, because this breaks the 8-byte
atomic fetch_and_add functions (without having to use an UNSPEC to hide the
add).

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460, USA
email: [email protected], phone: +1 (978) 899-4797

Re: [PATCH, rs6000] power8 patches, patch #4 (revised), new power8 builtins

Reply via email to