Re: RFC: ARM 64-bit shifts in NEON

Richard Earnshaw Wed, 07 Dec 2011 05:43:08 -0800

On 02/12/11 12:42, Andrew Stubbs wrote:
> Hi All,
> 
> I'm trying to implement DImode shifts using ARM NEON instructions. This 
> wouldn't be difficult in itself, but making it play nice with the 
> existing implementation is causing me problems. I'd like a few 
> suggestions/pointers/comments to help me get this right, please.
> 
> The existing shift mechanisms must be kept, partly because the NEON unit 
> is optional, and partly because it does not permit the full range of 
> DImode operations, so sometimes it's more efficient to do 64-bit 
> operations in core-registers, rather than copy all the values over to 
> NEON, do the operation, and move the result back. Which set of patterns 
> are used is determined by the register allocator and its costs mechanism.
> 
> The late decision means that the patterns may only use the post-reload 
> splitter, and so cannot rely on many of the usual passes to sort out 
> inefficiencies. In particular, the lack of combine makes it hard to 
> detect and optimize extend-and-copy sequences.
> 
> So, I've attached two patches. The first is neon-shifts.patch, and does 
> most of the work. The second is extendsidi2_neon.patch, and is intended 
> to aid moving the shift amount from SImode registers, but doesn't go as 
> far as I'd like.
> 
> I've not actually tested any of the output code just yet, so there may 
> be logic errors, but those are easily fixed later, and what I'm trying 
> to get right here is the GCC machine description.
> 
> Given this testcase:
> 
>     void
>     f (long long *a, int b)
>     {
>       *a = *a << b;
>     }
> 
> Without any patches, GCC gives this output, using only ARM core 
> registers (in thumb2 mode):
> 
>     f:
>           ldr     r2, [r0, #0]
>           ldr     r3, [r0, #4]
>           push    {r4, r5, r6}
>           rsb     r6, r1, #32
>           sub     r4, r1, #32
>           lsrs    r6, r2, r6
>           lsls    r5, r2, r4
>           lsls    r3, r3, r1
>           lsls    r1, r2, r1
>           orrs    r3, r3, r6
>           str     r1, [r0, #0]
>           ands    r4, r3, r4, asr #32
>           it      cc
>           movcc   r4, r5
>           str     r4, [r0, #4]
>           pop     {r4, r5, r6}
>           bx      lr
> 
> With just neon-shifts.patch, we get this output, now with NEON shifts:
> 
> f:
>          fldd    d17, [r0, #0]   @ int
>          mov     r2, r1
>          movs    r3, #0
>          push    {r4, r5}
>          fmdrr   d18, r2, r3     @ int
>          vshl.i64        d16, d17, d18
>          fstd    d16, [r0, #0]   @ int
>          pop     {r4, r5}
>          bx      lr
> 
> 
> As you can see, the shift is much improved, but the shift amount is 
> first extended into two SImode registers, and then moved to a NEON 
> DImode register, which increases core-register pressure unnecessarily.
> 
> With both patches, we now get this:
> 
> f:
>          fldd    d17, [r0, #0]   @ int
>          vdup.32 d16, r1
>          vshr.u64        d16, d16, #32   <-- still unnecessary
>          vshl.i64        d16, d17, d16
>          fstd    d16, [r0, #0]   @ int
>          bx      lr
> 
> Now the value is copied and then extended. I have chosen to use vdup.32 
> instead of vmov.i32 because the latter can only target half the DImode 
> registers. The right shift is necessary for a general zero-extend, but 
> is not useful in this case as only the bottom 8 bits are interesting, 
> and vdup has already done the right thing.
> 
> Note that the examples I've given are for left shifts. Right shifts are 
> also implemented, but are a little more complicated (in the 
> shift-by-register case) because the shift must be implemented as a left 
> shift by a negative amount, and so an unspec is used to prevent the 
> compiler doing anything 'clever'. Apart from an extra negation, the end 
> result is much the same, but the patterns look different.
> 
> 
> All this is a nice improvement, but I'm not happy:
> 
> 1. The post-reload split means that I've had to add a clobber for CC to 
> all the patterns, even though only some of them really need it. I think 
> I've convinced myself that this is ok because it doesn't matter before 
> scheduling, and after splitting the clobbers are only retained if 
> they're really needed, but it still feels wrong.
> 
> 2. The extend optimization is fine for general case extends, but it can 
> be improved for the shift-amount case because we actually only need the 
> bottom 8 bits, as indicated above. The problem is that there's no 
> obvious way to achieve this:
>     - there's no combine pass after this point, so a pattern that 
> recognises and re-splits the extend, move and shift can't be used.
>     - I don't believe there can be a pattern that uses SImode for the 
> shift amount because the value needs to be in a DImode register 
> eventually, and that means one needs to have been allocated before it 
> gets split, and that means the extend needs to be separate.
> 
> 3. The type of the shift-amount is determined by the type used in the 
> ashldi3 pattern, and that uses SImode. This is fine for values already 
> in SImode registers (probably the common case), but means that values 
> already in DImode registers will have to get truncated and then 
> re-extended, and this is not an operation that can generally be 
> optimized away once introduced.
>     - I've considered using a DImode shift-amount for the ashldi3 
> pattern, and that would solve this problem - extend and truncate *can* 
> be optimized away, but since it doesn't get split until post reload, the 
> register allocator would already have allocated two SImode registers 
> before we have any chance to make it go away.
> 
> 4. I'm not sure, but I think the general-case shift in core registers is 
> sufficiently long-winded that it might be worthwhile completely 
> discarding that option (i.e. it might cheaper to just always use neon 
> shifts, when neon is available, of course). I'd keep the 
> shift-by-constant-amount variants though. Does anybody have any comments 
> on that?
> 
> 5. The left and right shift patterns couldn't be unified because I 
> couldn't find a way to do match_operand with unspecs, and anyway, the 
> patterns are a slightly different shape.
> 
> 6. Same with the logical and arithmetic right shifts; I couldn't find a 
> way to unify those patterns either, even though the only difference is 
> the unspec index number.
> 
> 
> Any help would be appreciated. I've probably implemented this backwards, 
> or something ...
>



So it looks like the code generated for core registers with thumb2 is
pretty rubbish (no real surprise there -- to get the best code you need
to make use of the fact that on ARM a shift by a small negative number
(< -128) will give zero.  This gives us sequences like:

For ARM state it's something like (untested)

                                        @ shft < 32                     , shft 
>= 32
__ashldi3_v3:
        sub     r3, r2, #32             @ -ve                           , shft 
- 32
        lsl     ah, ah, r2              @ ah << shft                    , 0
        rsb     ip, r2, #32             @ 32 - shft                     , -ve
        orr     ah, ah, al, lsl r3      @ ah << shft                    , al << 
shft - 32
        orr     ah, ah, al, lsr ip      @ ah << shft | al >> 32 - shft  , al << 
shft - 32
        lsl     al, al, r2              @ al << shft                    , 0

For Thumb2 (where there is no orr with register shift)

        lsls    ah, ah, r2              @ ah << shft                    , 0
        sub     r3, r2, #32             @ -ve                           , shft 
- 32
        lsl     ip, al, r3              @ 0                             , al << 
shft - 32
        negs    r3, r3                  @ 32 - shft                     , -ve
        orr     ah, ah, ip              @ ah << shft                    , al << 
shft - 32
        lsr     r3, al, r3              @ al >> 32 - shft               , 0
        orrs    ah, ah, r3              @ ah << shft | al >> 32 - shft  , al << 
shft - 32
        lsls    al, al, r2              @ al << shft                    , 0

Neither of which needs the condition flags during execution (and indeed
is probably better in both cases than the code currently in lib1funcs.asm
for a modern core).  The flag clobbering behaviour in the thumb2 variant
is only for code size saving; that would normally be added by a late
optimization pass.

None of this directly helps with your neon usage, but it does show that we
really don't need to clobber the condition code register to get an
efficient sequence.

R.

Re: RFC: ARM 64-bit shifts in NEON

Reply via email to