Re: RFC: ARM 64-bit shifts in NEON

Andrew Stubbs Mon, 12 Dec 2011 08:29:13 -0800

On 07/12/11 13:42, Richard Earnshaw wrote:

So it looks like the code generated for core registers with thumb2 is
pretty rubbish (no real surprise there -- to get the best code you need
to make use of the fact that on ARM a shift by a small negative number
(<  -128) will give zero.  This gives us sequences like:


For ARM state it's something like (untested)

                                        @ shft<  32                  , shft>= 32
__ashldi3_v3:
        sub     r3, r2, #32             @ -ve                           , shft 
- 32
        lsl     ah, ah, r2              @ ah<<  shft              , 0
        rsb     ip, r2, #32             @ 32 - shft                     , -ve
        orr     ah, ah, al, lsl r3      @ ah<<  shft              , al<<  shft 
- 32
        orr     ah, ah, al, lsr ip      @ ah<<  shft | al>>  32 - shft      , 
al<<  shft - 32
        lsl     al, al, r2              @ al<<  shft              , 0

For Thumb2 (where there is no orr with register shift)

        lsls    ah, ah, r2              @ ah<<  shft              , 0
        sub     r3, r2, #32             @ -ve                           , shft 
- 32
        lsl     ip, al, r3              @ 0                             , al<<  
shft - 32
        negs    r3, r3                  @ 32 - shft                     , -ve
        orr     ah, ah, ip              @ ah<<  shft              , al<<  shft 
- 32
        lsr     r3, al, r3              @ al>>  32 - shft         , 0
        orrs    ah, ah, r3              @ ah<<  shft | al>>  32 - shft      , 
al<<  shft - 32
        lsls    al, al, r2              @ al<<  shft              , 0

Neither of which needs the condition flags during execution (and indeed
is probably better in both cases than the code currently in lib1funcs.asm
for a modern core).  The flag clobbering behaviour in the thumb2 variant
is only for code size saving; that would normally be added by a late
optimization pass.

None of this directly helps with your neon usage, but it does show that we
really don't need to clobber the condition code register to get an
efficient sequence.

Unfortunately, both these sequences use two scratch registers, as shown,and that's worse than clobbering CC.

Now, I can implement this for non-Neon easily enough, I think, and thatwould be a win, but I'm trying to figure out how best to do it for boththat case and the case where neon is available but the compiler choosesnot to do it.

The problem is that when there is no neon available, this can beconverted at expand or split1 time, but when neon *is* available we haveto wait until a post-reload split, and then we'd be forced to expandthis in early-clobber mode, which is far less optimal.

Any suggestions now to do this without pessimizing the code in the casethat neon is available but not used?

In fact, is the general shift operation sufficiently expensive that Ishould I just abandon the fall back alternatives and *always* use Neonwhen available? In this case, what about A8 vs. A9?


Thanks

Andrew

Re: RFC: ARM 64-bit shifts in NEON

Reply via email to