On 02/12/11 12:42, Andrew Stubbs wrote: > Hi All, > > I'm trying to implement DImode shifts using ARM NEON instructions. This > wouldn't be difficult in itself, but making it play nice with the > existing implementation is causing me problems. I'd like a few > suggestions/pointers/comments to help me get this right, please. > > The existing shift mechanisms must be kept, partly because the NEON unit > is optional, and partly because it does not permit the full range of > DImode operations, so sometimes it's more efficient to do 64-bit > operations in core-registers, rather than copy all the values over to > NEON, do the operation, and move the result back. Which set of patterns > are used is determined by the register allocator and its costs mechanism. > > The late decision means that the patterns may only use the post-reload > splitter, and so cannot rely on many of the usual passes to sort out > inefficiencies. In particular, the lack of combine makes it hard to > detect and optimize extend-and-copy sequences. > > So, I've attached two patches. The first is neon-shifts.patch, and does > most of the work. The second is extendsidi2_neon.patch, and is intended > to aid moving the shift amount from SImode registers, but doesn't go as > far as I'd like. > > I've not actually tested any of the output code just yet, so there may > be logic errors, but those are easily fixed later, and what I'm trying > to get right here is the GCC machine description. > > Given this testcase: > > void > f (long long *a, int b) > { > *a = *a << b; > } > > Without any patches, GCC gives this output, using only ARM core > registers (in thumb2 mode): > > f: > ldr r2, [r0, #0] > ldr r3, [r0, #4] > push {r4, r5, r6} > rsb r6, r1, #32 > sub r4, r1, #32 > lsrs r6, r2, r6 > lsls r5, r2, r4 > lsls r3, r3, r1 > lsls r1, r2, r1 > orrs r3, r3, r6 > str r1, [r0, #0] > ands r4, r3, r4, asr #32 > it cc > movcc r4, r5 > str r4, [r0, #4] > pop {r4, r5, r6} > bx lr > > With just neon-shifts.patch, we get this output, now with NEON shifts: > > f: > fldd d17, [r0, #0] @ int > mov r2, r1 > movs r3, #0 > push {r4, r5} > fmdrr d18, r2, r3 @ int > vshl.i64 d16, d17, d18 > fstd d16, [r0, #0] @ int > pop {r4, r5} > bx lr > > > As you can see, the shift is much improved, but the shift amount is > first extended into two SImode registers, and then moved to a NEON > DImode register, which increases core-register pressure unnecessarily. > > With both patches, we now get this: > > f: > fldd d17, [r0, #0] @ int > vdup.32 d16, r1 > vshr.u64 d16, d16, #32 <-- still unnecessary > vshl.i64 d16, d17, d16 > fstd d16, [r0, #0] @ int > bx lr > > Now the value is copied and then extended. I have chosen to use vdup.32 > instead of vmov.i32 because the latter can only target half the DImode > registers. The right shift is necessary for a general zero-extend, but > is not useful in this case as only the bottom 8 bits are interesting, > and vdup has already done the right thing. > > Note that the examples I've given are for left shifts. Right shifts are > also implemented, but are a little more complicated (in the > shift-by-register case) because the shift must be implemented as a left > shift by a negative amount, and so an unspec is used to prevent the > compiler doing anything 'clever'. Apart from an extra negation, the end > result is much the same, but the patterns look different. > > > All this is a nice improvement, but I'm not happy: > > 1. The post-reload split means that I've had to add a clobber for CC to > all the patterns, even though only some of them really need it. I think > I've convinced myself that this is ok because it doesn't matter before > scheduling, and after splitting the clobbers are only retained if > they're really needed, but it still feels wrong. > > 2. The extend optimization is fine for general case extends, but it can > be improved for the shift-amount case because we actually only need the > bottom 8 bits, as indicated above. The problem is that there's no > obvious way to achieve this: > - there's no combine pass after this point, so a pattern that > recognises and re-splits the extend, move and shift can't be used. > - I don't believe there can be a pattern that uses SImode for the > shift amount because the value needs to be in a DImode register > eventually, and that means one needs to have been allocated before it > gets split, and that means the extend needs to be separate. > > 3. The type of the shift-amount is determined by the type used in the > ashldi3 pattern, and that uses SImode. This is fine for values already > in SImode registers (probably the common case), but means that values > already in DImode registers will have to get truncated and then > re-extended, and this is not an operation that can generally be > optimized away once introduced. > - I've considered using a DImode shift-amount for the ashldi3 > pattern, and that would solve this problem - extend and truncate *can* > be optimized away, but since it doesn't get split until post reload, the > register allocator would already have allocated two SImode registers > before we have any chance to make it go away. > > 4. I'm not sure, but I think the general-case shift in core registers is > sufficiently long-winded that it might be worthwhile completely > discarding that option (i.e. it might cheaper to just always use neon > shifts, when neon is available, of course). I'd keep the > shift-by-constant-amount variants though. Does anybody have any comments > on that? > > 5. The left and right shift patterns couldn't be unified because I > couldn't find a way to do match_operand with unspecs, and anyway, the > patterns are a slightly different shape. > > 6. Same with the logical and arithmetic right shifts; I couldn't find a > way to unify those patterns either, even though the only difference is > the unspec index number. > > > Any help would be appreciated. I've probably implemented this backwards, > or something ... >
So it looks like the code generated for core registers with thumb2 is pretty rubbish (no real surprise there -- to get the best code you need to make use of the fact that on ARM a shift by a small negative number (< -128) will give zero. This gives us sequences like: For ARM state it's something like (untested) @ shft < 32 , shft >= 32 __ashldi3_v3: sub r3, r2, #32 @ -ve , shft - 32 lsl ah, ah, r2 @ ah << shft , 0 rsb ip, r2, #32 @ 32 - shft , -ve orr ah, ah, al, lsl r3 @ ah << shft , al << shft - 32 orr ah, ah, al, lsr ip @ ah << shft | al >> 32 - shft , al << shft - 32 lsl al, al, r2 @ al << shft , 0 For Thumb2 (where there is no orr with register shift) lsls ah, ah, r2 @ ah << shft , 0 sub r3, r2, #32 @ -ve , shft - 32 lsl ip, al, r3 @ 0 , al << shft - 32 negs r3, r3 @ 32 - shft , -ve orr ah, ah, ip @ ah << shft , al << shft - 32 lsr r3, al, r3 @ al >> 32 - shft , 0 orrs ah, ah, r3 @ ah << shft | al >> 32 - shft , al << shft - 32 lsls al, al, r2 @ al << shft , 0 Neither of which needs the condition flags during execution (and indeed is probably better in both cases than the code currently in lib1funcs.asm for a modern core). The flag clobbering behaviour in the thumb2 variant is only for code size saving; that would normally be added by a late optimization pass. None of this directly helps with your neon usage, but it does show that we really don't need to clobber the condition code register to get an efficient sequence. R.