> On Mar 2, 2016, at 4:05 PM, Christophe Lyon <christophe.l...@linaro.org> 
> wrote:
> 
> On 2 March 2016 at 12:35, Edward Nevill <edward.nev...@linaro.org> wrote:
>> Hi,
>> 
>> I have just switched to gcc 5.2 from 4.9.2 and the code quality does seem to 
>> have improved significantly. For example, it now seems much better at using 
>> ldp/stp and it seems to has stopped gratuitous use of the SIMD registers.
> 
> Hi Ed,
> 
> Thanks for the feedback.
> 
> Can you be more specific on the GCC versions you are using?
> Do you mean Linaro TCWG releases?
> 
> 
>> 
>> However, I still have a few whinges:-)
>> 
>> See attached copy.c / copy.s (This is a performance critical function from 
>> OpenJDK)
>> 
>> pd_disjoint_words:
>>        cmp     x2, 8           <<< (1)
>>        sub     sp, sp, #64     <<< (2)
>>        bhi     .L2
>>        cmp     w2, 8           <<< (1)
>>        bls     .L15
>> .L2:
>>        add     sp, sp, 64      <<< (2)
>> 
>> (1) If count as a 64 bit unsigned is <= 8 then it is probably still <= 8 as 
>> a 32 bit unsigned.
>> 
>> (2) Nowhere in the function does it store anything on the stack, so why
>> drop and restore the stack every time. Also, minor quibble in the
>> disass, why does sub use #64 whereas add uses just '64' (appreciate this
>> is probably binutils, not gcc).
>> 
>> .L15:
>>        adrp    x3, .L4
>>        add     x3, x3, :lo12:.L4
>>        ldrb    w2, [x3,w2,uxtw]     <<< (3)
>>        adr     x3, .Lrtx4
>>        add     x2, x3, w2, sxtb #2
>>        br      x2
>> 
>> (3) Why use a byte table, this is not some sort of embedded system. Use
>> a word table and this becomes.
>> 
>> .L15:
>>        adrp    x3, .L4
>>        add     x3, x3, :lo12:.L4
>>        ldr     x2, [x3, x2, lsl #3]
>>        br      x2
>> 
>> An aligned word load takes exactly the same time as a byte load and we
>> save the faffing about calculating the address.
>> 
>> .L10:
>>        ldp     x6, x7, [x0]
>>        ldp     x4, x5, [x0, 16]
>>        ldp     x2, x3, [x0, 32]  <<< (4)
>>        stp     x2, x3, [x1, 32]  <<< (4)
>>        stp     x6, x7, [x1]
>>        stp     x4, x5, [x1, 16]
>> 
>> (4) Seems to be something wrong with the load scheduler here? Why not
>> move the stp x2, x3 to the end. It does this repeatedly.
>> 
>> Unfortunately as this function is performance critical it means I will
>> probably end up doing it in inline assembler which is time consuming,
>> error prone and non portable.
>> 
>> * Whinge mode off
>> 
> 
> I've just tried with our 5.3-2015.12 snapshot, and all your comments
> except (4) are still valid.

Edward,

This is very useful information.  Would you please copy-paste this into one or 
more bugzilla entries at bugs.linaro.org?  TCWG will then evaluate which of the 
issues are still present in GCC trunk, and we are going to put them into our 
optimization pipeline.

> 
> The same is true with today's trunk. FWIW, (4) now looks like this:
> .L10:
>        ldp     x6, x7, [x0]
>        ldp     x4, x5, [x0, 16]
>        ldp     x2, x3, [x0, 32]
>        stp     x6, x7, [x1]
>        stp     x4, x5, [x1, 16]
>        stp     x2, x3, [x1, 32]

Right, this is fixed in GCC 6 (and backported to Linaro 5.3, AFAIK), where 
compiler tries to sort memory references in order of increasing address to 
accommodate CPUs with hardware cache auto-prefetcher.

Thank you,

--
Maxim Kuvyrkov
www.linaro.org



_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/linaro-toolchain

Reply via email to