On Tue, Jan 5, 2016 at 5:52 AM, Xiaofeng Ren <xiaofeng....@nxp.com> wrote:
>> Gcc-5.1:
>>   40110c:       3dc00c6c        ldr     q12, [x3,#48]

>> Gcc-4.8:
>>   40135c:       4cdf78af        ld1     {v15.4s}, [x5], #16

The ld1 and ldr instructions are effectively equivalent, they are both
loading 16-byte values into fp/simd registers.

I see a difference in the scheduling though.  The gcc-4.8 output has a
series of shift/add/store instructions while the gcc-5.1 output has a
series of shift instructions followed by a series of store
instructions.  The gcc-5.1 output will serialize the code as these are
simd shifts which can only execute one at a time, and stores can only
execute one at a time.  I see that gcc-4.8 has no cortex-a53 pipeline
description, so we appear to be getting good code by accident.  The
gcc-5.1 has a cortex a53 scheduler, but it doesn't handle simd
instructions, so it isn't scheduling them correctly.  I see that there
was a change added in November
   https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00025.html
that adds a new a53 pipeline description, and this one does handle
simd instructions.  With current sources, I see some shifts,
alternating shifts and stores, and then the last of the stores.  This
should give better performance than the gcc-5.1 code.  I haven't tried
testing it on hardware.

Jim
_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/linaro-toolchain

Reply via email to