Re: ARM GCC 8.x Performance Dropping Compared to Linaro GCC 7.x

2019-08-21 Thread Maxim Kuvyrkov
Hi Yupeng,

Great testcase, thanks!

I've investigated this, and there are two separate changes between GCC 7 and 
GCC 8 each causing half of the regression.

The first regression is due compiler making unlucky decisions.  Before the 
regression compiler just got lucky, and I'll look into bringing these lucky 
decisions back.

The second regression is due to changed tuning of the compiler.  New setting is 
better on average, and this testcase happens to regress.  I don't know whether 
we'll manage to fix that.

--
Maxim Kuvyrkov
www.linaro.org



> On Aug 21, 2019, at 9:55 AM, Yupeng Chang  wrote:
> 
> Hi Maxim,
> Attached is the testcase.
> 
> Please follow these steps to test:
> 1. download GCC 8.3 from: 
> https://developer.arm.com/-/media/Files/downloads/gnu-a/8.3-2019.03/binrel/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu.tar.xz?revision=2e88a73f-d233-4f96-b1f4-d8b36e9bb0b9&la=en
> 2. download GCC 7.4 from: 
> https://releases.linaro.org/components/toolchain/binaries/latest-7/aarch64-linux-gnu/gcc-linaro-7.4.1-2019.02-x86_64_aarch64-linux-gnu.tar.xz
> 
> Please extract the toolchain into /usr/local
> 
> 3. extract the attached package arm-performance.tar.xz into your linux 
> machine's home folder
> cd into arm-performance
> run "make" to generate the test program and the ASM code dumped from .o file
> 
> 
> On Tue, Aug 20, 2019 at 9:15 PM Maxim Kuvyrkov  
> wrote:
> Hi Yupeng,
> 
> There are many changes from Linaro GCC 7.x to ARM GCC 8.x, so it is difficult 
> to guess what may be going wrong.
> 
> Do you have a testcase that you can share?  With a testcase we can 
> investigate the problem and, possibly, fix it.
> 
> Regards,
> 
> --
> Maxim Kuvyrkov
> www.linaro.org
> 
> 
> 
> > On Aug 19, 2019, at 7:14 AM, Yupeng Chang  wrote:
> > 
> > Hi Dear Linaro Team,
> > I recently found a very strange issue regarding the code performance.
> > I have a loop written in GCC NEON.
> > The binary of this coded generated by Linaro GCC 7.x is much faster than it
> > generated by ARM GCC 8.x
> > 
> > My CPU is ARM Cortex-A53 AARCH64.
> > The compile option is:
> > -Wall -O3 -mcpu=cortex-a53+crypto
> > 
> > the code is like below:
> > for (uint32 c = 0; c < channels; c += 16, roi_result += 16) {
> > int32x4_t   S1, S2, S3, S4;
> > int16x4_t   DT;
> > 
> > DT = vld1_s16(feature1 + c + 0);
> > S1 = vmull_lane_s16(DT, SZ, 0);
> > DT = vld1_s16(feature1 + c + 4);
> > S2 = vmull_lane_s16(DT, SZ, 0);
> > DT = vld1_s16(feature1 + c + 8);
> > S3 = vmull_lane_s16(DT, SZ, 0);
> > DT = vld1_s16(feature1 + c + 12);
> > S4 = vmull_lane_s16(DT, SZ, 0);
> > 
> > DT = vld1_s16(feature2 + c + 0);
> > S1 = vmlal_lane_s16(S1, DT, SZ, 1);
> > DT = vld1_s16(feature2 + c + 4);
> > S2 = vmlal_lane_s16(S2, DT, SZ, 1);
> > DT = vld1_s16(feature2 + c + 8);
> > S3 = vmlal_lane_s16(S3, DT, SZ, 1);
> > DT = vld1_s16(feature2 + c + 12);
> > S4 = vmlal_lane_s16(S4, DT, SZ, 1);
> > 
> > DT = vld1_s16(feature3 + c + 0);
> > S1 = vmlal_lane_s16(S1, DT, SZ, 2);
> > DT = vld1_s16(feature3 + c + 4);
> > S2 = vmlal_lane_s16(S2, DT, SZ, 2);
> > DT = vld1_s16(feature3 + c + 8);
> > S3 = vmlal_lane_s16(S3, DT, SZ, 2);
> > DT = vld1_s16(feature3 + c + 12);
> > S4 = vmlal_lane_s16(S4, DT, SZ, 2);
> > 
> > DT = vld1_s16(feature4 + c + 0);
> > S1 = vmlal_lane_s16(S1, DT, SZ, 3);
> > DT = vld1_s16(feature4 + c + 4);
> > S2 = vmlal_lane_s16(S2, DT, SZ, 3);
> > DT = vld1_s16(feature4 + c + 8);
> > S3 = vmlal_lane_s16(S3, DT, SZ, 3);
> > DT = vld1_s16(feature4 + c + 12);
> > S4 = vmlal_lane_s16(S4, DT, SZ, 3);
> > 
> > DT = vrshrn_n_s32(S1, Q_VALUE);
> > vst1_s16(roi_result + 0, DT);
> > DT = vrshrn_n_s32(S2, Q_VALUE);
> > vst1_s16(roi_result + 4, DT);
> > DT = vrshrn_n_s32(S3, Q_VALUE);
> > vst1_s16(roi_result + 8, DT);
> > DT = vrshrn_n_s32(S4, Q_VALUE);
> > vst1_s16(roi_result + 12, DT);
> > }
> > 
> > Code generated by GCC7:
> >  294:   6b10031fcmp w24, w16
> >  298:   fc606959ldr d25, [x10, x0]
> >  29c:   fc686922ldr d2, [x9, x8]
> >  2a0:   fc676921ldr d1, [x9, x7]
> >  2a4:   fc666920ldr d0, [x9, x6]
> >  2a8:   fc686958ldr d24, [x10, x8]
> >  2ac:   fc676957ldr d23, [x10, x7]
> >  2b0:   fc666956ldr d22, [x10, x6]
> >  2b4:   fc606855ldr d21, [x2, x0]
> >  2b8:   fc686854ldr d20, [x2, x8]
> >  2bc:   fc676853ldr d19, [x2, x7]
> >  2c0:   fc666852ldr d18, [x2, x6]
> >  2c4:   fc606891ldr d17, [x4, x0]
> >  2c8:   fc686890ldr d16, [x4, x8]
> >  2cc:   fc676887ldr d7, [x4, x7]
> >  2d0:   fc666885ldr d5, [x4, x6]
> >  2d4:   0f44a063smull   v3.4s, v3.4h, v4.h[0]
> >  2d8:   0f44a042smull   v2.4s, v2.4h, v4.h[0]
> >  2dc:   0f44a021smull 

Re: ARM GCC 8.x Performance Dropping Compared to Linaro GCC 7.x

2019-08-21 Thread Yupeng Chang
Hi Maxim,
Thank you very much for looking into this !
Hope you can fix this regression and bring performance back to GCC! :D

Yupeng Chang
Aug 22 2019

On Wed, Aug 21, 2019 at 10:21 PM Maxim Kuvyrkov 
wrote:

> Hi Yupeng,
>
> Great testcase, thanks!
>
> I've investigated this, and there are two separate changes between GCC 7
> and GCC 8 each causing half of the regression.
>
> The first regression is due compiler making unlucky decisions.  Before the
> regression compiler just got lucky, and I'll look into bringing these lucky
> decisions back.
>
> The second regression is due to changed tuning of the compiler.  New
> setting is better on average, and this testcase happens to regress.  I
> don't know whether we'll manage to fix that.
>
> --
> Maxim Kuvyrkov
> www.linaro.org
>
>
>
> > On Aug 21, 2019, at 9:55 AM, Yupeng Chang  wrote:
> >
> > Hi Maxim,
> > Attached is the testcase.
> >
> > Please follow these steps to test:
> > 1. download GCC 8.3 from:
> https://developer.arm.com/-/media/Files/downloads/gnu-a/8.3-2019.03/binrel/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu.tar.xz?revision=2e88a73f-d233-4f96-b1f4-d8b36e9bb0b9&la=en
> > 2. download GCC 7.4 from:
> https://releases.linaro.org/components/toolchain/binaries/latest-7/aarch64-linux-gnu/gcc-linaro-7.4.1-2019.02-x86_64_aarch64-linux-gnu.tar.xz
> >
> > Please extract the toolchain into /usr/local
> >
> > 3. extract the attached package arm-performance.tar.xz into your linux
> machine's home folder
> > cd into arm-performance
> > run "make" to generate the test program and the ASM code dumped from .o
> file
> >
> >
> > On Tue, Aug 20, 2019 at 9:15 PM Maxim Kuvyrkov <
> maxim.kuvyr...@linaro.org> wrote:
> > Hi Yupeng,
> >
> > There are many changes from Linaro GCC 7.x to ARM GCC 8.x, so it is
> difficult to guess what may be going wrong.
> >
> > Do you have a testcase that you can share?  With a testcase we can
> investigate the problem and, possibly, fix it.
> >
> > Regards,
> >
> > --
> > Maxim Kuvyrkov
> > www.linaro.org
> >
> >
> >
> > > On Aug 19, 2019, at 7:14 AM, Yupeng Chang  wrote:
> > >
> > > Hi Dear Linaro Team,
> > > I recently found a very strange issue regarding the code performance.
> > > I have a loop written in GCC NEON.
> > > The binary of this coded generated by Linaro GCC 7.x is much faster
> than it
> > > generated by ARM GCC 8.x
> > >
> > > My CPU is ARM Cortex-A53 AARCH64.
> > > The compile option is:
> > > -Wall -O3 -mcpu=cortex-a53+crypto
> > >
> > > the code is like below:
> > > for (uint32 c = 0; c < channels; c += 16, roi_result += 16) {
> > > int32x4_t   S1, S2, S3, S4;
> > > int16x4_t   DT;
> > >
> > > DT = vld1_s16(feature1 + c + 0);
> > > S1 = vmull_lane_s16(DT, SZ, 0);
> > > DT = vld1_s16(feature1 + c + 4);
> > > S2 = vmull_lane_s16(DT, SZ, 0);
> > > DT = vld1_s16(feature1 + c + 8);
> > > S3 = vmull_lane_s16(DT, SZ, 0);
> > > DT = vld1_s16(feature1 + c + 12);
> > > S4 = vmull_lane_s16(DT, SZ, 0);
> > >
> > > DT = vld1_s16(feature2 + c + 0);
> > > S1 = vmlal_lane_s16(S1, DT, SZ, 1);
> > > DT = vld1_s16(feature2 + c + 4);
> > > S2 = vmlal_lane_s16(S2, DT, SZ, 1);
> > > DT = vld1_s16(feature2 + c + 8);
> > > S3 = vmlal_lane_s16(S3, DT, SZ, 1);
> > > DT = vld1_s16(feature2 + c + 12);
> > > S4 = vmlal_lane_s16(S4, DT, SZ, 1);
> > >
> > > DT = vld1_s16(feature3 + c + 0);
> > > S1 = vmlal_lane_s16(S1, DT, SZ, 2);
> > > DT = vld1_s16(feature3 + c + 4);
> > > S2 = vmlal_lane_s16(S2, DT, SZ, 2);
> > > DT = vld1_s16(feature3 + c + 8);
> > > S3 = vmlal_lane_s16(S3, DT, SZ, 2);
> > > DT = vld1_s16(feature3 + c + 12);
> > > S4 = vmlal_lane_s16(S4, DT, SZ, 2);
> > >
> > > DT = vld1_s16(feature4 + c + 0);
> > > S1 = vmlal_lane_s16(S1, DT, SZ, 3);
> > > DT = vld1_s16(feature4 + c + 4);
> > > S2 = vmlal_lane_s16(S2, DT, SZ, 3);
> > > DT = vld1_s16(feature4 + c + 8);
> > > S3 = vmlal_lane_s16(S3, DT, SZ, 3);
> > > DT = vld1_s16(feature4 + c + 12);
> > > S4 = vmlal_lane_s16(S4, DT, SZ, 3);
> > >
> > > DT = vrshrn_n_s32(S1, Q_VALUE);
> > > vst1_s16(roi_result + 0, DT);
> > > DT = vrshrn_n_s32(S2, Q_VALUE);
> > > vst1_s16(roi_result + 4, DT);
> > > DT = vrshrn_n_s32(S3, Q_VALUE);
> > > vst1_s16(roi_result + 8, DT);
> > > DT = vrshrn_n_s32(S4, Q_VALUE);
> > > vst1_s16(roi_result + 12, DT);
> > > }
> > >
> > > Code generated by GCC7:
> > >  294:   6b10031fcmp w24, w16
> > >  298:   fc606959ldr d25, [x10, x0]
> > >  29c:   fc686922ldr d2, [x9, x8]
> > >  2a0:   fc676921ldr d1, [x9, x7]
> > >  2a4:   fc666920ldr d0, [x9, x6]
> > >  2a8:   fc686958ldr d24, [x10, x8]
> > >  2ac:   fc676957ldr d23, [x10, x7]
> > >  2b0:   fc666956ldr d22, [x10, x6]
> > >  2b4:   fc606855