Re: ARM GCC 8.x Performance Dropping Compared to Linaro GCC 7.x
Hi Yupeng, Great testcase, thanks! I've investigated this, and there are two separate changes between GCC 7 and GCC 8 each causing half of the regression. The first regression is due compiler making unlucky decisions. Before the regression compiler just got lucky, and I'll look into bringing these lucky decisions back. The second regression is due to changed tuning of the compiler. New setting is better on average, and this testcase happens to regress. I don't know whether we'll manage to fix that. -- Maxim Kuvyrkov www.linaro.org > On Aug 21, 2019, at 9:55 AM, Yupeng Chang wrote: > > Hi Maxim, > Attached is the testcase. > > Please follow these steps to test: > 1. download GCC 8.3 from: > https://developer.arm.com/-/media/Files/downloads/gnu-a/8.3-2019.03/binrel/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu.tar.xz?revision=2e88a73f-d233-4f96-b1f4-d8b36e9bb0b9&la=en > 2. download GCC 7.4 from: > https://releases.linaro.org/components/toolchain/binaries/latest-7/aarch64-linux-gnu/gcc-linaro-7.4.1-2019.02-x86_64_aarch64-linux-gnu.tar.xz > > Please extract the toolchain into /usr/local > > 3. extract the attached package arm-performance.tar.xz into your linux > machine's home folder > cd into arm-performance > run "make" to generate the test program and the ASM code dumped from .o file > > > On Tue, Aug 20, 2019 at 9:15 PM Maxim Kuvyrkov > wrote: > Hi Yupeng, > > There are many changes from Linaro GCC 7.x to ARM GCC 8.x, so it is difficult > to guess what may be going wrong. > > Do you have a testcase that you can share? With a testcase we can > investigate the problem and, possibly, fix it. > > Regards, > > -- > Maxim Kuvyrkov > www.linaro.org > > > > > On Aug 19, 2019, at 7:14 AM, Yupeng Chang wrote: > > > > Hi Dear Linaro Team, > > I recently found a very strange issue regarding the code performance. > > I have a loop written in GCC NEON. > > The binary of this coded generated by Linaro GCC 7.x is much faster than it > > generated by ARM GCC 8.x > > > > My CPU is ARM Cortex-A53 AARCH64. > > The compile option is: > > -Wall -O3 -mcpu=cortex-a53+crypto > > > > the code is like below: > > for (uint32 c = 0; c < channels; c += 16, roi_result += 16) { > > int32x4_t S1, S2, S3, S4; > > int16x4_t DT; > > > > DT = vld1_s16(feature1 + c + 0); > > S1 = vmull_lane_s16(DT, SZ, 0); > > DT = vld1_s16(feature1 + c + 4); > > S2 = vmull_lane_s16(DT, SZ, 0); > > DT = vld1_s16(feature1 + c + 8); > > S3 = vmull_lane_s16(DT, SZ, 0); > > DT = vld1_s16(feature1 + c + 12); > > S4 = vmull_lane_s16(DT, SZ, 0); > > > > DT = vld1_s16(feature2 + c + 0); > > S1 = vmlal_lane_s16(S1, DT, SZ, 1); > > DT = vld1_s16(feature2 + c + 4); > > S2 = vmlal_lane_s16(S2, DT, SZ, 1); > > DT = vld1_s16(feature2 + c + 8); > > S3 = vmlal_lane_s16(S3, DT, SZ, 1); > > DT = vld1_s16(feature2 + c + 12); > > S4 = vmlal_lane_s16(S4, DT, SZ, 1); > > > > DT = vld1_s16(feature3 + c + 0); > > S1 = vmlal_lane_s16(S1, DT, SZ, 2); > > DT = vld1_s16(feature3 + c + 4); > > S2 = vmlal_lane_s16(S2, DT, SZ, 2); > > DT = vld1_s16(feature3 + c + 8); > > S3 = vmlal_lane_s16(S3, DT, SZ, 2); > > DT = vld1_s16(feature3 + c + 12); > > S4 = vmlal_lane_s16(S4, DT, SZ, 2); > > > > DT = vld1_s16(feature4 + c + 0); > > S1 = vmlal_lane_s16(S1, DT, SZ, 3); > > DT = vld1_s16(feature4 + c + 4); > > S2 = vmlal_lane_s16(S2, DT, SZ, 3); > > DT = vld1_s16(feature4 + c + 8); > > S3 = vmlal_lane_s16(S3, DT, SZ, 3); > > DT = vld1_s16(feature4 + c + 12); > > S4 = vmlal_lane_s16(S4, DT, SZ, 3); > > > > DT = vrshrn_n_s32(S1, Q_VALUE); > > vst1_s16(roi_result + 0, DT); > > DT = vrshrn_n_s32(S2, Q_VALUE); > > vst1_s16(roi_result + 4, DT); > > DT = vrshrn_n_s32(S3, Q_VALUE); > > vst1_s16(roi_result + 8, DT); > > DT = vrshrn_n_s32(S4, Q_VALUE); > > vst1_s16(roi_result + 12, DT); > > } > > > > Code generated by GCC7: > > 294: 6b10031fcmp w24, w16 > > 298: fc606959ldr d25, [x10, x0] > > 29c: fc686922ldr d2, [x9, x8] > > 2a0: fc676921ldr d1, [x9, x7] > > 2a4: fc666920ldr d0, [x9, x6] > > 2a8: fc686958ldr d24, [x10, x8] > > 2ac: fc676957ldr d23, [x10, x7] > > 2b0: fc666956ldr d22, [x10, x6] > > 2b4: fc606855ldr d21, [x2, x0] > > 2b8: fc686854ldr d20, [x2, x8] > > 2bc: fc676853ldr d19, [x2, x7] > > 2c0: fc666852ldr d18, [x2, x6] > > 2c4: fc606891ldr d17, [x4, x0] > > 2c8: fc686890ldr d16, [x4, x8] > > 2cc: fc676887ldr d7, [x4, x7] > > 2d0: fc666885ldr d5, [x4, x6] > > 2d4: 0f44a063smull v3.4s, v3.4h, v4.h[0] > > 2d8: 0f44a042smull v2.4s, v2.4h, v4.h[0] > > 2dc: 0f44a021smull
Re: ARM GCC 8.x Performance Dropping Compared to Linaro GCC 7.x
Hi Maxim, Thank you very much for looking into this ! Hope you can fix this regression and bring performance back to GCC! :D Yupeng Chang Aug 22 2019 On Wed, Aug 21, 2019 at 10:21 PM Maxim Kuvyrkov wrote: > Hi Yupeng, > > Great testcase, thanks! > > I've investigated this, and there are two separate changes between GCC 7 > and GCC 8 each causing half of the regression. > > The first regression is due compiler making unlucky decisions. Before the > regression compiler just got lucky, and I'll look into bringing these lucky > decisions back. > > The second regression is due to changed tuning of the compiler. New > setting is better on average, and this testcase happens to regress. I > don't know whether we'll manage to fix that. > > -- > Maxim Kuvyrkov > www.linaro.org > > > > > On Aug 21, 2019, at 9:55 AM, Yupeng Chang wrote: > > > > Hi Maxim, > > Attached is the testcase. > > > > Please follow these steps to test: > > 1. download GCC 8.3 from: > https://developer.arm.com/-/media/Files/downloads/gnu-a/8.3-2019.03/binrel/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu.tar.xz?revision=2e88a73f-d233-4f96-b1f4-d8b36e9bb0b9&la=en > > 2. download GCC 7.4 from: > https://releases.linaro.org/components/toolchain/binaries/latest-7/aarch64-linux-gnu/gcc-linaro-7.4.1-2019.02-x86_64_aarch64-linux-gnu.tar.xz > > > > Please extract the toolchain into /usr/local > > > > 3. extract the attached package arm-performance.tar.xz into your linux > machine's home folder > > cd into arm-performance > > run "make" to generate the test program and the ASM code dumped from .o > file > > > > > > On Tue, Aug 20, 2019 at 9:15 PM Maxim Kuvyrkov < > maxim.kuvyr...@linaro.org> wrote: > > Hi Yupeng, > > > > There are many changes from Linaro GCC 7.x to ARM GCC 8.x, so it is > difficult to guess what may be going wrong. > > > > Do you have a testcase that you can share? With a testcase we can > investigate the problem and, possibly, fix it. > > > > Regards, > > > > -- > > Maxim Kuvyrkov > > www.linaro.org > > > > > > > > > On Aug 19, 2019, at 7:14 AM, Yupeng Chang wrote: > > > > > > Hi Dear Linaro Team, > > > I recently found a very strange issue regarding the code performance. > > > I have a loop written in GCC NEON. > > > The binary of this coded generated by Linaro GCC 7.x is much faster > than it > > > generated by ARM GCC 8.x > > > > > > My CPU is ARM Cortex-A53 AARCH64. > > > The compile option is: > > > -Wall -O3 -mcpu=cortex-a53+crypto > > > > > > the code is like below: > > > for (uint32 c = 0; c < channels; c += 16, roi_result += 16) { > > > int32x4_t S1, S2, S3, S4; > > > int16x4_t DT; > > > > > > DT = vld1_s16(feature1 + c + 0); > > > S1 = vmull_lane_s16(DT, SZ, 0); > > > DT = vld1_s16(feature1 + c + 4); > > > S2 = vmull_lane_s16(DT, SZ, 0); > > > DT = vld1_s16(feature1 + c + 8); > > > S3 = vmull_lane_s16(DT, SZ, 0); > > > DT = vld1_s16(feature1 + c + 12); > > > S4 = vmull_lane_s16(DT, SZ, 0); > > > > > > DT = vld1_s16(feature2 + c + 0); > > > S1 = vmlal_lane_s16(S1, DT, SZ, 1); > > > DT = vld1_s16(feature2 + c + 4); > > > S2 = vmlal_lane_s16(S2, DT, SZ, 1); > > > DT = vld1_s16(feature2 + c + 8); > > > S3 = vmlal_lane_s16(S3, DT, SZ, 1); > > > DT = vld1_s16(feature2 + c + 12); > > > S4 = vmlal_lane_s16(S4, DT, SZ, 1); > > > > > > DT = vld1_s16(feature3 + c + 0); > > > S1 = vmlal_lane_s16(S1, DT, SZ, 2); > > > DT = vld1_s16(feature3 + c + 4); > > > S2 = vmlal_lane_s16(S2, DT, SZ, 2); > > > DT = vld1_s16(feature3 + c + 8); > > > S3 = vmlal_lane_s16(S3, DT, SZ, 2); > > > DT = vld1_s16(feature3 + c + 12); > > > S4 = vmlal_lane_s16(S4, DT, SZ, 2); > > > > > > DT = vld1_s16(feature4 + c + 0); > > > S1 = vmlal_lane_s16(S1, DT, SZ, 3); > > > DT = vld1_s16(feature4 + c + 4); > > > S2 = vmlal_lane_s16(S2, DT, SZ, 3); > > > DT = vld1_s16(feature4 + c + 8); > > > S3 = vmlal_lane_s16(S3, DT, SZ, 3); > > > DT = vld1_s16(feature4 + c + 12); > > > S4 = vmlal_lane_s16(S4, DT, SZ, 3); > > > > > > DT = vrshrn_n_s32(S1, Q_VALUE); > > > vst1_s16(roi_result + 0, DT); > > > DT = vrshrn_n_s32(S2, Q_VALUE); > > > vst1_s16(roi_result + 4, DT); > > > DT = vrshrn_n_s32(S3, Q_VALUE); > > > vst1_s16(roi_result + 8, DT); > > > DT = vrshrn_n_s32(S4, Q_VALUE); > > > vst1_s16(roi_result + 12, DT); > > > } > > > > > > Code generated by GCC7: > > > 294: 6b10031fcmp w24, w16 > > > 298: fc606959ldr d25, [x10, x0] > > > 29c: fc686922ldr d2, [x9, x8] > > > 2a0: fc676921ldr d1, [x9, x7] > > > 2a4: fc666920ldr d0, [x9, x6] > > > 2a8: fc686958ldr d24, [x10, x8] > > > 2ac: fc676957ldr d23, [x10, x7] > > > 2b0: fc666956ldr d22, [x10, x6] > > > 2b4: fc606855