Hi Maxim,
I applied this patch to ARM GCC 8.3 2019.03, and it works!
GCC 8.3 with this patch can generate code much faster than the GCC 8.3
without this patch.

But the code is still slightly slower than it generated by GCC 7.x

I'll do more test to see if there are other regressions.

Yupeng Chang
Aug 23 2019

On Fri, Aug 23, 2019 at 2:14 PM Maxim Kuvyrkov <maxim.kuvyr...@linaro.org>
wrote:

> Hi Yupeng,
>
> Thanks for the offer.  Attached is a patch against trunk, and it fixes
> about 2/3 of the regression.  You should be able to apply it to
> arm-8-branch or gcc-8-branch as well (the only important part are changes
> to autopref_rank_for_schedule).
>
> The other 1/3 will require much more work -- neon intrinsics needs to be
> converted from inline asms to GCC builtins, so that we can attach scheduler
> descriptions to them.
>
> Please let me know performance results and, especially, whether this patch
> regresses any of your other testcases.
>
> --
> Maxim Kuvyrkov
> www.linaro.org
>
>
>
> > On Aug 23, 2019, at 8:34 AM, Yupeng Chang <chang...@gmail.com> wrote:
> >
> > Hi Maxim,
> > If you have any patches that need to be tested, you can send them to me.
> > I can help you test this.
> >
> > Yupeng Chang
> > Aug 23 2019
> >
> > On Wed, Aug 21, 2019 at 10:21 PM Maxim Kuvyrkov <
> maxim.kuvyr...@linaro.org> wrote:
> > Hi Yupeng,
> >
> > Great testcase, thanks!
> >
> > I've investigated this, and there are two separate changes between GCC 7
> and GCC 8 each causing half of the regression.
> >
> > The first regression is due compiler making unlucky decisions.  Before
> the regression compiler just got lucky, and I'll look into bringing these
> lucky decisions back.
> >
> > The second regression is due to changed tuning of the compiler.  New
> setting is better on average, and this testcase happens to regress.  I
> don't know whether we'll manage to fix that.
> >
> > --
> > Maxim Kuvyrkov
> > www.linaro.org
> >
> >
> >
> > > On Aug 21, 2019, at 9:55 AM, Yupeng Chang <chang...@gmail.com> wrote:
> > >
> > > Hi Maxim,
> > > Attached is the testcase.
> > >
> > > Please follow these steps to test:
> > > 1. download GCC 8.3 from:
> https://developer.arm.com/-/media/Files/downloads/gnu-a/8.3-2019.03/binrel/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu.tar.xz?revision=2e88a73f-d233-4f96-b1f4-d8b36e9bb0b9&la=en
> > > 2. download GCC 7.4 from:
> https://releases.linaro.org/components/toolchain/binaries/latest-7/aarch64-linux-gnu/gcc-linaro-7.4.1-2019.02-x86_64_aarch64-linux-gnu.tar.xz
> > >
> > > Please extract the toolchain into /usr/local
> > >
> > > 3. extract the attached package arm-performance.tar.xz into your linux
> machine's home folder
> > > cd into arm-performance
> > > run "make" to generate the test program and the ASM code dumped from
> .o file
> > >
> > >
> > > On Tue, Aug 20, 2019 at 9:15 PM Maxim Kuvyrkov <
> maxim.kuvyr...@linaro.org> wrote:
> > > Hi Yupeng,
> > >
> > > There are many changes from Linaro GCC 7.x to ARM GCC 8.x, so it is
> difficult to guess what may be going wrong.
> > >
> > > Do you have a testcase that you can share?  With a testcase we can
> investigate the problem and, possibly, fix it.
> > >
> > > Regards,
> > >
> > > --
> > > Maxim Kuvyrkov
> > > www.linaro.org
> > >
> > >
> > >
> > > > On Aug 19, 2019, at 7:14 AM, Yupeng Chang <chang...@gmail.com>
> wrote:
> > > >
> > > > Hi Dear Linaro Team,
> > > > I recently found a very strange issue regarding the code performance.
> > > > I have a loop written in GCC NEON.
> > > > The binary of this coded generated by Linaro GCC 7.x is much faster
> than it
> > > > generated by ARM GCC 8.x
> > > >
> > > > My CPU is ARM Cortex-A53 AARCH64.
> > > > The compile option is:
> > > > -Wall -O3 -mcpu=cortex-a53+crypto
> > > >
> > > > the code is like below:
> > > >     for (uint32 c = 0; c < channels; c += 16, roi_result += 16) {
> > > >         int32x4_t       S1, S2, S3, S4;
> > > >         int16x4_t       DT;
> > > >
> > > >         DT = vld1_s16(feature1 + c + 0);
> > > >         S1 = vmull_lane_s16(DT, SZ, 0);
> > > >         DT = vld1_s16(feature1 + c + 4);
> > > >         S2 = vmull_lane_s16(DT, SZ, 0);
> > > >         DT = vld1_s16(feature1 + c + 8);
> > > >         S3 = vmull_lane_s16(DT, SZ, 0);
> > > >         DT = vld1_s16(feature1 + c + 12);
> > > >         S4 = vmull_lane_s16(DT, SZ, 0);
> > > >
> > > >         DT = vld1_s16(feature2 + c + 0);
> > > >         S1 = vmlal_lane_s16(S1, DT, SZ, 1);
> > > >         DT = vld1_s16(feature2 + c + 4);
> > > >         S2 = vmlal_lane_s16(S2, DT, SZ, 1);
> > > >         DT = vld1_s16(feature2 + c + 8);
> > > >         S3 = vmlal_lane_s16(S3, DT, SZ, 1);
> > > >         DT = vld1_s16(feature2 + c + 12);
> > > >         S4 = vmlal_lane_s16(S4, DT, SZ, 1);
> > > >
> > > >         DT = vld1_s16(feature3 + c + 0);
> > > >         S1 = vmlal_lane_s16(S1, DT, SZ, 2);
> > > >         DT = vld1_s16(feature3 + c + 4);
> > > >         S2 = vmlal_lane_s16(S2, DT, SZ, 2);
> > > >         DT = vld1_s16(feature3 + c + 8);
> > > >         S3 = vmlal_lane_s16(S3, DT, SZ, 2);
> > > >         DT = vld1_s16(feature3 + c + 12);
> > > >         S4 = vmlal_lane_s16(S4, DT, SZ, 2);
> > > >
> > > >         DT = vld1_s16(feature4 + c + 0);
> > > >         S1 = vmlal_lane_s16(S1, DT, SZ, 3);
> > > >         DT = vld1_s16(feature4 + c + 4);
> > > >         S2 = vmlal_lane_s16(S2, DT, SZ, 3);
> > > >         DT = vld1_s16(feature4 + c + 8);
> > > >         S3 = vmlal_lane_s16(S3, DT, SZ, 3);
> > > >         DT = vld1_s16(feature4 + c + 12);
> > > >         S4 = vmlal_lane_s16(S4, DT, SZ, 3);
> > > >
> > > >         DT = vrshrn_n_s32(S1, Q_VALUE);
> > > >         vst1_s16(roi_result + 0, DT);
> > > >         DT = vrshrn_n_s32(S2, Q_VALUE);
> > > >         vst1_s16(roi_result + 4, DT);
> > > >         DT = vrshrn_n_s32(S3, Q_VALUE);
> > > >         vst1_s16(roi_result + 8, DT);
> > > >         DT = vrshrn_n_s32(S4, Q_VALUE);
> > > >         vst1_s16(roi_result + 12, DT);
> > > >     }
> > > >
> > > > Code generated by GCC7:
> > > >  294:   6b10031f    cmp w24, w16
> > > >  298:   fc606959    ldr d25, [x10, x0]
> > > >  29c:   fc686922    ldr d2, [x9, x8]
> > > >  2a0:   fc676921    ldr d1, [x9, x7]
> > > >  2a4:   fc666920    ldr d0, [x9, x6]
> > > >  2a8:   fc686958    ldr d24, [x10, x8]
> > > >  2ac:   fc676957    ldr d23, [x10, x7]
> > > >  2b0:   fc666956    ldr d22, [x10, x6]
> > > >  2b4:   fc606855    ldr d21, [x2, x0]
> > > >  2b8:   fc686854    ldr d20, [x2, x8]
> > > >  2bc:   fc676853    ldr d19, [x2, x7]
> > > >  2c0:   fc666852    ldr d18, [x2, x6]
> > > >  2c4:   fc606891    ldr d17, [x4, x0]
> > > >  2c8:   fc686890    ldr d16, [x4, x8]
> > > >  2cc:   fc676887    ldr d7, [x4, x7]
> > > >  2d0:   fc666885    ldr d5, [x4, x6]
> > > >  2d4:   0f44a063    smull   v3.4s, v3.4h, v4.h[0]
> > > >  2d8:   0f44a042    smull   v2.4s, v2.4h, v4.h[0]
> > > >  2dc:   0f44a021    smull   v1.4s, v1.4h, v4.h[0]
> > > >  2e0:   0f44a000    smull   v0.4s, v0.4h, v4.h[0]
> > > >  2e4:   0f542323    smlal   v3.4s, v25.4h, v4.h[1]
> > > >  2e8:   0f542302    smlal   v2.4s, v24.4h, v4.h[1]
> > > >  2ec:   0f5422e1    smlal   v1.4s, v23.4h, v4.h[1]
> > > >  2f0:   0f5422c0    smlal   v0.4s, v22.4h, v4.h[1]
> > > >  2f4:   0f6422a3    smlal   v3.4s, v21.4h, v4.h[2]
> > > >  2f8:   0f642282    smlal   v2.4s, v20.4h, v4.h[2]
> > > >  2fc:   0f642261    smlal   v1.4s, v19.4h, v4.h[2]
> > > >  300:   0f642240    smlal   v0.4s, v18.4h, v4.h[2]
> > > >  304:   0f742223    smlal   v3.4s, v17.4h, v4.h[3]
> > > >  308:   0f742202    smlal   v2.4s, v16.4h, v4.h[3]
> > > >  30c:   0f7420e1    smlal   v1.4s, v7.4h, v4.h[3]
> > > >  310:   0f7420a0    smlal   v0.4s, v5.4h, v4.h[3]
> > > >  314:   0f138c63    rshrn   v3.4h, v3.4s, #13
> > > >  318:   0f138c42    rshrn   v2.4h, v2.4s, #13
> > > >  31c:   0f138c21    rshrn   v1.4h, v1.4s, #13
> > > >  320:   0f138c00    rshrn   v0.4h, v0.4s, #13
> > > >  324:   6d3e0a63    stp d3, d2, [x19, #-32]
> > > >  328:   6d3f0261    stp d1, d0, [x19, #-16]
> > > >
> > > > Code generated by GCC8:
> > > >
> > > >  26c:   6b0b02ff    cmp w23, w11
> > > >  270:   fc606922    ldr d2, [x9, x0]
> > > >  274:   fc666941    ldr d1, [x10, x6]
> > > >  278:   fc666920    ldr d0, [x9, x6]
> > > >  27c:   0f44a000    smull   v0.4s, v0.4h, v4.h[0]
> > > >  280:   0f542020    smlal   v0.4s, v1.4h, v4.h[1]
> > > >  284:   fc6668e1    ldr d1, [x7, x6]
> > > >  288:   0f642020    smlal   v0.4s, v1.4h, v4.h[2]
> > > >  28c:   fc646945    ldr d5, [x10, x4]
> > > >  290:   fc666901    ldr d1, [x8, x6]
> > > >  294:   0f742020    smlal   v0.4s, v1.4h, v4.h[3]
> > > >  298:   fc646921    ldr d1, [x9, x4]
> > > >  29c:   0f44a021    smull   v1.4s, v1.4h, v4.h[0]
> > > >  2a0:   0f5420a1    smlal   v1.4s, v5.4h, v4.h[1]
> > > >  2a4:   fc626945    ldr d5, [x10, x2]
> > > >  2a8:   0f138c03    rshrn   v3.4h, v0.4s, #13
> > > >  2ac:   fc626920    ldr d0, [x9, x2]
> > > >  2b0:   0f44a000    smull   v0.4s, v0.4h, v4.h[0]
> > > >  2b4:   0f5420a0    smlal   v0.4s, v5.4h, v4.h[1]
> > > >  2b8:   fc606945    ldr d5, [x10, x0]
> > > >  2bc:   0f44a042    smull   v2.4s, v2.4h, v4.h[0]
> > > >  2c0:   0f5420a2    smlal   v2.4s, v5.4h, v4.h[1]
> > > >  2c4:   fc6468e5    ldr d5, [x7, x4]
> > > >  2c8:   0f6420a1    smlal   v1.4s, v5.4h, v4.h[2]
> > > >  2cc:   fc6268e5    ldr d5, [x7, x2]
> > > >  2d0:   0f6420a0    smlal   v0.4s, v5.4h, v4.h[2]
> > > >  2d4:   fc6068e5    ldr d5, [x7, x0]
> > > >  2d8:   0f6420a2    smlal   v2.4s, v5.4h, v4.h[2]
> > > >  2dc:   fc646905    ldr d5, [x8, x4]
> > > >  2e0:   0f7420a1    smlal   v1.4s, v5.4h, v4.h[3]
> > > >  2e4:   fc626905    ldr d5, [x8, x2]
> > > >  2e8:   0f138c21    rshrn   v1.4h, v1.4s, #13
> > > >  2ec:   0f7420a0    smlal   v0.4s, v5.4h, v4.h[3]
> > > >  2f0:   0f138c00    rshrn   v0.4h, v0.4s, #13
> > > >  2f4:   fc606905    ldr d5, [x8, x0]
> > > >  2f8:   0f7420a2    smlal   v2.4s, v5.4h, v4.h[3]
> > > >  2fc:   0f138c42    rshrn   v2.4h, v2.4s, #13
> > > >  300:   6d000e62    stp d2, d3, [x19]
> > > >  304:   6d010261    stp d1, d0, [x19, #16]
> > > >  308:   91008273    add x19, x19, #0x20
> > > >
> > > > I did some tests on different compile options, and found that option
> > > > "-fschedule-insns" on GCC 7 will generate code that runs faster, if I
> > > > disable schedule-insns, GCC7 will generate the same code as GCC8.
> > > > However, this option seems don't work on GCC8, if I enable
> > > > "-fschedule-insns" with GCC8, the code generated by GCC8 is even
> slower. If
> > > > I disable "-fschedule-insns" with GCC8, the generated code is just
> like the
> > > > sequence as in C code.
> > > >
> > > > I compiled my code with -O3, which means -fschedule-insns will be
> enabled
> > > > by default.
> > > >
> > > > With this option enabled, GCC7 will reschedule instructions, and it
> seems
> > > > that GCC7 will arrange the same instructions all together, but GCC8
> doesn't
> > > > do that, or GCC8 will reschedule instructions in a worse way.
> > > >
> > > > My question is, is this behavior expected in GCC8, GCC9 and the
> future
> > > > version?
> > > > Is this change in GCC code scheduling related to the fix of "spectre
> and
> > > > mitigation" ?
> > > >
> > > > If I want the same instruction scheduling mechanism in GCC8, what
> can I do ?
> > > >
> > > > Thank you for looking into this.
> > > >
> > > > Looking forward to your reply!
> > > >
> > > > Tomas Chang
> > > > Aug 19, 2019
> > > > _______________________________________________
> > > > linaro-toolchain mailing list
> > > > linaro-toolchain@lists.linaro.org
> > > > https://lists.linaro.org/mailman/listinfo/linaro-toolchain
> > >
> > > <arm-performance.tar.xz>
> >
>
_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/linaro-toolchain

Reply via email to