Re: ARM GCC 8.x Performance Dropping Compared to Linaro GCC 7.x

Yupeng Chang Thu, 29 Aug 2019 22:55:01 -0700

Hi Maxim,
This patch does fix part of the regression.
I don't see other regression on other software with this patch.


I'm waiting for more updates / fixes from you.

And thank you very much for your work!

Yupeng Chang
Aug 30 2019

On Thu, Aug 29, 2019 at 10:51 PM Maxim Kuvyrkov <maxim.kuvyr...@linaro.org>
wrote:

> Hi Yupeng,
>
> Any update?
>
> --
> Maxim Kuvyrkov
> www.linaro.org
>
>
>
> > On Aug 23, 2019, at 12:09 PM, Yupeng Chang <chang...@gmail.com> wrote:
> >
> > Hi Maxim,
> > I applied this patch to ARM GCC 8.3 2019.03, and it works!
> > GCC 8.3 with this patch can generate code much faster than the GCC 8.3
> without this patch.
> >
> > But the code is still slightly slower than it generated by GCC 7.x
> >
> > I'll do more test to see if there are other regressions.
> >
> > Yupeng Chang
> > Aug 23 2019
> >
> > On Fri, Aug 23, 2019 at 2:14 PM Maxim Kuvyrkov <
> maxim.kuvyr...@linaro.org> wrote:
> > Hi Yupeng,
> >
> > Thanks for the offer.  Attached is a patch against trunk, and it fixes
> about 2/3 of the regression.  You should be able to apply it to
> arm-8-branch or gcc-8-branch as well (the only important part are changes
> to autopref_rank_for_schedule).
> >
> > The other 1/3 will require much more work -- neon intrinsics needs to be
> converted from inline asms to GCC builtins, so that we can attach scheduler
> descriptions to them.
> >
> > Please let me know performance results and, especially, whether this
> patch regresses any of your other testcases.
> >
> > --
> > Maxim Kuvyrkov
> > www.linaro.org
> >
> >
> >
> > > On Aug 23, 2019, at 8:34 AM, Yupeng Chang <chang...@gmail.com> wrote:
> > >
> > > Hi Maxim,
> > > If you have any patches that need to be tested, you can send them to
> me.
> > > I can help you test this.
> > >
> > > Yupeng Chang
> > > Aug 23 2019
> > >
> > > On Wed, Aug 21, 2019 at 10:21 PM Maxim Kuvyrkov <
> maxim.kuvyr...@linaro.org> wrote:
> > > Hi Yupeng,
> > >
> > > Great testcase, thanks!
> > >
> > > I've investigated this, and there are two separate changes between GCC
> 7 and GCC 8 each causing half of the regression.
> > >
> > > The first regression is due compiler making unlucky decisions.  Before
> the regression compiler just got lucky, and I'll look into bringing these
> lucky decisions back.
> > >
> > > The second regression is due to changed tuning of the compiler.  New
> setting is better on average, and this testcase happens to regress.  I
> don't know whether we'll manage to fix that.
> > >
> > > --
> > > Maxim Kuvyrkov
> > > www.linaro.org
> > >
> > >
> > >
> > > > On Aug 21, 2019, at 9:55 AM, Yupeng Chang <chang...@gmail.com>
> wrote:
> > > >
> > > > Hi Maxim,
> > > > Attached is the testcase.
> > > >
> > > > Please follow these steps to test:
> > > > 1. download GCC 8.3 from:
> https://developer.arm.com/-/media/Files/downloads/gnu-a/8.3-2019.03/binrel/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu.tar.xz?revision=2e88a73f-d233-4f96-b1f4-d8b36e9bb0b9&la=en
> > > > 2. download GCC 7.4 from:
> https://releases.linaro.org/components/toolchain/binaries/latest-7/aarch64-linux-gnu/gcc-linaro-7.4.1-2019.02-x86_64_aarch64-linux-gnu.tar.xz
> > > >
> > > > Please extract the toolchain into /usr/local
> > > >
> > > > 3. extract the attached package arm-performance.tar.xz into your
> linux machine's home folder
> > > > cd into arm-performance
> > > > run "make" to generate the test program and the ASM code dumped from
> .o file
> > > >
> > > >
> > > > On Tue, Aug 20, 2019 at 9:15 PM Maxim Kuvyrkov <
> maxim.kuvyr...@linaro.org> wrote:
> > > > Hi Yupeng,
> > > >
> > > > There are many changes from Linaro GCC 7.x to ARM GCC 8.x, so it is
> difficult to guess what may be going wrong.
> > > >
> > > > Do you have a testcase that you can share?  With a testcase we can
> investigate the problem and, possibly, fix it.
> > > >
> > > > Regards,
> > > >
> > > > --
> > > > Maxim Kuvyrkov
> > > > www.linaro.org
> > > >
> > > >
> > > >
> > > > > On Aug 19, 2019, at 7:14 AM, Yupeng Chang <chang...@gmail.com>
> wrote:
> > > > >
> > > > > Hi Dear Linaro Team,
> > > > > I recently found a very strange issue regarding the code
> performance.
> > > > > I have a loop written in GCC NEON.
> > > > > The binary of this coded generated by Linaro GCC 7.x is much
> faster than it
> > > > > generated by ARM GCC 8.x
> > > > >
> > > > > My CPU is ARM Cortex-A53 AARCH64.
> > > > > The compile option is:
> > > > > -Wall -O3 -mcpu=cortex-a53+crypto
> > > > >
> > > > > the code is like below:
> > > > >     for (uint32 c = 0; c < channels; c += 16, roi_result += 16) {
> > > > >         int32x4_t       S1, S2, S3, S4;
> > > > >         int16x4_t       DT;
> > > > >
> > > > >         DT = vld1_s16(feature1 + c + 0);
> > > > >         S1 = vmull_lane_s16(DT, SZ, 0);
> > > > >         DT = vld1_s16(feature1 + c + 4);
> > > > >         S2 = vmull_lane_s16(DT, SZ, 0);
> > > > >         DT = vld1_s16(feature1 + c + 8);
> > > > >         S3 = vmull_lane_s16(DT, SZ, 0);
> > > > >         DT = vld1_s16(feature1 + c + 12);
> > > > >         S4 = vmull_lane_s16(DT, SZ, 0);
> > > > >
> > > > >         DT = vld1_s16(feature2 + c + 0);
> > > > >         S1 = vmlal_lane_s16(S1, DT, SZ, 1);
> > > > >         DT = vld1_s16(feature2 + c + 4);
> > > > >         S2 = vmlal_lane_s16(S2, DT, SZ, 1);
> > > > >         DT = vld1_s16(feature2 + c + 8);
> > > > >         S3 = vmlal_lane_s16(S3, DT, SZ, 1);
> > > > >         DT = vld1_s16(feature2 + c + 12);
> > > > >         S4 = vmlal_lane_s16(S4, DT, SZ, 1);
> > > > >
> > > > >         DT = vld1_s16(feature3 + c + 0);
> > > > >         S1 = vmlal_lane_s16(S1, DT, SZ, 2);
> > > > >         DT = vld1_s16(feature3 + c + 4);
> > > > >         S2 = vmlal_lane_s16(S2, DT, SZ, 2);
> > > > >         DT = vld1_s16(feature3 + c + 8);
> > > > >         S3 = vmlal_lane_s16(S3, DT, SZ, 2);
> > > > >         DT = vld1_s16(feature3 + c + 12);
> > > > >         S4 = vmlal_lane_s16(S4, DT, SZ, 2);
> > > > >
> > > > >         DT = vld1_s16(feature4 + c + 0);
> > > > >         S1 = vmlal_lane_s16(S1, DT, SZ, 3);
> > > > >         DT = vld1_s16(feature4 + c + 4);
> > > > >         S2 = vmlal_lane_s16(S2, DT, SZ, 3);
> > > > >         DT = vld1_s16(feature4 + c + 8);
> > > > >         S3 = vmlal_lane_s16(S3, DT, SZ, 3);
> > > > >         DT = vld1_s16(feature4 + c + 12);
> > > > >         S4 = vmlal_lane_s16(S4, DT, SZ, 3);
> > > > >
> > > > >         DT = vrshrn_n_s32(S1, Q_VALUE);
> > > > >         vst1_s16(roi_result + 0, DT);
> > > > >         DT = vrshrn_n_s32(S2, Q_VALUE);
> > > > >         vst1_s16(roi_result + 4, DT);
> > > > >         DT = vrshrn_n_s32(S3, Q_VALUE);
> > > > >         vst1_s16(roi_result + 8, DT);
> > > > >         DT = vrshrn_n_s32(S4, Q_VALUE);
> > > > >         vst1_s16(roi_result + 12, DT);
> > > > >     }
> > > > >
> > > > > Code generated by GCC7:
> > > > >  294:   6b10031f    cmp w24, w16
> > > > >  298:   fc606959    ldr d25, [x10, x0]
> > > > >  29c:   fc686922    ldr d2, [x9, x8]
> > > > >  2a0:   fc676921    ldr d1, [x9, x7]
> > > > >  2a4:   fc666920    ldr d0, [x9, x6]
> > > > >  2a8:   fc686958    ldr d24, [x10, x8]
> > > > >  2ac:   fc676957    ldr d23, [x10, x7]
> > > > >  2b0:   fc666956    ldr d22, [x10, x6]
> > > > >  2b4:   fc606855    ldr d21, [x2, x0]
> > > > >  2b8:   fc686854    ldr d20, [x2, x8]
> > > > >  2bc:   fc676853    ldr d19, [x2, x7]
> > > > >  2c0:   fc666852    ldr d18, [x2, x6]
> > > > >  2c4:   fc606891    ldr d17, [x4, x0]
> > > > >  2c8:   fc686890    ldr d16, [x4, x8]
> > > > >  2cc:   fc676887    ldr d7, [x4, x7]
> > > > >  2d0:   fc666885    ldr d5, [x4, x6]
> > > > >  2d4:   0f44a063    smull   v3.4s, v3.4h, v4.h[0]
> > > > >  2d8:   0f44a042    smull   v2.4s, v2.4h, v4.h[0]
> > > > >  2dc:   0f44a021    smull   v1.4s, v1.4h, v4.h[0]
> > > > >  2e0:   0f44a000    smull   v0.4s, v0.4h, v4.h[0]
> > > > >  2e4:   0f542323    smlal   v3.4s, v25.4h, v4.h[1]
> > > > >  2e8:   0f542302    smlal   v2.4s, v24.4h, v4.h[1]
> > > > >  2ec:   0f5422e1    smlal   v1.4s, v23.4h, v4.h[1]
> > > > >  2f0:   0f5422c0    smlal   v0.4s, v22.4h, v4.h[1]
> > > > >  2f4:   0f6422a3    smlal   v3.4s, v21.4h, v4.h[2]
> > > > >  2f8:   0f642282    smlal   v2.4s, v20.4h, v4.h[2]
> > > > >  2fc:   0f642261    smlal   v1.4s, v19.4h, v4.h[2]
> > > > >  300:   0f642240    smlal   v0.4s, v18.4h, v4.h[2]
> > > > >  304:   0f742223    smlal   v3.4s, v17.4h, v4.h[3]
> > > > >  308:   0f742202    smlal   v2.4s, v16.4h, v4.h[3]
> > > > >  30c:   0f7420e1    smlal   v1.4s, v7.4h, v4.h[3]
> > > > >  310:   0f7420a0    smlal   v0.4s, v5.4h, v4.h[3]
> > > > >  314:   0f138c63    rshrn   v3.4h, v3.4s, #13
> > > > >  318:   0f138c42    rshrn   v2.4h, v2.4s, #13
> > > > >  31c:   0f138c21    rshrn   v1.4h, v1.4s, #13
> > > > >  320:   0f138c00    rshrn   v0.4h, v0.4s, #13
> > > > >  324:   6d3e0a63    stp d3, d2, [x19, #-32]
> > > > >  328:   6d3f0261    stp d1, d0, [x19, #-16]
> > > > >
> > > > > Code generated by GCC8:
> > > > >
> > > > >  26c:   6b0b02ff    cmp w23, w11
> > > > >  270:   fc606922    ldr d2, [x9, x0]
> > > > >  274:   fc666941    ldr d1, [x10, x6]
> > > > >  278:   fc666920    ldr d0, [x9, x6]
> > > > >  27c:   0f44a000    smull   v0.4s, v0.4h, v4.h[0]
> > > > >  280:   0f542020    smlal   v0.4s, v1.4h, v4.h[1]
> > > > >  284:   fc6668e1    ldr d1, [x7, x6]
> > > > >  288:   0f642020    smlal   v0.4s, v1.4h, v4.h[2]
> > > > >  28c:   fc646945    ldr d5, [x10, x4]
> > > > >  290:   fc666901    ldr d1, [x8, x6]
> > > > >  294:   0f742020    smlal   v0.4s, v1.4h, v4.h[3]
> > > > >  298:   fc646921    ldr d1, [x9, x4]
> > > > >  29c:   0f44a021    smull   v1.4s, v1.4h, v4.h[0]
> > > > >  2a0:   0f5420a1    smlal   v1.4s, v5.4h, v4.h[1]
> > > > >  2a4:   fc626945    ldr d5, [x10, x2]
> > > > >  2a8:   0f138c03    rshrn   v3.4h, v0.4s, #13
> > > > >  2ac:   fc626920    ldr d0, [x9, x2]
> > > > >  2b0:   0f44a000    smull   v0.4s, v0.4h, v4.h[0]
> > > > >  2b4:   0f5420a0    smlal   v0.4s, v5.4h, v4.h[1]
> > > > >  2b8:   fc606945    ldr d5, [x10, x0]
> > > > >  2bc:   0f44a042    smull   v2.4s, v2.4h, v4.h[0]
> > > > >  2c0:   0f5420a2    smlal   v2.4s, v5.4h, v4.h[1]
> > > > >  2c4:   fc6468e5    ldr d5, [x7, x4]
> > > > >  2c8:   0f6420a1    smlal   v1.4s, v5.4h, v4.h[2]
> > > > >  2cc:   fc6268e5    ldr d5, [x7, x2]
> > > > >  2d0:   0f6420a0    smlal   v0.4s, v5.4h, v4.h[2]
> > > > >  2d4:   fc6068e5    ldr d5, [x7, x0]
> > > > >  2d8:   0f6420a2    smlal   v2.4s, v5.4h, v4.h[2]
> > > > >  2dc:   fc646905    ldr d5, [x8, x4]
> > > > >  2e0:   0f7420a1    smlal   v1.4s, v5.4h, v4.h[3]
> > > > >  2e4:   fc626905    ldr d5, [x8, x2]
> > > > >  2e8:   0f138c21    rshrn   v1.4h, v1.4s, #13
> > > > >  2ec:   0f7420a0    smlal   v0.4s, v5.4h, v4.h[3]
> > > > >  2f0:   0f138c00    rshrn   v0.4h, v0.4s, #13
> > > > >  2f4:   fc606905    ldr d5, [x8, x0]
> > > > >  2f8:   0f7420a2    smlal   v2.4s, v5.4h, v4.h[3]
> > > > >  2fc:   0f138c42    rshrn   v2.4h, v2.4s, #13
> > > > >  300:   6d000e62    stp d2, d3, [x19]
> > > > >  304:   6d010261    stp d1, d0, [x19, #16]
> > > > >  308:   91008273    add x19, x19, #0x20
> > > > >
> > > > > I did some tests on different compile options, and found that
> option
> > > > > "-fschedule-insns" on GCC 7 will generate code that runs faster,
> if I
> > > > > disable schedule-insns, GCC7 will generate the same code as GCC8.
> > > > > However, this option seems don't work on GCC8, if I enable
> > > > > "-fschedule-insns" with GCC8, the code generated by GCC8 is even
> slower. If
> > > > > I disable "-fschedule-insns" with GCC8, the generated code is just
> like the
> > > > > sequence as in C code.
> > > > >
> > > > > I compiled my code with -O3, which means -fschedule-insns will be
> enabled
> > > > > by default.
> > > > >
> > > > > With this option enabled, GCC7 will reschedule instructions, and
> it seems
> > > > > that GCC7 will arrange the same instructions all together, but
> GCC8 doesn't
> > > > > do that, or GCC8 will reschedule instructions in a worse way.
> > > > >
> > > > > My question is, is this behavior expected in GCC8, GCC9 and the
> future
> > > > > version?
> > > > > Is this change in GCC code scheduling related to the fix of
> "spectre and
> > > > > mitigation" ?
> > > > >
> > > > > If I want the same instruction scheduling mechanism in GCC8, what
> can I do ?
> > > > >
> > > > > Thank you for looking into this.
> > > > >
> > > > > Looking forward to your reply!
> > > > >
> > > > > Tomas Chang
> > > > > Aug 19, 2019
> > > > > _______________________________________________
> > > > > linaro-toolchain mailing list
> > > > > linaro-toolchain@lists.linaro.org
> > > > > https://lists.linaro.org/mailman/listinfo/linaro-toolchain
> > > >
> > > > <arm-performance.tar.xz>
> > >
>
>
_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/linaro-toolchain

Re: ARM GCC 8.x Performance Dropping Compared to Linaro GCC 7.x

Reply via email to