Hi Maxim, This patch does fix part of the regression. I don't see other regression on other software with this patch.
I'm waiting for more updates / fixes from you. And thank you very much for your work! Yupeng Chang Aug 30 2019 On Thu, Aug 29, 2019 at 10:51 PM Maxim Kuvyrkov <maxim.kuvyr...@linaro.org> wrote: > Hi Yupeng, > > Any update? > > -- > Maxim Kuvyrkov > www.linaro.org > > > > > On Aug 23, 2019, at 12:09 PM, Yupeng Chang <chang...@gmail.com> wrote: > > > > Hi Maxim, > > I applied this patch to ARM GCC 8.3 2019.03, and it works! > > GCC 8.3 with this patch can generate code much faster than the GCC 8.3 > without this patch. > > > > But the code is still slightly slower than it generated by GCC 7.x > > > > I'll do more test to see if there are other regressions. > > > > Yupeng Chang > > Aug 23 2019 > > > > On Fri, Aug 23, 2019 at 2:14 PM Maxim Kuvyrkov < > maxim.kuvyr...@linaro.org> wrote: > > Hi Yupeng, > > > > Thanks for the offer. Attached is a patch against trunk, and it fixes > about 2/3 of the regression. You should be able to apply it to > arm-8-branch or gcc-8-branch as well (the only important part are changes > to autopref_rank_for_schedule). > > > > The other 1/3 will require much more work -- neon intrinsics needs to be > converted from inline asms to GCC builtins, so that we can attach scheduler > descriptions to them. > > > > Please let me know performance results and, especially, whether this > patch regresses any of your other testcases. > > > > -- > > Maxim Kuvyrkov > > www.linaro.org > > > > > > > > > On Aug 23, 2019, at 8:34 AM, Yupeng Chang <chang...@gmail.com> wrote: > > > > > > Hi Maxim, > > > If you have any patches that need to be tested, you can send them to > me. > > > I can help you test this. > > > > > > Yupeng Chang > > > Aug 23 2019 > > > > > > On Wed, Aug 21, 2019 at 10:21 PM Maxim Kuvyrkov < > maxim.kuvyr...@linaro.org> wrote: > > > Hi Yupeng, > > > > > > Great testcase, thanks! > > > > > > I've investigated this, and there are two separate changes between GCC > 7 and GCC 8 each causing half of the regression. > > > > > > The first regression is due compiler making unlucky decisions. Before > the regression compiler just got lucky, and I'll look into bringing these > lucky decisions back. > > > > > > The second regression is due to changed tuning of the compiler. New > setting is better on average, and this testcase happens to regress. I > don't know whether we'll manage to fix that. > > > > > > -- > > > Maxim Kuvyrkov > > > www.linaro.org > > > > > > > > > > > > > On Aug 21, 2019, at 9:55 AM, Yupeng Chang <chang...@gmail.com> > wrote: > > > > > > > > Hi Maxim, > > > > Attached is the testcase. > > > > > > > > Please follow these steps to test: > > > > 1. download GCC 8.3 from: > https://developer.arm.com/-/media/Files/downloads/gnu-a/8.3-2019.03/binrel/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu.tar.xz?revision=2e88a73f-d233-4f96-b1f4-d8b36e9bb0b9&la=en > > > > 2. download GCC 7.4 from: > https://releases.linaro.org/components/toolchain/binaries/latest-7/aarch64-linux-gnu/gcc-linaro-7.4.1-2019.02-x86_64_aarch64-linux-gnu.tar.xz > > > > > > > > Please extract the toolchain into /usr/local > > > > > > > > 3. extract the attached package arm-performance.tar.xz into your > linux machine's home folder > > > > cd into arm-performance > > > > run "make" to generate the test program and the ASM code dumped from > .o file > > > > > > > > > > > > On Tue, Aug 20, 2019 at 9:15 PM Maxim Kuvyrkov < > maxim.kuvyr...@linaro.org> wrote: > > > > Hi Yupeng, > > > > > > > > There are many changes from Linaro GCC 7.x to ARM GCC 8.x, so it is > difficult to guess what may be going wrong. > > > > > > > > Do you have a testcase that you can share? With a testcase we can > investigate the problem and, possibly, fix it. > > > > > > > > Regards, > > > > > > > > -- > > > > Maxim Kuvyrkov > > > > www.linaro.org > > > > > > > > > > > > > > > > > On Aug 19, 2019, at 7:14 AM, Yupeng Chang <chang...@gmail.com> > wrote: > > > > > > > > > > Hi Dear Linaro Team, > > > > > I recently found a very strange issue regarding the code > performance. > > > > > I have a loop written in GCC NEON. > > > > > The binary of this coded generated by Linaro GCC 7.x is much > faster than it > > > > > generated by ARM GCC 8.x > > > > > > > > > > My CPU is ARM Cortex-A53 AARCH64. > > > > > The compile option is: > > > > > -Wall -O3 -mcpu=cortex-a53+crypto > > > > > > > > > > the code is like below: > > > > > for (uint32 c = 0; c < channels; c += 16, roi_result += 16) { > > > > > int32x4_t S1, S2, S3, S4; > > > > > int16x4_t DT; > > > > > > > > > > DT = vld1_s16(feature1 + c + 0); > > > > > S1 = vmull_lane_s16(DT, SZ, 0); > > > > > DT = vld1_s16(feature1 + c + 4); > > > > > S2 = vmull_lane_s16(DT, SZ, 0); > > > > > DT = vld1_s16(feature1 + c + 8); > > > > > S3 = vmull_lane_s16(DT, SZ, 0); > > > > > DT = vld1_s16(feature1 + c + 12); > > > > > S4 = vmull_lane_s16(DT, SZ, 0); > > > > > > > > > > DT = vld1_s16(feature2 + c + 0); > > > > > S1 = vmlal_lane_s16(S1, DT, SZ, 1); > > > > > DT = vld1_s16(feature2 + c + 4); > > > > > S2 = vmlal_lane_s16(S2, DT, SZ, 1); > > > > > DT = vld1_s16(feature2 + c + 8); > > > > > S3 = vmlal_lane_s16(S3, DT, SZ, 1); > > > > > DT = vld1_s16(feature2 + c + 12); > > > > > S4 = vmlal_lane_s16(S4, DT, SZ, 1); > > > > > > > > > > DT = vld1_s16(feature3 + c + 0); > > > > > S1 = vmlal_lane_s16(S1, DT, SZ, 2); > > > > > DT = vld1_s16(feature3 + c + 4); > > > > > S2 = vmlal_lane_s16(S2, DT, SZ, 2); > > > > > DT = vld1_s16(feature3 + c + 8); > > > > > S3 = vmlal_lane_s16(S3, DT, SZ, 2); > > > > > DT = vld1_s16(feature3 + c + 12); > > > > > S4 = vmlal_lane_s16(S4, DT, SZ, 2); > > > > > > > > > > DT = vld1_s16(feature4 + c + 0); > > > > > S1 = vmlal_lane_s16(S1, DT, SZ, 3); > > > > > DT = vld1_s16(feature4 + c + 4); > > > > > S2 = vmlal_lane_s16(S2, DT, SZ, 3); > > > > > DT = vld1_s16(feature4 + c + 8); > > > > > S3 = vmlal_lane_s16(S3, DT, SZ, 3); > > > > > DT = vld1_s16(feature4 + c + 12); > > > > > S4 = vmlal_lane_s16(S4, DT, SZ, 3); > > > > > > > > > > DT = vrshrn_n_s32(S1, Q_VALUE); > > > > > vst1_s16(roi_result + 0, DT); > > > > > DT = vrshrn_n_s32(S2, Q_VALUE); > > > > > vst1_s16(roi_result + 4, DT); > > > > > DT = vrshrn_n_s32(S3, Q_VALUE); > > > > > vst1_s16(roi_result + 8, DT); > > > > > DT = vrshrn_n_s32(S4, Q_VALUE); > > > > > vst1_s16(roi_result + 12, DT); > > > > > } > > > > > > > > > > Code generated by GCC7: > > > > > 294: 6b10031f cmp w24, w16 > > > > > 298: fc606959 ldr d25, [x10, x0] > > > > > 29c: fc686922 ldr d2, [x9, x8] > > > > > 2a0: fc676921 ldr d1, [x9, x7] > > > > > 2a4: fc666920 ldr d0, [x9, x6] > > > > > 2a8: fc686958 ldr d24, [x10, x8] > > > > > 2ac: fc676957 ldr d23, [x10, x7] > > > > > 2b0: fc666956 ldr d22, [x10, x6] > > > > > 2b4: fc606855 ldr d21, [x2, x0] > > > > > 2b8: fc686854 ldr d20, [x2, x8] > > > > > 2bc: fc676853 ldr d19, [x2, x7] > > > > > 2c0: fc666852 ldr d18, [x2, x6] > > > > > 2c4: fc606891 ldr d17, [x4, x0] > > > > > 2c8: fc686890 ldr d16, [x4, x8] > > > > > 2cc: fc676887 ldr d7, [x4, x7] > > > > > 2d0: fc666885 ldr d5, [x4, x6] > > > > > 2d4: 0f44a063 smull v3.4s, v3.4h, v4.h[0] > > > > > 2d8: 0f44a042 smull v2.4s, v2.4h, v4.h[0] > > > > > 2dc: 0f44a021 smull v1.4s, v1.4h, v4.h[0] > > > > > 2e0: 0f44a000 smull v0.4s, v0.4h, v4.h[0] > > > > > 2e4: 0f542323 smlal v3.4s, v25.4h, v4.h[1] > > > > > 2e8: 0f542302 smlal v2.4s, v24.4h, v4.h[1] > > > > > 2ec: 0f5422e1 smlal v1.4s, v23.4h, v4.h[1] > > > > > 2f0: 0f5422c0 smlal v0.4s, v22.4h, v4.h[1] > > > > > 2f4: 0f6422a3 smlal v3.4s, v21.4h, v4.h[2] > > > > > 2f8: 0f642282 smlal v2.4s, v20.4h, v4.h[2] > > > > > 2fc: 0f642261 smlal v1.4s, v19.4h, v4.h[2] > > > > > 300: 0f642240 smlal v0.4s, v18.4h, v4.h[2] > > > > > 304: 0f742223 smlal v3.4s, v17.4h, v4.h[3] > > > > > 308: 0f742202 smlal v2.4s, v16.4h, v4.h[3] > > > > > 30c: 0f7420e1 smlal v1.4s, v7.4h, v4.h[3] > > > > > 310: 0f7420a0 smlal v0.4s, v5.4h, v4.h[3] > > > > > 314: 0f138c63 rshrn v3.4h, v3.4s, #13 > > > > > 318: 0f138c42 rshrn v2.4h, v2.4s, #13 > > > > > 31c: 0f138c21 rshrn v1.4h, v1.4s, #13 > > > > > 320: 0f138c00 rshrn v0.4h, v0.4s, #13 > > > > > 324: 6d3e0a63 stp d3, d2, [x19, #-32] > > > > > 328: 6d3f0261 stp d1, d0, [x19, #-16] > > > > > > > > > > Code generated by GCC8: > > > > > > > > > > 26c: 6b0b02ff cmp w23, w11 > > > > > 270: fc606922 ldr d2, [x9, x0] > > > > > 274: fc666941 ldr d1, [x10, x6] > > > > > 278: fc666920 ldr d0, [x9, x6] > > > > > 27c: 0f44a000 smull v0.4s, v0.4h, v4.h[0] > > > > > 280: 0f542020 smlal v0.4s, v1.4h, v4.h[1] > > > > > 284: fc6668e1 ldr d1, [x7, x6] > > > > > 288: 0f642020 smlal v0.4s, v1.4h, v4.h[2] > > > > > 28c: fc646945 ldr d5, [x10, x4] > > > > > 290: fc666901 ldr d1, [x8, x6] > > > > > 294: 0f742020 smlal v0.4s, v1.4h, v4.h[3] > > > > > 298: fc646921 ldr d1, [x9, x4] > > > > > 29c: 0f44a021 smull v1.4s, v1.4h, v4.h[0] > > > > > 2a0: 0f5420a1 smlal v1.4s, v5.4h, v4.h[1] > > > > > 2a4: fc626945 ldr d5, [x10, x2] > > > > > 2a8: 0f138c03 rshrn v3.4h, v0.4s, #13 > > > > > 2ac: fc626920 ldr d0, [x9, x2] > > > > > 2b0: 0f44a000 smull v0.4s, v0.4h, v4.h[0] > > > > > 2b4: 0f5420a0 smlal v0.4s, v5.4h, v4.h[1] > > > > > 2b8: fc606945 ldr d5, [x10, x0] > > > > > 2bc: 0f44a042 smull v2.4s, v2.4h, v4.h[0] > > > > > 2c0: 0f5420a2 smlal v2.4s, v5.4h, v4.h[1] > > > > > 2c4: fc6468e5 ldr d5, [x7, x4] > > > > > 2c8: 0f6420a1 smlal v1.4s, v5.4h, v4.h[2] > > > > > 2cc: fc6268e5 ldr d5, [x7, x2] > > > > > 2d0: 0f6420a0 smlal v0.4s, v5.4h, v4.h[2] > > > > > 2d4: fc6068e5 ldr d5, [x7, x0] > > > > > 2d8: 0f6420a2 smlal v2.4s, v5.4h, v4.h[2] > > > > > 2dc: fc646905 ldr d5, [x8, x4] > > > > > 2e0: 0f7420a1 smlal v1.4s, v5.4h, v4.h[3] > > > > > 2e4: fc626905 ldr d5, [x8, x2] > > > > > 2e8: 0f138c21 rshrn v1.4h, v1.4s, #13 > > > > > 2ec: 0f7420a0 smlal v0.4s, v5.4h, v4.h[3] > > > > > 2f0: 0f138c00 rshrn v0.4h, v0.4s, #13 > > > > > 2f4: fc606905 ldr d5, [x8, x0] > > > > > 2f8: 0f7420a2 smlal v2.4s, v5.4h, v4.h[3] > > > > > 2fc: 0f138c42 rshrn v2.4h, v2.4s, #13 > > > > > 300: 6d000e62 stp d2, d3, [x19] > > > > > 304: 6d010261 stp d1, d0, [x19, #16] > > > > > 308: 91008273 add x19, x19, #0x20 > > > > > > > > > > I did some tests on different compile options, and found that > option > > > > > "-fschedule-insns" on GCC 7 will generate code that runs faster, > if I > > > > > disable schedule-insns, GCC7 will generate the same code as GCC8. > > > > > However, this option seems don't work on GCC8, if I enable > > > > > "-fschedule-insns" with GCC8, the code generated by GCC8 is even > slower. If > > > > > I disable "-fschedule-insns" with GCC8, the generated code is just > like the > > > > > sequence as in C code. > > > > > > > > > > I compiled my code with -O3, which means -fschedule-insns will be > enabled > > > > > by default. > > > > > > > > > > With this option enabled, GCC7 will reschedule instructions, and > it seems > > > > > that GCC7 will arrange the same instructions all together, but > GCC8 doesn't > > > > > do that, or GCC8 will reschedule instructions in a worse way. > > > > > > > > > > My question is, is this behavior expected in GCC8, GCC9 and the > future > > > > > version? > > > > > Is this change in GCC code scheduling related to the fix of > "spectre and > > > > > mitigation" ? > > > > > > > > > > If I want the same instruction scheduling mechanism in GCC8, what > can I do ? > > > > > > > > > > Thank you for looking into this. > > > > > > > > > > Looking forward to your reply! > > > > > > > > > > Tomas Chang > > > > > Aug 19, 2019 > > > > > _______________________________________________ > > > > > linaro-toolchain mailing list > > > > > linaro-toolchain@lists.linaro.org > > > > > https://lists.linaro.org/mailman/listinfo/linaro-toolchain > > > > > > > > <arm-performance.tar.xz> > > > > > _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain