https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102652
Bug ID: 102652 Summary: Unnecessary zeroing out of local ARM NEON arrays Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: decio at decpp dot net Target Milestone: --- Created attachment 51567 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51567&action=edit Testcase to reproduce the bug. Sorry for gzipping it, but if uncompressed, it exceeds the 1 MB file size limit. This is my first time reporting a compiler bug, so please be kind to me if I made any mistakes. In particular, I'm not sure if tree-optimization is the correct component. Consider the attached code, briefly reproduced next, which is a minimal testcase obtained from many instances of more complex code in use in an application of mine: /* START CODE */ #include <arm_neon.h> void bug(int8_t *out, const int8_t *in) { for (int i = 0; i < 2; i++) { int8x16x4_t x; x.val[0] = vld1q_s8(&in[16 * i]); x.val[1] = x.val[2] = x.val[3] = vshrq_n_s8(x.val[0], 7); vst4q_s8(&out[64 * i], x); } } /* END CODE */ This is the assembly output of this code: 0000000000000000 <bug>: 0: d10203ff sub sp, sp, #0x80 4: d2800009 mov x9, #0x0 // #0 8: d2800008 mov x8, #0x0 // #0 c: d2800007 mov x7, #0x0 // #0 10: d2800006 mov x6, #0x0 // #0 14: d2800005 mov x5, #0x0 // #0 18: d2800004 mov x4, #0x0 // #0 1c: d2800003 mov x3, #0x0 // #0 20: a90023e9 stp x9, x8, [sp] 24: d2800002 mov x2, #0x0 // #0 28: a9011be7 stp x7, x6, [sp, #16] 2c: a90213e5 stp x5, x4, [sp, #32] 30: f9001be3 str x3, [sp, #48] 34: 3dc00020 ldr q0, [x1] 38: a903a7e2 stp x2, x9, [sp, #56] 3c: a9049fe8 stp x8, x7, [sp, #72] 40: 4f090404 sshr v4.16b, v0.16b, #7 44: 3d8003e0 str q0, [sp] 48: 4c4023e0 ld1 {v0.16b-v3.16b}, [sp] 4c: a90597e6 stp x6, x5, [sp, #88] 50: 4ea41c81 mov v1.16b, v4.16b 54: a9068fe4 stp x4, x3, [sp, #104] 58: 4ea41c82 mov v2.16b, v4.16b 5c: f9003fe2 str x2, [sp, #120] 60: 4ea41c83 mov v3.16b, v4.16b 64: 4c9f0000 st4 {v0.16b-v3.16b}, [x0], #64 68: 3dc00424 ldr q4, [x1, #16] 6c: 910103e1 add x1, sp, #0x40 70: 3d8013e4 str q4, [sp, #64] 74: 4f090484 sshr v4.16b, v4.16b, #7 78: 4c402020 ld1 {v0.16b-v3.16b}, [x1] 7c: 4ea41c81 mov v1.16b, v4.16b 80: 4ea41c82 mov v2.16b, v4.16b 84: 4ea41c83 mov v3.16b, v4.16b 88: 4c000000 st4 {v0.16b-v3.16b}, [x0] 8c: 910203ff add sp, sp, #0x80 90: d65f03c0 ret It can be seen that the generated code attemps to zero out the variable "x", which I understand is, first of all, uncalled for (seeing as it's local to function bug and not in the global scope), and even if it were necessary, it has no effect anyway since these variables are initialized later. Many registers are redundantly zeroed (at addresses 4-1c and 24) which are then stored in the stack (at addresses 20, 28-30, 38, 3c, 4c, 54 and 5c). None of these instructions were required to be generated. The zeroed out values are loaded in addresses 48 and 78, but 3 out of the 4 registers (v1, v2, v3) are immediately overwritten, in addresses 50, 58 and 60 for the first load, and 7c-84 for the second load. For the remaining register that is loaded (v0), an unnecessary and redundant trip to memory is performed: for the first iteration of the loop, q0 is loaded at address 34, stored at address 44 and reloaded with the same value in address 48. The second and third instructions could just be removed. For the second iteration, a a load is performed in address 68, followed by a store in address 70 and another load in address 78. Again, the second and third instructions could be removed, so long as the destination register of the instruction in address 68, and the source register of the instruction in address 74, were both changed to q0. In total, it appears that 24 out of 37 instructions could be removed from the generated code without any change of behavior, many of which are fairly expensive as they involve trips to memory. Thus, I estimate a speedup on the order of 3x if this issue were fixed. Note that the "-mcpu=native" and "-mtune=native" do not make the issue go away. This issue only appears to happen for small loops that can be fully unrolled. If the loop iteration count is unknown at compile-time, or if a larger iteration count is used such as 32, the issue goes away, as seen in the following assembly output: 0000000000000000 <bug>: 0: 91080022 add x2, x1, #0x200 4: d503201f nop 8: 3cc10424 ldr q4, [x1], #16 c: 4ea41c80 mov v0.16b, v4.16b 10: 4f090484 sshr v4.16b, v4.16b, #7 14: 4ea41c81 mov v1.16b, v4.16b 18: 4ea41c82 mov v2.16b, v4.16b 1c: 4ea41c83 mov v3.16b, v4.16b 20: 4c9f0000 st4 {v0.16b-v3.16b}, [x0], #64 24: eb02003f cmp x1, x2 28: 54ffff01 b.ne 8 <bug+0x8> // b.any 2c: d65f03c0 ret However, even this code could be improved to something like this (manually written, untested modification): 0000000000000000 <bug>: 0: add x2, x1, #0x200 4: nop 8: ldr q0, [x1], #16 c: sshr v1.16b, v0.16b, #7 10: mov v2.16b, v1.16b 14: mov v3.16b, v1.16b 18: st4 {v0.16b-v3.16b}, [x0], #64 1c: cmp x1, x2 20: b.ne 8 <bug+0x8> // b.any 24: ret It appears gcc is trying to avoid using the v0-v3 registers elsewhere, i.e. in the load and shift instructions. For completeness, here is the output of "gcc-11 -v -save-temps -O3 -c -o bug.o bug.c": Using built-in specs. COLLECT_GCC=gcc Target: aarch64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 10.3.0-1ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-10/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-10 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-mutex Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu1) COLLECT_GCC_OPTIONS='-v' '-save-temps' '-O3' '-c' '-o' 'bug.o' '-mlittle-endian' '-mabi=lp64' /usr/lib/gcc/aarch64-linux-gnu/10/cc1 -E -quiet -v -imultiarch aarch64-linux-gnu bug.c -mlittle-endian -mabi=lp64 -O3 -fpch-preprocess -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security -fstack-clash-protection -o bug.i ignoring nonexistent directory "/usr/local/include/aarch64-linux-gnu" ignoring nonexistent directory "/usr/lib/gcc/aarch64-linux-gnu/10/include-fixed" ignoring nonexistent directory "/usr/lib/gcc/aarch64-linux-gnu/10/../../../../aarch64-linux-gnu/include" #include "..." search starts here: #include <...> search starts here: /usr/lib/gcc/aarch64-linux-gnu/10/include /usr/local/include /usr/include/aarch64-linux-gnu /usr/include End of search list. COLLECT_GCC_OPTIONS='-v' '-save-temps' '-O3' '-c' '-o' 'bug.o' '-mlittle-endian' '-mabi=lp64' /usr/lib/gcc/aarch64-linux-gnu/10/cc1 -fpreprocessed bug.i -quiet -dumpbase bug.c -mlittle-endian -mabi=lp64 -auxbase-strip bug.o -O3 -version -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security -fstack-clash-protection -o bug.s GNU C17 (Ubuntu 10.3.0-1ubuntu1) version 10.3.0 (aarch64-linux-gnu) compiled by GNU C version 10.3.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.0, isl version isl-0.23-GMP GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072 GNU C17 (Ubuntu 10.3.0-1ubuntu1) version 10.3.0 (aarch64-linux-gnu) compiled by GNU C version 10.3.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.0, isl version isl-0.23-GMP GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072 Compiler executable checksum: af83b0a86657149dda0e3a20e47571e2 COLLECT_GCC_OPTIONS='-v' '-save-temps' '-O3' '-c' '-o' 'bug.o' '-mlittle-endian' '-mabi=lp64' as -v -EL -mabi=lp64 -o bug.o bug.s GNU assembler version 2.37 (aarch64-linux-gnu) using BFD version (GNU Binutils for Ubuntu) 2.37 COMPILER_PATH=/usr/lib/gcc/aarch64-linux-gnu/10/:/usr/lib/gcc/aarch64-linux-gnu/10/:/usr/lib/gcc/aarch64-linux-gnu/:/usr/lib/gcc/aarch64-linux-gnu/10/:/usr/lib/gcc/aarch64-linux-gnu/ LIBRARY_PATH=/usr/lib/gcc/aarch64-linux-gnu/10/:/usr/lib/gcc/aarch64-linux-gnu/10/../../../aarch64-linux-gnu/:/usr/lib/gcc/aarch64-linux-gnu/10/../../../../lib/:/lib/aarch64-linux-gnu/:/lib/../lib/:/usr/lib/aarch64-linux-gnu/:/usr/lib/../lib/:/usr/lib/gcc/aarch64-linux-gnu/10/../../../:/lib/:/usr/lib/ COLLECT_GCC_OPTIONS='-v' '-save-temps' '-O3' '-c' '-o' 'bug.o' '-mlittle-endian' '-mabi=lp64' System information: Raspberry Pi 4 Model B board with 4 GB of RAM. CPU: Broadcom BCM2711 with 4 x Cortex-A72 CPUs. Output of "uname -a": Linux rpi4 5.11.0-1019-raspi #20-Ubuntu SMP PREEMPT Tue Sep 21 15:23:42 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux Jetson Nano 2 GB board. CPU: Nvidia Tegra X1 with 4 x Cortex-A57 CPUs. Output of "uname -a": Linux jetson-nano 4.9.140-tegra #1 SMP PREEMPT Tue Oct 27 21:02:37 PDT 2020 aarch64 aarch64 aarch64 GNU/Linux Versions of gcc in which I tried this in the Raspberry Pi 4: Using built-in specs. COLLECT_GCC=gcc-7 COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/7/lto-wrapper Target: aarch64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 7.5.0-6ubuntu4' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu Thread model: posix gcc version 7.5.0 (Ubuntu/Linaro 7.5.0-6ubuntu4) Using built-in specs. COLLECT_GCC=gcc-9 COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/9/lto-wrapper Target: aarch64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.3.0-23ubuntu2' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu Thread model: posix gcc version 9.3.0 (Ubuntu 9.3.0-23ubuntu2) Using built-in specs. COLLECT_GCC=gcc-10 COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/10/lto-wrapper Target: aarch64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 10.3.0-1ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-10/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-10 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-mutex Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu1) Using built-in specs. COLLECT_GCC=gcc-11 COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/11/lto-wrapper Target: aarch64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 11.2.0-7ubuntu2' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-11 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=2 Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 11.2.0 (Ubuntu 11.2.0-7ubuntu2) Versions of gcc in which I tried this in the Jetson Nano: Using built-in specs. COLLECT_GCC=gcc-7 COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/7/lto-wrapper Target: aarch64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 7.5.0-3ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu Thread model: posix gcc version 7.5.0 (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) Using built-in specs. COLLECT_GCC=gcc-8 COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8/lto-wrapper Target: aarch64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 8.4.0-1ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-8 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --disable-libphobos --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu Thread model: posix gcc version 8.4.0 (Ubuntu/Linaro 8.4.0-1ubuntu1~18.04)