https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87869
Bug ID: 87869 Summary: Unrolled loop leads to excessive code bloat with -Os on ARC EM. Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: nbowler at draconx dot ca Target Milestone: --- Consider the following code: % cat >test.c <<'EOF' #include <stdint.h> void do_stuff_12iter(void) { volatile uint32_t *blah = (void *)0xf0000000; unsigned i; for (i = 0; i < 12; i++) { blah[i] = 3; } } void do_stuff_11iter(void) { volatile uint32_t *blah = (void *)0xf0000000; unsigned i; for (i = 0; i < 11; i++) { blah[i] = 3; } } EOF When I compile this with gcc: % arc-unknown-elf-gcc -v Using built-in specs. COLLECT_GCC=/usr/x86_64-pc-linux-gnu/arc-unknown-elf/gcc-bin/8.2.0/arc-unknown-elf-gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/arc-unknown-elf/8.2.0/lto-wrapper Target: arc-unknown-elf Configured with: /var/tmp/portage/cross-arc-unknown-elf/gcc-8.2.0-r3/work/gcc-8.2.0/configure --host=x86_64-pc-linux-gnu --target=arc-unknown-elf --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/arc-unknown-elf/gcc-bin/8.2.0 --includedir=/usr/lib/gcc/arc-unknown-elf/8.2.0/include --datadir=/usr/share/gcc-data/arc-unknown-elf/8.2.0 --mandir=/usr/share/gcc-data/arc-unknown-elf/8.2.0/man --infodir=/usr/share/gcc-data/arc-unknown-elf/8.2.0/info --with-gxx-include-dir=/usr/lib/gcc/arc-unknown-elf/8.2.0/include/g++-v8 --with-python-dir=/share/gcc-data/arc-unknown-elf/8.2.0/python --enable-languages=c,c++ --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 8.2.0-r3' --disable-esp --enable-libstdcxx-time --enable-poison-system-directories --disable-libstdcxx-time --with-sysroot=/usr/arc-unknown-elf --disable-bootstrap --with-newlib --enable-multilib --disable-altivec --disable-fixed-point --disable-libgomp --disable-libmudflap --disable-libssp --disable-libmpx --disable-systemtap --disable-vtable-verify --disable-libvtv --disable-libquadmath --enable-lto --without-isl --disable-libsanitizer --disable-default-pie --enable-default-ssp Thread model: single gcc version 8.2.0 (Gentoo 8.2.0-r3) % arc-unknown-elf-gcc -c -Os -mcpu=arcem -mno-sdata -mcode-density -mq-class -mbarrel-shifter -mmpy-option=3 -mswap test.c The 11-iteration loop gets fully unrolled with pretty horrible results: 00000018 <do_stuff_11iter>: 18: 730c mov_s r0,3 1a: 1e00 7000 f000 0000 st r0,[0xf0000000] 22: 1e00 7000 f000 0004 st r0,[0xf0000004] 2a: 1e00 7000 f000 0008 st r0,[0xf0000008] 32: 1e00 7000 f000 000c st r0,[0xf000000c] 3a: 1e00 7000 f000 0010 st r0,[0xf0000010] 42: 1e00 7000 f000 0014 st r0,[0xf0000014] 4a: 1e00 7000 f000 0018 st r0,[0xf0000018] 52: 1e00 7000 f000 001c st r0,[0xf000001c] 5a: 1e00 7000 f000 0020 st r0,[0xf0000020] 62: 1e00 7000 f000 0024 st r0,[0xf0000024] 6a: 1e00 7000 f000 0028 st r0,[0xf0000028] 72: 7ee0 j_s [blink] That's almost five times the size of the 12-iteration one which didn't get unrolled: 00000000 <do_stuff_12iter>: 0: 41c3 f000 0000 mov_s r1,0xf0000000 6: 734c mov_s r2,3 8: d80c mov_s r0,0xc a: 240a 7000 mov lp_count,r0 e: 20a8 0140 lp 10 ;16 <do_stuff_12iter+0x16> 12: 1904 0090 st.ab r2,[r1,4] 16: 7ee0 j_s [blink] That one's pretty good. This specific example could be a _tiny_ bit better, because the constant values moved to r2 and r0 could be immediates in the instructions where those registers are used but I'm not bothered by that. Since I requested size optimizations, it would be nice if my code size didn't get quintupled like this.