[Bug tree-optimization/47059] compiler fails to coalesce loads/stores
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47059 Denis Vlasenko changed: What|Removed |Added CC||vda.linux at googlemail dot com --- Comment #3 from Denis Vlasenko --- I encountered this behavior with 4.8.0: struct pollfd pfd[3]; ... pfd[2].events = POLLOUT; pfd[2].revents = 0; This compiled to: movw$4, 44(%rsp)#, pfd[2].events movw$0, 46(%rsp)#, pfd[2].revents
[Bug middle-end/66240] RFE: extend -falign-xyz syntax
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240 --- Comment #7 from Denis Vlasenko --- Patch v8 https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00792.html https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00793.html https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00794.html https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00795.html
[Bug target/45996] -falign-functions=X does not work
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=45996 Denis Vlasenko changed: What|Removed |Added CC||vda.linux at googlemail dot com --- Comment #8 from Denis Vlasenko --- See bug 66240
[Bug rtl-optimization/21182] gcc can use registers but uses stack instead
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182 --- Comment #6 from Denis Vlasenko 2013-01-18 00:48:23 UTC --- Created attachment 29200 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=29200 Updated testcase, build heper, and results of testing with different gcc versions Tarball contains: serpent.c: the original testcase, only with "#ifdef NAIL_REGS" instead of "#if 0" which allows test compiles w/o editing it. Basically, "gcc -DNAIL_REGS serpent.c" will try to force gcc to use only registers instead of stack. gencode.sh: builds serpent.c with -O2 and -O3, with and without -DNAIL_REGS. The object file names contain gcc version and used options. Then they are objdump'ed and output saved. Tweakable with setting $PREFIX and/or $CC. No -fomit-frame-pointer used: the testcase can be compiled so that stack is not used even without that option. Disassembly: serpent-O2-3.4.3.asm serpent-O2-4.2.1.asm serpent-O2-4.6.3.asm serpent-O2-DNAIL_REGS-3.4.3.asm serpent-O2-DNAIL_REGS-4.2.1.asm serpent-O2-DNAIL_REGS-4.6.3.asm serpent-O3-3.4.3.asm serpent-O3-4.2.1.asm serpent-O3-4.6.3.asm serpent-O3-DNAIL_REGS-3.4.3.asm serpent-O3-DNAIL_REGS-4.2.1.asm serpent-O3-DNAIL_REGS-4.6.3.asm Object files: textdata bss dec hex filename 3260 0 03260 cbc serpent-O2-DNAIL_REGS-3.4.3.o 3260 0 03260 cbc serpent-O3-DNAIL_REGS-3.4.3.o 3292 0 03292 cdc serpent-O3-3.4.3.o 3536 0 03536 dd0 serpent-O2-4.6.3.o 3536 0 03536 dd0 serpent-O3-4.6.3.o 3845 0 03845 f05 serpent-O2-DNAIL_REGS-4.6.3.o 3845 0 03845 f05 serpent-O3-DNAIL_REGS-4.6.3.o 3877 0 03877 f25 serpent-O2-4.2.1.o 3877 0 03877 f25 serpent-O3-4.2.1.o 4302 0 0430210ce serpent-O2-3.4.3.o 4641 0 046411221 serpent-O2-DNAIL_REGS-4.2.1.o 4641 0 046411221 serpent-O3-DNAIL_REGS-4.2.1.o Take a look inside serpent-O2-DNAIL_REGS-3.4.3.asm file. This is what I want to get without asm hacks: the smallest code, uses no stack. gcc-3.4.3 -O3 comes close: it does spill a few words to stack (search for (%ebp)), but is generally good code (close to ideal?). All other attempts fare worse: gcc-3.4.3 -O2: code is significantly worse than -O3. gcc-4.2.1 -O2/-O3: code is better than gcc-3.4.3 -O2, worse than gcc-4.6.3 gcc-4.6.3 -O2/-O3: six instances of spills to stack . Code is still not as good as gcc-3.4.3 -O3. (-DNAIL_REGS only confuses it more, unlike 3.4.3). Stack usage summary: $ grep 'sub.*,%esp' *.asm | grep -v DNAIL_REGS serpent-O2-3.4.3.asm: 6: 81 ec 00 01 00 00 sub$0x100,%esp serpent-O2-4.2.1.asm: 6: 83 ec 78sub$0x78,%esp serpent-O2-4.6.3.asm: 4: 83 ec 04sub$0x4,%esp serpent-O3-4.2.1.asm: 6: 83 ec 78sub$0x78,%esp serpent-O3-4.6.3.asm: 4: 83 ec 04sub$0x4,%esp (serpent-O3-3.4.3.asm is not listed, but it allocates and uses one word on stack by push insn). Modules with best (= minimal) stack usage: $ grep -F -e '(%esp)' -e '(%ebp)' serpent-O2-DNAIL_REGS-3.4.3.asm 6: 8b 75 08mov0x8(%ebp),%esi 9: 8b 7d 10mov0x10(%ebp),%edi ca9: 8b 75 0cmov0xc(%ebp),%esi $ grep -F -e '(%esp)' -e '(%ebp)' serpent-O3-3.4.3.asm 7: 8b 7d 08mov0x8(%ebp),%edi a: 8b 4d 10mov0x10(%ebp),%ecx 18c: 89 7d f0mov%edi,-0x10(%ebp) 1dd: 8b 45 f0mov-0x10(%ebp),%eax 23b: 8b 75 f0mov-0x10(%ebp),%esi 299: 8b 7d f0mov-0x10(%ebp),%edi 432: 8b 55 f0mov-0x10(%ebp),%edx 4a0: 8b 4d f0mov-0x10(%ebp),%ecx 50e: 8b 7d f0mov-0x10(%ebp),%edi 84f: 8b 45 f0mov-0x10(%ebp),%eax 8b9: 8b 75 f0mov-0x10(%ebp),%esi 923: 8b 7d f0mov-0x10(%ebp),%edi cb6: 8b 55 0cmov0xc(%ebp),%edx $ grep -F -e '(%esp)' -e '(%ebp)' serpent-O3-4.6.3.asm 7: 8b 4c 24 20 mov0x20(%esp),%ecx b: 8b 44 24 18 mov0x18(%esp),%eax 22e: 89 0c 24mov%ecx,(%esp) 239: 23 3c 24and(%esp),%edi 588: 89 0c 24mov%ecx,(%esp) 58f: 23 3c 24and(%esp),%edi 8f4: 89 0c 24mov%ecx,(%esp) 8fd: 23 3c 24and(%esp),%edi c60: 89 0c 24mov%ecx,(%esp) c6b: 23 3c 24and(%esp),%edi d37: 89 14 24mov%edx,(%esp) d5a: 8b 44 24 1c mov0x1c(%esp),%eax d5e: 33 14 24
[Bug rtl-optimization/21182] gcc can use registers but uses stack instead
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182 Denis Vlasenko changed: What|Removed |Added CC||vda.linux at googlemail dot ||com --- Comment #7 from Denis Vlasenko 2013-01-18 00:51:01 UTC --- "gcc-4.6.3 got better a bit, still not as good as gcc-4.6.3 -O3." I meant: gcc-4.6.3 got better a bit, still not as good as gcc-3.4.3 -O3 used to be.
[Bug rtl-optimization/21182] gcc can use registers but uses stack instead
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182 --- Comment #8 from Denis Vlasenko 2013-01-18 00:55:37 UTC --- Grrr, another mistake. Correcting again: Conclusion: gcc-3.4.3 -O3 was close to ideal. ^ gcc-4.2.1 is worse. gcc-4.6.3 got better a bit, still not as good as gcc-3.4.3 -O3 used to be. ^
[Bug target/30354] -Os doesn't optimize a/CONST even if it saves size.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30354 --- Comment #16 from Denis Vlasenko 2013-01-18 10:29:12 UTC --- (In reply to comment #15) > Honza, did you find time to have a look? > > I think this regressed alot in 4.6 Not really - it's just .eh_frame section. I re-ran the tests with two gcc's I have here and sizes look like this: text databssdechexfilename 257731 0 0 257731 3eec3divmod-4.2.1-Os.o 242787 0 0 242787 3b463divmod-4.6.3-Os.o Stock (unpatched) gcc improved, juggles registers better. For example: int ib_100_x(int x) { return (100 / x) ^ (100 % x); } 0: b8 64 00 00 00 mov$0x64,%eax 5: 99 cltd 6: f7 7c 24 04 idivl 0x4(%esp) - a: 31 c2 xor%eax,%edx - c: 89 d0 mov%edx,%eax - e: c3 ret + a: 31 d0 xor%edx,%eax + c: c3 ret I believe my patch would improve things still - it is orthogonal to register allocation. BTW, just so that we are all on the same page wrt compiler options: here's the script I use to compile, disassemble, and extract function sizes from test program in comment 3. Tweakable by setting $PREFIX and/or $CC: gencode.sh == #!/bin/sh #PREFIX="i686-" test "$PREFIX" || PREFIX="" test "$CC" || CC="${PREFIX}gcc" test "$OBJDUMP" || OBJDUMP="${PREFIX}objdump" test "$NM" || NM="${PREFIX}nm" CC_VER=`$CC --version | sed -n 's/[^ ]* [^ ]* \([3-9]\.[1-9][^ ]*\).*/\1/p'` test "$CC_VER" || exit 1 build() { opt=$1 bname=divmod-$CC_VER${opt}${nail} # -ffunction-sections makes disasm easier to understand # (insn offsets start from 0 within every function). # -fno-exceptions -fno-asynchronous-unwind-tables: die, .eh_frame, die! $CC \ -m32 \ -fomit-frame-pointer \ -ffunction-sections \ -fno-exceptions \ -fno-asynchronous-unwind-tables \ ${opt} t.c -c -o $bname.o \ && $OBJDUMP -dr $bname.o >$bname.asm \ && $NM --size-sort $bname.o | sort -k3 >$bname.nm } build -Os #build -O2 #not interesting #build -O3 #not interesting size *.o | tee SIZES
[Bug rtl-optimization/21150] Suboptimal byte extraction from 64bits
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150 Denis Vlasenko changed: What|Removed |Added CC||vda.linux at googlemail dot ||com --- Comment #6 from Denis Vlasenko 2013-01-18 11:12:18 UTC --- Guess this can be closed now. All four cases look good: $ cat helper-4.6.3-O2.asm helper-4.6.3-O2.o: file format elf32-i386 ... : 0:0f b6 05 2d 00 00 00 movzbl 0x2d,%eax 7:32 05 24 00 00 00xor0x24,%al d:32 05 00 00 00 00xor0x0,%al 13:32 05 36 00 00 00xor0x36,%al 19:32 05 3f 00 00 00xor0x3f,%al 1f:32 05 09 00 00 00xor0x9,%al 25:32 05 12 00 00 00xor0x12,%al 2b:32 05 1b 00 00 00xor0x1b,%al 31:c3 ret Disassembly of section .text.b: : 0:0f b6 05 12 00 00 00 movzbl 0x12,%eax 7:32 05 09 00 00 00xor0x9,%al d:32 05 00 00 00 00xor0x0,%al 13:32 05 1b 00 00 00xor0x1b,%al 19:32 05 24 00 00 00xor0x24,%al 1f:32 05 2d 00 00 00xor0x2d,%al 25:32 05 36 00 00 00xor0x36,%al 2b:32 05 3f 00 00 00xor0x3f,%al 31:c3 ret Disassembly of section .text.c: : 0:0f b6 05 09 00 00 00 movzbl 0x9,%eax 7:32 05 00 00 00 00xor0x0,%al d:32 05 12 00 00 00xor0x12,%al 13:32 05 1b 00 00 00xor0x1b,%al 19:32 05 24 00 00 00xor0x24,%al 1f:32 05 2d 00 00 00xor0x2d,%al 25:32 05 36 00 00 00xor0x36,%al 2b:32 05 3f 00 00 00xor0x3f,%al 31:c3 ret Disassembly of section .text.d: : 0:0f b6 05 12 00 00 00 movzbl 0x12,%eax 7:32 05 09 00 00 00xor0x9,%al d:32 05 00 00 00 00xor0x0,%al 13:32 05 1b 00 00 00xor0x1b,%al 19:32 05 24 00 00 00xor0x24,%al 1f:32 05 2d 00 00 00xor0x2d,%al 25:32 05 36 00 00 00xor0x36,%al 2b:32 05 3f 00 00 00xor0x3f,%al 31:c3 ret Curiously, -Os manages to squeeze two more bytes out of it. helper-4.6.3-Os.o: file format elf32-i386 : 0: a0 2d 00 00 00 mov0x2d,%al ^^ ^^^ better than movzbl 5: 33 05 24 00 00 00 xor0x24,%eax << why %eax? oh well... b: 33 05 00 00 00 00 xor0x0,%eax 11: 32 05 36 00 00 00 xor0x36,%al 17: 32 05 3f 00 00 00 xor0x3f,%al 1d: 32 05 09 00 00 00 xor0x9,%al 23: 32 05 12 00 00 00 xor0x12,%al 29: 32 05 1b 00 00 00 xor0x1b,%al 2f: c3 ret
[Bug rtl-optimization/21141] [3.4 Regression] excessive stack usage
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21141 Denis Vlasenko changed: What|Removed |Added CC||vda.linux at googlemail dot ||com --- Comment #9 from Denis Vlasenko 2013-01-18 16:01:52 UTC --- Current gcc seems to be doing fine: $ grep 'sub.*,%esp' *.asm; size *.o whirlpool-4.2.1-O2.asm: 81 ec 84 01 00 00sub$0x184,%esp whirlpool-4.2.1-O3.asm: 81 ec 4c 01 00 00sub$0x14c,%esp whirlpool-4.2.1-Os.asm: 81 ec 84 01 00 00sub$0x184,%esp whirlpool-4.6.3-O2.asm: 81 ec 4c 01 00 00sub$0x14c,%esp whirlpool-4.6.3-O3.asm: 81 ec 4c 01 00 00sub$0x14c,%esp whirlpool-4.6.3-Os.asm: 81 ec 4c 01 00 00sub$0x14c,%esp text databssdechexfilename 6223 0 0 6223 184fwhirlpool-4.2.1-O2.o 5663 0 0 5663 161fwhirlpool-4.2.1-O3.o 6194 0 0 6194 1832whirlpool-4.2.1-Os.o 5655 0 0 5655 1617whirlpool-4.6.3-O2.o 5703 0 0 5703 1647whirlpool-4.6.3-O3.o 5570 0 0 5570 15c2whirlpool-4.6.3-Os.o
[Bug rtl-optimization/21141] [3.4 Regression] excessive stack usage
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21141 --- Comment #10 from Denis Vlasenko 2013-01-18 16:03:37 UTC --- BTW, testcase needs a small fix: -static const u64 C0[256]; +u64 C0[256]; or else gcc with optimize it almost to nothing :)
[Bug rtl-optimization/21182] [4.6/4.7/4.8 Regression] gcc can use registers but uses stack instead
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182 --- Comment #11 from Denis Vlasenko 2013-01-20 14:39:42 UTC --- (In reply to comment #10) > 4.4.7 and 4.5.4 generate the same code (no stack use) for -D/-UNAIL_REGS. > With 4.6.3, the -DNAIL_REGS code regresses very much (IRA ...), the > -UNAIL_REGS code is nearly perfect but less good than 4.4/4.5 (if you > only consider grep esp serpent.s | wc -l). Same behavior with 4.7.2. > > Trunk got somewhat worse with -UNAIL_REGS but better with -DNAIL_REGS (at > -O2): > > -UNAIL_REGS -DNAIL_REGS > 4.5.4 3 3 > 4.6.315 101 This matches what I see with 4.6.3 - 15 insns with %esp (and no %ebp): $ grep '%esp' serpent-4.6.3-O2.asm 4: 83 ec 04sub$0x4,%esp 7: 8b 4c 24 20 mov0x20(%esp),%ecx b: 8b 44 24 18 mov0x18(%esp),%eax 22e: 89 0c 24mov%ecx,(%esp) 239: 23 3c 24and(%esp),%edi 588: 89 0c 24mov%ecx,(%esp) 58f: 23 3c 24and(%esp),%edi 8f4: 89 0c 24mov%ecx,(%esp) 8fd: 23 3c 24and(%esp),%edi c60: 89 0c 24mov%ecx,(%esp) c6b: 23 3c 24and(%esp),%edi d37: 89 14 24mov%edx,(%esp) d5a: 8b 44 24 1c mov0x1c(%esp),%eax d5e: 33 14 24xor(%esp),%edx d70: 83 c4 04add$0x4,%esp > The most important thing to fix is the -UNAIL_REGS case of course. Sure. NAIL_REGS is only a hack meant to demonstrate that regs *can* be allocated optimally.
[Bug c/70646] Corrupt truncated function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646 Denis Vlasenko changed: What|Removed |Added CC||vda.linux at googlemail dot com --- Comment #3 from Denis Vlasenko --- I can reproduce it with: $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/5.3.1/lto-wrapper Target: x86_64-redhat-linux Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,objc,obj-c++,fortran,ada,go,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --disable-libgcj --with-isl --enable-libmpx --enable-gnu-indirect-function --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux Thread model: posix gcc version 5.3.1 20160406 (Red Hat 5.3.1-6) (GCC) No fancy compiler flags are necessary to thigger it. Without "-fno-omit-frame-pointer", function loses its two remaining insns, I see an empty body: .type qla2x00_get_host_fabric_name, @function qla2x00_get_host_fabric_name: .LFB4504: .cfi_startproc .cfi_endproc .LFE4504: .size qla2x00_get_host_fabric_name, .-qla2x00_get_host_fabric_name Simple "gcc -Os qla_attr.i.c -S" would do. gcc -O2 produces a normally-looking function.
[Bug c/70646] Corrupt truncated function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646 --- Comment #4 from Denis Vlasenko --- Shorter reproducer: typedef __signed__ char __s8; typedef unsigned char __u8; typedef __signed__ short __s16; typedef unsigned short __u16; typedef __signed__ int __s32; typedef unsigned int __u32; __extension__ typedef __signed__ long long __s64; __extension__ typedef unsigned long long __u64; typedef signed char s8; typedef unsigned char u8; typedef signed short s16; typedef unsigned short u16; typedef signed int s32; typedef unsigned int u32; typedef signed long long s64; typedef unsigned long long u64; typedef __u64 __be64; static inline __attribute__((no_instrument_function)) __attribute__((__const__)) __u64 __fswab64(__u64 val) { return __builtin_bswap64(val); } static inline __attribute__((no_instrument_function)) __attribute__((always_inline)) __u64 __swab64p(const __u64 *p) { return (__builtin_constant_p((__u64)(*p)) ? ((__u64)( (((__u64)(*p) & (__u64)0x00ffULL) << 56) | (((__u64)(*p) & (__u64)0xff00ULL) << 40) | (((__u64)(*p) & (__u64)0x00ffULL) << 24) | (((__u64)(*p) & (__u64)0xff00ULL) << 8) | (((__u64)(*p) & (__u64)0x00ffULL) >> 8) | (((__u64)(*p) & (__u64)0xff00ULL) >> 24) | (((__u64)(*p) & (__u64)0x00ffULL) >> 40) | (((__u64)(*p) & (__u64)0xff00ULL) >> 56))) : __fswab64(*p)); } static inline __attribute__((no_instrument_function)) __attribute__((always_inline)) __u64 __be64_to_cpup(const __be64 *p) { return __swab64p((__u64 *)p); } static inline __attribute__((no_instrument_function)) __attribute__((always_inline)) u64 get_unaligned_be64(const void *p) { return __be64_to_cpup((__be64 *)p); } static inline __attribute__((no_instrument_function)) u64 wwn_to_u64(u8 *wwn) { return get_unaligned_be64(wwn); } struct Scsi_Host { unsigned long base; unsigned long io_port; unsigned char n_io_port; unsigned char dma_channel; unsigned int irq; void *shost_data; unsigned long hostdata[0] __attribute__ ((aligned (sizeof(unsigned long; }; static inline __attribute__((no_instrument_function)) void *shost_priv(struct Scsi_Host *shost) { return (void *)shost->hostdata; } typedef struct scsi_qla_host { u8 fabric_node_name[8]; u32 device_flags; } scsi_qla_host_t; struct fc_host_attrs { u64 node_name; u64 port_name; u64 permanent_port_name; u32 supported_classes; u8 supported_fc4s[32]; u32 supported_speeds; u32 maxframe_size; u16 max_npiv_vports; char serial_number[80]; char manufacturer[80]; char model[256]; char model_description[256]; char hardware_version[64]; char driver_version[64]; char firmware_version[64]; char optionrom_version[64]; u32 port_id; u8 active_fc4s[32]; u32 speed; u64 fabric_name; }; static void qla2x00_get_host_fabric_name(struct Scsi_Host *shost) { scsi_qla_host_t *vha = shost_priv(shost); u8 node_name[8] = { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF}; u64 fabric_name = wwn_to_u64(node_name); if (vha->device_flags & 0x1) fabric_name = wwn_to_u64(vha->fabric_node_name); (((struct fc_host_attrs *)(shost)->shost_data)->fabric_name) = fabric_name; } void *get_host_fabric_name = qla2x00_get_host_fabric_name;
[Bug c/70646] Corrupt truncated function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646 --- Comment #5 from Denis Vlasenko --- Even smaller reproducer. Bug disappears if "__attribute__((always_inline))" is removed everywhere. typedef unsigned char u8; typedef unsigned int u32; typedef unsigned long long u64; static inline __attribute__((__const__)) u64 __fswab64(u64 val) { return __builtin_bswap64(val); } static inline __attribute__((always_inline)) u64 __swab64p(const u64 *p) { return (__builtin_constant_p((u64)(*p)) ? ((u64)( (((u64)(*p) & (u64)0x00ffULL) << 56) | (((u64)(*p) & (u64)0xff00ULL) << 40) | (((u64)(*p) & (u64)0x00ffULL) << 24) | (((u64)(*p) & (u64)0xff00ULL) << 8) | (((u64)(*p) & (u64)0x00ffULL) >> 8) | (((u64)(*p) & (u64)0xff00ULL) >> 24) | (((u64)(*p) & (u64)0x00ffULL) >> 40) | (((u64)(*p) & (u64)0xff00ULL) >> 56))) : __fswab64(*p)); } static inline __attribute__((always_inline)) u64 __be64_to_cpup(const u64 *p) { return __swab64p((u64 *)p); } static inline __attribute__((always_inline)) u64 get_unaligned_be64(const void *p) { return __be64_to_cpup((u64 *)p); } static inline u64 wwn_to_u64(u8 *wwn) { return get_unaligned_be64(wwn); } struct Scsi_Host { void *shost_data; unsigned long hostdata[0]; }; static inline void *shost_priv(struct Scsi_Host *shost) { return (void *)shost->hostdata; } typedef struct scsi_qla_host { u8 fabric_node_name[8]; u32 device_flags; } scsi_qla_host_t; struct fc_host_attrs { u64 fabric_name; }; static void qla2x00_get_host_fabric_name(struct Scsi_Host *shost) { scsi_qla_host_t *vha = shost_priv(shost); u8 node_name[8] = { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF}; u64 fabric_name = wwn_to_u64(node_name); if (vha->device_flags & 0x1) fabric_name = wwn_to_u64(vha->fabric_node_name); (((struct fc_host_attrs *)(shost)->shost_data)->fabric_name) = fabric_name; } void *get_host_fabric_name = qla2x00_get_host_fabric_name;
[Bug c/70646] Corrupt truncated function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646 --- Comment #6 from Denis Vlasenko --- I can collapse the chain of inlines down to this and still see the bug. Removing "__attribute__((always_inline))", or merging __swab64p() and wwn_to_u64(), makes bug disappear. typedef unsigned char u8; typedef unsigned int u32; typedef unsigned long long u64; static inline __attribute__((always_inline)) u64 __swab64p(const u64 *p) { return (__builtin_constant_p((u64)(*p)) ? ((u64)( (((u64)(*p) & (u64)0x00ffULL) << 56) | (((u64)(*p) & (u64)0xff00ULL) << 40) | (((u64)(*p) & (u64)0x00ffULL) << 24) | (((u64)(*p) & (u64)0xff00ULL) << 8) | (((u64)(*p) & (u64)0x00ffULL) >> 8) | (((u64)(*p) & (u64)0xff00ULL) >> 24) | (((u64)(*p) & (u64)0x00ffULL) >> 40) | (((u64)(*p) & (u64)0xff00ULL) >> 56))) : __builtin_bswap64(*p)); } static inline u64 wwn_to_u64(void *wwn) { return __swab64p(wwn); } struct Scsi_Host { void *shost_data; unsigned long hostdata[0]; }; static inline void *shost_priv(struct Scsi_Host *shost) { return (void *)shost->hostdata; } typedef struct scsi_qla_host { u8 fabric_node_name[8]; u32 device_flags; } scsi_qla_host_t; struct fc_host_attrs { u64 fabric_name; }; static void qla2x00_get_host_fabric_name(struct Scsi_Host *shost) { scsi_qla_host_t *vha = shost_priv(shost); u8 node_name[8] = { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF}; u64 fabric_name = wwn_to_u64(node_name); if (vha->device_flags & 0x1) fabric_name = wwn_to_u64(vha->fabric_node_name); (((struct fc_host_attrs *)(shost)->shost_data)->fabric_name) = fabric_name; } void *get_host_fabric_name = qla2x00_get_host_fabric_name;
[Bug rtl-optimization/21150] Suboptimal byte extraction from 64bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150 --- Comment #7 from Denis Vlasenko --- Fixed at least in 4.7.2, maybe earlier. With -m32 -fomit-frame-pointer -O2: a: movzbl v+45, %eax xorbv+36, %al xorbv, %al xorbv+54, %al xorbv+63, %al xorbv+9, %al xorbv+18, %al xorbv+27, %al ret b: movzbl v+18, %eax xorbv+9, %al xorbv, %al xorbv+27, %al xorbv+36, %al xorbv+45, %al xorbv+54, %al xorbv+63, %al ret c: movzbl v+9, %eax xorbv, %al xorbv+18, %al xorbv+27, %al xorbv+36, %al xorbv+45, %al xorbv+54, %al xorbv+63, %al ret d: movzbl v+18, %eax xorbv+9, %al xorbv, %al xorbv+27, %al xorbv+36, %al xorbv+45, %al xorbv+54, %al xorbv+63, %al ret With same but -Os, my only complaint is that word-sized XORs are needlessly adding partial register update stalls: d: movbv+18, %al xorbv+9, %al xorlv, %eax xorbv+27, %al xorlv+36, %eax xorbv+45, %al xorbv+54, %al xorbv+63, %al ret but overall it looks much better. Feel free to close this BZ.
[Bug middle-end/66240] RFE: extend -falign-xyz syntax
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240 --- Comment #4 from Denis Vlasenko --- Created attachment 38293 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38293&action=edit Proposed patch This patch implements -falign-functions=N[,M] for now, with the eye for easy extension to other -falign options. I tested that with -falign-functions=N (tried 8, 15, 16, 17...) the alignment directives are the same before and after the patch: -falign-functions=8 generates ".p2align 3,,7" before and after. -falign-functions=17 generates ".p2align 5,,16" before and after. I tested that -falign-functions=N,N (two equal paramenters) works exactly like -falign-functions=N. Patch drops currently performed forced alignment to 8 if requested alignment is higher than 8: before the patch, -falign-functions=9 was generating .p2align 4,,8 .p2align 3 which means "Align to 16 if the skip is 8 bytes or less; else align to 8". After the patch, "p2align 3" is not emitted. I drop that because I ultimately want to do something like -falign-functions=64,8 - IOW, I want to align functions by 64 bytes, but only if that entails a skip of less than 8 bytes - otherwise I want **no alignment at all**. The forced ".p2align 3" interferes with that intention. This is an RFC-patch, IOW: I don't insist on removal of ".p2align 3" generation. I imagine that it should be retained for compat, and yet another option should be added to suppress it if desired (how about "-mno-8byte-code-subalign"? Argh...)
[Bug middle-end/70703] New: Regression in register usage on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70703 Bug ID: 70703 Summary: Regression in register usage on x86 Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: vda.linux at googlemail dot com Target Milestone: --- $ cat bad.c unsigned ud_x_641_mul(unsigned x) { /* optimized version of x / 641 */ return ((unsigned long long)x * 0x663d81) >> 32; } With gcc from current svn: $ gcc -m32 -fomit-frame-pointer -O2 bad.c -S && cat bad.s ... ud_x_641_mul: .cfi_startproc movl$6700417, %ecx movl%ecx, %eax mull4(%esp) movl%edx, %ecx movl%ecx, %eax ret Same result with -Os. Note two pointless mov insns. gcc 5.3.1 is "better", it adds only one unnecessary insn: ud_x_641_mul: .cfi_startproc movl$6700417, %ecx movl%ecx, %eax mull4(%esp) movl%edx, %eax ret gcc 4.4.x and 4.7.2 were generating this code, which looks optimal: ud_x_641_mul: .cfi_startproc movl$6700417, %eax mull4(%esp) movl%edx, %eax ret I did not test other versions of gcc yet.
[Bug target/30354] -Os doesn't optimize a/CONST even if it saves size.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30354 --- Comment #17 from Denis Vlasenko --- Any chance of this being finally done? I proposed a simple, working patch in 2007, it's 2016 now and all these years users of -Os suffer from slow divisions in important cases usch as "signed_int / 16" and "unsigned_int / 10". I understand your desire to do it "better", to make gcc count size of div/idiv more accurately, without having to lie to it in the insn size table. But with you guys constantly distracted by other, more important issues, what happened here is that _nothing_ was done... I retested the patch with current svn (the future 7.0.0), using test program with 15000 divisions from comment 3: Bumping division cost up to 8 is no longer enough, this only makes gcc to be better towards some (not all) 2^N divisors. Bumping div cost to 9..12 helps with most of remaining 2^N divisor cases, and for two exceptional cases of x / 641 and x / 6700417. Only bumping div cost to 13, namely, changing div costs as follows: const struct processor_costs ix86_size_cost = {/* costs for tuning for size */ ... {COSTS_N_BYTES (13), /* cost of a divide/mod for QI */ COSTS_N_BYTES (13), /* HI */ COSTS_N_BYTES (13), /* SI */ COSTS_N_BYTES (13), /* DI */ COSTS_N_BYTES (15)}, /* other */ makes it work as it used to in 4.4.x days: out of 15000 cases in t.c, 975 cases are optimized so that they don't use "div" anymore. This should have made it smaller too... but did not, because meanwhile gcc has regressed in another area. Now it inserts superfluous register moves. See bug 70703 which I just filed. Essentially, instead of movl$6700417, %eax mull4(%esp) movl%edx, %eax ret gcc generates: movl$6700417, %ecx movl%ecx, %eax mull4(%esp) movl%edx, %ecx movl%ecx, %eax ret Sizes of compiled testcases (objN denotes cost of "div", A...D correspond to costs of 10..13): textdata bss dec hex filename 242787 0 0 242787 3b463 gcc.obj3/divmod-7.0.0-Os.o 242813 0 0 242813 3b47d gcc.obj8/divmod-7.0.0-Os.o 242838 0 0 242838 3b496 gcc.obj9/divmod-7.0.0-Os.o 242844 0 0 242844 3b49c gcc.objA/divmod-7.0.0-Os.o 242844 0 0 242844 3b49c gcc.objB/divmod-7.0.0-Os.o 242844 0 0 242844 3b49c gcc.objC/divmod-7.0.0-Os.o 247573 0 0 247573 3c715 gcc.objD/divmod-7.0.0-Os.o So. Any chance of this patch being accepted sometime before 2100? ;)
[Bug target/30354] -Os doesn't optimize a/CONST even if it saves size.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30354 --- Comment #18 from Denis Vlasenko --- Created attachment 38297 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38297&action=edit Comparison of generated code with 7.0.0.svn on i86 With div cost of 3: : - 0: 8b 44 24 04 mov0x4(%esp),%eax - 4: b9 64 00 00 00 mov$0x64,%ecx - 9: 31 d2 xor%edx,%edx - b: f7 f1 div%ecx - d: c3 ret With div cost of 13: + 0: b9 1f 85 eb 51 mov$0x51eb851f,%ecx + 5: 89 c8 mov%ecx,%eax + 7: f7 64 24 04 mull 0x4(%esp) + b: 89 d1 mov%edx,%ecx + d: 89 c8 mov%ecx,%eax + f: c1 e8 05shr$0x5,%eax + 12: c3 ret
[Bug middle-end/66240] RFE: extend -falign-xyz syntax
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240 --- Comment #6 from Denis Vlasenko --- Patches v7 are posted: https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00720.html https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00721.html https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00722.html https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00723.html
[Bug c/77966] Corrupt function with -fsanitize-coverage=trace-pc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77966 Denis Vlasenko changed: What|Removed |Added CC||vda.linux at googlemail dot com --- Comment #1 from Denis Vlasenko --- Simplified a bit: - spinlock_t is not essential - mempool_t is not essential - snic_log_q_error_err_status variable is not necessary - __attribute__ ((__aligned__)) can be dropped too - struct vnic_wq can be folded OTOH: - struct vnic_wq_ctrl wrapping of int variable is necessary - wq_lock[1] unused member is necessary (makes gcc "know for sure" that wq[1] is 1-element array) - each of -O2 -fno-reorder-blocks -fsanitize-coverage=trace-pc are necessary extern unsigned int ioread32(void *); struct vnic_wq_ctrl { unsigned int error_status; }; struct snic { unsigned int wq_count; struct vnic_wq_ctrl *wq[1]; int wq_lock[1]; }; void snic_log_q_error(struct snic *snic) { unsigned int i; for (i = 0; i < snic->wq_count; i++) ioread32(&snic->wq[i]->error_status); } : 0: 53 push %rbx 1: 48 89 fbmov%rdi,%rbx 4: e8 00 00 00 00 callq __sanitizer_cov_trace_pc 9: 8b 03 mov(%rbx),%eax b: 85 c0 test %eax,%eax # snic->wq_count==0? d: 75 09 jne18 f: 5b pop%rbx # yes, 0 10: e9 00 00 00 00 jmpq __sanitizer_cov_trace_pc #tail call 15: 0f 1f 00nopl (%rax) 18: e8 00 00 00 00 callq __sanitizer_cov_trace_pc 1d: 48 8b 7b 08 mov0x8(%rbx),%rdi 21: e8 00 00 00 00 callq ioread32 26: 83 3b 01cmpl $0x1,(%rbx) # snic->wq_count<=1? 29: 76 e4 jbef 2b: e8 00 00 00 00 callq __sanitizer_cov_trace_pc Looks like gcc thinks that the loop can execute only zero or one time (or else we run off wq[1]). So when it iterated once: 21: e8 00 00 00 00 callq ioread32 it checks that snic->wq_count <= 1 26: 83 3b 01cmpl $0x1,(%rbx) 29: 76 e4 jbef and if not, we are in "impossible" land and just stop codegen. -fsanitize-coverage=trace-pc generator twitches one last time: 2b: e8 00 00 00 00 callq __sanitizer_cov_trace_pc
[Bug c/77966] Corrupt function with -fsanitize-coverage=trace-pc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77966 --- Comment #2 from Denis Vlasenko --- Without -fsanitize-coverage=trace-pc, the second, redundant check "snic->wq_count<=1?" is not generated. This eliminates the hanging "impossible" code path: : 0: 8b 07 mov(%rdi),%eax 2: 85 c0 test %eax,%eax 4: 74 09 je f 6: 48 8b 7f 08 mov0x8(%rdi),%rdi a: e9 00 00 00 00 jmpq ioread32 f: c3 retq
[Bug target/77966] Corrupt function with -fsanitize-coverage=trace-pc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77966 --- Comment #4 from Denis Vlasenko --- This confuses object code sanity analysis tools which check that every function ends "properly", i.e. with a return or jump (possibly padded with nops). Can gcc get an option like -finsert-stop-insn-when-unreachable[=insn], making bad programs crash if they do reach "impossible" code, rather than happily running off and executing random stuff? For x86, one-byte INT3, INT1, HLT or two-byte UD2 insn would be a good choice.
[Bug c/65410] New: "Short local string array" optimization doesn't happen if string has NULs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65410 Bug ID: 65410 Summary: "Short local string array" optimization doesn't happen if string has NULs Product: gcc Version: 4.7.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: vda.linux at googlemail dot com void f(char *); void g() { char buf[12] = "1234567890"; f(buf); } In the above example, "gcc -O2" creates buf[12] with immediate stores: subq$24, %rsp movabsq $4050765991979987505, %rax movq%rsp, %rdi movq%rax, (%rsp) movl$12345, 8(%rsp) callf addq$24, %rsp ret But if buf[] definition has \0 anywhere (for example, at the end where it does not even change the semantics of the code), optimization is not happening, gcc allocates a constant string and copies it into buf[]: void f(char *); void g() { char buf[12] = "1234567890\0"; f(buf); } .section.rodata .LC0: .string "1234567890" .string "" .text g: subq$24, %rsp movq.LC0(%rip), %rax movq%rsp, %rdi movq%rax, (%rsp) movl.LC0+8(%rip), %eax movl%eax, 8(%rsp) callf addq$24, %rsp ret
[Bug c/66122] New: Bad uninlining decisions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122 Bug ID: 66122 Summary: Bad uninlining decisions Product: gcc Version: 4.9.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: vda.linux at googlemail dot com Target Milestone: --- On linux kernel build, I found thousands of cases where functions which are expected (by programmer) to be inlined, aren't actually inlined. The following script is used to find them: nm --size-sort vmlinux | grep -iF ' t ' | uniq -c | grep -v '^ *1 ' | sort -rn It caltually finds functions which have same name, size, and occur more than once. There are a few false positives, but vast majority of them are functions which were supposed to be inlined, but weren't: (Count) (size) (name) 473 000b t spin_unlock_irqrestore 449 005f t rcu_read_unlock 355 0009 t atomic_inc 353 006e t rcu_read_lock 350 0075 t rcu_read_lock_sched_held 291 000b t spin_unlock 266 0019 t arch_local_irq_restore 215 000b t spin_lock 180 0011 t kzalloc 165 0012 t list_add_tail 161 0019 t arch_local_save_flags 153 0016 t test_and_set_bit 134 000b t spin_unlock_irq 134 0009 t atomic_dec 130 000b t spin_unlock_bh 122 0010 t brelse 120 0016 t test_and_clear_bit 120 000b t spin_lock_irq 119 001e t get_dma_ops 117 0053 t cpumask_next 116 0036 t kref_get 114 001a t schedule_work 106 000b t spin_lock_bh 103 0019 t arch_local_irq_disable 98 0014 t atomic_dec_and_test 83 0020 t sg_page 81 0037 t cpumask_check 79 0036 t pskb_may_pull 72 0044 t perf_fetch_caller_regs 70 002f t cpumask_next 68 0036 t clk_prepare_enable 65 0018 t pci_write_config_byte 65 0013 t tasklet_schedule 61 0023 t init_completion 60 002b t trace_handle_return 59 0043 t nlmsg_trim 59 0019 t pci_read_config_dword 59 000c t slow_down_io ... ... Note tiny sizes of some functions. Let's take a look at atomic_inc: static inline void atomic_inc(atomic_t *v) { asm volatile(LOCK_PREFIX "incl %0" : "+m" (v->counter)); } You would imagine that this won't ever be deinlined, right? It's one assembly instruction. Well, it isn't always inlined. Here's the disassembly of vmlinux: 81003000 : 81003000: 55 push %rbp 81003001: 48 89 e5mov%rsp,%rbp 81003004: f0 ff 07lock incl (%rdi) 81003007: 5d pop%rbp 81003008: c3 retq This can be fixed using __always_inline, but kernel developers hesitate to slap thousands of __always_inline everywhere, the mood is that this is a compiler's fault and it should not be accomodated for, but fixed. This happens quite easily with -Os (IOW: with CC_OPTIMIZE_FOR_SIZE=y kernel build), but -O2 is not immune either. I found a file which exhibits an example of bad deinlining for both -O2 and -Os and I'm going to attach it.
[Bug c/66122] Bad uninlining decisions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122 --- Comment #1 from Denis Vlasenko --- Created attachment 35528 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35528&action=edit Preprocessed example exhibiting a bug This is a preprocessed kernel/locking/mutex.c file from kernel source. When built with either -O2 or -Os, it wrongly deinlines spin_lock() and spin_unlock(): $ gcc -O2 -c mutex.preprocessed.c -o mutex.preprocessed.o $ objdump -dr mutex.preprocessed.o mutex.preprocessed.o: file format elf64-x86-64 Disassembly of section .text: : 0: 80 07 01addb $0x1,(%rdi) 3: c3 retq 4: 66 66 66 2e 0f 1f 84data32 data32 nopw %cs:0x0(%rax,%rax,1) b: 00 00 00 00 00 0010 <__mutex_init>: ... 0040 : 40: e9 00 00 00 00 jmpq 45 41: R_X86_64_PC32 _raw_spin_lock-0x4 45: 66 66 2e 0f 1f 84 00data32 nopw %cs:0x0(%rax,%rax,1) 4c: 00 00 00 00 These functions are defined as: static inline __attribute__((no_instrument_function)) void spin_unlock(spinlock_t *lock) { __raw_spin_unlock(&lock->rlock); } static inline __attribute__((no_instrument_function)) void spin_lock(spinlock_t *lock) { _raw_spin_lock(&lock->rlock); } and programmer's intent was that they will always be inlined. This is with gcc-4.7.2
[Bug c/66122] Bad uninlining decisions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122 --- Comment #2 from Denis Vlasenko --- Tested with gcc-4.9.2. The attached testcase doesn't exhibit the bug, but compiling the same kernel tree, with the same .config, and then running nm --size-sort vmlinux | grep -iF ' t ' | uniq -c | grep -v '^ *1 ' | sort -rn reveals that now other functions get wrongly deinlined: 8 0028 t acpi_os_allocate_zeroed 7 0011 t dst_output_sk 7 000b t hweight_long 5 0023 t umask_show 5 000f t init_once 4 0047 t uni2char 4 0028 t cmask_show 4 0025 t inv_show 4 0025 t edge_show 4 0020 t char2uni 4 001f t event_show 4 001d t acpi_node 4 0012 t t_stop 4 0012 t dst_discard 4 0011 t kzalloc 4 000b t udp_lib_close 4 0006 t udp_lib_hash 3 0059 t get_expiry 3 0025 t __uncore_inv_show 3 0025 t __uncore_edge_show 3 0023 t __uncore_umask_show 3 0023 t name_show 3 0022 t acpi_os_allocate 3 001f t __uncore_event_show 3 000d t cpumask_set_cpu 3 000a t nofill ... ... For example, hweight_long: static inline unsigned long hweight_long(unsigned long w) { return sizeof(w) == 4 ? hweight32(w) : hweight64(w); } wasn't expected by programmer to be deinlined. But it was: 81009c40 : 81009c40: 55 push %rbp 81009c41: e8 da eb 31 00 callq 81328820 <__sw_hweight64> 81009c46: 48 89 e5mov%rsp,%rbp 81009c49: 5d pop%rbp 81009c4a: c3 retq 81009c4b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) I'm going to find and attach a file which deinlines hweight_long.
[Bug c/66122] Bad uninlining decisions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122 --- Comment #3 from Denis Vlasenko --- Created attachment 35530 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35530&action=edit Preprocessed example exhibiting a bug on gcc -4.9.2 This is a preprocessed kernel/pid.c file from kernel source. When built with -O2, it wrongly deinlines hweight_long. $ gcc -O2 -c pid.preprocessed.c -o kernel.pid.o $ objdump -dr kernel.pid.o | grep -A3 hweight_long : 0: e8 00 00 00 00 callq 5 1: R_X86_64_PC32__sw_hweight64-0x4 5: c3 retq $ gcc -v 2>&1 | tail -1 gcc version 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC)
[Bug ipa/65740] spectacularly bad inlinining decisions with -Os
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65740 Denis Vlasenko changed: What|Removed |Added CC||vda.linux at googlemail dot com --- Comment #3 from Denis Vlasenko --- Bug 66122 contains more information, and a recipe how to find many examples using linux kernel build. For one, this is not limited to -Os (it does happen with -Os way more easily).
[Bug c/66122] Bad uninlining decisions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122 --- Comment #6 from Denis Vlasenko --- Got a hold on a machine with gcc version 5.1.1 20150422 (Red Hat 5.1.1-1) Pulled current Linus kernel tree and built it with this config: http://busybox.net/~vda/kernel_config2 Note that "CONFIG_CC_OPTIMIZE_FOR_SIZE is not set", i.e. it's a -O2 build. Selecting duplicate functions still shows a number of tiny uninlined functions: $ nm --size-sort vmlinux | grep -iF ' t ' | uniq -c | grep -v '^ *1 ' | sort -rn 83 008a t rcu_read_lock_sched_held 48 001b t sd_driver_init 48 0012 t sd_driver_exit 48 0008 t __initcall_sd_driver_init6 47 0020 t usb_serial_module_init 47 0012 t usb_serial_module_exit 47 0008 t __initcall_usb_serial_module_init6 45 0057 t uni2char 45 0025 t char2uni 43 001f t sd_probe 40 006a t rcu_read_unlock 29 005a t cpumask_next 27 007a t rcu_read_lock 27 0011 t kzalloc 24 0022 t arch_local_save_flags 23 0041 t cpumask_check 19 0017 t phy_module_init 19 0017 t phy_module_exit 19 0008 t __initcall_phy_module_init6 18 006c t spi_write 18 003f t show_alarm 18 000b t bitmap_weight 15 0037 t show_alarms 15 0014 t init_once 14 0603 t init_engine 14 0354 t pcm_trigger 14 033b t pcm_open 14 00f8 t stop_transport 14 00db t pcm_close 14 00c8 t set_meters_on 14 00b5 t write_dsp 14 00b5 t pcm_hw_free 14 0091 t pcm_pointer 14 0090 t hw_rule_playback_channels_by_format 14 008d t send_vector 14 004f t snd_echo_vumeters_info 14 0042 t hw_rule_sample_rate 14 003e t snd_echo_vumeters_switch_put 14 0034 t audiopipe_free 14 002b t snd_echo_channels_info_info 14 0024 t snd_echo_remove 14 001b t echo_driver_init 14 0019 t pcm_analog_out_hw_params 14 0019 t arch_local_irq_restore 14 0014 t snd_echo_dev_free 14 0012 t echo_driver_exit 14 0008 t __initcall_echo_driver_init6 13 0127 t pcm_analog_out_open 13 0127 t pcm_analog_in_open 13 0039 t qdisc_peek_dequeued 13 0037 t cpumask_check 13 0022 t arch_local_irq_restore 13 001c t pcm_analog_in_hw_params 13 0006 t bcma_host_soc_unregister_driver 12 0053 t nlmsg_trim ... Such as: 811a42e0 : 811a42e0: 55 push %rbp 811a42e1: 81 ce 00 80 00 00 or $0x8000,%esi 811a42e7: 48 89 e5mov%rsp,%rbp 811a42ea: e8 f1 92 1a 00 callq <__kmalloc> 811a42ef: 5d pop%rbp 811a42f0: c3 retq 810792d0 : 810792d0: 55 push %rbp 810792d1: 48 89 e5mov%rsp,%rbp 810792d4: e8 37 a8 b7 00 callq <__bitmap_weight> 810792d9: 5d pop%rbp 810792da: c3 retq and even 88566c9b : 88566c9b: 55 push %rbp 88566c9c: 48 89 e5mov%rsp,%rbp 88566c9f: 5d pop%rbp 88566ca0: c3 retq This is an *empty function* from drivers/bcma/bcma_private.h:103 uninlined: static inline void __exit bcma_host_soc_unregister_driver(void) { } BTW it doesn't even have any callers in vmlinux. It should have been optimized out.
[Bug ipa/66122] Bad uninlining decisions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122 --- Comment #8 from Denis Vlasenko --- If you try to reproduce this with kernel build, be sure to not select CONFIG_OPTIMIZE_INLINING (it forces inlining by making all iniline functions __always_inline). I didn't mention it before, but the recent (as of this writing) gcc 5.1.1 20150422 (Red Hat 5.1.1-1) with -Os easily triggers this behavior (more than a thousand *.o modules with spurious deinlines during kernel build).
[Bug ipa/66122] Bad uninlining decisions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122 --- Comment #10 from Denis Vlasenko --- (In reply to Jakub Jelinek from comment #9) > If you expect that all functions with inline keyword must be always inlined, > then you really should use __always_inline__ attribute. Otherwise, inline > keyword is primarily an optimization hint to the compiler that it might be > desirable to inline it. So, talking about uninlining or deinlining makes > absolutely no sense, Jakub, are you saying that compiling static inline oid spin_unlock(spinlock_t *lock) { __raw_spin_unlock(&lock->rlock); } , where __raw_spin_unlock is a function (not macro), to a deinlined function spin_unlock: call __raw_spin_unlock ret and then callers doing call spin_unlock *can ever* make sense? That's ridiculous. How about this? static inline void atomic_inc(atomic_t *v) { asm volatile(LOCK_PREFIX "incl %0" : "+m" (v->counter)); } You think it's okay to not inline one insn? Kernel people did not take my patch which tries to fix this by __always_inining locking ops. Basically, they think that compiler should not do stupid things.
[Bug c/66240] New: RFE: extend -falign-xyz syntax
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240 Bug ID: 66240 Summary: RFE: extend -falign-xyz syntax Product: gcc Version: 5.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: vda.linux at googlemail dot com Target Milestone: --- Experimentally, compilation with -O2 -falign-functions=17 -falign-loops=17 -falign-jumps=17 -falign-labels=17 results in the following: - functions are aligned using ".p2align 5,,16" asm directive - loops/jumps/labels are aligned using ".p2align 5" -Os -falign-functions=17 -falign-loops=17 -falign-jumps=17 -falign-labels=17 results in the following: - functions are not aligned - loops/jumps/labels are aligned using ".p2align 5" Can this be improved so that in all cases, ".p2align 5,,16" is used? Shouldn't be that hard... Next step (what this RFE is all about). -falign-functions=N is too simplistic. Ingo Molnar ran some tests and it looks on latest x86 CPUs, 64-byte alignment runs fastest (he tried many other possibilites). However, developers are less than thrilled by the idea of a slam-dunk 64-byte aligning everything. Too much waste: On 05/20/2015 02:47 AM, Linus Torvalds wrote: > At the same time, I have to admit that I abhor a 64-byte function > alignment, when we have a fair number of functions that are (much) > smaller than that. > > Is there some way to get gcc to take the size of the function into > account? Because aligning a 16-byte or 32-byte function on a 64-byte > alignment is just criminally nasty and wasteful. I propose the following: align function to 64-byte boundaries *IF* this does not introduce huge amount of padding. GNU as already has support for this: .align N1,FILL,N3 "The third expression is also absolute, and is also optional. If it is present, it is the maximum number of bytes that should be skipped by this alignment directive." So, what we want is to put something like ".align 64,,7" before every function. 98% of functions in typical linux kernel have first instruction 7 or fewer bytes long. Thus, with ".align 64,,7", calling any function will at a minimum be able to fetch one insn in one L1 read, not two. And this would be acheved with only ~3.5 bytes per function wasted to padding on average, whereas ".align 64" would waste 31 byte on average. Please extend -falign-foo=N syntax to, say, -falign-foo=N,M, which generates ".align M,,N-1" or equivalent.
[Bug c/66240] RFE: extend -falign-xyz syntax
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240 --- Comment #2 from Denis Vlasenko --- (In reply to Josh Triplett from comment #1) > Another alternative discussed in that thread, which seems near-ideal: align > functions to a given size (for instance, 64 bytes), pack them into that size > if they fit, but avoid splitting a function across that boundary unless it's > larger than that boundary. Josh, I would be more than happy to see gcc/ld becoming clever enough to pack functions intelligently (say, align big ones to cacheline boundaries, and fit tiny ones into the resulting padding "holes"). I'm afraid in the current state of gcc code, that'll be a very tall order to fulfil. In this BZ, I'm asking for something easy-ish to be done.
[Bug rtl-optimization/64907] New: Suboptimal code (saving rbx on stack in order to save another reg in rbx)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64907 Bug ID: 64907 Summary: Suboptimal code (saving rbx on stack in order to save another reg in rbx) Product: gcc Version: 4.7.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vda.linux at googlemail dot com void put_16bit(unsigned short v); void put_32bit(unsigned v) { put_16bit(v); put_16bit(v >> 16); } With gcc 4.7.2 the above compiles to the following assembly: put_32bit: pushq %rbx movl%edi, %ebx andl$65535, %edi callput_16bit movl%ebx, %edi popq%rbx shrl$16, %edi jmp put_16bit Code saves %rbx on stack only in order to save %edi to %ebx. A simpler alternative is to just save %rdi on stack: put_32bit: pushq %rdi andl$65535, %edi callput_16bit popq%rdi shrl$16, %edi jmp put_16bit
[Bug middle-end/66240] RFE: extend -falign-xyz syntax
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240 --- Comment #5 from Denis Vlasenko --- Patches v3 posted to the mailing list: https://gcc.gnu.org/ml/gcc-patches/2016-08/msg02073.html https://gcc.gnu.org/ml/gcc-patches/2016-08/msg02074.html
[Bug c/100320] New: regression: 32-bit x86 memcpy is suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100320 Bug ID: 100320 Summary: regression: 32-bit x86 memcpy is suboptimal Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: vda.linux at googlemail dot com Target Milestone: --- Bug 21329 has returned. 32-bit x86 memory block moves are using "movl $LEN,%ecx; rep movsl" insns. However, for fixed short blocks it is more efficient to just repeat a few "movsl" insns - this allows to drop "mov $LEN,%ecx" insn. It's shorter, and more importantly, "rep movsl" are slow-start microcoded insns (they are faster than moves using general-purpose registers only on blocks larger than 100-200 bytes) - OTOH, bare "movsl" are not microcoded and take ~4 cycles to execute. 21329 was closed with it fixed: CVSROOT:/cvs/gcc Module name:gcc Branch: gcc-4_0-rhl-branch Changes by: ja...@gcc.gnu.org 2005-05-18 19:08:44 Modified files: gcc: ChangeLog gcc/config/i386: i386.c Log message: 2005-05-06 Denis Vlasenko Jakub Jelinek PR target/21329 * config/i386/i386.c (ix86_expand_movmem): Don't use rep; movsb for -Os if (movsl;)*(movsw;)?(movsb;)? sequence is shorter. Don't use rep; movs{l,q} if the repetition count is really small, instead use a sequence of movs{l,q} instructions. (the above is commit 95935e2db5c45bef5631f51538d1e10d8b5b7524 in gcc.gnu.org/git/gcc.git, seems that code was largely replaced by: commit 8c996513856f2769aee1730cb211050fef055fb5 Author: Jan Hubicka Date: Mon Nov 27 17:00:26 2006 +010 expr.c (emit_block_move_via_libcall): Export. ) With gcc version 11.0.0 20210210 (Red Hat 11.0.0-0) (GCC) I see "rep movsl"s again: void *f(void *d, const void *s) { return memcpy(d, s, 16); } $ gcc -Os -m32 -fomit-frame-pointer -c -o z.o z.c && objdump -drw z.o z.o: file format elf32-i386 Disassembly of section .text: : 0: 57 push %edi 1: b9 04 00 00 00 mov$0x4,%ecx 6: 56 push %esi 7: 8b 44 24 0c mov0xc(%esp),%eax b: 8b 74 24 10 mov0x10(%esp),%esi f: 89 c7 mov%eax,%edi 11: f3 a5 rep movsl %ds:(%esi),%es:(%edi) 13: 5e pop%esi 14: 5f pop%edi 15: c3 ret The expected code would not have "mov $0x4,%ecx" and would have "rep movsl" replaced by "movsl;movsl;movsl;movsl". The testcase from 21329 with implicit block moves via struct copies, from here https://gcc.gnu.org/bugzilla/attachment.cgi?id=8790 also demonstrates it: $ gcc -Os -m32 -fomit-frame-pointer -c -o z1.o z1.c && objdump -drw z1.o z1.o: file format elf32-i386 Disassembly of section .text: : 0: a1 00 00 00 00 mov0x0,%eax 1: R_386_32 w10 5: a3 00 00 00 00 mov%eax,0x0 6: R_386_32 t10 a: c3 ret 000b : b: a1 00 00 00 00 mov0x0,%eax c: R_386_32 w20 10: 8b 15 04 00 00 00 mov0x4,%edx 12: R_386_32w20 16: a3 00 00 00 00 mov%eax,0x0 17: R_386_32t20 1b: 89 15 04 00 00 00 mov%edx,0x4 1d: R_386_32t20 21: c3 ret 0022 : 22: 57 push %edi 23: b9 09 00 00 00 mov$0x9,%ecx 28: bf 00 00 00 00 mov$0x0,%edi29: R_386_32t21 2d: 56 push %esi 2e: be 00 00 00 00 mov$0x0,%esi2f: R_386_32w21 33: f3 a4 rep movsb %ds:(%esi),%es:(%edi) 35: 5e pop%esi 36: 5f pop%edi 37: c3 ret 0038 : 38: 57 push %edi 39: b9 0a 00 00 00 mov$0xa,%ecx 3e: bf 00 00 00 00 mov$0x0,%edi3f: R_386_32t22 43: 56 push %esi 44: be 00 00 00 00 mov$0x0,%esi45: R_386_32w22 49: f3 a4 rep movsb %ds:(%esi),%es:(%edi) 4b: 5e pop%esi 4c: 5f pop%edi 4d: c3 ret 004e : 4e: 57 push %edi 4f: b9 0b 00 00 00 mov$0xb,%ecx 54: bf 00 00 00 00 mov$0x0,%edi55: R_386_32t23 59: 56 push %esi 5a: be 00 00 00 00 mov$0x0,%esi5b: R_386_32w23 5f: f3 a4 rep movsb
[Bug target/100320] [8/9/10/11/12 Regression] 32-bit x86 memcpy is suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100320 --- Comment #2 from Denis Vlasenko --- The relevant code in current git seems to be: static void expand_set_or_cpymem_via_rep (rtx destmem, rtx srcmem, rtx destptr, rtx srcptr, rtx value, rtx orig_value, rtx count, machine_mode mode, bool issetmem) { rtx destexp; rtx srcexp; rtx countreg; HOST_WIDE_INT rounded_count; /* If possible, it is shorter to use rep movs. TODO: Maybe it is better to move this logic to decide_alg. */ if (mode == QImode && CONST_INT_P (count) && !(INTVAL (count) & 3) && !TARGET_PREFER_KNOWN_REP_MOVSB_STOSB && (!issetmem || orig_value == const0_rtx)) mode = SImode; if (destptr != XEXP (destmem, 0) || GET_MODE (destmem) != BLKmode) destmem = adjust_automodify_address_nv (destmem, BLKmode, destptr, 0); countreg = ix86_zero_extend_to_Pmode (scale_counter (count, GET_MODE_SIZE (mode))); if (mode != QImode) { destexp = gen_rtx_ASHIFT (Pmode, countreg, GEN_INT (exact_log2 (GET_MODE_SIZE (mode; destexp = gen_rtx_PLUS (Pmode, destexp, destptr); } else destexp = gen_rtx_PLUS (Pmode, destptr, countreg); if ((!issetmem || orig_value == const0_rtx) && CONST_INT_P (count)) { rounded_count = ROUND_DOWN (INTVAL (count), (HOST_WIDE_INT) GET_MODE_SIZE (mode)); destmem = shallow_copy_rtx (destmem); set_mem_size (destmem, rounded_count); } else if (MEM_SIZE_KNOWN_P (destmem)) clear_mem_size (destmem); if (issetmem) { value = force_reg (mode, gen_lowpart (mode, value)); emit_insn (gen_rep_stos (destptr, countreg, destmem, value, destexp)); } else { if (srcptr != XEXP (srcmem, 0) || GET_MODE (srcmem) != BLKmode) srcmem = adjust_automodify_address_nv (srcmem, BLKmode, srcptr, 0); if (mode != QImode) { srcexp = gen_rtx_ASHIFT (Pmode, countreg, GEN_INT (exact_log2 (GET_MODE_SIZE (mode; srcexp = gen_rtx_PLUS (Pmode, srcexp, srcptr); } else srcexp = gen_rtx_PLUS (Pmode, srcptr, countreg); if (CONST_INT_P (count)) { rounded_count = ROUND_DOWN (INTVAL (count), (HOST_WIDE_INT) GET_MODE_SIZE (mode)); srcmem = shallow_copy_rtx (srcmem); set_mem_size (srcmem, rounded_count); } else { if (MEM_SIZE_KNOWN_P (srcmem)) clear_mem_size (srcmem); } emit_insn (gen_rep_mov (destptr, destmem, srcptr, srcmem, countreg, destexp, srcexp)); } }
[Bug c/115875] New: -Oz optimization of "push IMM; pop REG" is used incorrectly for 64-bit constants with 31th bit set
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115875 Bug ID: 115875 Summary: -Oz optimization of "push IMM; pop REG" is used incorrectly for 64-bit constants with 31th bit set Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: vda.linux at googlemail dot com Target Milestone: --- void sp_256_sub_8_p256_mod(unsigned long *r) { unsigned long reg, ooff; asm volatile ( "\n subq$0x, (%0)" "\n sbbq%1, 1*8(%0)" "\n sbbq$0, 2*8(%0)" "\n movq3*8(%0), %2" "\n sbbq$0, %2" "\n addq%1, %2" "\n movq%2, 3*8(%0)" : "=r" (r), "=r" (ooff), "=r" (reg) : "0" (r), "1" (0x) : "memory"); } "gcc -fomit-frame-pointer -Oz -S tls_sp_c32.c" generates this: pushq $-1 popq%rax # BUG!!! gcc thinks %rax = 0x # but, of course, it loads 0x instead! subq$0x, (%rdi) sbbq%rax, 1*8(%rdi) sbbq$0, 2*8(%rdi) movq3*8(%rdi), %rdx sbbq$0, %rdx addq%rax, %rdx movq%rdx, 3*8(%rdi) ret Looks like either gcc thinks "pushq $-1" truncates the value by 32 bits (in reality, it is sign-extended), or it thinks it uses "pop %eax" insn (no such insn exists in 64-bit mode, only 64-bit register pops are possible do). Code generated with -Os is correct: orl $-1, %eax # zero-extended to 64 bits, correct result in %rax subq$0x, (%rdi) sbbq%rax, 1*8(%rdi) sbbq$0, 2*8(%rdi) movq3*8(%rdi), %rdx sbbq$0, %rdx addq%rax, %rdx movq%rdx, 3*8(%rdi) ret In fact, in this case "push IMM+pop REG" is 3 bytes and "orl $-1, %eax" is 3 bytes too (8-bit immediate form), so -Oz optimization is not a win here (same size, slower code). $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/14/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-redhat-linux Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,m2,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --enable-libstdcxx-backtrace --with-libstdcxx-zoneinfo=/usr/share/zoneinfo --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-14.0.1-20240328/obj-x86_64-redhat-linux/isl-install --enable-offload-targets=nvptx-none,amdgcn-amdhsa --enable-offload-defaulted --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux --with-build-config=bootstrap-lto --enable-link-serialization=1 Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 14.0.1 20240328 (Red Hat 14.0.1-0) (GCC)
[Bug target/115875] -Oz optimization of "push IMM; pop REG" is used incorrectly for 64-bit constants with 31th bit set
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115875 --- Comment #2 from Denis Vlasenko --- 0xUL works, although it uses b8 ff ff ff ff mov$0x,%eax instead of 83 c8 ffor $0x,%eax