[Bug c/86680] New: possible gcc optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680 Bug ID: 86680 Summary: possible gcc optimization Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: florian.laroche at googlemail dot com Target Milestone: --- Created attachment 4 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=4&action=edit testcase I can see this on x86_64 and aarch64. The first function is compiled with much bigger code. Seems the alignment to 8 bytes and thus this multiple of 8 is forgotten in some optimization step. best regards, Florian La Roche $ aarch64-linux-gnu-gcc-8 -O2 -c test.c $ aarch64-linux-gnu-objdump -d test.o test.o: Dateiformat elf64-littleaarch64 Disassembly of section .text: : 0: 9001adrpx1, 0 <__bss_start1> 4: 9000adrpx0, 0 <__bss_end1> 8: f9400022ldr x2, [x1] c: f940ldr x0, [x0] 10: eb5fcmp x2, x0 14: 54000142b.cs3c // b.hs, b.nlast 18: d1000401sub x1, x0, #0x1 1c: aa0203e0mov x0, x2 20: cb020021sub x1, x1, x2 24: 927df021and x1, x1, #0xfff8 28: 91002021add x1, x1, #0x8 2c: 8b020021add x1, x1, x2 30: f800841fstr xzr, [x0], #8 34: eb01001fcmp x0, x1 38: 54c1b.ne30 // b.any 3c: d65f03c0ret 0040 : 40: 9000adrpx0, 0 <__bss_start2> 44: 9001adrpx1, 0 <__bss_end2> 48: f940ldr x0, [x0] 4c: f9400021ldr x1, [x1] 50: f940ldr x0, [x0] 54: f9400021ldr x1, [x1] 58: eb01001fcmp x0, x1 5c: 5482b.cs6c // b.hs, b.nlast 60: f800841fstr xzr, [x0], #8 64: eb01001fcmp x0, x1 68: 54c3b.cc60 // b.lo, b.ul, b.last 6c: d65f03c0ret Please note how the second function is compiled much smaller. The first function from "18" to "2c" should basically be optimized away. Compiling with -Os is also much better: $ aarch64-linux-gnu-gcc-8 -Os -c test.c $ aarch64-linux-gnu-objdump -d test.o test.o: Dateiformat elf64-littleaarch64 Disassembly of section .text: : 0: 9000adrpx0, 0 <__bss_start1> 4: 9001adrpx1, 0 <__bss_end1> 8: f940ldr x0, [x0] c: f9400021ldr x1, [x1] 10: eb01001fcmp x0, x1 14: 5443b.cc1c // b.lo, b.ul, b.last 18: d65f03c0ret 1c: f800841fstr xzr, [x0], #8 20: 17fcb 10 0024 : 24: 9000adrpx0, 0 <__bss_start2> 28: 9001adrpx1, 0 <__bss_end2> 2c: f940ldr x0, [x0] 30: f9400021ldr x1, [x1] 34: f940ldr x0, [x0] 38: f9400021ldr x1, [x1] 3c: eb3fcmp x1, x0 40: 5448b.hi48 // b.pmore 44: d65f03c0ret 48: f800841fstr xzr, [x0], #8 4c: 17fcb 3c The problem also shows up on x86_64 from "13" to "22": $ gcc -O2 -c test.c $ objdump -d test.o test.o: Dateiformat elf64-x86-64 Disassembly of section .text: : 0: 48 8d 05 00 00 00 00lea0x0(%rip),%rax# 7 7: 48 8d 15 00 00 00 00lea0x0(%rip),%rdx# e e: 48 39 d0cmp%rdx,%rax 11: 73 25 jae38 13: 48 8d 48 08 lea0x8(%rax),%rcx 17: 48 83 c2 07 add$0x7,%rdx 1b: 48 29 casub%rcx,%rdx 1e: 48 83 e2 f8 and$0xfff8,%rdx 22: 48 01 caadd%rcx,%rdx 25: 0f 1f 00nopl (%rax) 28: 48 c7 00 00 00 00 00movq $0x0,(%rax) 2f: 48 83 c0 08 add$0x8,%rax 33: 48 39 d0cmp%rdx,%rax 36: 75 f0 jne28 38: f3 c3 repz retq 3a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 0040 : 40: 48 8b 05 00 00 00 00mov0x0(%rip),%rax# 47 47: 48 8b 15 00 00 00 00mov0x0(%rip),%rdx# 4e 4e: 48 39 d0cmp%rdx,%rax 51: 73 16 jae69 53: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 58: 48 83 c0 08 add$0x8,%rax 5c: 48 c7 40 f8 00 00 00movq $0x0,-0x8(%rax) 63: 00 64: 48 39 d0cmp%rdx,%rax 67: 72 ef jb 58 69: f3 c3 repz retq
[Bug c/86680] possible gcc optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680 --- Comment #3 from Florian La Roche --- Hello Martin, I assume the two functions clear_bss1() and clear_bss2() to work on identical aligned data and produce similar assembler output. Yet looking at the assembler output, the first function produces many more assembler lines. "-Os" keeps the assembler lines also pretty small. The first assembler listing should remove "18" to "2C" the last listing should remove "13" to "22". Here another output from gcc, where the additional pseudocode shows up after optimizations. The lines with pseudo vars "_13" to "_20" should not be produced at all. ;; Function clear_bss1 (clear_bss1, funcdef_no=0, decl_uid=3118, cgraph_uid=0, symbol_order=0) Removing basic block 6 Removing basic block 7 Removing basic block 8 clear_bss1 () { unsigned long ivtmp.9; void * _11; unsigned long _12; unsigned long _13; unsigned long _16; unsigned long _17; unsigned long _18; unsigned long _19; unsigned long _20; [15.00%]: if (&__bss_start1 < &__bss_end1) goto ; [85.00%] else goto ; [15.00%] [12.75%]: ivtmp.9_7 = (unsigned long) &MEM[(void *)&__bss_start1 + 8B]; _12 = (unsigned long) &__bss_end1; _13 = _12 + 7; _16 = _13 - ivtmp.9_7; _17 = _16 & 18446744073709551608; _18 = (unsigned long) &__bss_start1; _19 = _18 + 16; _20 = _17 + _19; [85.00%]: # ivtmp.9_10 = PHI _11 = (void *) ivtmp.9_10; MEM[base: _11, offset: -8B] = 0; ivtmp.9_1 = ivtmp.9_10 + 8; if (ivtmp.9_1 != _20) goto ; [85.00%] else goto ; [15.00%] [15.00%]: return; } ;; Function clear_bss2 (clear_bss2, funcdef_no=1, decl_uid=3127, cgraph_uid=1, symbol_order=1) Removing basic block 5 Removing basic block 6 Removing basic block 7 Removing basic block 8 clear_bss2 () { long unsigned int * bss; long unsigned int * __bss_end2.2_10; [15.00%]: bss_5 = __bss_start2; __bss_end2.2_10 = __bss_end2; if (bss_5 < __bss_end2.2_10) goto ; [85.00%] else goto ; [15.00%] [85.00%]: # bss_11 = PHI bss_6 = bss_11 + 8; MEM[base: bss_6, offset: -8B] = 0; if (bss_6 < __bss_end2.2_10) goto ; [85.00%] else goto ; [15.00%] [15.00%]: return; } Is this helping to explain my bug entry? best regards, Florian La Roche
[Bug c/86680] possible gcc optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680 --- Comment #4 from Florian La Roche --- Right, compiling with "-O2 -fno-ivopts" resolves my issues. best regards, Florian La Roche
[Bug middle-end/86680] possible gcc optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680 --- Comment #7 from Florian La Roche --- Hello Andrew Pinski, shouldn't the compiler see that both must be aligned to 8 bytes and thus also their difference must be a multiple of 8 bytes? I haven't looked into gcc sources, but maybe this information could be exploited for additinal optimization. best regards, Florian La Roche
[Bug middle-end/86680] possible gcc optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680 --- Comment #8 from Florian La Roche --- I've found something the compiler optimized quite nicely: (Good for the compiler, but I'd be happy to stay with the original code that was much easier to read for humans.) extern unsigned long __bss_start[]; extern unsigned long __bss_end[]; //extern unsigned long __bss_size; void clear_bss(void) { unsigned long *bss = __bss_start; unsigned long i, end = __bss_end - __bss_start; //unsigned long i = __bss_size; for (i = 0; i < end; i += sizeof (unsigned long)) *bss++ = 0UL; } This results on aarch64 into this code: : 0: 9001adrpx1, 0 <__bss_end> 4: 9002adrpx2, 0 <__bss_start> 8: f9400021ldr x1, [x1] c: f9400042ldr x2, [x2] 10: cb020021sub x1, x1, x2 14: 9343fc21asr x1, x1, #3 18: b4c1cbz x1, 30 1c: d280mov x0, #0x0// #0 20: f822681fstr xzr, [x0, x2] 24: 91002000add x0, x0, #0x8 28: eb3fcmp x1, x0 2c: 54a8b.hi20 // b.pmore 30: d65f03c0ret Jakub, your example code did also result in pretty large code (but I've only tested 8.0.1, not the newest release on this). Thanks a lot, best regards, Florian La Roche
[Bug middle-end/86680] possible gcc optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680 --- Comment #9 from Florian La Roche --- Puh, even introduced an error here. This one works, but is getting complex compared to the original code: extern unsigned long __bss_start[]; extern unsigned long __bss_end[]; void clear_bss(void) { unsigned long *bss = __bss_start; unsigned long i, end = (__bss_end - __bss_start) * sizeof (unsigned long); for (i = 0; i < end; i += sizeof (unsigned long)) *bss++ = 0UL; } best regards, Florian La Roche
[Bug middle-end/86680] possible gcc optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680 --- Comment #10 from Florian La Roche --- In my optionion the result of "end = (__bss_end - __bss_start) * sizeof (unsigned long)" in my last testcase should show that the compile should be able to optimize the test code of the original submitted code. (Still of course completely unclear if this makes sense to implement.) best regards, Florian La Roche
[Bug middle-end/86680] possible gcc optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680 --- Comment #11 from Florian La Roche --- Below my current code that disables optimization for this one function and thus generates ok code length. best regards, Florian La Roche #if __GNUC__ > 4 #define __gcc_no_ivopts __attribute__ ((optimize("no-ivopts"))) #else #define __gcc_no_ivopts #endif extern unsigned long __bss_start[], __bss_end[]; void __gcc_no_ivopts clear_bss(void) { unsigned long *bss = __bss_start; #if 1 while (bss < __bss_end) *bss++ = 0UL; #else unsigned long i, end = (__bss_end - __bss_start) * sizeof(unsigned long); for (i = 0; i < end; i += sizeof(unsigned long)) *bss++ = 0UL; #endif }