[Bug c/86680] New: possible gcc optimization

2018-07-26 Thread florian.laroche at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680

Bug ID: 86680
   Summary: possible gcc optimization
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: florian.laroche at googlemail dot com
  Target Milestone: ---

Created attachment 4
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=4&action=edit
testcase

I can see this on x86_64 and aarch64. The first function is compiled with much
bigger code. Seems the alignment to 8 bytes and thus this multiple of 8
is forgotten in some optimization step.

best regards,

Florian La Roche




$ aarch64-linux-gnu-gcc-8 -O2 -c test.c
$ aarch64-linux-gnu-objdump -d test.o 

test.o: Dateiformat elf64-littleaarch64


Disassembly of section .text:

 :
   0:   9001adrpx1, 0 <__bss_start1>
   4:   9000adrpx0, 0 <__bss_end1>
   8:   f9400022ldr x2, [x1]
   c:   f940ldr x0, [x0]
  10:   eb5fcmp x2, x0
  14:   54000142b.cs3c   // b.hs, b.nlast
  18:   d1000401sub x1, x0, #0x1
  1c:   aa0203e0mov x0, x2
  20:   cb020021sub x1, x1, x2
  24:   927df021and x1, x1, #0xfff8
  28:   91002021add x1, x1, #0x8
  2c:   8b020021add x1, x1, x2
  30:   f800841fstr xzr, [x0], #8
  34:   eb01001fcmp x0, x1
  38:   54c1b.ne30   // b.any
  3c:   d65f03c0ret

0040 :
  40:   9000adrpx0, 0 <__bss_start2>
  44:   9001adrpx1, 0 <__bss_end2>
  48:   f940ldr x0, [x0]
  4c:   f9400021ldr x1, [x1]
  50:   f940ldr x0, [x0]
  54:   f9400021ldr x1, [x1]
  58:   eb01001fcmp x0, x1
  5c:   5482b.cs6c   // b.hs, b.nlast
  60:   f800841fstr xzr, [x0], #8
  64:   eb01001fcmp x0, x1
  68:   54c3b.cc60   // b.lo, b.ul, b.last
  6c:   d65f03c0ret



Please note how the second function is compiled much smaller. The first
function from "18" to "2c" should basically be optimized away.


Compiling with -Os is also much better:
$ aarch64-linux-gnu-gcc-8 -Os -c test.c
$ aarch64-linux-gnu-objdump -d test.o 

test.o: Dateiformat elf64-littleaarch64


Disassembly of section .text:

 :
   0:   9000adrpx0, 0 <__bss_start1>
   4:   9001adrpx1, 0 <__bss_end1>
   8:   f940ldr x0, [x0]
   c:   f9400021ldr x1, [x1]
  10:   eb01001fcmp x0, x1
  14:   5443b.cc1c   // b.lo, b.ul, b.last
  18:   d65f03c0ret
  1c:   f800841fstr xzr, [x0], #8
  20:   17fcb   10 

0024 :
  24:   9000adrpx0, 0 <__bss_start2>
  28:   9001adrpx1, 0 <__bss_end2>
  2c:   f940ldr x0, [x0]
  30:   f9400021ldr x1, [x1]
  34:   f940ldr x0, [x0]
  38:   f9400021ldr x1, [x1]
  3c:   eb3fcmp x1, x0
  40:   5448b.hi48   // b.pmore
  44:   d65f03c0ret
  48:   f800841fstr xzr, [x0], #8
  4c:   17fcb   3c 







The problem also shows up on x86_64 from "13" to "22":
$ gcc -O2 -c test.c
$ objdump -d test.o

test.o: Dateiformat elf64-x86-64


Disassembly of section .text:

 :
   0:   48 8d 05 00 00 00 00lea0x0(%rip),%rax# 7

   7:   48 8d 15 00 00 00 00lea0x0(%rip),%rdx# e

   e:   48 39 d0cmp%rdx,%rax
  11:   73 25   jae38 
  13:   48 8d 48 08 lea0x8(%rax),%rcx
  17:   48 83 c2 07 add$0x7,%rdx
  1b:   48 29 casub%rcx,%rdx
  1e:   48 83 e2 f8 and$0xfff8,%rdx
  22:   48 01 caadd%rcx,%rdx
  25:   0f 1f 00nopl   (%rax)
  28:   48 c7 00 00 00 00 00movq   $0x0,(%rax)
  2f:   48 83 c0 08 add$0x8,%rax
  33:   48 39 d0cmp%rdx,%rax
  36:   75 f0   jne28 
  38:   f3 c3   repz retq 
  3a:   66 0f 1f 44 00 00   nopw   0x0(%rax,%rax,1)

0040 :
  40:   48 8b 05 00 00 00 00mov0x0(%rip),%rax# 47

  47:   48 8b 15 00 00 00 00mov0x0(%rip),%rdx# 4e

  4e:   48 39 d0cmp%rdx,%rax
  51:   73 16   jae69 
  53:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
  58:   48 83 c0 08 add$0x8,%rax
  5c:   48 c7 40 f8 00 00 00movq   $0x0,-0x8(%rax)
  63:   00 
  64:   48 39 d0cmp%rdx,%rax
  67:   72 ef   jb 58 
  69:   f3 c3   repz retq

[Bug c/86680] possible gcc optimization

2018-07-26 Thread florian.laroche at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680

--- Comment #3 from Florian La Roche  ---
Hello Martin,

I assume the two functions clear_bss1() and clear_bss2() to work on
identical aligned data and produce similar assembler output.
Yet looking at the assembler output, the first function produces
many more assembler lines. "-Os" keeps the assembler lines also pretty small.

The first assembler listing should remove "18" to "2C" the last
listing should remove "13" to "22".



Here another output from gcc, where the additional pseudocode shows up
after optimizations. The lines with pseudo vars "_13" to "_20" should
not be produced at all.



;; Function clear_bss1 (clear_bss1, funcdef_no=0, decl_uid=3118, cgraph_uid=0,
symbol_order=0)

Removing basic block 6
Removing basic block 7
Removing basic block 8
clear_bss1 ()
{
  unsigned long ivtmp.9;
  void * _11;
  unsigned long _12;
  unsigned long _13;
  unsigned long _16;
  unsigned long _17;
  unsigned long _18;
  unsigned long _19;
  unsigned long _20;

   [15.00%]:
  if (&__bss_start1 < &__bss_end1)
goto ; [85.00%]
  else
goto ; [15.00%]

   [12.75%]:
  ivtmp.9_7 = (unsigned long) &MEM[(void *)&__bss_start1 + 8B];
  _12 = (unsigned long) &__bss_end1;
  _13 = _12 + 7;
  _16 = _13 - ivtmp.9_7;
  _17 = _16 & 18446744073709551608;
  _18 = (unsigned long) &__bss_start1;
  _19 = _18 + 16;
  _20 = _17 + _19;

   [85.00%]:
  # ivtmp.9_10 = PHI 
  _11 = (void *) ivtmp.9_10;
  MEM[base: _11, offset: -8B] = 0;
  ivtmp.9_1 = ivtmp.9_10 + 8;
  if (ivtmp.9_1 != _20)
goto ; [85.00%]
  else
goto ; [15.00%]

   [15.00%]:
  return;

}


;; Function clear_bss2 (clear_bss2, funcdef_no=1, decl_uid=3127, cgraph_uid=1,
symbol_order=1)

Removing basic block 5
Removing basic block 6
Removing basic block 7
Removing basic block 8
clear_bss2 ()
{
  long unsigned int * bss;
  long unsigned int * __bss_end2.2_10;

   [15.00%]:
  bss_5 = __bss_start2;
  __bss_end2.2_10 = __bss_end2;
  if (bss_5 < __bss_end2.2_10)
goto ; [85.00%]
  else
goto ; [15.00%]

   [85.00%]:
  # bss_11 = PHI 
  bss_6 = bss_11 + 8;
  MEM[base: bss_6, offset: -8B] = 0;
  if (bss_6 < __bss_end2.2_10)
goto ; [85.00%]
  else
goto ; [15.00%]

   [15.00%]:
  return;

}





Is this helping to explain my bug entry?


best regards,

Florian La Roche

[Bug c/86680] possible gcc optimization

2018-07-26 Thread florian.laroche at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680

--- Comment #4 from Florian La Roche  ---
Right, compiling with "-O2 -fno-ivopts" resolves my issues.

best regards,

Florian La Roche

[Bug middle-end/86680] possible gcc optimization

2018-07-27 Thread florian.laroche at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680

--- Comment #7 from Florian La Roche  ---
Hello Andrew Pinski,

shouldn't the compiler see that both must be aligned to 8 bytes
and thus also their difference must be a multiple of 8 bytes?

I haven't looked into gcc sources, but maybe this information could
be exploited for additinal optimization.

best regards,

Florian La Roche

[Bug middle-end/86680] possible gcc optimization

2018-07-27 Thread florian.laroche at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680

--- Comment #8 from Florian La Roche  ---
I've found something the compiler optimized quite nicely:
(Good for the compiler, but I'd be happy to stay with the original code
that was much easier to read for humans.)



extern unsigned long __bss_start[];
extern unsigned long __bss_end[];
//extern unsigned long __bss_size;

void clear_bss(void)
{
unsigned long *bss = __bss_start;
unsigned long i, end = __bss_end - __bss_start;
//unsigned long i = __bss_size;
for (i = 0; i < end; i += sizeof (unsigned long))
*bss++ = 0UL;
}




This results on aarch64 into this code:
 :
   0:   9001adrpx1, 0 <__bss_end>
   4:   9002adrpx2, 0 <__bss_start>
   8:   f9400021ldr x1, [x1]
   c:   f9400042ldr x2, [x2]
  10:   cb020021sub x1, x1, x2
  14:   9343fc21asr x1, x1, #3
  18:   b4c1cbz x1, 30 
  1c:   d280mov x0, #0x0// #0
  20:   f822681fstr xzr, [x0, x2]
  24:   91002000add x0, x0, #0x8
  28:   eb3fcmp x1, x0
  2c:   54a8b.hi20   // b.pmore
  30:   d65f03c0ret


Jakub, your example code did also result in pretty large code
(but I've only tested 8.0.1, not the newest release on this).


Thanks a lot,
best regards,

Florian La Roche

[Bug middle-end/86680] possible gcc optimization

2018-07-27 Thread florian.laroche at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680

--- Comment #9 from Florian La Roche  ---
Puh, even introduced an error here. This one works, but is
getting complex compared to the original code:



extern unsigned long __bss_start[];
extern unsigned long __bss_end[];

void clear_bss(void)
{
unsigned long *bss = __bss_start;
unsigned long i, end = (__bss_end - __bss_start) * sizeof (unsigned long);
for (i = 0; i < end; i += sizeof (unsigned long))
*bss++ = 0UL;
}


best regards,

Florian La Roche

[Bug middle-end/86680] possible gcc optimization

2018-07-27 Thread florian.laroche at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680

--- Comment #10 from Florian La Roche  
---
In my optionion the result of
"end = (__bss_end - __bss_start) * sizeof (unsigned long)"
in my last testcase should show that the compile should be
able to optimize the test code of the original submitted code.

(Still of course completely unclear if this makes sense to implement.)

best regards,

Florian La Roche

[Bug middle-end/86680] possible gcc optimization

2018-11-29 Thread florian.laroche at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680

--- Comment #11 from Florian La Roche  
---
Below my current code that disables optimization for this one function and thus
generates ok code length.

best regards,

Florian La Roche




#if __GNUC__ > 4
#define __gcc_no_ivopts __attribute__ ((optimize("no-ivopts")))
#else
#define __gcc_no_ivopts
#endif

extern unsigned long __bss_start[], __bss_end[];

void __gcc_no_ivopts clear_bss(void)
{
unsigned long *bss = __bss_start;
#if 1
while (bss < __bss_end)
*bss++ = 0UL;
#else
unsigned long i, end = (__bss_end - __bss_start) * sizeof(unsigned long);
for (i = 0; i < end; i += sizeof(unsigned long))
*bss++ = 0UL;
#endif
}