https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680
--- Comment #8 from Florian La Roche <florian.laroche at googlemail dot com> ---
I've found something the compiler optimized quite nicely:
(Good for the compiler, but I'd be happy to stay with the original code
that was much easier to read for humans.)
extern unsigned long __bss_start[];
extern unsigned long __bss_end[];
//extern unsigned long __bss_size;
void clear_bss(void)
{
unsigned long *bss = __bss_start;
unsigned long i, end = __bss_end - __bss_start;
//unsigned long i = __bss_size;
for (i = 0; i < end; i += sizeof (unsigned long))
*bss++ = 0UL;
}
This results on aarch64 into this code:
0000000000000000 <clear_bss>:
0: 90000001 adrp x1, 0 <__bss_end>
4: 90000002 adrp x2, 0 <__bss_start>
8: f9400021 ldr x1, [x1]
c: f9400042 ldr x2, [x2]
10: cb020021 sub x1, x1, x2
14: 9343fc21 asr x1, x1, #3
18: b40000c1 cbz x1, 30 <clear_bss+0x30>
1c: d2800000 mov x0, #0x0 // #0
20: f822681f str xzr, [x0, x2]
24: 91002000 add x0, x0, #0x8
28: eb00003f cmp x1, x0
2c: 54ffffa8 b.hi 20 <clear_bss+0x20> // b.pmore
30: d65f03c0 ret
Jakub, your example code did also result in pretty large code
(but I've only tested 8.0.1, not the newest release on this).
Thanks a lot,
best regards,
Florian La Roche