Hello all,
I'm not sure whether this has been posted before, but gcc creates
slightly inefficient code for large integers in several cases:
unsigned long long val;
void example1() {
val += 0x8000ULL;
}
On x86 this results in the following assembly:
addl $0, val
adcl $32768, val+4
ret
The first add is unnecessary as it shouldn't modify val or set the carry.
This isn't too bad, but compiling for a something like AVR, results in
8 byte loads, followed by three additions (of the high bytes),
followed by another 8 byte saves.
The compiler doesn't recognize that 5 of those loads and 5 of those
saves are unnecessary.
Replacing the addition, with bitwise or/xor also produces an
unnecessary instruction on x86, but produces optimal instructions on
an AVR.
Here is another inefficiency for x86:
unsigned long long val = 0;
unsigned long small = 0;
unsigned long long example1() {
return val | small;
}
unsigned long long example2() {
return val & small;
}
This produces for example1 (bad):
movl small, %ecx
movl val, %eax
movl val+4, %edx
pushl %ebx
xorl %ebx, %ebx
orl %ecx, %eax
orl %ebx, %edx
popl %ebx
ret
For example2 (good):
movl small, %eax
xorl %edx, %edx
andl val, %eax
ret
The RTL's generated for example1 and example2 are very similar until
the fwprop1 stage.
Since the largest word size on x86 is 4 bytes, each operation is
actually split into two.
The forward propagator correctly realizes that anding the upper 4
bytes results in a zero.
However, it doesn't seem to recognize that oring the upper 4 bytes
should return val's high word.
This problem also occurs in the xor operation, and also when
subtracting (val - small).
All programs were compiled with "-O2 -Wall" although I also tried -O3
and -Os with the same result.
Thanks for any help.