Hi,
The code produced by GCC for the RL78 target is around twice as large as that
produced by IAR and I've been trying to find out why.
The project I'm working on uses an RL78/F12 with 16KB of code flash. As I have
to get a bootloader and an application into that, I have to pay close attention
to how large the code is becoming.
Looking at the assembler output for some simple examples, the problem seems to
be 'bloated' code as opposed to not squeezing every last byte out through the
use of ingenious optimization tricks.
I've managed to build GCC myself so that I could experiment a bit but as this
is my first foray into compiler internals, I'm struggling to work out how
things fit together and what affects what.
My initial impression is that significant gains could be made by clearing away
some low-hanging fruit, but without understanding what caused that code to be
generated in the first place, it's hard to do anything about it.
In particular, I'd be interested to know what is caused (or could be improved)
by the RL78-specific code, and what comes from the generic part of GCC.
Here's an example extracted from one of the functions in our project:
--------
unsigned short gOrTest;
#define SOE0 (*(volatile unsigned short *)0xF012A)
void orTest()
{
SOE0 |= 3;
/* gOrTest |= 3; */
}
--------
This produces the following code (using -Os):
29 0000 C9 F2 2A 01 movw r10, #298
30 0004 AD F2 movw ax, r10
31 0006 16 movw hl, ax
32 0007 AB movw ax, [hl]
33 0008 BD F4 movw r12, ax
34 000a 60 mov a, x
35 000b 6C 03 or a, #3
36 000d 9D F0 mov r8, a
37 000f 8D F5 mov a, r13
38 0011 9D F1 mov r9, a
39 0013 AD F2 movw ax, r10
40 0015 12 movw bc, ax
41 0016 AD F0 movw ax, r8
42 0018 78 00 00 movw [bc], ax
43 001b D7 ret
There's so much unnecessary register passing going on there (#298 could go
straight into HL, why does the same value end up in BC even though HL hasn't
been touched? etc.)
Commenting out the 'SOE0' line and bringing the 'gOrTest' line back in
generates better code (but still worthy of optimization):
29 0000 8F 00 00 mov a, !_gOrTest
30 0003 6C 03 or a, #3
31 0005 9F 00 00 mov !_gOrTest, a
32 0008 8F 00 00 mov a, !_gOrTest+1
33 000b 6C 00 or a, #0
34 000d 9F 00 00 mov !_gOrTest+1, a
35 0010 D7 ret
What causes that code to be generated when using a variable instead of a fixed
memory address?
Even allowing for the unnecessary 'or a, #0' and keeping to a 16-bit access,
it's still possible to perform the same operation in half the space of the
original:
29 0000 36 2A 01 movw hl, #298
30 0003 AB movw ax, [hl]
31 0004 75 mov d, a
32 0005 60 mov a, x
33 0006 6C 03 or a, #3
34 0008 70 mov x, a
35 0009 65 mov a, d
36 000a 6C 00 or a, #0
37 000c BB movw [hl], ax
38 000d D7 ret
And, of course, that could be optimized further.
Excessive register copying and an apparant preference for R8 onwards over the
B,C,D,E,H and L registers (which could save a byte on every 'mov') seems to be
one of the main causes of 'bloated' code (among others).
So, I guess my question is how much of the bloat comes from inefficiencies in
the hardware-specific code? I saw a comment in the RL78 code about performing
CSE optimization but it's not clear to me where or how that would be done. I
tried to look at the code for some other processors to get an idea but it's
hard to find things when you don't
know what you're looking for :)
Any help would be gratefully received!
Regards,
Richard Hulme