[RL78] Questions about code-generation

peper03 Mon, 10 Mar 2014 08:31:16 -0700

Hi,

The code produced by GCC for the RL78 target is around twice as large as that 
produced by IAR and I've been trying to find out why.


The project I'm working on uses an RL78/F12 with 16KB of code flash.  As I have 
to get a bootloader and an application into that, I have to pay close attention 
to how large the code is becoming.

Looking at the assembler output for some simple examples, the problem seems to 
be 'bloated' code as opposed to not squeezing every last byte out through the 
use of ingenious optimization tricks.

I've managed to build GCC myself so that I could experiment a bit but as this 
is my first foray into compiler internals, I'm struggling to work out how 
things fit together and what affects what.

My initial impression is that significant gains could be made by clearing away 
some low-hanging fruit, but without understanding what caused that code to be 
generated in the first place, it's hard to do anything about it.

In particular, I'd be interested to know what is caused (or could be improved) 
by the RL78-specific code, and what comes from the generic part of GCC.

Here's an example extracted from one of the functions in our project:

--------

unsigned short gOrTest;
#define SOE0 (*(volatile unsigned short *)0xF012A)

void orTest()
{
   SOE0 |= 3;
   /* gOrTest |= 3; */
}

--------

This produces the following code (using -Os):

  29 0000 C9 F2 2A 01                  movw  r10, #298
  30 0004 AD F2                        movw  ax, r10
  31 0006 16                           movw  hl, ax
  32 0007 AB                           movw  ax, [hl]
  33 0008 BD F4                        movw  r12, ax
  34 000a 60                           mov   a, x
  35 000b 6C 03                        or    a, #3
  36 000d 9D F0                        mov   r8, a
  37 000f 8D F5                        mov   a, r13
  38 0011 9D F1                        mov   r9, a
  39 0013 AD F2                        movw  ax, r10
  40 0015 12                           movw  bc, ax
  41 0016 AD F0                        movw  ax, r8
  42 0018 78 00 00                     movw  [bc], ax
  43 001b D7                           ret

There's so much unnecessary register passing going on there (#298 could go 
straight into HL, why does the same value end up in BC even though HL hasn't 
been touched? etc.)

Commenting out the 'SOE0' line and bringing the 'gOrTest' line back in 
generates better code (but still worthy of optimization):

  29 0000 8F 00 00                     mov   a, !_gOrTest
  30 0003 6C 03                        or a, #3
  31 0005 9F 00 00                     mov   !_gOrTest, a
  32 0008 8F 00 00                     mov   a, !_gOrTest+1
  33 000b 6C 00                        or a, #0
  34 000d 9F 00 00                     mov   !_gOrTest+1, a
  35 0010 D7                           ret

What causes that code to be generated when using a variable instead of a fixed 
memory address?

Even allowing for the unnecessary 'or a, #0' and keeping to a 16-bit access, 
it's still possible to perform the same operation in half the space of the 
original:


  29 0000 36 2A 01                     movw hl, #298
  30 0003 AB                           movw ax, [hl]
  31 0004 75                           mov  d, a
  32 0005 60                           mov  a, x
  33 0006 6C 03                        or   a, #3
  34 0008 70                           mov  x, a
  35 0009 65                           mov  a, d
  36 000a 6C 00                        or   a, #0
  37 000c BB                           movw [hl], ax
  38 000d D7                           ret

And, of course, that could be optimized further.

Excessive register copying and an apparant preference for R8 onwards over the 
B,C,D,E,H and L registers (which could save a byte on every 'mov') seems to be 
one of the main causes of 'bloated' code (among others).

So, I guess my question is how much of the bloat comes from inefficiencies in 
the hardware-specific code?  I saw a comment in the RL78 code about performing 
CSE optimization but it's not clear to me where or how that would be done.  I 
tried to look at the code for some other processors to get an idea but it's 
hard to find things when you don't
know what you're looking for :)

Any help would be gratefully received!

Regards,

Richard Hulme

[RL78] Questions about code-generation

Reply via email to