Re: [RL78] Questions about code-generation

2014-03-10 Thread Richard Hulme

On 10/03/14 22:37, DJ Delorie wrote:

I've managed to build GCC myself so that I could experiment a bit
but as this is my first foray into compiler internals, I'm
struggling to work out how things fit together and what affects
what.


The key thing to know about the RL78 backend, is that it has two
"targets" it uses.  For the first part of the compilation, up until
after reload, the model uses 16 virtual registers (R8 through R15) and
a virtual machine to give gcc an orthogonal model that it can generate
code for.  After reload, there's a "devirtualization" pass in the RL78
backend that maps the virtual model to the real model (R0 through R7),
which means copying values in and out of the real registers according
to which addressing modes are needed.  Then GCC continues optimizing,
which gets rid of most of the unneeded instructions.

The problem you're probably running into is that deciding which real
registers to use for each virtual one is a very tricky task, and the
post-reload optimizers aren't expecing the code to look like what it
does.


What causes that code to be generated when using a variable instead
of a fixed memory address?


The use of "volatile" disables many of GCC's optimizations.  I
consider this a bug in GCC, but at the moment it needs to be "fixed"
in the backends on a case-by-case basis.


Ah, that certainly explains a lot.  How exactly would the fixing be 
done?  Is there an example I could look at for one of the other processors?


It's certainly unfortunate, since an awful lot of bit-twiddling goes on 
with the memory-mapped hardware registers (which obviously generally 
need to be declared volatile).


Just to get a feel for the potential gains, I've removed the volatile 
keyword from all the declarations and rebuilt the project.  That change 
alone reduces the code size by 3.7%.  I wouldn't want to risk running 
that code but the gain is certainly significant.


I calculated a week or two ago that we could make a code-saving of 
around 8% by using near or relative branches and near calls instead of 
always generating far calls.  I changed rl78-real.md to use near 
addressing and got about 5%.  That's probably about right.  I tried to 
generate relative branches too but I'm guessing that the 'length' 
attribute needs to be set for all instructions to get that working properly.


Obviously near/far addressing would need to be controlled by an external 
switch to allow for processors with more than 64KB code-flash.


A few small gains can be had elsewhere (using 'clrb a' in 
zero_extendqihi2_real, possibly optimizing addsi3_internal_real to avoid 
addw ax,#0 etc.).  These don't save much space in our project (about 
30-40 bytes perhaps) but it'll obviously vary from project to project.


Regards,

Richard


Re: [RL78] Questions about code-generation

2014-03-16 Thread Richard Hulme

On 10/03/14 22:37, DJ Delorie wrote:


The use of "volatile" disables many of GCC's optimizations.  I
consider this a bug in GCC, but at the moment it needs to be "fixed"
in the backends on a case-by-case basis.



Hi,

I've looked into the differences between the steps taken when using a 
variable declared volatile, and when it isn't but I'm getting a bit stuck.


Taking the following code as an example:
--
typedef struct
{
   unsigned char no0 :1;
   unsigned char no1 :1;
   unsigned char no2 :1;
   unsigned char no3 :1;
   unsigned char no4 :1;
   unsigned char no5 :1;
   unsigned char no6 :1;
   unsigned char no7 :1;
} __BITS8;

union un_if0h
{
   unsigned char if0h;
   __BITS8 BIT;
};

#define IF0H (*(volatile union un_if0h *)0xFFFE1).if0h
#define IF0H_bit (*(volatile union un_if0h *)0xFFFE1).BIT

void test(void)
{
   IF0H_bit.no5 = 1;
}

--

and compiling it with -Os and -da once as-is and once with IF0H_bit not 
declared volatile.


The generated RTL is basically the same until the 'combine' stage

non-volatile start
Trying 5 -> 7:
Failed to match this instruction:
(parallel [
(set (reg:QI 45 [ MEM[(union un_if0h *)65505B].BIT.no5 ])
(mem/j:QI (const_int -31 [0xffe1]) [0 
MEM[(union un_if0h *)65505B].BIT.no5+0 S1 A8]))

(set (reg/f:HI 43)
(const_int -31 [0xffe1]))
])
Failed to match this instruction:
(parallel [
(set (reg:QI 45 [ MEM[(union un_if0h *)65505B].BIT.no5 ])
(mem/j:QI (const_int -31 [0xffe1]) [0 
MEM[(union un_if0h *)65505B].BIT.no5+0 S1 A8]))

(set (reg/f:HI 43)
(const_int -31 [0xffe1]))
])

Trying 7 -> 8:
Successfully matched this instruction:
(set (reg:QI 46)
(ior:QI (mem/j:QI (reg/f:HI 43) [0 MEM[(union un_if0h 
*)65505B].BIT.no5+0 S1 A8])

(const_int 32 [0x20])))
deferring deletion of insn with uid = 7.
modifying insn i3 8: r46:QI=[r43:HI]|0x20
deferring rescan insn with uid = 8.
-non-volatile end-

--volatile start--
Trying 5 -> 7:
Failed to match this instruction:
(parallel [
(set (reg:QI 45 [ MEM[(volatile union un_if0h *)65505B].BIT.no5 ])
(mem/v/j:QI (const_int -31 [0xffe1]) [0 
MEM[(volatile union un_if0h *)65505B].BIT.no5+0 S1 A8]))

(set (reg/f:HI 43)
(const_int -31 [0xffe1]))
])
Failed to match this instruction:
(parallel [
(set (reg:QI 45 [ MEM[(volatile union un_if0h *)65505B].BIT.no5 ])
(mem/v/j:QI (const_int -31 [0xffe1]) [0 
MEM[(volatile union un_if0h *)65505B].BIT.no5+0 S1 A8]))

(set (reg/f:HI 43)
(const_int -31 [0xffe1]))
])

Trying 7 -> 8:
Failed to match this instruction:
(set (reg:QI 46)
(ior:QI (mem/v/j:QI (reg/f:HI 43) [0 MEM[(volatile union un_if0h 
*)65505B].BIT.no5+0 S1 A8])

(const_int 32 [0x20])))
---volatile end---

Bearing in mind that I'm new to all this and may be missing something 
blindingly obvious, what would cause 7->8 to fail when declared volatile 
and not when not?  Does something need adding to rl78-virt.md to allow 
it to match?


It doesn't seem like this is due to missing an optimization step that 
combines insns (hmm, "combine?") but rather to not recognizing that a 
single, existing insn is possible and so splitting the operation up into 
multiple steps.


The 'Failed to match' string comes after calling 'recog' but I'm either 
too blind or too stupid to find the implementation.


The result of this (as I mentioned in my first post) is that this is 
produced:


  28_test:
  29  C9 F2 E1 FF   movwr10, #-31
  30 0004 AD F2 movwax, r10
  31 0006 16movwhl, ax
  32 0007 8Bmov a, [hl]
  33 0008 6C 20 or  a, #32
  34 000a 9Bmov [hl], a
  35 000b D7ret

instead of this:
  28_test:
  29  71 5A E1  set10xfffe1.5
  30 0003 D7ret

Surely the optimized code is also valid for a volatile variable?  In 
fact, I would have thought it *more* valid as it performs the entire 
operation in a single instruction instead of splitting it into a very 
definite read-modify-write sequence?


Since operations on memory-mapped hardware registers are your 
bread-and-butter on a microcontroller, 'curing' this would bring 
significant gains.


Am I missing something (non-)obvious?

Regards,

Richard



Re: [RL78] Questions about code-generation

2014-03-21 Thread Richard Hulme

On 11/03/14 01:40, DJ Delorie wrote:

I'm curious.  Have you tried out other approaches before you decided
to go with the virtual registers?


Yes.  Getting GCC to understand the "unusual" addressing modes the
RL78 uses was too much for the register allocator to handle.  Even
when the addressing modes are limited to "usual" ones, GCC doesn't
have a good way to do regalloc and reload when there are limits on
what registers you can use in an address expression, and it's worse
when there are dependencies between operands, or limited numbers of
address registers.


Is it possible that the virtual pass causes inefficiencies in some cases 
by sticking with r8-r31 when one of the 'normal' registers would be better?


For example, I'm having a devil of a time convincing the compiler that 
an immediate value can be stored directly in any of the normal 16-bit 
registers (e.g. 'movw hl, #123').  I'm beginning to wonder whether it's 
the unoptimized code being fed in that's causing problems.


Taking a slight variation on my original test code (removing the 
'volatile' keyword and accessing an 8-bit memory location):




#define SOE0L (*(unsigned char *)0xF012A)

void orTest()
{
   SOE0L |= 3;
}



produces (with -O0)

  28_test:
  29  C9 F0 2A 01   movwr8, #298
  30 0004 C9 F2 2A 01   movwr10, #298
  31 0008 AD F2 movwax, r10
  32 000a BD F4 movwr12, ax
  33 000c FA F4 movwhl, r12
  34 000e 8Bmov a, [hl]
  35 000f 9D F2 mov r10, a
  36 0011 6A F2 03  or  r10, #3
  37 0014 AD F0 movwax, r8
  38 0016 BD F4 movwr12, ax
  39 0018 DA F4 movwbc, r12
  40 001a 8D F2 mov a, r10
  41 001c 48 00 00  mov [bc], a
  42 001f D7ret

In some cases, the normal optimization steps remove a lot, if not all, 
of the unnecessary register passing, but not always.


The conditions on the movhi_real insn allow an immediate value to be 
stored in (for example) HL directly, and yet I cannot find a single 
instance in my project where it isn't in the form of


movwr8, #298
movwax, r10
movwhl, ax

and no manner of re-arranging the conditions (that I've found) will 
cause the correct code to be generated.  It's determined to put the 
immediate value into rX, and then copy that into ax (which is also 
unnecessary).


I see the same problem with 'cmp' when the value to be compared is in 
the A register:


mov r8, a
cmp r8, #3

The A register is the one register that can be almost guaranteed to be 
usable with any instruction, and copying it to R8 (or wherever) to 
perform the comparison not only wastes two bytes for the move but also 
makes the cmp instruction a byte longer, so five bytes are used instead 
of two.


I looked at the code produced for IA64 and ARM targets, and although I'm 
not as familiar with those instruction sets, they didn't appear to do as 
much needless copying, which strengthens my suspicion that it's 
something in the RL78 backend that needs 'tweaking'.


The suggestions made regarding 'volatile' were very helpful and I've 
made some good savings elsewhere by adding support for different 
addressing modes and more efficient instructions but there are still a 
number of (theoretically) easy pickings that should (I feel) be possible 
before more complicated optimizations need to be looked at.


As ever, any suggestions are very gratefully received.  I hope to be 
able to post some patches once I'm comfortable that I haven't missed 
anything obvious or done something stupid.


Regards,

Richard.



Re: [RL78] Questions about code-generation

2014-03-22 Thread Richard Hulme

On 22/03/14 01:47, Jeff Law wrote:

On 03/21/14 18:35, DJ Delorie wrote:


I've found that "removing uneeded moves through registers" is
something gcc does poorly in the post-reload optimizers.  I've written
my own on some occasions (for rl78 too).  Perhaps this is a good
starting point to look at?


much needless copying, which strengthens my suspicion that it's
something in the RL78 backend that needs 'tweaking'.


Of course it is, I've said that before I think.  The RL78 uses a
virtual model until reload, then converts each virtual instructions
into multiple real instructions, then optimizes the result.  This is
going to be worse than if the real model had been used throughout
(like arm or x86), but in this case, the real model *can't* be used
throughout, because gcc can't understand it well enough to get through
regalloc and reload.  The RL78 is just to "weird" to be modelled
as-is.

I keep hoping that gcc's own post-reload optimizers would do a better
job, though.  Combine should be able to combine, for example, the "mov
r8,ax; cmp r8,#4" types of insns together.

The virtual register file was the only way I could see to make RL78
work.  I can't recall the details, but when you described the situation
to me the virtual register file was the only way I could see to make the
RL78 work in the IRA+reload world.

What would be quite interesting to try would be to continue to use the
virtualized register set, but instead use the IRA+LRA path.  Presumably
that wouldn't be terribly hard to try and there's a reasonable chance
that'll improve the code in a noticeable way.


Looking at how that's done by other backends, as far as I can tell, I 
just need to add something like:


#undef  TARGET_LRA_P
#define TARGET_LRA_P rl78_enable_lra

static bool
rl78_enable_lra (void)
{
  return true;
}

to rl78.c?  At least in theory, even if other work is needed elsewhere 
to make things run smoothly.


Unfortunately, that function never seems to be called.

How does TARGET_LRA_P get used, anyway?  I can't find anything that 
tries to use it, only places where it gets set.  Is there some funky 
preprocessor stuff going on that's stopping me grepping for it?



The next obvious thing to try, and it's probably a lot more work, would
be to see if IRA+LRA is smart enough (or can be made so with a
reasonable amount of work) to eliminate the virtual register file
completely.

Just to be clear, I'm not planning to work on this; my participation and
interest in the RL78 was limited to providing a few tips to DJ.


And from my side, I'm not trying to get anyone to work on it (though 
obviously I'm not averse to it).  I'm just looking for hints and tips so 
that I can try to understand the causes (and hopefully find some solutions).


Regards,

Richard.


Re: [RL78] Questions about code-generation

2014-03-22 Thread Richard Hulme

On 22/03/14 01:35, DJ Delorie wrote:

Is it possible that the virtual pass causes inefficiencies in some
cases by sticking with r8-r31 when one of the 'normal' registers
would be better?


That's not a fair question to ask, since the virtual pass can *only*
use r8-r31.  The first bank has to be left alone else the
devirtualizer becomes a few orders of magnitude harder, if not
impossible, to make work correctly.


What I meant was that because the virtual pass can only use r8-r31, it's 
causing unnecessary register moves to be generated because it chooses, 
say, r8 as the register for a byte compare.  Because r8 is a *valid* 
register to use with a byte compare, it sticks with it come what may and 
then causes additional instructions to be generated to make sure that 
the result to be compared definitely ends up in r8, even if the register 
the result was in is also valid for a byte compare operation.



much needless copying, which strengthens my suspicion that it's
something in the RL78 backend that needs 'tweaking'.


Of course it is, I've said that before I think.  The RL78 uses a
virtual model until reload, then converts each virtual instructions
into multiple real instructions, then optimizes the result.  This is


It may be obvious to you and everyone else on this list that it's the 
backend that needs tweaking but I've only been looking at the compiler 
internals for a couple of weeks, so unfortunately it's not obvious to me.


I'm not complaining or pointing fingers or anything like that.  I'm just 
trying to understand the reasons why things are the way they are - what 
things are happening in the backend, what's happening in the 'generic' 
part and the interactions between them.


I understand that it's easy to say 'This is what the compiler's 
generating.  That's stupid.  It should be generating this', which is why 
I'm trying to understand the reasons that cause the compiler to generate 
what it's generating.



going to be worse than if the real model had been used throughout
(like arm or x86), but in this case, the real model *can't* be used
throughout, because gcc can't understand it well enough to get through
regalloc and reload.  The RL78 is just to "weird" to be modelled
as-is.


Can you explain what is too weird about it in particular?  It certainly 
has restrictions on which registers can be used with various 
instructions but I wouldn't have thought they were that complicated that 
they couldn't be described using the normal constraints?


Regards,

Richard.


Re: [RL78] Questions about code-generation

2014-03-24 Thread Richard Hulme

On 24/03/14 04:44, Jeff Law wrote:

On 03/22/14 05:29, Richard Hulme wrote:

On 22/03/14 01:47, Jeff Law wrote:

On 03/21/14 18:35, DJ Delorie wrote:


I've found that "removing uneeded moves through registers" is
something gcc does poorly in the post-reload optimizers.  I've written
my own on some occasions (for rl78 too).  Perhaps this is a good
starting point to look at?


much needless copying, which strengthens my suspicion that it's
something in the RL78 backend that needs 'tweaking'.


Of course it is, I've said that before I think.  The RL78 uses a
virtual model until reload, then converts each virtual instructions
into multiple real instructions, then optimizes the result.  This is
going to be worse than if the real model had been used throughout
(like arm or x86), but in this case, the real model *can't* be used
throughout, because gcc can't understand it well enough to get through
regalloc and reload.  The RL78 is just to "weird" to be modelled
as-is.

I keep hoping that gcc's own post-reload optimizers would do a better
job, though.  Combine should be able to combine, for example, the "mov
r8,ax; cmp r8,#4" types of insns together.

The virtual register file was the only way I could see to make RL78
work.  I can't recall the details, but when you described the situation
to me the virtual register file was the only way I could see to make the
RL78 work in the IRA+reload world.

What would be quite interesting to try would be to continue to use the
virtualized register set, but instead use the IRA+LRA path.  Presumably
that wouldn't be terribly hard to try and there's a reasonable chance
that'll improve the code in a noticeable way.


Looking at how that's done by other backends, as far as I can tell, I
just need to add something like:

#undef  TARGET_LRA_P
#define TARGET_LRA_P rl78_enable_lra

static bool
rl78_enable_lra (void)
{
   return true;
}

to rl78.c?  At least in theory, even if other work is needed elsewhere
to make things run smoothly.

Unfortunately, that function never seems to be called.

How does TARGET_LRA_P get used, anyway?  I can't find anything that
tries to use it, only places where it gets set.  Is there some funky
preprocessor stuff going on that's stopping me grepping for it?

That should be enough to switch to the LRA path.   It's a target hook.
Grep for "targetm.lra_p"


Hi Jeff,

Ok, I figured out what was wrong eventually.  I'd added the lines above 
*after* the declaration of the targetm variable.


Activating LRA alone is certainly not the answer.  Whilst I can see that 
*some* of the "to me, to you" register passing has been eliminated, LRA 
seems to have an intense dislike to indirect memory addressing with an 
offset.  So instead of something like:


mov   a, [sp+4]

it's now producing:

movw   ax, sp
addw   ax, #4
movw   hl, ax
mova, [hl]

which takes 7 bytes (compared to 4).  Overall I've got an code increase 
of about 31%.


I don't know why it's avoiding the indirect with offset addressing mode. 
 It *does* generate code using it but seemingly as a last resort.


Something else to track down, I guess.

Regards,

Richard.



RL78 sim?

2014-03-29 Thread Richard Hulme

Hi,

So far I've been testing with hardware but I'm pretty sure I read 
somewhere about an RL78 simulator, which would be a useful addition. 
Does this simulator exist, and if so, how do I run the tests against it?


I tried 'make -k check RUNTESTFLAGS="--target_board=rl78-sim"' but in 
amongst the errors I see 'ERROR: couldn't load description file for 
rl78-sim', either it has a different name or I'm missing something on my 
system (and a quick search didn't seem to find anything but I don't 
really know what I'm looking for).


Regards,

Richard.


Forcing REG_DEAD?

2014-04-06 Thread Richard Hulme

Hi,

Is there a way to force the compiler to consider an operand dead?

Specifically, I've got the RL78 backend to generate SET1 and CLR1 
instructions to set and clear individual bits.  These instructions can 
either work on the contents of a specific memory address, or indirectly 
by putting the memory address into the HL register.


If more than one bit in a given byte should be set or cleared, the 
compiler uses the indirect alternative but in most cases this actually 
leads to larger code especially if not all bit operations on any given 
memory address are performed sequentially (e.g. 'clear bit 3 of address 
X, set bit 6 of address Y, set bit 1 of address X').



typedef struct {
   unsigned char no0 :1;
   unsigned char no1 :1;
   unsigned char no2 :1;
   unsigned char no3 :1;
   unsigned char no4 :1;
   unsigned char no5 :1;
   unsigned char no6 :1;
   unsigned char no7 :1;
} __BITS8;

#define MEMREG (*(volatile __BITS8*)0xFFF0C)

void test()
{
   MEMREG.no1 = 1;
   MEMREG.no2 = 0;
}


Produces:

  28_test:
  29  36 0C FF  movwhl, #-244
  30 0003 71 92 set1[hl].1
  31 0005 71 A3 clr1[hl].2
  32 0007 D7ret

Where this would be more efficient (and in real-world situations much 
more so):


  28_test:
  29  71 1A 0C  set10xfff0c.1
  30 0003 71 2B 0C  clr10xfff0c.2
  31 0006 D7ret



The problem seems to be during the combine phase.  With the second 
MEMREG line commented out:



Trying 9 -> 10:
Successfully matched this instruction:
(set (mem/v/j:QI (reg/f:HI 44) [3 MEM[(volatile struct __BITS8 
*)65292B].no1+0 S1 A16])
(ior:QI (mem/v/j:QI (reg/f:HI 44) [3 MEM[(volatile struct __BITS8 
*)65292B].no1+0 S1 A16])

(const_int 2 [0x2])))
deferring deletion of insn with uid = 9.
modifying insn i310: [r44:HI]=[r44:HI]|0x2
  REG_DEAD r44:HI
deferring rescan insn with uid = 10.

Trying 6 -> 10:
Successfully matched this instruction:
(set (mem/v/j:QI (const_int -244 [0xff0c]) [3 MEM[(volatile 
struct __BITS8 *)65292B].no1+0 S1 A16])
(ior:QI (mem/v/j:QI (const_int -244 [0xff0c]) [3 
MEM[(volatile struct __BITS8 *)65292B].no1+0 S1 A16])

(const_int 2 [0x2])))
deferring deletion of insn with uid = 6.
modifying insn i310: [0xff0c]=[0xff0c]|0x2
deferring rescan insn with uid = 10.
starting the processing of deferred insns
rescanning insn with uid = 10.
ending the processing of deferred insns


With both lines active:


Trying 9 -> 10:
Successfully matched this instruction:
(set (mem/v/j:QI (reg/f:HI 44) [3 MEM[(volatile struct __BITS8 
*)65292B].no1+0 S1 A16])
(ior:QI (mem/v/j:QI (reg/f:HI 44) [3 MEM[(volatile struct __BITS8 
*)65292B].no1+0 S1 A16])

(const_int 2 [0x2])))
deferring deletion of insn with uid = 9.
modifying insn i310: [r44:HI]=[r44:HI]|0x2
deferring rescan insn with uid = 10.

Trying 6 -> 10:
Failed to match this instruction:
(parallel [
(set (mem/v/j:QI (const_int -244 [0xff0c]) [3 
MEM[(volatile struct __BITS8 *)65292B].no1+0 S1 A16])
(ior:QI (mem/v/j:QI (const_int -244 [0xff0c]) 
[3 MEM[(volatile struct __BITS8 *)65292B].no1+0 S1 A16])

(const_int 2 [0x2])))
(set (reg/f:HI 44)
(const_int -244 [0xff0c]))
])
Failed to match this instruction:
(parallel [
(set (mem/v/j:QI (const_int -244 [0xff0c]) [3 
MEM[(volatile struct __BITS8 *)65292B].no1+0 S1 A16])
(ior:QI (mem/v/j:QI (const_int -244 [0xff0c]) 
[3 MEM[(volatile struct __BITS8 *)65292B].no1+0 S1 A16])

(const_int 2 [0x2])))
(set (reg/f:HI 44)
(const_int -244 [0xff0c]))
])


The second example leaves the destination operand 'alive', and fails to 
find a match for the direct-addressing alternative.


Is there any way of preventing the compiler going with the indirect 
alternative?  Can a 'parallel' match be defined in the machine 
description that indicates the '(set (reg/f:HI...' should be discarded?


Thanks in advance,

Richard.