Re: [RL78] Questions about code-generation
On 10/03/14 22:37, DJ Delorie wrote: I've managed to build GCC myself so that I could experiment a bit but as this is my first foray into compiler internals, I'm struggling to work out how things fit together and what affects what. The key thing to know about the RL78 backend, is that it has two "targets" it uses. For the first part of the compilation, up until after reload, the model uses 16 virtual registers (R8 through R15) and a virtual machine to give gcc an orthogonal model that it can generate code for. After reload, there's a "devirtualization" pass in the RL78 backend that maps the virtual model to the real model (R0 through R7), which means copying values in and out of the real registers according to which addressing modes are needed. Then GCC continues optimizing, which gets rid of most of the unneeded instructions. The problem you're probably running into is that deciding which real registers to use for each virtual one is a very tricky task, and the post-reload optimizers aren't expecing the code to look like what it does. What causes that code to be generated when using a variable instead of a fixed memory address? The use of "volatile" disables many of GCC's optimizations. I consider this a bug in GCC, but at the moment it needs to be "fixed" in the backends on a case-by-case basis. Ah, that certainly explains a lot. How exactly would the fixing be done? Is there an example I could look at for one of the other processors? It's certainly unfortunate, since an awful lot of bit-twiddling goes on with the memory-mapped hardware registers (which obviously generally need to be declared volatile). Just to get a feel for the potential gains, I've removed the volatile keyword from all the declarations and rebuilt the project. That change alone reduces the code size by 3.7%. I wouldn't want to risk running that code but the gain is certainly significant. I calculated a week or two ago that we could make a code-saving of around 8% by using near or relative branches and near calls instead of always generating far calls. I changed rl78-real.md to use near addressing and got about 5%. That's probably about right. I tried to generate relative branches too but I'm guessing that the 'length' attribute needs to be set for all instructions to get that working properly. Obviously near/far addressing would need to be controlled by an external switch to allow for processors with more than 64KB code-flash. A few small gains can be had elsewhere (using 'clrb a' in zero_extendqihi2_real, possibly optimizing addsi3_internal_real to avoid addw ax,#0 etc.). These don't save much space in our project (about 30-40 bytes perhaps) but it'll obviously vary from project to project. Regards, Richard
Re: [RL78] Questions about code-generation
On 10/03/14 22:37, DJ Delorie wrote: The use of "volatile" disables many of GCC's optimizations. I consider this a bug in GCC, but at the moment it needs to be "fixed" in the backends on a case-by-case basis. Hi, I've looked into the differences between the steps taken when using a variable declared volatile, and when it isn't but I'm getting a bit stuck. Taking the following code as an example: -- typedef struct { unsigned char no0 :1; unsigned char no1 :1; unsigned char no2 :1; unsigned char no3 :1; unsigned char no4 :1; unsigned char no5 :1; unsigned char no6 :1; unsigned char no7 :1; } __BITS8; union un_if0h { unsigned char if0h; __BITS8 BIT; }; #define IF0H (*(volatile union un_if0h *)0xFFFE1).if0h #define IF0H_bit (*(volatile union un_if0h *)0xFFFE1).BIT void test(void) { IF0H_bit.no5 = 1; } -- and compiling it with -Os and -da once as-is and once with IF0H_bit not declared volatile. The generated RTL is basically the same until the 'combine' stage non-volatile start Trying 5 -> 7: Failed to match this instruction: (parallel [ (set (reg:QI 45 [ MEM[(union un_if0h *)65505B].BIT.no5 ]) (mem/j:QI (const_int -31 [0xffe1]) [0 MEM[(union un_if0h *)65505B].BIT.no5+0 S1 A8])) (set (reg/f:HI 43) (const_int -31 [0xffe1])) ]) Failed to match this instruction: (parallel [ (set (reg:QI 45 [ MEM[(union un_if0h *)65505B].BIT.no5 ]) (mem/j:QI (const_int -31 [0xffe1]) [0 MEM[(union un_if0h *)65505B].BIT.no5+0 S1 A8])) (set (reg/f:HI 43) (const_int -31 [0xffe1])) ]) Trying 7 -> 8: Successfully matched this instruction: (set (reg:QI 46) (ior:QI (mem/j:QI (reg/f:HI 43) [0 MEM[(union un_if0h *)65505B].BIT.no5+0 S1 A8]) (const_int 32 [0x20]))) deferring deletion of insn with uid = 7. modifying insn i3 8: r46:QI=[r43:HI]|0x20 deferring rescan insn with uid = 8. -non-volatile end- --volatile start-- Trying 5 -> 7: Failed to match this instruction: (parallel [ (set (reg:QI 45 [ MEM[(volatile union un_if0h *)65505B].BIT.no5 ]) (mem/v/j:QI (const_int -31 [0xffe1]) [0 MEM[(volatile union un_if0h *)65505B].BIT.no5+0 S1 A8])) (set (reg/f:HI 43) (const_int -31 [0xffe1])) ]) Failed to match this instruction: (parallel [ (set (reg:QI 45 [ MEM[(volatile union un_if0h *)65505B].BIT.no5 ]) (mem/v/j:QI (const_int -31 [0xffe1]) [0 MEM[(volatile union un_if0h *)65505B].BIT.no5+0 S1 A8])) (set (reg/f:HI 43) (const_int -31 [0xffe1])) ]) Trying 7 -> 8: Failed to match this instruction: (set (reg:QI 46) (ior:QI (mem/v/j:QI (reg/f:HI 43) [0 MEM[(volatile union un_if0h *)65505B].BIT.no5+0 S1 A8]) (const_int 32 [0x20]))) ---volatile end--- Bearing in mind that I'm new to all this and may be missing something blindingly obvious, what would cause 7->8 to fail when declared volatile and not when not? Does something need adding to rl78-virt.md to allow it to match? It doesn't seem like this is due to missing an optimization step that combines insns (hmm, "combine?") but rather to not recognizing that a single, existing insn is possible and so splitting the operation up into multiple steps. The 'Failed to match' string comes after calling 'recog' but I'm either too blind or too stupid to find the implementation. The result of this (as I mentioned in my first post) is that this is produced: 28_test: 29 C9 F2 E1 FF movwr10, #-31 30 0004 AD F2 movwax, r10 31 0006 16movwhl, ax 32 0007 8Bmov a, [hl] 33 0008 6C 20 or a, #32 34 000a 9Bmov [hl], a 35 000b D7ret instead of this: 28_test: 29 71 5A E1 set10xfffe1.5 30 0003 D7ret Surely the optimized code is also valid for a volatile variable? In fact, I would have thought it *more* valid as it performs the entire operation in a single instruction instead of splitting it into a very definite read-modify-write sequence? Since operations on memory-mapped hardware registers are your bread-and-butter on a microcontroller, 'curing' this would bring significant gains. Am I missing something (non-)obvious? Regards, Richard
Re: [RL78] Questions about code-generation
On 11/03/14 01:40, DJ Delorie wrote: I'm curious. Have you tried out other approaches before you decided to go with the virtual registers? Yes. Getting GCC to understand the "unusual" addressing modes the RL78 uses was too much for the register allocator to handle. Even when the addressing modes are limited to "usual" ones, GCC doesn't have a good way to do regalloc and reload when there are limits on what registers you can use in an address expression, and it's worse when there are dependencies between operands, or limited numbers of address registers. Is it possible that the virtual pass causes inefficiencies in some cases by sticking with r8-r31 when one of the 'normal' registers would be better? For example, I'm having a devil of a time convincing the compiler that an immediate value can be stored directly in any of the normal 16-bit registers (e.g. 'movw hl, #123'). I'm beginning to wonder whether it's the unoptimized code being fed in that's causing problems. Taking a slight variation on my original test code (removing the 'volatile' keyword and accessing an 8-bit memory location): #define SOE0L (*(unsigned char *)0xF012A) void orTest() { SOE0L |= 3; } produces (with -O0) 28_test: 29 C9 F0 2A 01 movwr8, #298 30 0004 C9 F2 2A 01 movwr10, #298 31 0008 AD F2 movwax, r10 32 000a BD F4 movwr12, ax 33 000c FA F4 movwhl, r12 34 000e 8Bmov a, [hl] 35 000f 9D F2 mov r10, a 36 0011 6A F2 03 or r10, #3 37 0014 AD F0 movwax, r8 38 0016 BD F4 movwr12, ax 39 0018 DA F4 movwbc, r12 40 001a 8D F2 mov a, r10 41 001c 48 00 00 mov [bc], a 42 001f D7ret In some cases, the normal optimization steps remove a lot, if not all, of the unnecessary register passing, but not always. The conditions on the movhi_real insn allow an immediate value to be stored in (for example) HL directly, and yet I cannot find a single instance in my project where it isn't in the form of movwr8, #298 movwax, r10 movwhl, ax and no manner of re-arranging the conditions (that I've found) will cause the correct code to be generated. It's determined to put the immediate value into rX, and then copy that into ax (which is also unnecessary). I see the same problem with 'cmp' when the value to be compared is in the A register: mov r8, a cmp r8, #3 The A register is the one register that can be almost guaranteed to be usable with any instruction, and copying it to R8 (or wherever) to perform the comparison not only wastes two bytes for the move but also makes the cmp instruction a byte longer, so five bytes are used instead of two. I looked at the code produced for IA64 and ARM targets, and although I'm not as familiar with those instruction sets, they didn't appear to do as much needless copying, which strengthens my suspicion that it's something in the RL78 backend that needs 'tweaking'. The suggestions made regarding 'volatile' were very helpful and I've made some good savings elsewhere by adding support for different addressing modes and more efficient instructions but there are still a number of (theoretically) easy pickings that should (I feel) be possible before more complicated optimizations need to be looked at. As ever, any suggestions are very gratefully received. I hope to be able to post some patches once I'm comfortable that I haven't missed anything obvious or done something stupid. Regards, Richard.
Re: [RL78] Questions about code-generation
On 22/03/14 01:47, Jeff Law wrote: On 03/21/14 18:35, DJ Delorie wrote: I've found that "removing uneeded moves through registers" is something gcc does poorly in the post-reload optimizers. I've written my own on some occasions (for rl78 too). Perhaps this is a good starting point to look at? much needless copying, which strengthens my suspicion that it's something in the RL78 backend that needs 'tweaking'. Of course it is, I've said that before I think. The RL78 uses a virtual model until reload, then converts each virtual instructions into multiple real instructions, then optimizes the result. This is going to be worse than if the real model had been used throughout (like arm or x86), but in this case, the real model *can't* be used throughout, because gcc can't understand it well enough to get through regalloc and reload. The RL78 is just to "weird" to be modelled as-is. I keep hoping that gcc's own post-reload optimizers would do a better job, though. Combine should be able to combine, for example, the "mov r8,ax; cmp r8,#4" types of insns together. The virtual register file was the only way I could see to make RL78 work. I can't recall the details, but when you described the situation to me the virtual register file was the only way I could see to make the RL78 work in the IRA+reload world. What would be quite interesting to try would be to continue to use the virtualized register set, but instead use the IRA+LRA path. Presumably that wouldn't be terribly hard to try and there's a reasonable chance that'll improve the code in a noticeable way. Looking at how that's done by other backends, as far as I can tell, I just need to add something like: #undef TARGET_LRA_P #define TARGET_LRA_P rl78_enable_lra static bool rl78_enable_lra (void) { return true; } to rl78.c? At least in theory, even if other work is needed elsewhere to make things run smoothly. Unfortunately, that function never seems to be called. How does TARGET_LRA_P get used, anyway? I can't find anything that tries to use it, only places where it gets set. Is there some funky preprocessor stuff going on that's stopping me grepping for it? The next obvious thing to try, and it's probably a lot more work, would be to see if IRA+LRA is smart enough (or can be made so with a reasonable amount of work) to eliminate the virtual register file completely. Just to be clear, I'm not planning to work on this; my participation and interest in the RL78 was limited to providing a few tips to DJ. And from my side, I'm not trying to get anyone to work on it (though obviously I'm not averse to it). I'm just looking for hints and tips so that I can try to understand the causes (and hopefully find some solutions). Regards, Richard.
Re: [RL78] Questions about code-generation
On 22/03/14 01:35, DJ Delorie wrote: Is it possible that the virtual pass causes inefficiencies in some cases by sticking with r8-r31 when one of the 'normal' registers would be better? That's not a fair question to ask, since the virtual pass can *only* use r8-r31. The first bank has to be left alone else the devirtualizer becomes a few orders of magnitude harder, if not impossible, to make work correctly. What I meant was that because the virtual pass can only use r8-r31, it's causing unnecessary register moves to be generated because it chooses, say, r8 as the register for a byte compare. Because r8 is a *valid* register to use with a byte compare, it sticks with it come what may and then causes additional instructions to be generated to make sure that the result to be compared definitely ends up in r8, even if the register the result was in is also valid for a byte compare operation. much needless copying, which strengthens my suspicion that it's something in the RL78 backend that needs 'tweaking'. Of course it is, I've said that before I think. The RL78 uses a virtual model until reload, then converts each virtual instructions into multiple real instructions, then optimizes the result. This is It may be obvious to you and everyone else on this list that it's the backend that needs tweaking but I've only been looking at the compiler internals for a couple of weeks, so unfortunately it's not obvious to me. I'm not complaining or pointing fingers or anything like that. I'm just trying to understand the reasons why things are the way they are - what things are happening in the backend, what's happening in the 'generic' part and the interactions between them. I understand that it's easy to say 'This is what the compiler's generating. That's stupid. It should be generating this', which is why I'm trying to understand the reasons that cause the compiler to generate what it's generating. going to be worse than if the real model had been used throughout (like arm or x86), but in this case, the real model *can't* be used throughout, because gcc can't understand it well enough to get through regalloc and reload. The RL78 is just to "weird" to be modelled as-is. Can you explain what is too weird about it in particular? It certainly has restrictions on which registers can be used with various instructions but I wouldn't have thought they were that complicated that they couldn't be described using the normal constraints? Regards, Richard.
Re: [RL78] Questions about code-generation
On 24/03/14 04:44, Jeff Law wrote: On 03/22/14 05:29, Richard Hulme wrote: On 22/03/14 01:47, Jeff Law wrote: On 03/21/14 18:35, DJ Delorie wrote: I've found that "removing uneeded moves through registers" is something gcc does poorly in the post-reload optimizers. I've written my own on some occasions (for rl78 too). Perhaps this is a good starting point to look at? much needless copying, which strengthens my suspicion that it's something in the RL78 backend that needs 'tweaking'. Of course it is, I've said that before I think. The RL78 uses a virtual model until reload, then converts each virtual instructions into multiple real instructions, then optimizes the result. This is going to be worse than if the real model had been used throughout (like arm or x86), but in this case, the real model *can't* be used throughout, because gcc can't understand it well enough to get through regalloc and reload. The RL78 is just to "weird" to be modelled as-is. I keep hoping that gcc's own post-reload optimizers would do a better job, though. Combine should be able to combine, for example, the "mov r8,ax; cmp r8,#4" types of insns together. The virtual register file was the only way I could see to make RL78 work. I can't recall the details, but when you described the situation to me the virtual register file was the only way I could see to make the RL78 work in the IRA+reload world. What would be quite interesting to try would be to continue to use the virtualized register set, but instead use the IRA+LRA path. Presumably that wouldn't be terribly hard to try and there's a reasonable chance that'll improve the code in a noticeable way. Looking at how that's done by other backends, as far as I can tell, I just need to add something like: #undef TARGET_LRA_P #define TARGET_LRA_P rl78_enable_lra static bool rl78_enable_lra (void) { return true; } to rl78.c? At least in theory, even if other work is needed elsewhere to make things run smoothly. Unfortunately, that function never seems to be called. How does TARGET_LRA_P get used, anyway? I can't find anything that tries to use it, only places where it gets set. Is there some funky preprocessor stuff going on that's stopping me grepping for it? That should be enough to switch to the LRA path. It's a target hook. Grep for "targetm.lra_p" Hi Jeff, Ok, I figured out what was wrong eventually. I'd added the lines above *after* the declaration of the targetm variable. Activating LRA alone is certainly not the answer. Whilst I can see that *some* of the "to me, to you" register passing has been eliminated, LRA seems to have an intense dislike to indirect memory addressing with an offset. So instead of something like: mov a, [sp+4] it's now producing: movw ax, sp addw ax, #4 movw hl, ax mova, [hl] which takes 7 bytes (compared to 4). Overall I've got an code increase of about 31%. I don't know why it's avoiding the indirect with offset addressing mode. It *does* generate code using it but seemingly as a last resort. Something else to track down, I guess. Regards, Richard.
RL78 sim?
Hi, So far I've been testing with hardware but I'm pretty sure I read somewhere about an RL78 simulator, which would be a useful addition. Does this simulator exist, and if so, how do I run the tests against it? I tried 'make -k check RUNTESTFLAGS="--target_board=rl78-sim"' but in amongst the errors I see 'ERROR: couldn't load description file for rl78-sim', either it has a different name or I'm missing something on my system (and a quick search didn't seem to find anything but I don't really know what I'm looking for). Regards, Richard.
Forcing REG_DEAD?
Hi, Is there a way to force the compiler to consider an operand dead? Specifically, I've got the RL78 backend to generate SET1 and CLR1 instructions to set and clear individual bits. These instructions can either work on the contents of a specific memory address, or indirectly by putting the memory address into the HL register. If more than one bit in a given byte should be set or cleared, the compiler uses the indirect alternative but in most cases this actually leads to larger code especially if not all bit operations on any given memory address are performed sequentially (e.g. 'clear bit 3 of address X, set bit 6 of address Y, set bit 1 of address X'). typedef struct { unsigned char no0 :1; unsigned char no1 :1; unsigned char no2 :1; unsigned char no3 :1; unsigned char no4 :1; unsigned char no5 :1; unsigned char no6 :1; unsigned char no7 :1; } __BITS8; #define MEMREG (*(volatile __BITS8*)0xFFF0C) void test() { MEMREG.no1 = 1; MEMREG.no2 = 0; } Produces: 28_test: 29 36 0C FF movwhl, #-244 30 0003 71 92 set1[hl].1 31 0005 71 A3 clr1[hl].2 32 0007 D7ret Where this would be more efficient (and in real-world situations much more so): 28_test: 29 71 1A 0C set10xfff0c.1 30 0003 71 2B 0C clr10xfff0c.2 31 0006 D7ret The problem seems to be during the combine phase. With the second MEMREG line commented out: Trying 9 -> 10: Successfully matched this instruction: (set (mem/v/j:QI (reg/f:HI 44) [3 MEM[(volatile struct __BITS8 *)65292B].no1+0 S1 A16]) (ior:QI (mem/v/j:QI (reg/f:HI 44) [3 MEM[(volatile struct __BITS8 *)65292B].no1+0 S1 A16]) (const_int 2 [0x2]))) deferring deletion of insn with uid = 9. modifying insn i310: [r44:HI]=[r44:HI]|0x2 REG_DEAD r44:HI deferring rescan insn with uid = 10. Trying 6 -> 10: Successfully matched this instruction: (set (mem/v/j:QI (const_int -244 [0xff0c]) [3 MEM[(volatile struct __BITS8 *)65292B].no1+0 S1 A16]) (ior:QI (mem/v/j:QI (const_int -244 [0xff0c]) [3 MEM[(volatile struct __BITS8 *)65292B].no1+0 S1 A16]) (const_int 2 [0x2]))) deferring deletion of insn with uid = 6. modifying insn i310: [0xff0c]=[0xff0c]|0x2 deferring rescan insn with uid = 10. starting the processing of deferred insns rescanning insn with uid = 10. ending the processing of deferred insns With both lines active: Trying 9 -> 10: Successfully matched this instruction: (set (mem/v/j:QI (reg/f:HI 44) [3 MEM[(volatile struct __BITS8 *)65292B].no1+0 S1 A16]) (ior:QI (mem/v/j:QI (reg/f:HI 44) [3 MEM[(volatile struct __BITS8 *)65292B].no1+0 S1 A16]) (const_int 2 [0x2]))) deferring deletion of insn with uid = 9. modifying insn i310: [r44:HI]=[r44:HI]|0x2 deferring rescan insn with uid = 10. Trying 6 -> 10: Failed to match this instruction: (parallel [ (set (mem/v/j:QI (const_int -244 [0xff0c]) [3 MEM[(volatile struct __BITS8 *)65292B].no1+0 S1 A16]) (ior:QI (mem/v/j:QI (const_int -244 [0xff0c]) [3 MEM[(volatile struct __BITS8 *)65292B].no1+0 S1 A16]) (const_int 2 [0x2]))) (set (reg/f:HI 44) (const_int -244 [0xff0c])) ]) Failed to match this instruction: (parallel [ (set (mem/v/j:QI (const_int -244 [0xff0c]) [3 MEM[(volatile struct __BITS8 *)65292B].no1+0 S1 A16]) (ior:QI (mem/v/j:QI (const_int -244 [0xff0c]) [3 MEM[(volatile struct __BITS8 *)65292B].no1+0 S1 A16]) (const_int 2 [0x2]))) (set (reg/f:HI 44) (const_int -244 [0xff0c])) ]) The second example leaves the destination operand 'alive', and fails to find a match for the direct-addressing alternative. Is there any way of preventing the compiler going with the indirect alternative? Can a 'parallel' match be defined in the machine description that indicates the '(set (reg/f:HI...' should be discarded? Thanks in advance, Richard.