Re: Instruction scheduling for the R5900's 2 integer pipelines
On 01/19/2016 05:04 AM, Woon yung Liu wrote: Hi, I'm am trying to complete support for the MIPS R5900, by adding support for its second interger multiplication/division pipe. GCC currently supports only the first one.My target at this moment is the public GCC v5.3.0 release. To get the 2nd pipeline supported, I've added the hi1 and lo1 registers to GCC, as well as constraints for them (wr for hi1lo1 and wl for lo1). The existing instructions in mips.md have been modified to use the new constraints as new alternatives. A new constraint modifier was added too, which will append a 1 to the instruction (i.e. changes mult to mult1) if it detects that the specified operand is for pipeline 1 instead of 0. The 2nd pipeline is utilized by using different instructions (i.e. mult1 instead of mult, as mult is for the 1st pipeline) and registers (i.e. lo1 and hi1, instead of lo and hi). Right now, I know that it is possible for GCC to output the new instructions for the 2nd pipeline if I manipulate the MD constrains for instructions like mult... but GCC doesn't seem to be ever using the 2nd pipeline on its own otherwise. I originally believed that it's because I didn't add in a pipeline description into my MD file (5900.md), but nothing seemed to have changed after I did that. I followed the documentation on the pipeline description, but I realized that I still don't understand how the automatron will tell GCC which alternative (and hence which integer pipe) to use and so I don't think think there's a relationship between the automatron and the two different sets of multiplication/divisions instructions yet. Could somebody please advice me on how to get this going? Or at least, tell me which other target has two integer pipelines that are used in this way, so that I will have something to reference to? AFAIK, no other MIPS processors have this 2nd pipeline design as the R5900. There was a time where GCC would generate mult/multiply-add instructions that would issue into the 2nd R5900 pipeline. It's been 15+ years since I looked at that problem, but IIRC I twiddled the old register allocator, along with the expected changes in the pipeline and constraints for the mul/mul-add insns in the mips backend to exploit the dual multiply pipes on the r5900. The key was to realize that because selection of the pipeline is static based on the registers used, you have to look at this as a register allocation problem. You might dig out the old Cygnus releases. They may provide clues, particularly on the register allocation tweak. Jeff
Re: Implementing TI mode (128-bit) and the 2nd pipeline for the MIPS R5900
On 01/19/2016 04:59 AM, Woon yung Liu wrote: In my current attempt at adding support for the TI mode, the MMI definitions are added into a MD file for the R5900 and some functions (i.e. mips_output_move) were modified to allow certain moves for the TI mode of the R5900 target. However, while it seems like TI-mode integers can now be passed between functions and used with the MMI (within one 128-bit GPR), GCC still treats 128-bit moves as complex moves (split across two 64-bit registers); its built-in functions expect both $a0 and $a1 to be used if the first argument is a 128-bit value. To return a 128-bit value, both $v0 and $v1 are used. You'll have to adjust FUNCTION_ARG and its counterpart for return values to describe how to pass these 128 bit values around. Otherwise, I believe that there are two solutions to the problem with the calling convention (but again, I have no idea which is better): 1. Keep the target as 64-bit. Support for MMI will be either compromised (i.e. made to assemble and split the 128-bit vectors upon entry/exit) or totally omitted. Perhaps omission would be best so that there will never be a compromise in performance. 2. Promote the word size of the R5900 to 128-bit. I think that SONY might have done this, as the code from their late games used lq/sq (quard-word load/store) to preserve registers. However, I think that this goes against the existing ABIs, doesn't it? Plus the MMI instruction set is proprietary and isn't used in any other MIPS. Changing the word size to 128 bit should not be necessary. Many ports define patterns for operations on data types that are larger than their native word mode. You really need to add the new patterns for operating on 128bit values to the machine description and adjust the parameter passing routines . We did have to force the compiler to assume a 64bit *host* datatype (long long). I don't recall the reasoning behind that. If I carry on with my current design, I suppose that I need to make it so that the hi1/lo1 registers are never used for other MIPS targets. I didn't find a RTL constraint that meant something like "nothing", so I made the new constraints define MD1_REGS (hi1/lo1) as their MD_REGS (hi/lo) equivalents if the target is not the R5900 (much like the DSP ACC register constraint, ka). But unlike the DSP ACC register constraint (ka), my constraints are used as alternatives alongside whatever (i.e. x for hi/lo or ka for hi/lo/acc) that was originally there. Would this be acceptable, given that there will be two similar alternatives for some instructions when the target is not the R5900? You define the registers & constraints normally. However, you make the registers conditional on the target in use. ie, if you're not on an r5900 target, then mark those registers as fixed. That will prevent the compiler from trying to use them on things other than the r5900. Again, you may want to find the old cygnus releases of the r5900 toolchain. It had functional access to the second hi/lo register pair. jeff
Re: Implementing TI mode (128-bit) and the 2nd pipeline for the MIPS R5900
On 19/01/16 14:42, Jeff Law wrote: > On 01/19/2016 04:59 AM, Woon yung Liu wrote: >> >> In my current attempt at adding support for the TI mode, the MMI >> definitions are added into a MD file for the R5900 and some functions >> (i.e. mips_output_move) were modified to allow certain moves for the >> TI mode of the R5900 target. However, while it seems like TI-mode >> integers can now be passed between functions and used with the MMI >> (within one 128-bit GPR), GCC still treats 128-bit moves as complex >> moves (split across two 64-bit registers); its built-in functions >> expect both $a0 and $a1 to be used if the first argument is a 128-bit >> value. To return a 128-bit value, both $v0 and $v1 are used. > You'll have to adjust FUNCTION_ARG and its counterpart for return values > to describe how to pass these 128 bit values around. > >> >> >> Otherwise, I believe that there are two solutions to the problem with >> the calling convention (but again, I have no idea which is better): >> 1. Keep the target as 64-bit. Support for MMI will be either >> compromised (i.e. made to assemble and split the 128-bit vectors upon >> entry/exit) or totally omitted. Perhaps omission would be best so >> that there will never be a compromise in performance. > >> >> 2. Promote the word size of the R5900 to 128-bit. I think that SONY >> might have done this, as the code from their late games used lq/sq >> (quard-word load/store) to preserve registers. However, I think that >> this goes against the existing ABIs, doesn't it? Plus the MMI >> instruction set is proprietary and isn't used in any other MIPS. > Changing the word size to 128 bit should not be necessary. > > Many ports define patterns for operations on data types that are larger > than their native word mode. > > You really need to add the new patterns for operating on 128bit values > to the machine description and adjust the parameter passing routines . > > We did have to force the compiler to assume a 64bit *host* datatype > (long long). I don't recall the reasoning behind that. > Probably because historically you needed CONST_DOUBLE (with VOIDmode) to handle 128-bit immediates (a pair of HOST_WIDE_INTs). It may not be necessary any more with the new wide integer types. R. > >> If I carry on with my current design, I suppose that I need to make >> it so that the hi1/lo1 registers are never used for other MIPS >> targets. I didn't find a RTL constraint that meant something like >> "nothing", so I made the new constraints define MD1_REGS (hi1/lo1) as >> their MD_REGS (hi/lo) equivalents if the target is not the R5900 >> (much like the DSP ACC register constraint, ka). But unlike the DSP >> ACC register constraint (ka), my constraints are used as alternatives >> alongside whatever (i.e. x for hi/lo or ka for hi/lo/acc) that was >> originally there. Would this be acceptable, given that there will be >> two similar alternatives for some instructions when the target is not >> the R5900? > You define the registers & constraints normally. However, you make the > registers conditional on the target in use. ie, if you're not on an > r5900 target, then mark those registers as fixed. That will prevent the > compiler from trying to use them on things other than the r5900. > > Again, you may want to find the old cygnus releases of the r5900 > toolchain. It had functional access to the second hi/lo register pair. > > jeff >
RE: Implementing TI mode (128-bit) and the 2nd pipeline for the MIPS R5900
Jeff Law writes: > On 01/19/2016 04:59 AM, Woon yung Liu wrote: > > > > In my current attempt at adding support for the TI mode, the MMI > > definitions are added into a MD file for the R5900 and some functions > > (i.e. mips_output_move) were modified to allow certain moves for the > > TI mode of the R5900 target. However, while it seems like TI-mode > > integers can now be passed between functions and used with the MMI > > (within one 128-bit GPR), GCC still treats 128-bit moves as complex > > moves (split across two 64-bit registers); its built-in functions > > expect both $a0 and $a1 to be used if the first argument is a 128-bit > > value. To return a 128-bit value, both $v0 and $v1 are used. > You'll have to adjust FUNCTION_ARG and its counterpart for return values > to describe how to pass these 128 bit values around. I'm generally against modified calling conventions especially given the number of them that MIPS already has. We opted against using new wider registers for arguments/returns in MSA instead choosing to consider it as an optimised convention rather than the standard. What environment are you looking to support this in? Linux, bare metal, BSD, other? There's a reasonable amount of housekeeping to consider for context switching and debug depending on the environment. On the topic of TImode... Do you ever truly end up with TImode data with the R5900 extensions or is it all vector types? We initially had TImode in various places for MSA and removed it all in favour of the vector modes which made everything a lot cleaner. If there truly is TImode support then things get a little ugly based on what I remember from untangling MSA from TImode mainly because of the interaction with multiplies. > > Otherwise, I believe that there are two solutions to the problem with > > the calling convention (but again, I have no idea which is better): > > 1. Keep the target as 64-bit. Support for MMI will be either > > compromised (i.e. made to assemble and split the 128-bit vectors upon > > entry/exit) or totally omitted. Perhaps omission would be best so that > > there will never be a compromise in performance. As above I suggest this approach but allow vectors to be passed using the pre-existing defacto convention and look at optimizing it later. Matthew
RE: [Patch] MIPS FDE deletion
On Mon, 11 Jan 2016, Moore, Catherine wrote: > > Does it mean PR target/53276 has been fixed now? What was the commit to > > add .cfi support for the stubs? > > I don't know about the status of PR target/53276. The commit to add > .cfi support for call stubs was this one: > > r184379 | rsandifo | 2012-02-19 08:44:54 -0800 (Sun, 19 Feb 2012) | 7 lines > > gcc/ > * config/mips/mips.c (mips16_build_call_stub): Add CFI information > to stubs with non-sibling calls. > > libgcc/ > * config/mips/mips16.S (CALL_STUB_RET): Add CFI information. Thanks. I thought it was someting recent, but this is fairly old. I saw your patch handles the `fn_stub' case among others and your test case included an `__fn_stub_foo' stub too, which is what PR target/53276 is all about, which is why I thought it may have been resolved and the existence of the PR accidentally missed. BTW, your test case has a stub of the `fn_stub' kind (`__fn_stub_foo') and one of the `call_fp_stub' kind (`__call_stub_fp_foo'), but none of the `call_stub' kind (for `foo' it would be called `__call_stub_foo'). The latter has AFAICT been addressed by r184379. Was the omission of the test case then deliberate for some reason (why?) or just accidental? Maciej
Re: Instruction scheduling for the R5900's 2 integer pipelines
On 01/19/2016 09:22 AM, Woon yung Liu wrote: Right now, I do have an old homebrew GCC v3.2.2 port to study as well, but I didn't follow everything from it because I didn't want to risk including obsolete constructs. Thanks for the information on the old Cygnus port. I'll try to scrape together a working system with it. Look for a change from me in local-alloc.c, circa 1998. At least I think that's where I had to twiddle things. jeff
SH runtime switchable atomics - proposed design
I've been working on the new version of runtime-selected SH atomics for musl, and I think what I've got might be appropriate for GCC's generated atomics too. I know Oleg was not very excited about doing this on the gcc side from a cost/benefit perspective, but I think my approach is actually preferable over inline atomics from a code size perspective. It uses a single "cas" function with an "SFUNC" type ABI (not standard calling convention) with the following constraints: Inputs: - R0: Memory address to operate on - R1: Address of implementation function, loaded from a global - R2: Comparison value - R3: Value to set on success Outputs: - R3: Old value read, ==R2 iff cas succeeded. Preserved: R0, R2. Clobbered: R1, PR, T. This call (performed from __asm__ for musl, but gcc would do it as SH "SFUNC") is highly compact/convenient for inlining because it avoids clobbering any of the argument registers that are likely to already be in use by the caller, and it preserves the important values that are likely to be reused after the cas operation. For J2 and future J4, the function pointer just points to: rts cas.l r2,r3,@r0 and the only costs vs an inline cas.l are loading the address of the function (done in the caller; involves GOT access) and clobbering R1 and PR. This is still a draft design and the version in musl is subject to change at any time since it's not a public API/ABI, but I think it could turn into something useful to have on the gcc side with a -matomic-model=libfunc option or similar. Other ABI considerations for gcc use would be where to store the function pointer and how to initialize it. To be reasonably efficient with FDPIC the caller needs to be responsible for loading the function pointer (and it needs to always point to code, not a function descriptor) so that the callee does not need a GOT pointer passed in. Rich
Re: [musl] SH runtime switchable atomics - proposed design
On Tue, Jan 19, 2016 at 03:28:52PM -0500, Rich Felker wrote: > I've been working on the new version of runtime-selected SH atomics > for musl, and I think what I've got might be appropriate for GCC's > generated atomics too. I know Oleg was not very excited about doing > this on the gcc side from a cost/benefit perspective, but I think my > approach is actually preferable over inline atomics from a code size > perspective. It uses a single "cas" function with an "SFUNC" type ABI > (not standard calling convention) with the following constraints: > > Inputs: > - R0: Memory address to operate on > - R1: Address of implementation function, loaded from a global > - R2: Comparison value > - R3: Value to set on success > > Outputs: > - R3: Old value read, ==R2 iff cas succeeded. > > Preserved: R0, R2. > > Clobbered: R1, PR, T. > > This call (performed from __asm__ for musl, but gcc would do it as SH > "SFUNC") is highly compact/convenient for inlining because it avoids > clobbering any of the argument registers that are likely to already be > in use by the caller, and it preserves the important values that are > likely to be reused after the cas operation. > > For J2 and future J4, the function pointer just points to: > > rts >cas.l r2,r3,@r0 > > and the only costs vs an inline cas.l are loading the address of the > function (done in the caller; involves GOT access) and clobbering R1 > and PR. > > This is still a draft design and the version in musl is subject to > change at any time since it's not a public API/ABI, but I think it > could turn into something useful to have on the gcc side with a > -matomic-model=libfunc option or similar. Other ABI considerations for > gcc use would be where to store the function pointer and how to > initialize it. To be reasonably efficient with FDPIC the caller needs > to be responsible for loading the function pointer (and it needs to > always point to code, not a function descriptor) so that the callee > does not need a GOT pointer passed in. Attached is my current draft of the implementations of the cas 'sfunc' for musl. Forgot to include it before. Rich /* Contract for all versions is same as cas.l r2,r3,@r0 * pr and r1 are also clobbered (by jsr & r1 as temp). * r0,r2,r4-r15 must be preserved. * r3 contains result (==r2 iff cas succeeded). */ .align 2 __sh_cas_gusa: mov.l r5,@-r15 mov.l r4,@-r15 mov.l r0,r4 mova 1f,r0 mov r15,r1 mov #(0f-1f),r15 0: mov.l @r4,r5 cmp/eq r5,r2 bf 1f mov.l r3,@r4 1: mov r1,r15 mov r5,r3 mov r4,r0 mov.l @r15+,r4 rts mov.l @r15+,r5 __sh_cas_llsc: mov r0,r1 synco 0: movli.l @r1,r0 cmp/eq r0,r2 bf 1f mov r3,r0 movco.l r0,@r1 bf 0b mov r2,r0 1: synco mov r0,r3 rts mov r1,r0 __sh_cas_imask: mov r0,r1 stc sr,r0 mov.l r0,@-r15 or #0xf0,r0 ldc r0,sr mov.l @r1,r0 cmp/eq r0,r2 bf 1f mov r3,@r1 1: ldc.l @r15+,sr mov r0,r3 rts mov r1,r0 __sh_cas_cas_l: rts cas.l r2,r3,@r0
gcc-5-20160119 is now available
Snapshot gcc-5-20160119 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/5-20160119/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 5 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-5-branch revision 232591 You'll find: gcc-5-20160119.tar.bz2 Complete GCC MD5=4fd7bfbebbffc85ee8583f60bbcab476 SHA1=120b77d0c51385058c30894918002395e3e85b73 Diffs from 5-20160112 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-5 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Re: Source Code for Profile Guided Code Positioning
On Fri, Jan 15, 2016 at 9:51 AM, Yury Gribov wrote: > On 01/15/2016 08:44 PM, vivek pandya wrote: >> >> Thanks Yury for >> https://gcc.gnu.org/ml/gcc-patches/2011-09/msg01440.html this link. >> It implements procedure reordering as linker plugin. >> I have some questions : >> 1 ) Can you point me to some documentation for "how to write plugin >> for linkers " I am I have not seen doc for structs with 'ld_' prefix >> (i.e defined in plugin-api.h ) >> 2 ) There is one more algorithm for Basic Block ordering with >> execution frequency count in PH paper . Is there any implementation >> available for it ? > > > Quite frankly - I don't know (I've only learned about Google implementation > recently). > > I've added Sriram to maybe comment. Sorry for the late response. The google/gcc_4_9 branch has the source of function reordering linker Plugin. It is available in the function_reordering_plugin directory under the top level gcc directory. The function reordering plugin constructs a callgraph and uses profile information to do a Pettis Hansen style function reordering. This plugin does not do basic block re-ordering. There is no documentation as such that I am aware of to write a linker plugin. Here is a very brief overview. The linker calls the plugin's "onload" function when registering the plugin and the plugin inturn can register two call-backs with the linker, "claim_file_hook" and the "all_symbols_read_hook". "claim_file_hook" is called for each object file that the linker prcesses and the "all_symbols_read_hook" is called after all the symbols have been read by the linker. These are just two different interesting points in the course of a link. The plugin can also get handles to linker functions like "get_input_section_name" which it can use to process sections given their handle. You can also check the gold linker tests for simpler plugin examples. HTH, Thanks Sri > > -Y