moving v16sf reg with multiple sub-regs
Hi there, I have implemented a move of a v16sf type like this because it is held by 4 v4sf registers: --- snip --- (define_expand "movv16sf" [(set (match_operand:V16SF 0 "nonimmediate_operand" "") (match_operand:V16SF 1 "general_operand" ""))] "" " if ((reload_in_progress | reload_completed) == 0 && !register_operand (operands[0], V16SFmode) && !nonmemory_operand (operands[1], V16SFmode)) operands[1] = force_reg (V16SFmode, operands[1]); move_v16sf( operands ); DONE; ") --- end snip --- and in the config's .c file: --- snip --- void move_v16sf (operands ) rtx operands[]; { rtx op0 = operands[0]; rtx op1 = operands[1]; enum rtx_code code0 = GET_CODE (operands[0]); enum rtx_code code1 = GET_CODE (operands[1]); int subreg_offset0 = 0; int subreg_offset1 = 0; enum delay_type delay = DELAY_NONE; if (code0 == REG) { int regno0 = REGNO (op0) + subreg_offset0; if (code1 == REG) { int regno1 = REGNO (op1) + subreg_offset1; /* Just in case, don't do anything for assigning a register to itself, unless we are filling a delay slot. */ if (regno0 == regno1 && set_nomacro == 0) return; emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 0 ), gen_rtx_SUBREG( V4SFmode, op1, 0 ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 16 ), gen_rtx_SUBREG( V4SFmode, op1, 16 ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 32 ), gen_rtx_SUBREG( V4SFmode, op1, 32 ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 48 ), gen_rtx_SUBREG( V4SFmode, op1, 48 ) ); } else if (code1 == MEM) { rtx src_reg; src_reg = copy_addr_to_reg ( XEXP (op1,0) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 0 ), gen_rtx_MEM( V4SFmode, src_reg ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 16 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 16 ) ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 32 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 32 ) ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 48 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 48 ) ) ); } } else if (code0 == MEM) { if (code1 == REG) { rtx dest_reg; dest_reg = copy_addr_to_reg ( XEXP (op0,0) ); emit_move_insn( gen_rtx_MEM( V4SFmode, dest_reg ), gen_rtx_SUBREG (V4SFmode, op1, 0 ) ); emit_move_insn( gen_rtx_MEM( V4SFmode, plus_constant( dest_reg, 16) ), gen_rtx_SUBREG (V4SFmode, op1, 16 ) ); emit_move_insn( gen_rtx_MEM( V4SFmode, plus_constant( dest_reg, 32) ), gen_rtx_SUBREG (V4SFmode, op1, 32 ) ); emit_move_insn( gen_rtx_MEM( V4SFmode, plus_constant( dest_reg, 48) ), gen_rtx_SUBREG (V4SFmode, op1, 48 ) ); } } } --- end snip --- This works ok, but it produces inefficient code, here some sample source code: --- snip --- typedef int v4 __attribute__((mode(V4SF))); typedef int m4 __attribute__((mode(V16SF))); v4 vec1, vec2; m4 frog; int main( int argc, char* argv[] ) { m4 blob; asm( "some_instruction %0,%1,%2,%3" : "=&j" (blob): "j" (vec1), "j" (vec2), "j" (frog) ); asm( "some_instruction2 %0,%1" : "=&j" (frog) : "j" (blob) ); return 0; } --- end snip --- where j is the register class for v4sf and v16sf types. This produces a move of the v16sf type between the two asm instructions, when it doesn't need to, does anyone have any ideas why this move isn't eliminated? #APP some_instruction r10,r22,r20,r00 #NO_APP move r00,r10 move r01,r11 move r02,r12 move r03,r13 #APP some_instruction2 r10, r00 r10 isn't needed to be preserved (it isn't written out) but it seems to be making a copy anyway. Worse, if "blob" is defined in global space like "frog", then it also writes out r10 to memory when it shouldn't. Any ideas appreciated. Regards
Re: moving v16sf reg with multiple sub-regs
Further investigation. If I remove the define_expand for movv16sf and throw in a dummy define_insn that supports reg<->reg mem<->reg reg<->mem, then the redundant move is optimized away. But of course, the store load and move all use 4 instructions each so this produces inefficient code. Any idea how I can get the same removal of redundant temporaries and still get the multiple instructions for each operation interspersed nicely? Dylan "Dylan Cuthbert" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Hi there, > > I have implemented a move of a v16sf type like this because it is held by > 4 v4sf registers: > > --- snip --- > > (define_expand "movv16sf" > [(set (match_operand:V16SF 0 "nonimmediate_operand" "") > (match_operand:V16SF 1 "general_operand" ""))] > "" > " if ((reload_in_progress | reload_completed) == 0 > && !register_operand (operands[0], V16SFmode) > && !nonmemory_operand (operands[1], V16SFmode)) >operands[1] = force_reg (V16SFmode, operands[1]); > > move_v16sf( operands ); > DONE; > ") > > --- end snip --- > > > and in the config's .c file: > > > --- snip --- > > void > move_v16sf (operands ) > rtx operands[]; > { > rtx op0 = operands[0]; > rtx op1 = operands[1]; > enum rtx_code code0 = GET_CODE (operands[0]); > enum rtx_code code1 = GET_CODE (operands[1]); > int subreg_offset0 = 0; > int subreg_offset1 = 0; > enum delay_type delay = DELAY_NONE; > > if (code0 == REG) >{ > int regno0 = REGNO (op0) + subreg_offset0; > > if (code1 == REG) > { > int regno1 = REGNO (op1) + subreg_offset1; > > /* Just in case, don't do anything for assigning a register > to itself, unless we are filling a delay slot. */ > if (regno0 == regno1 && set_nomacro == 0) return; > > emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 0 ), gen_rtx_SUBREG( > V4SFmode, op1, 0 ) ); > emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 16 ), gen_rtx_SUBREG( > V4SFmode, op1, 16 ) ); > emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 32 ), gen_rtx_SUBREG( > V4SFmode, op1, 32 ) ); > emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 48 ), gen_rtx_SUBREG( > V4SFmode, op1, 48 ) ); > } > else if (code1 == MEM) > { > rtx src_reg; > > src_reg = copy_addr_to_reg ( XEXP (op1,0) ); > > emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 0 ), gen_rtx_MEM( > V4SFmode, src_reg ) ); > emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 16 ), gen_rtx_MEM( > V4SFmode, plus_constant( src_reg, 16 ) ) ); > emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 32 ), gen_rtx_MEM( > V4SFmode, plus_constant( src_reg, 32 ) ) ); > emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 48 ), gen_rtx_MEM( > V4SFmode, plus_constant( src_reg, 48 ) ) ); > } > >} > > else if (code0 == MEM) >{ > if (code1 == REG) > { > rtx dest_reg; > > dest_reg = copy_addr_to_reg ( XEXP (op0,0) ); > > emit_move_insn( gen_rtx_MEM( V4SFmode, dest_reg ), gen_rtx_SUBREG > (V4SFmode, op1, 0 ) ); > emit_move_insn( gen_rtx_MEM( V4SFmode, plus_constant( dest_reg, 16) ), > gen_rtx_SUBREG (V4SFmode, op1, 16 ) ); > emit_move_insn( gen_rtx_MEM( V4SFmode, plus_constant( dest_reg, 32) ), > gen_rtx_SUBREG (V4SFmode, op1, 32 ) ); > emit_move_insn( gen_rtx_MEM( V4SFmode, plus_constant( dest_reg, 48) ), > gen_rtx_SUBREG (V4SFmode, op1, 48 ) ); > } >} > > } > --- end snip --- > > > This works ok, but it produces inefficient code, here some sample source > code: > > --- snip --- > > typedef int v4 __attribute__((mode(V4SF))); > typedef int m4 __attribute__((mode(V16SF))); > > v4 vec1, vec2; > m4 frog; > > int main( int argc, char* argv[] ) > { > m4 blob; > > asm( "some_instruction %0,%1,%2,%3" : "=&j" (blob): "j" (vec1), "j" > (vec2), "j" (frog) ); > asm( "some_instruction2 %0,%1" : "=&j" (frog) : "j" (blob) ); > > return 0; > } > > --- end snip --- > > where j is the register class for v4sf and v16sf types. > This produces a move of the v16sf type betwe
Re: moving v16sf reg with multiple sub-regs
Hi there, The assembler instructions themselves don't allow the target to be the same as the source unfortunately so removing the '&' is difficult. (If I enforce the same thing without a '&' in inline asm using builtins and building the expression manually to generate a new reg rtx if the dest/source are the same do you think it will optimize better?) However, I don't see why it isn't eliminating the move that is generated when it realises that the temporary source is discarded. It seems to do this ok if it is just a define_insn with raw multi-line assembly, but I can't use multi-line assembly or it destroys optimizations that occur if sub-register access is performed, ie. if I overwrite the second v4sf in a v16sf type, gcc nicely gets rid of the move of that particular sub-register when it copies the entire v16sf around - something I was quite impressed by. Regards Dylan "James E Wilson" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] Dylan Cuthbert wrote: asm( "some_instruction %0,%1,%2,%3" : "=&j" (blob): "j" (vec1), "j" (vec2), "j" (frog) ); asm( "some_instruction2 %0,%1" : "=&j" (frog) : "j" (blob) ); It is the goal of the register allocator to use as few registers as possible, which means that we will try to use the same register for input and output here. Until we get to reload, where we see the early clobber (&), and then are forced to add a copy so that the instruction has separate input and output registers. Early clobbers are bad. Don't ever use them unless you have to. Just because the instruction operates on pieces of the input does not mean & is necessary. You only add the & if the input and output operands must be in different non-overlapping registers. This is just a guess. Try compiling with -da and looking at the register assignments in the .lreg and .greg files, and also at what reload did. It is possible that there could be something else going on. -- Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com
Re: moving v16sf reg with multiple sub-regs
One reason that occurred to me is that I am issueing the v16sf move as four subreg v4sf moves. One thing I get are "variable may not be initialised" warnings: v16sf test; test = _builtin_matrix_mul( left, right ); return test; Is there someway I can flag the moves to say that is moving the v16sf "whole" so it doesn't need to be initialised and hence avoid the warning? Dylan "James E Wilson" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] Dylan Cuthbert wrote: asm( "some_instruction %0,%1,%2,%3" : "=&j" (blob): "j" (vec1), "j" (vec2), "j" (frog) ); asm( "some_instruction2 %0,%1" : "=&j" (frog) : "j" (blob) ); It is the goal of the register allocator to use as few registers as possible, which means that we will try to use the same register for input and output here. Until we get to reload, where we see the early clobber (&), and then are forced to add a copy so that the instruction has separate input and output registers. Early clobbers are bad. Don't ever use them unless you have to. Just because the instruction operates on pieces of the input does not mean & is necessary. You only add the & if the input and output operands must be in different non-overlapping registers. This is just a guess. Try compiling with -da and looking at the register assignments in the .lreg and .greg files, and also at what reload did. It is possible that there could be something else going on. -- Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com
Re: moving v16sf reg with multiple sub-regs
Brilliant! This got rid of the warnings, *and* got rid of the spurious move I was getting. Thanks for the advice! The problem with the spurious move was with the move from memory to the v16sf register, because it was doing it with subregs it thought it still had to preserve the previous (uninitialised) value and so couldn't optimize the move out completely because of the parameter aliasing on those asm instructions. For the last parameter (equiv) of emit_no_conflict_block I am putting in "gen_rtx_SET ( V16SFmode, op0, op1 )", does this seem correct to you? Much thanks Dylan "James E Wilson" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] Dylan Cuthbert wrote: Is there someway I can flag the moves to say that is moving the v16sf "whole" so it doesn't need to be initialised and hence avoid the warning? See emit_no_conflict_block in optabs.c. -- Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com
Re: moving v16sf reg with multiple sub-regs
Ah, ok, sorry about that, I read it as being the equivalent of the whole operation. I'll throw op1 in there, thanks again. Dylan "James E Wilson" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] Dylan Cuthbert wrote: For the last parameter (equiv) of emit_no_conflict_block I am putting in "gen_rtx_SET ( V16SFmode, op0, op1 )", does this seem correct to you? This is supposed to be the value of op0 after the no conflict block. So it should just be op1. -- Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com
Re: moving v16sf reg with multiple sub-regs
Thanks for the info... I was worried about the aliasing problems - adjust_address fits the ticket perfectly. The information for simplify_gen_subreg is a little sparse, what does it do differently to gen_rtx_subreg? Regards Dylan "Richard Sandiford" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] "Dylan Cuthbert" <[EMAIL PROTECTED]> writes: emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 0 ), gen_rtx_MEM( V4SFmode, src_reg ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 16 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 16 ) ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 32 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 32 ) ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 48 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 48 ) ) ); Note that generating MEMs like this is a bad idea because it discards alias information. It's better to use functions like adjust_address instead. It's probably also better to use simplify_gen_subreg instead of gen_rtx_SUBREG. Richard
Re: moving v16sf reg with multiple sub-regs
Ok, I think I found out why gen_subreg crashes here: (with gcc 3.3.3) if (byte % GET_MODE_SIZE (outermode) || byte >= GET_MODE_SIZE (innermode)) abort (); This check doesn't seem right to me ;-) I'll see what's in the latest cvs for this function. Regards Dylan "Richard Sandiford" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] "Dylan Cuthbert" <[EMAIL PROTECTED]> writes: emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 0 ), gen_rtx_MEM( V4SFmode, src_reg ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 16 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 16 ) ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 32 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 32 ) ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 48 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 48 ) ) ); Note that generating MEMs like this is a bad idea because it discards alias information. It's better to use functions like adjust_address instead. It's probably also better to use simplify_gen_subreg instead of gen_rtx_SUBREG. Richard
Re: moving v16sf reg with multiple sub-regs
I tried simplify_gen_subreg but it crashes with a compiler error. Maybe because V4SF isn't really thought of as a subreg of a V16SF at the moment? I am using gcc 3.3.3 right now so it might be just that it works in a later version of the compiler? "Richard Sandiford" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] "Dylan Cuthbert" <[EMAIL PROTECTED]> writes: emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 0 ), gen_rtx_MEM( V4SFmode, src_reg ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 16 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 16 ) ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 32 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 32 ) ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 48 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 48 ) ) ); Note that generating MEMs like this is a bad idea because it discards alias information. It's better to use functions like adjust_address instead. It's probably also better to use simplify_gen_subreg instead of gen_rtx_SUBREG. Richard
Re: moving v16sf reg with multiple sub-regs
Unless in gcc-world outermode has the meaning of innermode? (and vice versa) which.. from looking at some other source... perhaps it does.. :-/ "Dylan Cuthbert" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] Ok, I think I found out why gen_subreg crashes here: (with gcc 3.3.3) if (byte % GET_MODE_SIZE (outermode) || byte >= GET_MODE_SIZE (innermode)) abort (); This check doesn't seem right to me ;-) I'll see what's in the latest cvs for this function. Regards Dylan "Richard Sandiford" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] "Dylan Cuthbert" <[EMAIL PROTECTED]> writes: emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 0 ), gen_rtx_MEM( V4SFmode, src_reg ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 16 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 16 ) ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 32 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 32 ) ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 48 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 48 ) ) ); Note that generating MEMs like this is a bad idea because it discards alias information. It's better to use functions like adjust_address instead. It's probably also better to use simplify_gen_subreg instead of gen_rtx_SUBREG. Richard