date:20200911

Re: A problem with one instruction multiple latencies and pipelines

2020-09-11 Thread Richard Sandiford

Segher Boessenkool  writes:
> On Thu, Sep 10, 2020 at 11:04:26AM +0100, Richard Sandiford wrote:
>> Segher Boessenkool  writes:
>> > You can also use some other attributes to classify instructions, you
>> > don't have to put it all in one "type" attribute.  This can of course be
>> > done later, at a time when it is clearer what a good design will be.
>> > Sometimes it is obvious from the start though :-)
>> >
>> > (This primarily makes the pipeline descriptions much simpler, but also
>> > custom scheduling code and the like.  If one core has two types of "A"
>> > insn, say "Aa" and "Ab", it isn't nice if all other cores now have to
>> > handle both "Aa" and "Ab" instead of just "A").
>> 
>> I guess it makes the description of other cores more verbose, but it
>> doesn't IMO make them more complicated.  It's still just one test
>> against one attribute.  And updating existing descriptions can be
>> done automatically by sed.
>
> Consider cores that do not care about the "subtype" at all: when using
> just "type", all cores have to test for "foo,foo_subtype", while with
> a separate attribute they can just test for "foo".

Right.  But that was exactly the case I was considering with the sed
comment above :-)

I don't see adding a new attribute value to a set_attr as extra
complication.  It's just adding a value to a set.  To me extra
complication is adding (ior …)s, (and …)s, etc.

> A typical example in rs6000 is the "sign_extend" attribute for load
> instructions.  All cores before power4 do not care at all about it.
> (Load and store insns get into combinatorial explosion land as well,
> as you discuss below, with "update" and "indexed" forms).
>
>> The difficulty with splitting into subattributes is that the tests in
>> the cores that *do* care then become more complicated.  E.g. you have
>> to do:
>> 
>>(ior (and (eq_attr "type" "foo")
>>  (eq_attr "foo_subtype" "foo1"))
>> (eq_attr "type" "...others.."))
>> 
>> and:
>> 
>>(ior (and (eq_attr "type" "foo")
>>  (eq_attr "foo_subtype" "!foo1"))
>> (eq_attr "type" "...others.."))
>> 
>> instead of just:
>> 
>>(eq_attr "type" "...")
>> 
>> in both cases.
>
> Yes.  It is a trade-off.
>
>> It's not too bad when it's just one subtype (like above), but it could
>> easily get out of hand.
>> 
>> Perhaps in this case there's an argument in favour of having a separate
>> attribute due to combinatorial explosion.  E.g. we have:
>> 
>> - alu_shift_imm
>> - alus_shift_imm
>> - logic_shift_imm
>> - logics_shift_imm
>> 
>> so having one attribute that describes the shift in all four cases
>> would perhaps be better than splitting each of them into two.
>
> Yeas, exactly.  And for rs6000 we *did* have many more types before, and
> very frequently one was missed (in a scheduling description usually).
> Especially common was when some new type attribute was added, not all
> existing cpu descriptions were updated correctly.  Maybe this is the
> strongest argument "for" actually :-)

The update shouldn't be done by hand.  The regularity of the syntax
means that an appropriate sed really is good enough (perhaps with
formatting fixes on top).  For extra robustness, you can replace the
old attribute value with two new ones, so that any missed case triggers
a build error.

What I'm really wary of with taking the line above is that it feeds
into the attitude that existing scheduling descriptions should become
fossilised at the point they're added: noone working on a different
core should change the declarations in the file later in case they
miss something.  I don't think that's what you're saying, but it
could feed into that general attitude.

>> Arguments in the other direction:
>> 
>> - Once we have a separate attribute, it isn't clear where the line
>>   should be drawn.
>
> Good taste ;-)
>
>>  Alternatives include:
>> 
>>   (1) keeping the new attribute specific to shift immediates only
>> 
>>   (2) also having a “register” value, and folding:
>>   alu_shift_imm, alu_shift_reg -> alu_shift
>> 
>>   (3) also having a “none” value, and doing away with the *_shift
>>   type variants altogether:
>>   alu_sreg, alu_shift_imm, alu_shift_reg -> alu_reg
>> 
>>   I think we should have a clear reason for whichever we pick,
>>   otherwise we could be creating technical debt for later.
>
> Yes.
>
> One example of what we do:
>
> ;; Is this instruction using operands[2] as shift amount, and can that be a
> ;; register?
> ;; This is used for shift insns.
> (define_attr "maybe_var_shift" "no,yes" (const_string "no"))
>
> ;; Is this instruction using a shift amount from a register?
> ;; This is used for shift insns.
> (define_attr "var_shift" "no,yes"
>   (if_then_else (and (eq_attr "type" "shift")
>  (eq_attr "maybe_var_shift" "yes"))
> (if_then_else (match_operand 2 "gpc_reg_operand")
>   (const_string "yes")
>   (const_string "no"))

Re: subreg vs vec_select

2020-09-11 Thread Richard Sandiford

Ilya Leoshkevich via Gcc  writes:
> On Wed, 2020-09-09 at 16:09 -0500, Segher Boessenkool wrote:
>> Hi Ilya,
>> 
>> On Wed, Sep 09, 2020 at 11:50:56AM +0200, Ilya Leoshkevich via Gcc
>> wrote:
>> > I have a vector pseudo containing a single 128-bit value (V1TFmode)
>> > and
>> > I need to access its last 64 bits (DFmode). Which of the two
>> > options
>> > is better?
>> > 
>> > (subreg:DF (reg:V1TF) 8)
>> > 
>> > or
>> > 
>> > (vec_select:DF (subreg:V2DF (reg:V1TF) 0) (parallel [(const_int
>> > 1)]))
>> > 
>> > If I use the first one, I run into a problem with set_noop_p (): it
>> > thinks that
>> > 
>> > (set (subreg:DF (reg:TF %f0) 8) (subreg:DF (reg:V1TF %f0) 8))
>> > 
>> > is a no-op, because it doesn't check the mode after stripping the
>> > subreg:
>> > 
>> > https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/rtlanal.c;h=5ae38b79#l1616
>> > 
>> > However this is not correct, because SET_DEST is the second
>> > register in
>> > a register pair, and SET_SRC is half of a vector register that
>> > overlaps
>> > the first register in the corresponding pair. So it looks as if
>> > mode
>> > needs to be considered there.
>> 
>> Yes.
>> 
>> > This helps:
>> > 
>> > --- a/gcc/rtlanal.c
>> > +++ b/gcc/rtlanal.c
>> > @@ -1619,6 +1619,8 @@ set_noop_p (const_rtx set)
>> > return 0;
>> >src = SUBREG_REG (src);
>> >dst = SUBREG_REG (dst);
>> > +  if (GET_MODE (src) != GET_MODE (dst))
>> > +   return 0;
>> >  }
>> > 
>> > but I'm not sure whether I'm not missing something about subreg
>> > semantics in the first place.
>> 
>> You probably should just see if both modes are the same number of
>> hard
>> registers?  HARD_REGNO_NREGS.
>
> I've refined my patch as follows:
>
> --- a/gcc/rtlanal.c
> +++ b/gcc/rtlanal.c
> @@ -1619,6 +1619,11 @@ set_noop_p (const_rtx set)
> return 0;
>src = SUBREG_REG (src);
>dst = SUBREG_REG (dst);
> +  if (REG_P (src) && HARD_REGISTER_P (src) && REG_P (dst)
> + && HARD_REGISTER_P (dst)
> + && hard_regno_nregs (REGNO (src), GET_MODE (src))
> +!= hard_regno_nregs (REGNO (dst), GET_MODE (dst)))
> +   return 0;
>  }

I think checking the mode would be safer.  Having the same number
of registers doesn't mean that the bits are distributed across the
registers in the same way.

Out of interest, why can't the subregs in the example above get
folded down to hard registers?

Thanks,
Richard

Re: subreg vs vec_select

2020-09-11 Thread Ilya Leoshkevich via Gcc

On Fri, 2020-09-11 at 10:46 +0100, Richard Sandiford wrote:
> Ilya Leoshkevich via Gcc  writes:
> > On Wed, 2020-09-09 at 16:09 -0500, Segher Boessenkool wrote:
> > > Hi Ilya,
> > > 
> > > On Wed, Sep 09, 2020 at 11:50:56AM +0200, Ilya Leoshkevich via
> > > Gcc
> > > wrote:
> > > > I have a vector pseudo containing a single 128-bit value
> > > > (V1TFmode)
> > > > and
> > > > I need to access its last 64 bits (DFmode). Which of the two
> > > > options
> > > > is better?
> > > > 
> > > > (subreg:DF (reg:V1TF) 8)
> > > > 
> > > > or
> > > > 
> > > > (vec_select:DF (subreg:V2DF (reg:V1TF) 0) (parallel [(const_int
> > > > 1)]))
> > > > 
> > > > If I use the first one, I run into a problem with set_noop_p
> > > > (): it
> > > > thinks that
> > > > 
> > > > (set (subreg:DF (reg:TF %f0) 8) (subreg:DF (reg:V1TF %f0) 8))
> > > > 
> > > > is a no-op, because it doesn't check the mode after stripping
> > > > the
> > > > subreg:
> > > > 
> > > > https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/rtlanal.c;h=5ae38b79#l1616
> > > > 
> > > > However this is not correct, because SET_DEST is the second
> > > > register in
> > > > a register pair, and SET_SRC is half of a vector register that
> > > > overlaps
> > > > the first register in the corresponding pair. So it looks as if
> > > > mode
> > > > needs to be considered there.
> > > 
> > > Yes.
> > > 
> > > > This helps:
> > > > 
> > > > --- a/gcc/rtlanal.c
> > > > +++ b/gcc/rtlanal.c
> > > > @@ -1619,6 +1619,8 @@ set_noop_p (const_rtx set)
> > > > return 0;
> > > >src = SUBREG_REG (src);
> > > >dst = SUBREG_REG (dst);
> > > > +  if (GET_MODE (src) != GET_MODE (dst))
> > > > +   return 0;
> > > >  }
> > > > 
> > > > but I'm not sure whether I'm not missing something about subreg
> > > > semantics in the first place.
> > > 
> > > You probably should just see if both modes are the same number of
> > > hard
> > > registers?  HARD_REGNO_NREGS.
> > 
> > I've refined my patch as follows:
> > 
> > --- a/gcc/rtlanal.c
> > +++ b/gcc/rtlanal.c
> > @@ -1619,6 +1619,11 @@ set_noop_p (const_rtx set)
> > return 0;
> >src = SUBREG_REG (src);
> >dst = SUBREG_REG (dst);
> > +  if (REG_P (src) && HARD_REGISTER_P (src) && REG_P (dst)
> > + && HARD_REGISTER_P (dst)
> > + && hard_regno_nregs (REGNO (src), GET_MODE (src))
> > +!= hard_regno_nregs (REGNO (dst), GET_MODE (dst)))
> > +   return 0;
> >  }
> 
> I think checking the mode would be safer.  Having the same number
> of registers doesn't mean that the bits are distributed across the
> registers in the same way.

Yeah, that's what I was trying to express with this hypothetical
machine example.  On the other hand, checking mode is too pessimistic.
E.g. if we talk about s390 GPRs, then considering

(set (subreg:SI (reg:DI %r0) 4) (subreg:SI (reg:DI %r0) 4))

a no-op is fine from my perspective.  So having a more restrictive
check might be desirable.  Is there a way to ask the backend how the
subreg bits are distributed?

> Out of interest, why can't the subregs in the example above get
> folded down to hard registers?

I think this is because the offsets are not 0.  I could imagine folding
(subreg:DF (reg:TF %f0) 8) to (reg:DF %f2) - but there must be a
backend hook for this.  Does anything like this exist?  Also, can
(subreg:DF (reg:V1TF %f0) 8) be folded at all? This is simply
the second doubleword of 128-bit %v0 vector register.

Re: subreg vs vec_select

2020-09-11 Thread Ilya Leoshkevich via Gcc

On Fri, 2020-09-11 at 12:17 +0200, Ilya Leoshkevich wrote:
> On Fri, 2020-09-11 at 10:46 +0100, Richard Sandiford wrote:
> > Ilya Leoshkevich via Gcc  writes:
> > > On Wed, 2020-09-09 at 16:09 -0500, Segher Boessenkool wrote:
> > > > Hi Ilya,
> > > > 
> > > > On Wed, Sep 09, 2020 at 11:50:56AM +0200, Ilya Leoshkevich via
> > > > Gcc
> > > > wrote:
> > > > > I have a vector pseudo containing a single 128-bit value
> > > > > (V1TFmode)
> > > > > and
> > > > > I need to access its last 64 bits (DFmode). Which of the two
> > > > > options
> > > > > is better?
> > > > > 
> > > > > (subreg:DF (reg:V1TF) 8)
> > > > > 
> > > > > or
> > > > > 
> > > > > (vec_select:DF (subreg:V2DF (reg:V1TF) 0) (parallel
> > > > > [(const_int
> > > > > 1)]))
> > > > > 
> > > > > If I use the first one, I run into a problem with set_noop_p
> > > > > (): it
> > > > > thinks that
> > > > > 
> > > > > (set (subreg:DF (reg:TF %f0) 8) (subreg:DF (reg:V1TF %f0) 8))
> > > > > 
> > > > > is a no-op, because it doesn't check the mode after stripping
> > > > > the
> > > > > subreg:
> > > > > 
> > > > > https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/rtlanal.c;h=5ae38b79#l1616
> > > > > 
> > > > > However this is not correct, because SET_DEST is the second
> > > > > register in
> > > > > a register pair, and SET_SRC is half of a vector register
> > > > > that
> > > > > overlaps
> > > > > the first register in the corresponding pair. So it looks as
> > > > > if
> > > > > mode
> > > > > needs to be considered there.
> > > > 
> > > > Yes.
> > > > 
> > > > > This helps:
> > > > > 
> > > > > --- a/gcc/rtlanal.c
> > > > > +++ b/gcc/rtlanal.c
> > > > > @@ -1619,6 +1619,8 @@ set_noop_p (const_rtx set)
> > > > > return 0;
> > > > >src = SUBREG_REG (src);
> > > > >dst = SUBREG_REG (dst);
> > > > > +  if (GET_MODE (src) != GET_MODE (dst))
> > > > > +   return 0;
> > > > >  }
> > > > > 
> > > > > but I'm not sure whether I'm not missing something about
> > > > > subreg
> > > > > semantics in the first place.
> > > > 
> > > > You probably should just see if both modes are the same number
> > > > of
> > > > hard
> > > > registers?  HARD_REGNO_NREGS.
> > > 
> > > I've refined my patch as follows:
> > > 
> > > --- a/gcc/rtlanal.c
> > > +++ b/gcc/rtlanal.c
> > > @@ -1619,6 +1619,11 @@ set_noop_p (const_rtx set)
> > > return 0;
> > >src = SUBREG_REG (src);
> > >dst = SUBREG_REG (dst);
> > > +  if (REG_P (src) && HARD_REGISTER_P (src) && REG_P (dst)
> > > + && HARD_REGISTER_P (dst)
> > > + && hard_regno_nregs (REGNO (src), GET_MODE (src))
> > > +!= hard_regno_nregs (REGNO (dst), GET_MODE
> > > (dst)))
> > > +   return 0;
> > >  }
> > 
> > I think checking the mode would be safer.  Having the same number
> > of registers doesn't mean that the bits are distributed across the
> > registers in the same way.
> 
> Yeah, that's what I was trying to express with this hypothetical
> machine example.  On the other hand, checking mode is too
> pessimistic.
> E.g. if we talk about s390 GPRs, then considering
> 
> (set (subreg:SI (reg:DI %r0) 4) (subreg:SI (reg:DI %r0) 4))

Sorry, bad example: here the hard register modes actually match.
But it's probably possible to come up with something similar, where
the hard reg is accessed with different modes, but when we add subregs
on top, then we end up selecting the same bits.

> a no-op is fine from my perspective.  So having a more restrictive
> check might be desirable.  Is there a way to ask the backend how the
> subreg bits are distributed?
> 
> > Out of interest, why can't the subregs in the example above get
> > folded down to hard registers?
> 
> I think this is because the offsets are not 0.  I could imagine
> folding
> (subreg:DF (reg:TF %f0) 8) to (reg:DF %f2) - but there must be a
> backend hook for this.  Does anything like this exist?  Also, can
> (subreg:DF (reg:V1TF %f0) 8) be folded at all? This is simply
> the second doubleword of 128-bit %v0 vector register.

Re: subreg vs vec_select

2020-09-11 Thread Richard Sandiford

Ilya Leoshkevich  writes:
> On Fri, 2020-09-11 at 12:17 +0200, Ilya Leoshkevich wrote:
>> On Fri, 2020-09-11 at 10:46 +0100, Richard Sandiford wrote:
>> > Ilya Leoshkevich via Gcc  writes:
>> > > On Wed, 2020-09-09 at 16:09 -0500, Segher Boessenkool wrote:
>> > > > Hi Ilya,
>> > > > 
>> > > > On Wed, Sep 09, 2020 at 11:50:56AM +0200, Ilya Leoshkevich via
>> > > > Gcc
>> > > > wrote:
>> > > > > I have a vector pseudo containing a single 128-bit value
>> > > > > (V1TFmode)
>> > > > > and
>> > > > > I need to access its last 64 bits (DFmode). Which of the two
>> > > > > options
>> > > > > is better?
>> > > > > 
>> > > > > (subreg:DF (reg:V1TF) 8)
>> > > > > 
>> > > > > or
>> > > > > 
>> > > > > (vec_select:DF (subreg:V2DF (reg:V1TF) 0) (parallel
>> > > > > [(const_int
>> > > > > 1)]))
>> > > > > 
>> > > > > If I use the first one, I run into a problem with set_noop_p
>> > > > > (): it
>> > > > > thinks that
>> > > > > 
>> > > > > (set (subreg:DF (reg:TF %f0) 8) (subreg:DF (reg:V1TF %f0) 8))
>> > > > > 
>> > > > > is a no-op, because it doesn't check the mode after stripping
>> > > > > the
>> > > > > subreg:
>> > > > > 
>> > > > > https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/rtlanal.c;h=5ae38b79#l1616
>> > > > > 
>> > > > > However this is not correct, because SET_DEST is the second
>> > > > > register in
>> > > > > a register pair, and SET_SRC is half of a vector register
>> > > > > that
>> > > > > overlaps
>> > > > > the first register in the corresponding pair. So it looks as
>> > > > > if
>> > > > > mode
>> > > > > needs to be considered there.
>> > > > 
>> > > > Yes.
>> > > > 
>> > > > > This helps:
>> > > > > 
>> > > > > --- a/gcc/rtlanal.c
>> > > > > +++ b/gcc/rtlanal.c
>> > > > > @@ -1619,6 +1619,8 @@ set_noop_p (const_rtx set)
>> > > > > return 0;
>> > > > >src = SUBREG_REG (src);
>> > > > >dst = SUBREG_REG (dst);
>> > > > > +  if (GET_MODE (src) != GET_MODE (dst))
>> > > > > +   return 0;
>> > > > >  }
>> > > > > 
>> > > > > but I'm not sure whether I'm not missing something about
>> > > > > subreg
>> > > > > semantics in the first place.
>> > > > 
>> > > > You probably should just see if both modes are the same number
>> > > > of
>> > > > hard
>> > > > registers?  HARD_REGNO_NREGS.
>> > > 
>> > > I've refined my patch as follows:
>> > > 
>> > > --- a/gcc/rtlanal.c
>> > > +++ b/gcc/rtlanal.c
>> > > @@ -1619,6 +1619,11 @@ set_noop_p (const_rtx set)
>> > > return 0;
>> > >src = SUBREG_REG (src);
>> > >dst = SUBREG_REG (dst);
>> > > +  if (REG_P (src) && HARD_REGISTER_P (src) && REG_P (dst)
>> > > + && HARD_REGISTER_P (dst)
>> > > + && hard_regno_nregs (REGNO (src), GET_MODE (src))
>> > > +!= hard_regno_nregs (REGNO (dst), GET_MODE
>> > > (dst)))
>> > > +   return 0;
>> > >  }
>> > 
>> > I think checking the mode would be safer.  Having the same number
>> > of registers doesn't mean that the bits are distributed across the
>> > registers in the same way.
>> 
>> Yeah, that's what I was trying to express with this hypothetical
>> machine example.  On the other hand, checking mode is too
>> pessimistic.
>> E.g. if we talk about s390 GPRs, then considering
>> 
>> (set (subreg:SI (reg:DI %r0) 4) (subreg:SI (reg:DI %r0) 4))
>
> Sorry, bad example: here the hard register modes actually match.
> But it's probably possible to come up with something similar, where
> the hard reg is accessed with different modes, but when we add subregs
> on top, then we end up selecting the same bits.

Wel, in general, (subreg (reg …) …) shouldn't exist for hard registers.
They should just be folded to the underlying (reg …) instead.
So in the above I'd expect to see:

  (set (reg:SI %r0) (reg:SI %r0))

instead.  I remember there were cases on e.g. e500 where the target
prevented the fold from taking place due to layout issues, but wasn't
sure whether the above FP example above was something similar.

So…

>> a no-op is fine from my perspective.  So having a more restrictive
>> check might be desirable.  Is there a way to ask the backend how the
>> subreg bits are distributed?

…if the bits are distributed evenly in an obvious way, the hard register
subreg should already have been folded to a reg.  I think subregs of
hard registers are rare enough that we should just keep it simple
and safe.

>> > Out of interest, why can't the subregs in the example above get
>> > folded down to hard registers?
>> 
>> I think this is because the offsets are not 0.  I could imagine
>> folding
>> (subreg:DF (reg:TF %f0) 8) to (reg:DF %f2) - but there must be a
>> backend hook for this.  Does anything like this exist?  Also, can
>> (subreg:DF (reg:V1TF %f0) 8) be folded at all? This is simply
>> the second doubleword of 128-bit %v0 vector register.

For multi-register values with a regular layout, simplify_subreg_regno
(via subreg_get_info) uses various target queries to work out which
hard registers are actually being accessed.  Is the subreg

Re: A problem with one instruction multiple latencies and pipelines

2020-09-11 Thread Richard Earnshaw

On 07/09/2020 07:08, Qian, Jianhua wrote:
> Hi
> 
> I'm adding a new machine model. I have a problem when writing the 
> "define_insn_reservation" for instruction scheduling.
> How to write the "define_insn_reservation" for one instruction that there are 
> different latencies and pipelines according to parameter.
> 
> For example, the ADD (shifted register) instruction in a64fx
> 
> InstructionOption LatencyPipeline
> ADD (shifted register)   = 0 1  EX* | EAG*
>= [1-4] && =LSL  1+1   (EXA + EXA) 
> | (EXB + EXB)
>  2+1   (EXA + 
> EXA) | (EXB + EXB)
> 

A shift by immediate zero isn't a shift, so should never use this RTL
pattern.  We can ignore that case.

> In aarch64.md ADD (shifted register) instruction is defined as following．
>  (define_insn "*add__"
>   [(set (match_operand:GPI 0 "register_operand" "=r")
> (plus:GPI (ASHIFT:GPI (match_operand:GPI 1 "register_operand" "r")
>   (match_operand:QI 2 "aarch64_shift_imm_" 
> "n"))
>   (match_operand:GPI 3 "register_operand" "r")))]
>   ""
>   "add\\t%0, %3, %1,  %2"
>   [(set_attr "type" "alu_shift_imm")]
> )

You might consider using a define_bypass to adjust the cost - the
matcher rule takes a producer and consumer RTL - you don't care about
the consumer, but you can use the bypass to reduce the cost if the
producer uses an immediate in the 'low latency' range.  This would avoid
having to make a load of whole-sale changes to the main parts of the
machine description.


> 
> It could not be distinguished by the type "alu_shift_imm" when writing 
> "define_insn_reservation" for ADD (shifted register). 
> What should I do?
> 
> Regards
> Qian
> 
> 
> 

R.

Re: A problem with one instruction multiple latencies and pipelines

2020-09-11 Thread Segher Boessenkool

Hi!

On Fri, Sep 11, 2020 at 08:44:54AM +0100, Richard Sandiford wrote:
> Segher Boessenkool  writes:
> > Consider cores that do not care about the "subtype" at all: when using
> > just "type", all cores have to test for "foo,foo_subtype", while with
> > a separate attribute they can just test for "foo".
> 
> Right.  But that was exactly the case I was considering with the sed
> comment above :-)
> 
> I don't see adding a new attribute value to a set_attr as extra
> complication.  It's just adding a value to a set.  To me extra
> complication is adding (ior …)s, (and …)s, etc.

Combinatorial explosion is real.

(ior)s etc. are easier to read, and importantly it is much more obvious
if any case is missed with them.

Here is the very first diff in d839f53b7dfd, where I started this for
rs6000:

 (define_insn_reservation "ppc403-load" 2
-  (and (eq_attr "type" "load,load_ext,load_ext_u,load_ext_ux,load_ux,load_u,\
-   load_l,store_c,sync")
+  (and (eq_attr "type" "load,load_l,store_c,sync")
(eq_attr "cpu" "ppc403,ppc405"))
   "iu_40x")

We started off with 77 insn types.  Now we have 63 (it was 57 before, we
added some more).

> > Yeas, exactly.  And for rs6000 we *did* have many more types before, and
> > very frequently one was missed (in a scheduling description usually).
> > Especially common was when some new type attribute was added, not all
> > existing cpu descriptions were updated correctly.  Maybe this is the
> > strongest argument "for" actually :-)
> 
> The update shouldn't be done by hand.

No, it very much had to, because the existing code missed many cases and
was inconsistent in others.

> What I'm really wary of with taking the line above is that it feeds
> into the attitude that existing scheduling descriptions should become
> fossilised at the point they're added: noone working on a different
> core should change the declarations in the file later in case they
> miss something.  I don't think that's what you're saying, but it
> could feed into that general attitude.

It is very much not what I am saying.  I *am* saying that if people
trying to improve one CPU's modelling have to edit over 20 models for
CPUs that they do not care about, mistakes happen.  Such churn is
conceptually wrong as well, of course (the new types are just the same
as the old types on most older CPUs!)

> > Yes.
> >
> > One example of what we do:
> >
> > ;; Is this instruction using operands[2] as shift amount, and can that be a
> > ;; register?
> > ;; This is used for shift insns.
> > (define_attr "maybe_var_shift" "no,yes" (const_string "no"))
> >
> > ;; Is this instruction using a shift amount from a register?
> > ;; This is used for shift insns.
> > (define_attr "var_shift" "no,yes"
> >   (if_then_else (and (eq_attr "type" "shift")
> >  (eq_attr "maybe_var_shift" "yes"))
> > (if_then_else (match_operand 2 "gpc_reg_operand")
> >   (const_string "yes")
> >   (const_string "no"))
> > (const_string "no")))
> >
> > define_insns only use maybe_var_shift.  Only the power6 and e*500* cpu
> > descriptions ever look at var_shift.  (Directly.  There is some other
> > scheduling code that looks at it too -- and all but the power6 ones seem
> > to be incorrect!  That is all a direct translation of "type-only" code
> > there was before...)
> 
> Right.  This is similar to the:
> 
>   (define_attr "shift_imm_type" "none,alu_shift_op2")
> 
> thing I suggested upthread.  I think it's better for the define_insn
> attribute to specify the operand number directly, since there might
> well be cases where the shift isn't operand 2.

Not in this case; all such insns always have it as op 2 in the machine
instruction, and RTL almost always uses the same order.

> Anyway, I think we're in violent agreement on how this works, we're just
> taking different conclusions from it.

Yeah :-)  And not even so different: I think you agree it is a
trade-off, you just see it tilt more to one side, while I see it go to
the other side often.

We still have over fifty instruction types.  We (hopefully :-) ) use the
"extra attribute" method only where that helps more than it hurts.  It
is just another tool in the toolbox.


Segher

Re: subreg vs vec_select

2020-09-11 Thread Ilya Leoshkevich via Gcc

On Fri, 2020-09-11 at 12:14 +0100, Richard Sandiford wrote:
> Ilya Leoshkevich  writes:
> > On Fri, 2020-09-11 at 12:17 +0200, Ilya Leoshkevich wrote:
> > > On Fri, 2020-09-11 at 10:46 +0100, Richard Sandiford wrote:
> > > > Ilya Leoshkevich via Gcc  writes:
> > > > > On Wed, 2020-09-09 at 16:09 -0500, Segher Boessenkool wrote:
> > > > > > Hi Ilya,
> > > > > > 
> > > > > > On Wed, Sep 09, 2020 at 11:50:56AM +0200, Ilya Leoshkevich
> > > > > > via
> > > > > > Gcc
> > > > > > wrote:
> > > > > > > I have a vector pseudo containing a single 128-bit value
> > > > > > > (V1TFmode)
> > > > > > > and
> > > > > > > I need to access its last 64 bits (DFmode). Which of the
> > > > > > > two
> > > > > > > options
> > > > > > > is better?
> > > > > > > 
> > > > > > > (subreg:DF (reg:V1TF) 8)
> > > > > > > 
> > > > > > > or
> > > > > > > 
> > > > > > > (vec_select:DF (subreg:V2DF (reg:V1TF) 0) (parallel
> > > > > > > [(const_int
> > > > > > > 1)]))
> > > > > > > 
> > > > > > > If I use the first one, I run into a problem with
> > > > > > > set_noop_p
> > > > > > > (): it
> > > > > > > thinks that
> > > > > > > 
> > > > > > > (set (subreg:DF (reg:TF %f0) 8) (subreg:DF (reg:V1TF %f0)
> > > > > > > 8))
> > > > > > > 
> > > > > > > is a no-op, because it doesn't check the mode after
> > > > > > > stripping
> > > > > > > the
> > > > > > > subreg:
> > > > > > > 
> > > > > > > https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/rtlanal.c;h=5ae38b79#l1616
> > > > > > > 
> > > > > > > However this is not correct, because SET_DEST is the
> > > > > > > second
> > > > > > > register in
> > > > > > > a register pair, and SET_SRC is half of a vector register
> > > > > > > that
> > > > > > > overlaps
> > > > > > > the first register in the corresponding pair. So it looks
> > > > > > > as
> > > > > > > if
> > > > > > > mode
> > > > > > > needs to be considered there.
> > > > > > 
> > > > > > Yes.
> > > > > > 
> > > > > > > This helps:
> > > > > > > 
> > > > > > > --- a/gcc/rtlanal.c
> > > > > > > +++ b/gcc/rtlanal.c
> > > > > > > @@ -1619,6 +1619,8 @@ set_noop_p (const_rtx set)
> > > > > > > return 0;
> > > > > > >src = SUBREG_REG (src);
> > > > > > >dst = SUBREG_REG (dst);
> > > > > > > +  if (GET_MODE (src) != GET_MODE (dst))
> > > > > > > +   return 0;
> > > > > > >  }
> > > > > > > 
> > > > > > > but I'm not sure whether I'm not missing something about
> > > > > > > subreg
> > > > > > > semantics in the first place.
> > > > > > 
> > > > > > You probably should just see if both modes are the same
> > > > > > number
> > > > > > of
> > > > > > hard
> > > > > > registers?  HARD_REGNO_NREGS.
> > > > > 
> > > > > I've refined my patch as follows:
> > > > > 
> > > > > --- a/gcc/rtlanal.c
> > > > > +++ b/gcc/rtlanal.c
> > > > > @@ -1619,6 +1619,11 @@ set_noop_p (const_rtx set)
> > > > > return 0;
> > > > >src = SUBREG_REG (src);
> > > > >dst = SUBREG_REG (dst);
> > > > > +  if (REG_P (src) && HARD_REGISTER_P (src) && REG_P
> > > > > (dst)
> > > > > + && HARD_REGISTER_P (dst)
> > > > > + && hard_regno_nregs (REGNO (src), GET_MODE (src))
> > > > > +!= hard_regno_nregs (REGNO (dst), GET_MODE
> > > > > (dst)))
> > > > > +   return 0;
> > > > >  }
> > > > 
> > > > I think checking the mode would be safer.  Having the same
> > > > number
> > > > of registers doesn't mean that the bits are distributed across
> > > > the
> > > > registers in the same way.
> > > 
> > > Yeah, that's what I was trying to express with this hypothetical
> > > machine example.  On the other hand, checking mode is too
> > > pessimistic.
> > > E.g. if we talk about s390 GPRs, then considering
> > > 
> > > (set (subreg:SI (reg:DI %r0) 4) (subreg:SI (reg:DI %r0) 4))
> > 
> > Sorry, bad example: here the hard register modes actually match.
> > But it's probably possible to come up with something similar, where
> > the hard reg is accessed with different modes, but when we add
> > subregs
> > on top, then we end up selecting the same bits.
> 
> Wel, in general, (subreg (reg …) …) shouldn't exist for hard
> registers.
> They should just be folded to the underlying (reg …) instead.
> So in the above I'd expect to see:
> 
>   (set (reg:SI %r0) (reg:SI %r0))
> 
> instead.  I remember there were cases on e.g. e500 where the target
> prevented the fold from taking place due to layout issues, but wasn't
> sure whether the above FP example above was something similar.
> 
> So…
> 
> > > a no-op is fine from my perspective.  So having a more
> > > restrictive
> > > check might be desirable.  Is there a way to ask the backend how
> > > the
> > > subreg bits are distributed?
> 
> …if the bits are distributed evenly in an obvious way, the hard
> register
> subreg should already have been folded to a reg.  I think subregs of
> hard registers are rare enough that we should just keep it simple
> and safe.

Ok! I'll make a proper patch out of this and post to gcc-patches then.

> 
> > > > Out o

Lowest i386 CPU Model with proper C++ atomics

2020-09-11 Thread Joel Sherrill

Hi

Over at RTEMS, we ran into a case where the C++ atomics may not be right
for one of the lower level x86 models. We will investigate whether it can
be made right but this has led to the discussion of dropping older models
and setting a new minimum model. Right now, our base is a i386 w/FPU. The
best rationale we have for selecting a new floor is GCC atomics support.

With that in mind, what's the lowest/oldest i386 CPU model we
should consider as the new base model?

Thanks.

--joel

Re: Lowest i386 CPU Model with proper C++ atomics

2020-09-11 Thread Florian Weimer

* Joel Sherrill:

> With that in mind, what's the lowest/oldest i386 CPU model we
> should consider as the new base model?

The 80486 has a CMPXCHG instruction (4-byte CAS).  Starting from CAS,
you can build the rest.  There might be some caveats about the memory
model implementation (it may not be as strongly ordered as the current
i386 implementation, I haven't checked).

The i386 does not have CAS, which is probably the problem you are
referring to.  But on non-device memory, you can certainyl fake it if
you are able to disable interrupts.

Re: Lowest i386 CPU Model with proper C++ atomics

2020-09-11 Thread Joel Sherrill

On Fri, Sep 11, 2020 at 1:07 PM Florian Weimer  wrote:

> * Joel Sherrill:
>
> > With that in mind, what's the lowest/oldest i386 CPU model we
> > should consider as the new base model?
>
> The 80486 has a CMPXCHG instruction (4-byte CAS).  Starting from CAS,
> you can build the rest.  There might be some caveats about the memory
> model implementation (it may not be as strongly ordered as the current
> i386 implementation, I haven't checked).
>

I guess that sets a baseline there if that's what would be used in gcc's
exception processing.

>
> The i386 does not have CAS, which is probably the problem you are
> referring to.  But on non-device memory, you can certainyl fake it if
> you are able to disable interrupts.
>

I don't know that we have a huge issue in making the i486 a minimum.
I was proposing a Pentium II or P6 as a baseline since that moves you
up to having a TBR and initial SMP support. But I think there are still
embedded x86 clones that I am not sure meet the P6 minimum.

--joel

Re: Lowest i386 CPU Model with proper C++ atomics

2020-09-11 Thread Florian Weimer

* Joel Sherrill:

> I don't know that we have a huge issue in making the i486 a minimum.
> I was proposing a Pentium II or P6 as a baseline since that moves you
> up to having a TBR and initial SMP support.

Sorry, what's a TBR?

> But I think there are still embedded x86 clones that I am not sure
> meet the P6 minimum.

Some AMD Geode variants do not have CMOV and have been produced until
fairly recently (if they aren't in production still).  That means that
they do not meet the Pentium Pro baseline.

Re: Lowest i386 CPU Model with proper C++ atomics

2020-09-11 Thread Joel Sherrill

On Fri, Sep 11, 2020 at 1:40 PM Florian Weimer  wrote:

> * Joel Sherrill:
>
> > I don't know that we have a huge issue in making the i486 a minimum.
> > I was proposing a Pentium II or P6 as a baseline since that moves you
> > up to having a TBR and initial SMP support.
>
> Sorry, what's a TBR?
>

Time Base Register. GCC wouldn't use. It is a cycle counter register and
lets
one avoid use of the old i8254 derived counter/timer for high resolution
times.

>
> > But I think there are still embedded x86 clones that I am not sure
> > meet the P6 minimum.
>
> Some AMD Geode variants do not have CMOV and have been produced until
> fairly recently (if they aren't in production still).  That means that
> they do not meet the Pentium Pro baseline.
>

I guess we need to experiment with i486 for uniprocessor and maybe PII for
SMP ones.

--joel

Re: Lowest i386 CPU Model with proper C++ atomics

2020-09-11 Thread Janne Blomqvist via Gcc

On Fri, Sep 11, 2020 at 6:52 PM Joel Sherrill  wrote:
>
> Hi
>
> Over at RTEMS, we ran into a case where the C++ atomics may not be right
> for one of the lower level x86 models. We will investigate whether it can
> be made right but this has led to the discussion of dropping older models
> and setting a new minimum model. Right now, our base is a i386 w/FPU. The
> best rationale we have for selecting a new floor is GCC atomics support.
>
> With that in mind, what's the lowest/oldest i386 CPU model we
> should consider as the new base model?

We sort-of discussed this issue back when the Linux kernel dropped
support for i386. See thread starting at
https://gcc.gnu.org/legacy-ml/gcc/2012-12/msg00079.html (a thread
where you participated as well)



-- 
Janne Blomqvist

Re: Lowest i386 CPU Model with proper C++ atomics

2020-09-11 Thread Joel Sherrill

On Fri, Sep 11, 2020 at 4:36 PM Janne Blomqvist 
wrote:

> On Fri, Sep 11, 2020 at 6:52 PM Joel Sherrill  wrote:
> >
> > Hi
> >
> > Over at RTEMS, we ran into a case where the C++ atomics may not be right
> > for one of the lower level x86 models. We will investigate whether it can
> > be made right but this has led to the discussion of dropping older models
> > and setting a new minimum model. Right now, our base is a i386 w/FPU. The
> > best rationale we have for selecting a new floor is GCC atomics support.
> >
> > With that in mind, what's the lowest/oldest i386 CPU model we
> > should consider as the new base model?
>
> We sort-of discussed this issue back when the Linux kernel dropped
> support for i386. See thread starting at
> https://gcc.gnu.org/legacy-ml/gcc/2012-12/msg00079.html (a thread
> where you participated as well)
>

Wow time flies! That was eight years ago and Dr Dewar was still alive. :(

I guess the embedded x86 side has started to hit the wall also. Looks
like we should move to at least the i486 has a baseline.

Thanks.

--joel

>
>
>
> --
> Janne Blomqvist
>

gcc-9-20200911 is now available

2020-09-11 Thread GCC Administrator via Gcc

Snapshot gcc-9-20200911 is now available on
  https://gcc.gnu.org/pub/gcc/snapshots/9-20200911/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 9 git branch
with the following options: git://gcc.gnu.org/git/gcc.git branch releases/gcc-9 
revision 7cfb86a2a18c9d54d1fd43f17affcd184477ecb3

You'll find:

 gcc-9-20200911.tar.xzComplete GCC

  SHA256=5a0c18494a1c2f99a7af31e11f14e98c15828b8f5a3720c13753fdbacafc9592
  SHA1=ea1d9ceb3d0f855f8574a006db1bfb6821317022

Diffs from 9-20200904 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-9
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.

Re: A problem with one instruction multiple latencies and pipelines

Re: subreg vs vec_select

Re: subreg vs vec_select

Re: subreg vs vec_select

Re: subreg vs vec_select

Re: A problem with one instruction multiple latencies and pipelines

Re: A problem with one instruction multiple latencies and pipelines

Re: subreg vs vec_select

Lowest i386 CPU Model with proper C++ atomics

Re: Lowest i386 CPU Model with proper C++ atomics

Re: Lowest i386 CPU Model with proper C++ atomics

Re: Lowest i386 CPU Model with proper C++ atomics

Re: Lowest i386 CPU Model with proper C++ atomics

Re: Lowest i386 CPU Model with proper C++ atomics

Re: Lowest i386 CPU Model with proper C++ atomics

gcc-9-20200911 is now available

16 matches

Site Navigation

Mail list logo

Footer information