Re: Truncate optimisation question

2013-12-04 Thread Eric Botcazou
> Combine is asking simplify-rtx.c to truncate an addition to QImode
> and simplify-rtx.c is providing a reasonable representation of that.
> It's the representation we should use when matching against .md patterns,
> for example.  The problem is that combine doesn't want to keep the
> truncation in this case, but doesn't know that yet.

I disagree, I don't find it reasonable to turn an addition in SImode into an 
addition in QImode when the machine supports the former but not the latter.
I agree that it may help in some contexts, but then the transformation should 
be restricted to these contexts.

> Right, but the only complaint I know of is about its effect on combine.
> And my point is that that complaint isn't about combine failing to combine
> instructions per se.  It's that combine is failing to remove a redundant
> operation.  With the right input, the same rtl sequence could conceivably
> be generated on a CISC target like x86_64, since it defines all the required
> patterns (SImode addition, QI->SI zero extension, SImode comparison). It
> could also occur with a sequence that starts out as a QImode addition. So
> trying to make the simplification depend on CISCness seems like papering
> over the problem.

The problem is that this changes the combinations tried by the combiner in a 
way that is adverse to most RISC targets.  Sure, we could change all the 
affected back-ends but what's the point exactly?  What do we gain here?

Look for example at comment #4 in PR rtl-optimization/58295.

> If you think the patch was wrong or if you feel the fallout is too great
> then please feel free to revert it.

I think that the fallout is too great for RISC targets, yes, so I'm trying to 
find a reasonable compromise.

-- 
Eric Botcazou


Controling reloads of movsicc pattern

2013-12-04 Thread BELBACHIR Selim
Hi,

My target has : 
- 2 registers class to store SImode (like m68k, data $D & address $A).
- moves from wide offset MEM to $D or $A   (ex: mov d($A1+50),$A2   ormov 
d($A1+50),$D1)
- conditional moves from offset MEM to $D or $A but with a restriction : 
 offset MEM conditionally moved to $A has a limited offset of 0 or 1 
(ex: mov.ifEQ d($A1,1),$A1 whereas we can still do mov.ifEQ d($A1,50),$D1)

The predicate of movsicc pattern tells GCC that wide offset MEM is allowed and 
constraints describe 2 alternatives for 'wide offset MEM -> $D ' and 
'restricted offset MEM -> $A" :

(define_insn_and_split "movsicc_internal" 
  [(set (match_operand:SI 0 "register_operand" "=a,d,m,a,d,m,a,d,m")
(if_then_else:SI
  (match_operator 1 "prism_comparison_operator"
   [(match_operand 4 "cc_register" "") (const_int 0)])
  (match_operand:SI 2 "nonimmediate_operand"   " v,m,r,0,0,0,v,m,r")
 ;; "v" constraint is for restricted offset MEM
  (match_operand:SI 3 " nonimmediate_operand" " 0,0,0,v,m,r,v,m,r")))] 
;; the last 3 alternatives are split to match the other alternatives



I encountered : (on gcc4.7.3)
 
core_main.c:354:1: error: insn does not satisfy its constraints:
(insn 1176 1175 337 26 (set (reg:SI 5 $A5)
(if_then_else:SI (ne (reg:CC 56 $CCI)
(const_int 0 [0]))
(mem/c:SI (plus:SI (reg/f:SI 0 $A0)
(const_int 2104 [0x838])) [9 %sfp+2104 S4 A32])
(const_int 1 [0x1]))) core_main.c:211:32 158 {movsicc_internal}

Due to reload pass (core_main.c.199r.reload).


How can I tune reload or write my movsicc pattern to prevent reload pass from 
generating a conditional move from wide offset MEM to $A registers ??

   Regards,


Selim


[RFC, LRA] Repeated looping over subreg reloads.

2013-12-04 Thread Tejas Belagod


Hi,

I'm trying to relax CANNOT_CHANGE_MODE_CLASS for aarch64 to allow all mode 
changes on FP_REGS as aarch64 does not have register-packing, but I'm running 
into an LRA ICE. A test case generates an RTL subreg of the following form


   (set (reg:DF 97) (subreg:DF (reg:V2DF 95) 8))

LRA has to reload the subreg because the subreg is not representable as a full 
register. When LRA reloads this in lra-constraints.c:simplyfy_operand_subreg (), 
it seems to reload SUBREG_REG() and leave the byte offset alone.


i.e.

 (set (reg:V2DF 100) (reg:V2DF 95))
 (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8))

The code in lra-constraints.c is this conditional:

  /* Force a reload of the SUBREG_REG if this is a constant or PLUS or
 if there may be a problem accessing OPERAND in the outer
 mode.  */
  if ((REG_P (reg)
  
  insert_move_for_subreg (insert_before ? &before : NULL,
  insert_after ? &after : NULL,
  reg, new_reg);
}
  

What happens subsequently is that LRA keeps looping over this RTL and keeps 
reloading the SUBREG_REG() till the limit of constraint passes is reached.


 (set (reg:V2DF 100) (reg:V2DF 95))
 (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8))

I can't see any place where this subreg is resolved (eg. into equiv memref) 
before the next iteration comes around for reloading the inputs and outputs of 
curr_insn. Or am I missing something some part of code that tries reloading the 
subreg with different alternatives or reg classes?


Thanks,
Tejas.



Re: Truncate optimisation question

2013-12-04 Thread Richard Sandiford
Eric Botcazou  writes:
>> Combine is asking simplify-rtx.c to truncate an addition to QImode
>> and simplify-rtx.c is providing a reasonable representation of that.
>> It's the representation we should use when matching against .md patterns,
>> for example.  The problem is that combine doesn't want to keep the
>> truncation in this case, but doesn't know that yet.
>
> I disagree, I don't find it reasonable to turn an addition in SImode into an 
> addition in QImode when the machine supports the former but not the latter.
> I agree that it may help in some contexts, but then the transformation should 
> be restricted to these contexts.
>
>> Right, but the only complaint I know of is about its effect on combine.
>> And my point is that that complaint isn't about combine failing to combine
>> instructions per se.  It's that combine is failing to remove a redundant
>> operation.  With the right input, the same rtl sequence could conceivably
>> be generated on a CISC target like x86_64, since it defines all the required
>> patterns (SImode addition, QI->SI zero extension, SImode comparison). It
>> could also occur with a sequence that starts out as a QImode addition. So
>> trying to make the simplification depend on CISCness seems like papering
>> over the problem.
>
> The problem is that this changes the combinations tried by the combiner in a 
> way that is adverse to most RISC targets.  Sure, we could change all the 
> affected back-ends but what's the point exactly?  What do we gain here?
>
> Look for example at comment #4 in PR rtl-optimization/58295.

The comment says that we're trying to match:

1. (set (reg:SI) (zero_extend:SI (plus:QI (mem:QI) (const_int
2. (set (reg:QI) (plus:QI (mem:QI) (const_int)))
3. (set (reg:QI) (plus:QI (subreg:QI) (const_int)))
4. (set (reg:CC) (compare:CC (subreg:QI) (const_int)))
5. (set (reg:CC) (compare:CC (plus:QI (mem:QI) (const_int
6. (set (reg:SI) (leu:SI (subreg:QI) (const_int)))
7. (set (reg:SI) (leu:SI (subreg:QI) (const_int)))
8. (set (reg:SI) (leu:SI (plus:QI ...)))

And I think that's what we should be matching in cases where the
extension isn't redundant, even on RISC targets.

The problem here isn't really about which mode is on the plus,
but whether we recognise that the extension instruction is redundant.
I.e. we start with:

(insn 9 8 10 2 (set (reg:SI 120)
(plus:SI (subreg:SI (reg:QI 118) 0)
(const_int -48 [0xffd0]))) test.c:6 -1
 (nil))
(insn 10 9 11 2 (set (reg:SI 121)
(and:SI (reg:SI 120)
(const_int 255 [0xff]))) test.c:6 -1
 (nil))
(insn 11 10 12 2 (set (reg:CC 100 cc)
(compare:CC (reg:SI 121)
(const_int 9 [0x9]))) test.c:6 -1
 (nil))

and what we want combine to do is to recognise that insn 10 is redundant
and reduce the sequence to:

(insn 9 8 10 2 (set (reg:SI 120)
(plus:SI (subreg:SI (reg:QI 118) 0)
(const_int -48 [0xffd0]))) test.c:6 -1
 (nil))
(insn 11 10 12 2 (set (reg:CC 100 cc)
(compare:CC (reg:SI 120)
(const_int 9 [0x9]))) test.c:6 -1
 (nil))

But insn 11 is redundant on all targets, not just RISC ones.
It isn't about whether the target has a QImode addition or not.

>> If you think the patch was wrong or if you feel the fallout is too great
>> then please feel free to revert it.
>
> I think that the fallout is too great for RISC targets, yes, so I'm trying to 
> find a reasonable compromise.

Well, I think making the simplify-rtx code conditional on the target
would be the wrong way to go.  If we really can't live with it being
unconditional then I think we should revert it.  But like I say I think
it would be better to make combine recognise the redundancy even with
the new form.  (Or as I say, longer term, not to rely on combine to
eliminate redundant extensions.)  But I don't have time to do that myself...

Thanks,
Richard


Re: [RFC] Vectorization of indexed elements

2013-12-04 Thread Vidya Praveen
Hi Richi,

Apologies for the late response. I was on vacation.

On Mon, Oct 14, 2013 at 09:04:58AM +0100, Richard Biener wrote:
> On Fri, 11 Oct 2013, Vidya Praveen wrote:
> 
> > On Tue, Oct 01, 2013 at 09:26:25AM +0100, Richard Biener wrote:
> > > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> > > 
> > > > On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote:
> > > > > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> > > > > 
> > > > > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> > > > > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> > > > > > > [...]
> > > > > > > > > > I can't really insist on the single lane load.. something 
> > > > > > > > > > like:
> > > > > > > > > > 
> > > > > > > > > > vc:V4SI[0] = c
> > > > > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > > > > > > > va:V4SI = vb:V4SI  vt:V4SI
> > > > > > > > > > 
> > > > > > > > > > Or is there any other way to do this?
> > > > > > > > > 
> > > > > > > > > Can you elaborate on "I can't really insist on the single 
> > > > > > > > > lane load"?
> > > > > > > > > What's the single lane load in your example? 
> > > > > > > > 
> > > > > > > > Loading just one lane of the vector like this:
> > > > > > > > 
> > > > > > > > vc:V4SI[0] = c // from the above scalar example
> > > > > > > > 
> > > > > > > > or 
> > > > > > > > 
> > > > > > > > vc:V4SI[0] = c[2] 
> > > > > > > > 
> > > > > > > > is what I meant by single lane load. In this example:
> > > > > > > > 
> > > > > > > > t = c[2] 
> > > > > > > > ...
> > > > > > > > vb:v4si = b[0:3] 
> > > > > > > > vc:v4si = { t, t, t, t }
> > > > > > > > va:v4si = vb:v4si  vc:v4si 
> > > > > > > > 
> > > > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at 
> > > > > > > > vec_init, I cannot
> > > > > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] 
> > > > > > > > (which could be 
> > > > > > > > seen as vec_select:SI (vect_t 0) ). 
> > > > > > > > 
> > > > > > > > > I'd expect the instruction
> > > > > > > > > pattern as quoted to just work (and I hope we expand an 
> > > > > > > > > uniform
> > > > > > > > > constructor { a, a, a, a } properly using vec_duplicate).
> > > > > > > > 
> > > > > > > > As much as I went through the code, this is only done using 
> > > > > > > > vect_init. It is
> > > > > > > > not expanded as vec_duplicate from, for example, 
> > > > > > > > store_constructor() of expr.c
> > > > > > > 
> > > > > > > Do you see any issues if we expand such constructor as 
> > > > > > > vec_duplicate directly 
> > > > > > > instead of going through vect_init way? 
> > > > > > 
> > > > > > Sorry, that was a bad question.
> > > > > > 
> > > > > > But here's what I would like to propose as a first step. Please 
> > > > > > tell me if this
> > > > > > is acceptable or if it makes sense:
> > > > > > 
> > > > > > - Introduce standard pattern names 
> > > > > > 
> > > > > > "vmulim4" - vector muliply with second operand as indexed operand
> > > > > > 
> > > > > > Example:
> > > > > > 
> > > > > > (define_insn "vmuliv4si4"
> > > > > >[set (match_operand:V4SI 0 "register_operand")
> > > > > > (mul:V4SI (match_operand:V4SI 1 "register_operand")
> > > > > >   (vec_duplicate:V4SI
> > > > > > (vec_select:SI
> > > > > >   (match_operand:V4SI 2 "register_operand")
> > > > > >   (match_operand:V4SI 3 "immediate_operand)]
> > > > > >  ...
> > > > > > )
> > > > > 
> > > > > We could factor this with providing a standard pattern name for
> > > > > 
> > > > > (define_insn "vdupi"
> > > > >   [set (match_operand: 0 "register_operand")
> > > > >(vec_duplicate:
> > > > >   (vec_select:
> > > > >  (match_operand: 1 "register_operand")
> > > > >  (match_operand:SI 2 "immediate_operand]
> > > > 
> > > > This is good. I did think about this but then I thought of avoiding the 
> > > > need
> > > > for combiner patterns :-) 
> > > > 
> > > > But do you find the lane specific mov pattern I proposed, acceptable? 
> > > 
> > > The specific mul pattern?  As said, consider factoring to vdupi to
> > > avoid an explosion in required special optabs.
> > > 
> > > > > (you use V4SI for the immediate?  
> > > > 
> > > > Sorry typo again!! It should've been SI.
> > > > 
> > > > > Ideally vdupi has another custom
> > > > > mode for the vector index).
> > > > > 
> > > > > Note that this factored pattern is already available as 
> > > > > vec_perm_const!
> > > > > It is simply (vec_perm_const:V4SI   
> > > > > ).
> > > > > 
> > > > > Which means that on the GIMPLE level we should try to combine
> > > > > 
> > > > > el_4 = BIT_FIELD_REF ;
> > > > > v_5 = { el_4, el_4, ... };
> > > > 
> > > > I don't think we reach this state at all for the scenarios in 
> > > > discussion.
> > > > what we generally have is:
> > > > 
> > > >  el_4 = MEM_REF < array + index*size >
> > > >  v_5 = { el_4, ... }
> > > > 
> > > > 

Re: Controling reloads of movsicc pattern

2013-12-04 Thread Jeff Law

On 12/04/13 03:22, BELBACHIR Selim wrote:

Hi,

My target has :
- 2 registers class to store SImode (like m68k, data $D & address $A).
- moves from wide offset MEM to $D or $A   (ex: mov d($A1+50),$A2   ormov 
d($A1+50),$D1)
- conditional moves from offset MEM to $D or $A but with a restriction :
  offset MEM conditionally moved to $A has a limited offset of 0 or 1 
(ex: mov.ifEQ d($A1,1),$A1 whereas we can still do mov.ifEQ d($A1,50),$D1)

The predicate of movsicc pattern tells GCC that wide offset MEM is allowed and constraints 
describe 2 alternatives for 'wide offset MEM -> $D ' and 'restricted offset MEM -> 
$A" :

(define_insn_and_split "movsicc_internal"
   [(set (match_operand:SI 0 "register_operand" "=a,d,m,a,d,m,a,d,m")
 (if_then_else:SI
   (match_operator 1 "prism_comparison_operator"
[(match_operand 4 "cc_register" "") (const_int 0)])
   (match_operand:SI 2 "nonimmediate_operand"   " v,m,r,0,0,0,v,m,r") ;; 
"v" constraint is for restricted offset MEM
   (match_operand:SI 3 " nonimmediate_operand" " 0,0,0,v,m,r,v,m,r")))] 
;; the last 3 alternatives are split to match the other alternatives



I encountered : (on gcc4.7.3)

core_main.c:354:1: error: insn does not satisfy its constraints:
(insn 1176 1175 337 26 (set (reg:SI 5 $A5)
 (if_then_else:SI (ne (reg:CC 56 $CCI)
 (const_int 0 [0]))
 (mem/c:SI (plus:SI (reg/f:SI 0 $A0)
 (const_int 2104 [0x838])) [9 %sfp+2104 S4 A32])
 (const_int 1 [0x1]))) core_main.c:211:32 158 {movsicc_internal}

Due to reload pass (core_main.c.199r.reload).


How can I tune reload or write my movsicc pattern to prevent reload pass from 
generating a conditional move from wide offset MEM to $A registers ??
If at all possible, I would recommend switching to LRA.  There's an 
up-front cost, but it's definitely the direction all ports should be 
heading.  Avoiding reload is, umm, good.


jeff



Re: [RFC] Vectorization of indexed elements

2013-12-04 Thread Vidya Praveen
Hi Jakub,

Apologies for the late response.

On Fri, Oct 11, 2013 at 04:05:24PM +0100, Jakub Jelinek wrote:
> On Fri, Oct 11, 2013 at 03:54:08PM +0100, Vidya Praveen wrote:
> > Here's a compilable example:
> > 
> > void 
> > foo (int *__restrict__ a,
> >  int *__restrict__ b,
> >  int *__restrict__ c)
> > {
> >   int i;
> > 
> >   for (i = 0; i < 8; i++)
> > a[i] = b[i] * c[2];
> > }
> > 
> > This is vectorized by duplicating c[2] now. But I'm trying to take advantage
> > of target instructions that can take a vector register as second argument 
> > but
> > use only one element (by using the same value for all the lanes) of the 
> > vector register.
> > 
> > Eg. mul , , [index]
> > mla , , [index] // multiply and add
> > 
> > But for a loop like the one in the C example given, I will have to load the
> > c[2] in one element of the vector register (leaving the remaining unused)
> > rather. This is why I was proposing to load just one element in a vector 
> > register (what I meant as "lane specific load"). The benefit of doing this 
> > is
> > that we avoid explicit duplication, however such a simplification can only
> > be done where such support is available - the reason why I was thinking in
> > terms of optional standard pattern name. Another benefit is we will also be
> > able to support scalars in the expression like in the following example:
> > 
> > void
> > foo (int *__restrict__ a,
> >  int *__restrict__ b,
> >  int c)
> > {
> >   int i;
> > 
> >   for (i = 0; i < 8; i++)
> > a[i] = b[i] * c;
> > }
> 
> So just during combine let the broadcast operation be combined with the
> arithmetics?  

Yes. I can do that. But I always want it to be possible to recognize and load
directly to the indexed vector register from memory.


> Intel AVX512 ISA has similar feature, not sure what exactly
> they are doing for this. 

Thanks. I'll try to go through the code to understand.

> That said, the broadcast is likely going to be
> hoisted before the loop, and in that case is it really cheaper to have
> it unbroadcasted in a vector register rather than to broadcast it before the
> loop and just use there?

Could you explain what do you mean by unbroadcast? The constructor needs to be
expanded in one way or another, isn't it? I thought expanding to vec_duplicate
when the values are uniform is the most efficient when vec_duplicate could be
supported by the target. If you had meant that each element of vector is loaded
separately, I am thinking how can I combine such an operation with the 
arithmetic
operation.

Thanks
VP.





Re: [RFC, LRA] Repeated looping over subreg reloads.

2013-12-04 Thread Vladimir Makarov

On 12/4/2013, 6:15 AM, Tejas Belagod wrote:


Hi,

I'm trying to relax CANNOT_CHANGE_MODE_CLASS for aarch64 to allow all
mode changes on FP_REGS as aarch64 does not have register-packing, but
I'm running into an LRA ICE. A test case generates an RTL subreg of the
following form

(set (reg:DF 97) (subreg:DF (reg:V2DF 95) 8))

LRA has to reload the subreg because the subreg is not representable as
a full register. When LRA reloads this in
lra-constraints.c:simplyfy_operand_subreg (), it seems to reload
SUBREG_REG() and leave the byte offset alone.

i.e.

  (set (reg:V2DF 100) (reg:V2DF 95))
  (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8))

The code in lra-constraints.c is this conditional:

   /* Force a reload of the SUBREG_REG if this is a constant or PLUS or
  if there may be a problem accessing OPERAND in the outer
  mode.  */
   if ((REG_P (reg)
   
   insert_move_for_subreg (insert_before ? &before : NULL,
   insert_after ? &after : NULL,
   reg, new_reg);
 }
   

What happens subsequently is that LRA keeps looping over this RTL and
keeps reloading the SUBREG_REG() till the limit of constraint passes is
reached.

  (set (reg:V2DF 100) (reg:V2DF 95))
  (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8))

I can't see any place where this subreg is resolved (eg. into equiv
memref) before the next iteration comes around for reloading the inputs
and outputs of curr_insn. Or am I missing something some part of code
that tries reloading the subreg with different alternatives or reg classes?



I guess this behaviour is wrong.  We could spill the V2DF pseudo or put 
it into another class reg. But it is not implemented.  This code is 
actually a modified version of reload pass one.  We could implement 
alternative strategies and a check for potential loop (such code exists 
in process_alt_operands).


Could you send me the macro change and the test.  I'll look at it and 
figure out what can we do.




Re: [rust-dev] Rust front-end to GCC

2013-12-04 Thread Brian Anderson

On 12/03/2013 09:22 AM, Philip Herron wrote:

Hey all

Some of you may have noticed the gccrs branch on the git mirror. Since 
PyCon IE 2013 i gave a talk on my Python Front-end pet project and 
heard about rust by a few people and i never really looked at it 
before until then but i've kind of been hooked since.


So to learn the language i've been writing this front-end to GCC. Only 
really a a month or  so on and off work in between work. Currently it 
compiles alot of rust already in fairly little effort on my side GCC 
is doing loads of the heavy lifting.


Currently it compiles most of the basic stuff such as a struct an impl 
block while loop, functions expressions calling methods passing 
arguments etc. Currently focusing on getting the typing working 
correctly to support & and ~ and look at how templates might work as 
well as need to implement break and return.


There is still a lot of work but i would really like to share it and 
see what people think. Personally i think rust will target GCC very 
well and be a good addition (if / when it works). I really want to try 
and give back to this community who have been very good to me in 
learning over the last few years with GSOC.




This is very exciting work. I'm looking forward to following your progress.


C++ std headers and malloc, realloc poisoning

2013-12-04 Thread Oleg Endo
Hello,

Earlier this the following was committed:

2013-06-20  Oleg Endo  
Jason Merrill  

* system.h: Include  as well as .

... so that things like  could be included after including
system.h.
Some days ago I've tried building an SH cross-GCC on OSX 10.9 with the
latest XCode (clang) tools and its libc++ std lib.  Some of the libc++
headers use malloc, realloc etc, which are poisoned in system.h.  In
this particular case the problem is triggered by the inclusion of
 in sh.c, but there are more headers which show the same
problem (e.g. ).

Is the malloc, realloc poisoning actually still useful/helpful?  After
all it can be easily circumvented by doing
  "new char[my_size]" ...

A simple fix is to include C++ std headers before including system.h,
which works for .c/.cc files, but might become problematic if things
like  are included in headers in the future.

Anyway, just wanted to report my findings regarding this issue.

Cheers,
Oleg





Re: C++ std headers and malloc, realloc poisoning

2013-12-04 Thread Jason Merrill

On 12/04/2013 03:21 PM, Oleg Endo wrote:

Some days ago I've tried building an SH cross-GCC on OSX 10.9 with the
latest XCode (clang) tools and its libc++ std lib.  Some of the libc++
headers use malloc, realloc etc, which are poisoned in system.h.


Well, presumably we are poisoning them because we don't want people to 
use them.  I don't remember why that was, and system.h doesn't give a 
rationale, so I'm not sure if actual uses in the C++ library are 
problematic.


Jason.



Re: C++ std headers and malloc, realloc poisoning

2013-12-04 Thread Jakub Jelinek
On Wed, Dec 04, 2013 at 03:57:43PM -0500, Jason Merrill wrote:
> On 12/04/2013 03:21 PM, Oleg Endo wrote:
> >Some days ago I've tried building an SH cross-GCC on OSX 10.9 with the
> >latest XCode (clang) tools and its libc++ std lib.  Some of the libc++
> >headers use malloc, realloc etc, which are poisoned in system.h.
> 
> Well, presumably we are poisoning them because we don't want people
> to use them.  I don't remember why that was, and system.h doesn't
> give a rationale, so I'm not sure if actual uses in the C++ library
> are problematic.

I think the most important reason is that we want to handle out of mem
cases consistently, so instead of malloc etc. we want users to use xmalloc
etc. that guarantee non-NULL returned value, or fatal error and never
returning.  For operator new that is solvable through std::set_new_handler
I guess, but for malloc we really don't want people to deal with checking
NULL return values from those everywhere.

Jakub


Re: Dependency confusion in sched-deps

2013-12-04 Thread Maxim Kuvyrkov
On 8/11/2013, at 1:48 am, Paulo Matos  wrote:

> Hello,
> 
> I am slightly unsure if the confusion is in the dependencies or it's my 
> confusion.
> 
> I have tracked this strange behaviour which only occurs when we need to flush 
> pending instructions due to the pending list becoming too large (gcc 4.8, 
> haven't tried with trunk).
> 
> I have two stores: 
> 85: st zr, [r12] # zr is the zero register
> 90: st zr, [r18]
> 
> While analysing dependencies for `st zr, [r12]`, we notice that pending list 
> is too large in sched_analyze_1 and call flush_pending_lists (deps, insn, 
> false, true).
> 
> This in turn causes the last_pending_memory_flush to be set to:
> (insn_list:REG_DEP_TRUE 85 (nil))
> 
> When insn 90 is analyzed next, it skips the flushing bit since the pending 
> lists had just been flushed and enters the else bit where it does:
> add_dependence_list (insn, deps->last_pending_memory_flush, 1,
>  REG_DEP_ANTI, true);
> 
> This adds the dependency: 90 has an anti-dependency to 85.
> I think this should be a true dependency (write after write). It even says so 
> in the list of last_pending_memory_flush, however add_dependence_list 
> function ignored this and uses the dep_type passed: REG_DEP_ANTI.
> 
> Is anti the correct dependence? Why?

Output dependency is the right type (write after write).  Anti dependency is 
write after read, and true dependency is read after write.

Dependency type plays a role for estimating costs and latencies between 
instructions (which affects performance), but using wrong or imprecise 
dependency type does not affect correctness.  Dependency flush is a force-major 
occurrence during compilation, and developers tend not to spend too much time 
on coding best possible handling for these [hopefully] rare occurrences.

Anti dependency is a good guess for dependency type between two memory 
instructions.  In the above particular case it is wrong, and, I imagine, this 
causes a performance problem for you.  You can add better handling of this 
situation by remembering whether last_pending_memory_flush is memory read or 
memory write and then use it to select correct dependency type for insn 90: 
output, anti or true.

Let me know whether you want to pursue this and I can help with advice and 
patch review.

Thanks, 

--
Maxim Kuvyrkov
www.kugelworks.com




Re: m68k optimisations?

2013-12-04 Thread Maxim Kuvyrkov
On 9/11/2013, at 12:08 am, Fredrik Olsson  wrote:

> I have this simple functions:
> int sum_vec(int c, ...) {
>va_list argptr;
>va_start(argptr, c);
>int sum = 0;
>while (c--) {
>int x = va_arg(argptr, int);
>sum += x;
>}
>va_end(argptr);
>return sum;
> }
> 
> 
> When compiling with "-fomit-frame-pointer -Os -march=68000 -c -S
> -mshort" I get this assembly (I have manually added comments with
> clock cycles per instruction and a total for a count of 0, 8 and n>0):
>.even
>.globl _sum_vec
> _sum_vec:
>lea (6,%sp),%a0 | 8
>move.w 4(%sp),%d1   | 12
>clr.w %d0   | 4
>jra .L1 | 12
> .L2:
>add.w (%a0)+,%d0| 8
> .L1:
>dbra %d1,.L2| 16,12
>rts | 16
> | c==0: 8+12+4+12+12+16=64
> | c==8: 8+12+4+12+(16+8)*8+12+16=256
> | c==n: =64+24n
> 
> When instead compiling with "-fomit-frame-pointer -O3 -march=68000 -c
> -S -mshort" I expect to get more aggressive optimisation than -Os, or
> at least just as performant, but instead I get this:
>.even
>.globl _sum_vec
> _sum_vec:
>move.w 4(%sp),%d0   | 12
>jeq .L2 | 12,8
>lea (6,%sp),%a0 | 8
>subq.w #1,%d0   | 4
>and.l #65535,%d0| 16
>add.l %d0,%d0   | 8
>lea 8(%sp,%d0.l),%a1| 16
>clr.w %d0   | 4
> .L1:
>add.w (%a0)+,%d0| 8
>cmp.l %a0,%a1   | 8
>jne .L1 | 12|8
>rts | 16
> .L2:
>clr.w %d0   | 4
>rts | 16
> | c==0: 12+12+4+16=44
> | c==8: 12+8+8+4+16+8+16+4+(8+8+12)*4-4+16=316
> | c==n: =88+28n
> 
> The count==0 case is better. I can see what optimisation has been
> tried for the loop, but it just not working since both the ini for the
> loop and the loop itself becomes more costly.
> 
> Being a GCC beginner I would like a few pointers as to how I should go
> about to fix this?

You investigate such problems by comparing intermediate debug dumps of two 
compilation scenarios; by the assembly time it is almost impossible to guess 
where the problem is coming from.  Add -fdump-tree-all and -fdump-rtl-all to 
the compilation flags and find which optimization pass makes the wrong 
decision.  Then you trace that optimization pass or file a bug report in hopes 
that someone (optimization maintainer) will look at it.

Read through GCC wiki for information on debugging and troubleshooting GCC:
- http://gcc.gnu.org/wiki/GettingStarted
- http://gcc.gnu.org/wiki/FAQ
- http://gcc.gnu.org/wiki/

Thanks,

--
Maxim Kuvyrkov
www.kugelworks.com