Confused about code/comment in tree.c:build2

2020-01-30 Thread Bin.Cheng
Hi,
In tree.c:build2 there is following code/comment:

  if ((code == MINUS_EXPR || code == PLUS_EXPR || code == MULT_EXPR)
  && arg0 && arg1 && tt && POINTER_TYPE_P (tt)
  /* When sizetype precision doesn't match that of pointers
 we need to be able to build explicit extensions or truncations
 of the offset argument.  */
  && TYPE_PRECISION (sizetype) == TYPE_PRECISION (tt))
gcc_assert (TREE_CODE (arg0) == INTEGER_CST
&& TREE_CODE (arg1) == INTEGER_CST);

The comment says specific condition needs to be satisfied if sizetype
precision doesn't match that of pointers, while the code is checking
"==" of precisions?  Any explanation?

Thanks,
bin


How should I check if a type can be contextually converted to another one in c++FE

2020-02-09 Thread Bin.Cheng
Hi,
As subject, how should I check if a type can be contextually converted
to bool, in c++FE?

Thanks,
bin


Re: Live on Exit renaming.

2015-07-05 Thread Bin.Cheng
On Mon, Jul 6, 2015 at 6:02 AM, Steven Bosscher  wrote:
> On Sat, Jul 4, 2015 at 3:45 PM, Ajit Kumar Agarwal wrote:
>> I am not sure why the above optimization is not implemented in GCC.
>
> -fsplit-ivs-in-unroller

And thing might have changed.  Given the condition GCC does IVO on
gimple, unrolling on RTL, there is inconsistency between the two
optimizer since IVO takes register pressure of IVs into consideration
and assumes IVs will take single registers.  At least for some cases,
splitting live range of IVs results in bad code.  See PR29256 for more
information.  As described in the comment, actually I am going to do
some experiments disabling such transformation to see what happens.

Thanks,
bin
>
> Ciao!
> Steven


Re: Live on Exit renaming.

2015-07-05 Thread Bin.Cheng
On Mon, Jul 6, 2015 at 12:02 PM, Ajit Kumar Agarwal
 wrote:
>
>
> -Original Message-
> From: Bin.Cheng [mailto:amker.ch...@gmail.com]
> Sent: Monday, July 06, 2015 7:04 AM
> To: Steven Bosscher
> Cc: Ajit Kumar Agarwal; l...@redhat.com; Richard Biener; gcc@gcc.gnu.org; 
> Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
> Subject: Re: Live on Exit renaming.
>
> On Mon, Jul 6, 2015 at 6:02 AM, Steven Bosscher  wrote:
>> On Sat, Jul 4, 2015 at 3:45 PM, Ajit Kumar Agarwal wrote:
>>> I am not sure why the above optimization is not implemented in GCC.
>>
>> -fsplit-ivs-in-unroller
>
>>>And thing might have changed.  Given the condition GCC does IVO on gimple, 
>>>unrolling on RTL, there is inconsistency between the two optimizer since IVO 
>>>>>takes register pressure of IVs into consideration and assumes IVs will 
>>>take single registers.  At least for some cases, splitting live range of IVs 
>>>results in bad >>code.  See PR29256 for more information.  As described in 
>>>the comment, actually I am going to do some experiments disabling such 
>>>transformation to see >>what happens.
>
> The above optimization is implemented as a part of unroller in gimple. There 
> is an unroller pass in rtl which does not have support for this
As far as I understand, fsplit-ivs-in-unroller is a transformation in
RTL unroller.

Thanks,
bin
> optimization.  Shouldn't be the fsplit-ivs-in-unroller optimization 
> implemented in the unroller pass of rtl. I am looking at the implementation
> perspective for implementing the fsplit-ivs-in-unroller optimizations in the 
> unroller rtl pass.
>
> Thanks & Regards
> Ajit
>
> Thanks,
> bin
>>
>> Ciao!
>> Steven


Re: Live on Exit renaming.

2015-07-05 Thread Bin.Cheng
On Mon, Jul 6, 2015 at 1:16 PM, Ajit Kumar Agarwal
 wrote:
>
>
> -Original Message-
> From: Bin.Cheng [mailto:amker.ch...@gmail.com]
> Sent: Monday, July 06, 2015 10:26 AM
> To: Ajit Kumar Agarwal
> Cc: Steven Bosscher; l...@redhat.com; Richard Biener; gcc@gcc.gnu.org; Vinod 
> Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
> Subject: Re: Live on Exit renaming.
>
> On Mon, Jul 6, 2015 at 12:02 PM, Ajit Kumar Agarwal 
>  wrote:
>>
>>
>> -Original Message-
>> From: Bin.Cheng [mailto:amker.ch...@gmail.com]
>> Sent: Monday, July 06, 2015 7:04 AM
>> To: Steven Bosscher
>> Cc: Ajit Kumar Agarwal; l...@redhat.com; Richard Biener;
>> gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli
>> Hunsigida; Nagaraju Mekala
>> Subject: Re: Live on Exit renaming.
>>
>> On Mon, Jul 6, 2015 at 6:02 AM, Steven Bosscher  
>> wrote:
>>> On Sat, Jul 4, 2015 at 3:45 PM, Ajit Kumar Agarwal wrote:
>>>> I am not sure why the above optimization is not implemented in GCC.
>>>
>>> -fsplit-ivs-in-unroller
>>
>>>>And thing might have changed.  Given the condition GCC does IVO on gimple, 
>>>>unrolling on RTL, there is inconsistency between the two optimizer since 
>>>>IVO >>takes register pressure of IVs into consideration and assumes IVs 
>>>>will take single registers.  At least for some cases, splitting live range 
>>>>of IVs results in bad >>code.  See PR29256 for more information.  As 
>>>>described in the comment, actually I am going to do some experiments 
>>>>disabling such transformation to see >>what happens.
>>
>> The above optimization is implemented as a part of unroller in gimple.
>> There is an unroller pass in rtl which does not have support for this
>>>As far as I understand, fsplit-ivs-in-unroller is a transformation in RTL 
>>>unroller.
>
> My mistake. Yes you are right. The fsplit-ivs-in-unroller is a transformation 
> in RTL unroller.
> IVO on gimple doesn't take unrolling into consideration and assume to assign 
> single register for IV candidates. My thinking is that
> Splitting IVs at RTL with the unroller removes the long dependent chains and 
> thus makes the overlapping iterations and better
> Register allocators and there is a chance of movement of independent code 
> that got exposes with split-ivs-in-unroller.
>
> You have mentioned that splitting of IV candidate reults in bad code.  I 
> could see only the positive end of this optimizations.
> Could you please elaborate on the negative end of the fsplit-ivs-in-unroller 
> optimizations as you have mentioned that it results
> In bad code in some cases.
I had pointed to PR29256 in previous message.  I also saw such
examples in different benchmarks, and the situation is even worse on
targets supporting auto-increment addressing mode.

Thanks,
bin
>
> Thanks & Regards
> Ajit
>
> Thanks,
> bin
>> optimization.  Shouldn't be the fsplit-ivs-in-unroller optimization
>> implemented in the unroller pass of rtl. I am looking at the implementation 
>> perspective for implementing the fsplit-ivs-in-unroller optimizations in the 
>> unroller rtl pass.
>>
>> Thanks & Regards
>> Ajit
>>
>> Thanks,
>> bin
>>>
>>> Ciao!
>>> Steven


Re: Does GCC generate LDRD/STRD (Register) forms?

2015-07-06 Thread Bin.Cheng
On Tue, Jul 7, 2015 at 10:05 AM, Anmol Paralkar (anmparal)
 wrote:
> Hello,
>
> Does GCC generate LDRD/STRD (Register) forms [A8.8.74/A8.8.211 per ARMv7-A
> & ARMv7-R ARM]?
>
> Based on various attempts to write code to get GCC to generate a sample
> form, and subsequently inspecting the code I see in
> config/arm/arm.c/output_move_double () & arm.md [GCC 4.9.2], I think that
> these register based forms of LDRD/STRD are
> not generated, but I thought it might be a good idea to ask on the list,
> just in case.
Register based LDRD is harder than immediate version.  ARM doesn't
support [base + reg + offset] addressing mode, so address computation
of the second memory reference is scattered both in and out of memory
reference.  To identify such opportunities, one needs to trace
registers in address expression the memory access instruction and does
some kind of value computation and re-association.

Thanks,
bin
>
> Thanks,
> Anmol.
>


Question about always executed info computed in tree-ssa-loop-im.c

2015-07-07 Thread Bin.Cheng
Hi,
Function fill_always_executed_in_1 computes basic blocks' always
executed information, and it has below code and comment:

  /* In a loop that is always entered we may proceed anyway.
 But record that we entered it and stop once we leave it.  */
  inn_loop = bb->loop_father;

Then in following iterations, it breaks the loop if basic block not
belonging to the inner loop is encountered.  This means basic blocks
after inner loop won't have always executed information computed, even
they dominates the original loop's latch.

Am I missing something?  Why is that?

Thanks,
bin


Re: Question about always executed info computed in tree-ssa-loop-im.c

2015-07-08 Thread Bin.Cheng
On Wed, Jul 8, 2015 at 5:51 PM, Richard Biener
 wrote:
> On Wed, Jul 8, 2015 at 8:52 AM, Bin.Cheng  wrote:
>> Hi,
>> Function fill_always_executed_in_1 computes basic blocks' always
>> executed information, and it has below code and comment:
>>
>>   /* In a loop that is always entered we may proceed anyway.
>>  But record that we entered it and stop once we leave it.  */
>>   inn_loop = bb->loop_father;
>>
>> Then in following iterations, it breaks the loop if basic block not
>> belonging to the inner loop is encountered.  This means basic blocks
>> after inner loop won't have always executed information computed, even
>> they dominates the original loop's latch.
>>
>> Am I missing something?  Why is that?
>
> To improve here it would need to verify that all exits of the inner loop
> exit to the same BB of the outer loop and that no exit skips any blocks
> in the outer loop.  Like for
But we are working on dominating tree anyway.  Won't dominating latch be enough?

Thanks,
bin
>
>   for (;;)
> {
>   for (;;) { if (x) goto skip; if (y) break }
>   foo();
> skip:
> }
>
> so it is just a simple and conservative algorithm it seems.  It's also
> quadratic in the number of BBs of the outermost loop.
>
>> Thanks,
>> bin


Re: Question about always executed info computed in tree-ssa-loop-im.c

2015-07-08 Thread Bin.Cheng
On Wed, Jul 8, 2015 at 5:58 PM, Bin.Cheng  wrote:
> On Wed, Jul 8, 2015 at 5:51 PM, Richard Biener
>  wrote:
>> On Wed, Jul 8, 2015 at 8:52 AM, Bin.Cheng  wrote:
>>> Hi,
>>> Function fill_always_executed_in_1 computes basic blocks' always
>>> executed information, and it has below code and comment:
>>>
>>>   /* In a loop that is always entered we may proceed anyway.
>>>  But record that we entered it and stop once we leave it.  */
>>>   inn_loop = bb->loop_father;
>>>
>>> Then in following iterations, it breaks the loop if basic block not
>>> belonging to the inner loop is encountered.  This means basic blocks
>>> after inner loop won't have always executed information computed, even
>>> they dominates the original loop's latch.
>>>
>>> Am I missing something?  Why is that?
>>
>> To improve here it would need to verify that all exits of the inner loop
>> exit to the same BB of the outer loop and that no exit skips any blocks
>> in the outer loop.  Like for
> But we are working on dominating tree anyway.  Won't dominating latch be 
> enough?
>
> Thanks,
> bin
>>
>>   for (;;)
>> {
>>   for (;;) { if (x) goto skip; if (y) break }
>>   foo();
>> skip:
>> }
>>
>> so it is just a simple and conservative algorithm it seems.  It's also
>> quadratic in the number of BBs of the outermost loop.
What we need to do is check if inner loop's exit goes to basic block
not belonging to outer loop?

Thanks,
bin


Re: %fs and %gs segments on x86/x86-64

2015-07-13 Thread Bin.Cheng
On Thu, Jul 9, 2015 at 8:02 PM, Armin Rigo  wrote:
> Hi all,
>
> Here is an updated patch (attached) for __seg_fs and __seg_gs:
>
> * added a target hook "default_pointer_address_modes" to avoid
> disabling a few gcc optimizations which, according to my reading of
> the documentation, should continue to work even in the presence of
> multiple address spaces as long as they all use the same mode for
> pointers.
>
> * account for the extra byte in "%gs:(...)" addresses.
>
> * added one test case (better than none!) using "scan-assembler".  If
> people agree that this is the style of test that we need here, then I
> could add more of them.
>
> The diff is against trunk.  The tests don't all pass; the failures
> really seem unrelated, but I guess I should grab the same revision
> without the patch, compile it, try to run all the tests on the same
> machine, and compare the list of failures... it just takes a serious
> amount of time to do so...
>
> I also reported the bug I got previously
> (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66768) and it seems to
> occur already in other targets with address spaces.
For this issue, I will work out a patch for it.

Thanks,
bin
>
>
> A bientôt,
>
> Armin.


Re: Bin Cheng as Loop Induction Variable Optimizations maintainer

2015-08-02 Thread Bin.Cheng
On Sat, Aug 1, 2015 at 5:03 AM, Jeff Law  wrote:
> I am pleased to announce that the GCC Steering Committee has appointed Bin
> Cheng as the IVopts maintainer.
>
> Please join me in congratulating Bin on his new role.
>
> Bin, please update your entry in the MAINTAINERS file.  I also believe you
Done.

Thanks,
bin
> have some patches to self-approve :-)
>
> Thanks,
>
> Jeff


Re: Awareness of register pressure on strength reduction of induction variables.

2015-09-05 Thread Bin.Cheng
On Wed, Sep 2, 2015 at 8:52 PM, Richard Biener
 wrote:
> On Tue, Sep 1, 2015 at 7:57 PM, Ajit Kumar Agarwal
>  wrote:
>> All;
>>
>> The Global code motion are the important optimization that have an impact on 
>> register spills and Fetch. Thus
>> The Global code motion takes into account the increase or decrease of 
>> register pressure.
>>
>> Strength Reductions is an important optimization that has an impact on 
>> register pressure. The strength reduction
>> Optimizations doesn't take into account the register pressure and especially 
>> the strength reduction for induction
>> Variables that are inside the Loops. Loops are the bottleneck for the 
>> register pressure and any spill inside the loops
>> Based on strength reduction of induction variables degrades the performance 
>> of the Loops.
>>
>> Thus the strength reductions in general and especially the strength 
>> reduction for induction variables should take
>> Into account the register pressure based on Live range info.
>>
>> I don't think we do take into account the register pressure for strength 
>> reduction optimization in general and especially the
>> Strength reduction based on induction variables.
>
> We do loop strength reduction as part of induction variable
> optimization and I am not aware that IVO takes
> existing register pressure into account (it does limit the number of
> IVs it creates IIRC).
IIUC, it collects part of existing pressure into account such as
variables defined outside of loop and used in loop.  It doesn't take
existing pressure into account such as variables live out via loop's
exit edge, or variables live though loop.  It collects variable use
information during scanning for IV, rather than live-ness formula.

Thanks,
bin
>
> Richard.
>
>> Thanks & Regards
>> Ajit


How to generate jump_table_data in rtl optimizers

2015-11-08 Thread Bin.Cheng
Hi,
I used below code snippet to generate jump_table_data:

//setup label refs
start_sequence ();
emit_jump_table_data(gen_rtx_
ADDR_DIFF_VEC (CASE_VECTOR_MODE,

base_label_ref,

label_refs...))
insns = get_insns ();
end_sequence ();
split_edge_and_insert (edge, insns);

But later on GCC exited with below error message:

xxx.c:15:1: error: unrecognizable insn:
 }
 ^
(insn 187 109 113 3 (jump_table_data 185 0 0 (addr_diff_vec:DI
(label_ref:DI 184)
 [
(label_ref:DI 178)
(label_ref:DI 179)
(label_ref:DI 180)
(label_ref:DI 181)
(label_ref:DI 182)
(label_ref:DI 183)
(label_ref:DI 184)
]
(const_int 0 [0])
(const_int 0 [0]))) -1
 (nil))
xxx.c:15:1: internal compiler error: in extract_insn, at recog.c:2286

Seems what GCC expecting is as below:

(jump_table_data 185 0 0 (addr_diff_vec:DI (label_ref:DI 184)
 [
(label_ref:DI 178)
(label_ref:DI 179)
(label_ref:DI 180)
(label_ref:DI 181)
(label_ref:DI 182)
(label_ref:DI 183)
(label_ref:DI 184)
]
(const_int 0 [0])
(const_int 0 [0])))

The outer insn structure is created in emit_insn_after and make_insn_raw.

I looked into stmt.c:emit_case_dispatch_table, seems GCC used
record_insns there, and it only inserts insn to list, and does not
create the outer insn structure.

Since record_insns/record_insns_nobb are deprecated according to
comments, my question is how to generate jump_table_data then?  Is
jump_table_data not supported in emit_insn_*?

Thanks,
bin


Re: How to generate jump_table_data in rtl optimizers

2015-11-08 Thread Bin.Cheng
On Mon, Nov 9, 2015 at 2:20 PM, Bin.Cheng  wrote:
> Hi,
> I used below code snippet to generate jump_table_data:
>
> //setup label refs
> start_sequence ();
> emit_jump_table_data(gen_rtx_
> ADDR_DIFF_VEC (CASE_VECTOR_MODE,
>
> base_label_ref,
>
> label_refs...))
> insns = get_insns ();
> end_sequence ();
> split_edge_and_insert (edge, insns);
>
> But later on GCC exited with below error message:
>
> xxx.c:15:1: error: unrecognizable insn:
>  }
>  ^
> (insn 187 109 113 3 (jump_table_data 185 0 0 (addr_diff_vec:DI
> (label_ref:DI 184)
>  [
> (label_ref:DI 178)
> (label_ref:DI 179)
> (label_ref:DI 180)
> (label_ref:DI 181)
> (label_ref:DI 182)
> (label_ref:DI 183)
> (label_ref:DI 184)
> ]
> (const_int 0 [0])
> (const_int 0 [0]))) -1
>  (nil))
> xxx.c:15:1: internal compiler error: in extract_insn, at recog.c:2286
>
> Seems what GCC expecting is as below:
>
> (jump_table_data 185 0 0 (addr_diff_vec:DI (label_ref:DI 184)
>  [
> (label_ref:DI 178)
> (label_ref:DI 179)
> (label_ref:DI 180)
> (label_ref:DI 181)
> (label_ref:DI 182)
> (label_ref:DI 183)
> (label_ref:DI 184)
> ]
> (const_int 0 [0])
> (const_int 0 [0])))
>
> The outer insn structure is created in emit_insn_after and make_insn_raw.
>
> I looked into stmt.c:emit_case_dispatch_table, seems GCC used
> record_insns there, and it only inserts insn to list, and does not
> create the outer insn structure.
>
> Since record_insns/record_insns_nobb are deprecated according to
> comments, my question is how to generate jump_table_data then?  Is
> jump_table_data not supported in emit_insn_*?

Seems we need to handle JUMP_TABLE_DATA in emit_pattern_after_noloc
explicitly.  Anyway jump_table_data is used as an "insn".

Thanks,
bin


Re: Question about PR 48814 and ivopts and post-increment

2015-12-03 Thread Bin.Cheng
On Wed, Dec 2, 2015 at 5:11 AM, Steve Ellcey  wrote:
>
> I have a question involving ivopts and PR 48814, which was a fix for
> the post increment operation.  Prior to the fix for PR 48814, MIPS
> would generate this loop for strcmp (C code from glibc):
>
> $L4:
> lbu $3,0($4)
> lbu $2,0($5)
> addiu   $4,$4,1
> beq $3,$0,$L7
> addiu   $5,$5,1# This is a branch delay slot
> beq $3,$2,$L4
> subu$2,$3,$2   # This is a branch delay slot (only used after 
> loop)
>
>
> With the current top-of-tree we now generate:
>
> addiu   $4,$4,1
> $L8:
> lbu $3,-1($4)
> addiu   $5,$5,1
> beq $3,$0,$L7
> lbu $2,-1($5)  # This is a branch delay slot
> beq $3,$2,$L8
> addiu   $4,$4,1# This is a branch delay slot
>
> subu$2,$3,$2   # Done only once now after exiting loop.
>
> The main problem with the new loop is that the beq comparing $2 and $3
> is right before the load of $2 so there can be a delay due to the time
> that the load takes.  The ideal code would probably be:
>
> addiu   $4,$4,1
> $L8:
> lbu $3,-1($4)
> lbu $2,0($5)  # This is a branch delay slot
> beq $3,$0,$L7
> addiu   $5,$5,1
> beq $3,$2,$L8
> addiu   $4,$4,1# This is a branch delay slot
>
> subu$2,$3,$2   # Done only once now after exiting loop.
>
> Where we load $2 earlier (using a 0 offset instead of a -1 offset) and
> then do the increment of $5 after using it in the load.  The problem
> is that this isn't something that can just be done in the instruction
> scheduler because we are changing one of the instructions (to modify the
> offset) in addition to rearranging them and I don't think the instruction
> scheduler supports that.
Hmm, I think Bernd introduced sched_flag !DONT_BREAK_DEPENDENCIES to
resolve dependence by modifying address expression.  I think this is
the same problem, what's needed is to model dependence using that
framework.  Maybe delay slot is special here?

>
> It looks like is the ivopts code that decided to increment the registers
> first and use the -1 offsets in the loads after instead of using 0 offsets
> and then incrementing the offsets after the loads but I can't figure out
> how or why ivopts made that decision.
>
> Does anyone have any ideas on how I could 'fix' GCC to make it generate
> the ideal code?  Is there some way to do it in the instruction scheduler?
> Is there some way to modify ivopts to fix this by modifying the cost
It's likely IVO just peaks the first candidate when it runs into a
tie.  Could you please post preprocessed source code so that I can
have a look?  I am not familiar with glibc.  Thanks.

> analysis somehow?  Could I (partially) undo the fix for PR 48814?
> According to the final comment in that bugzilla report the change is
> really only needed for C11 and that the change does degrade the optimizer
> so could we go back to the old behaviour for C89/C99?  The code in ivopts
I saw this change caused code size regression on arm embedded processors.

Thanks,
bin

> has changed enough since the patch was applied I couldn't immediately see
> how to do that in the ToT sources.
>
> Steve Ellcey
> sell...@imgtec.com


Re: Question about PR 48814 and ivopts and post-increment

2015-12-03 Thread Bin.Cheng
On Fri, Dec 4, 2015 at 10:48 AM, Bin.Cheng  wrote:
> On Wed, Dec 2, 2015 at 5:11 AM, Steve Ellcey  wrote:
>>
>> I have a question involving ivopts and PR 48814, which was a fix for
>> the post increment operation.  Prior to the fix for PR 48814, MIPS
>> would generate this loop for strcmp (C code from glibc):
>>
>> $L4:
>> lbu $3,0($4)
>> lbu $2,0($5)
>> addiu   $4,$4,1
>> beq $3,$0,$L7
>> addiu   $5,$5,1# This is a branch delay slot
>> beq $3,$2,$L4
>> subu$2,$3,$2   # This is a branch delay slot (only used after 
>> loop)
>>
>>
>> With the current top-of-tree we now generate:
>>
>> addiu   $4,$4,1
>> $L8:
>> lbu $3,-1($4)
>> addiu   $5,$5,1
>> beq $3,$0,$L7
>> lbu $2,-1($5)  # This is a branch delay slot
>> beq $3,$2,$L8
>> addiu   $4,$4,1# This is a branch delay slot
>>
>> subu$2,$3,$2   # Done only once now after exiting loop.
>>
>> The main problem with the new loop is that the beq comparing $2 and $3
>> is right before the load of $2 so there can be a delay due to the time
>> that the load takes.  The ideal code would probably be:
>>
>> addiu   $4,$4,1
>> $L8:
>> lbu $3,-1($4)
>> lbu $2,0($5)  # This is a branch delay slot
>> beq $3,$0,$L7
>> addiu   $5,$5,1
>> beq $3,$2,$L8
>> addiu   $4,$4,1# This is a branch delay slot
>>
>> subu$2,$3,$2   # Done only once now after exiting loop.
>>
>> Where we load $2 earlier (using a 0 offset instead of a -1 offset) and
>> then do the increment of $5 after using it in the load.  The problem
>> is that this isn't something that can just be done in the instruction
>> scheduler because we are changing one of the instructions (to modify the
>> offset) in addition to rearranging them and I don't think the instruction
>> scheduler supports that.
> Hmm, I think Bernd introduced sched_flag !DONT_BREAK_DEPENDENCIES to
> resolve dependence by modifying address expression.  I think this is
> the same problem, what's needed is to model dependence using that
> framework.  Maybe delay slot is special here?
>
>>
>> It looks like is the ivopts code that decided to increment the registers
>> first and use the -1 offsets in the loads after instead of using 0 offsets
>> and then incrementing the offsets after the loads but I can't figure out
>> how or why ivopts made that decision.
>>
>> Does anyone have any ideas on how I could 'fix' GCC to make it generate
>> the ideal code?  Is there some way to do it in the instruction scheduler?
>> Is there some way to modify ivopts to fix this by modifying the cost
> It's likely IVO just peaks the first candidate when it runs into a
> tie.  Could you please post preprocessed source code so that I can
> have a look?  I am not familiar with glibc.  Thanks.
Oh, I saw the example in another thread of yours.

Thanks,
bin


Re: Question about PR 48814 and ivopts and post-increment

2015-12-04 Thread Bin.Cheng
On Fri, Dec 4, 2015 at 11:00 AM, Bin.Cheng  wrote:
> On Fri, Dec 4, 2015 at 10:48 AM, Bin.Cheng  wrote:
>> On Wed, Dec 2, 2015 at 5:11 AM, Steve Ellcey  wrote:
>>>
>>> I have a question involving ivopts and PR 48814, which was a fix for
>>> the post increment operation.  Prior to the fix for PR 48814, MIPS
>>> would generate this loop for strcmp (C code from glibc):
>>>
>>> $L4:
>>> lbu $3,0($4)
>>> lbu $2,0($5)
>>> addiu   $4,$4,1
>>> beq $3,$0,$L7
>>> addiu   $5,$5,1# This is a branch delay slot
>>> beq $3,$2,$L4
>>> subu$2,$3,$2   # This is a branch delay slot (only used after 
>>> loop)
>>>
>>>
>>> With the current top-of-tree we now generate:
>>>
>>> addiu   $4,$4,1
>>> $L8:
>>> lbu $3,-1($4)
>>> addiu   $5,$5,1
>>> beq $3,$0,$L7
>>> lbu $2,-1($5)  # This is a branch delay slot
>>> beq $3,$2,$L8
>>> addiu   $4,$4,1# This is a branch delay slot
>>>
>>> subu$2,$3,$2   # Done only once now after exiting loop.
>>>
>>> The main problem with the new loop is that the beq comparing $2 and $3
>>> is right before the load of $2 so there can be a delay due to the time
>>> that the load takes.  The ideal code would probably be:
>>>
>>> addiu   $4,$4,1
>>> $L8:
>>> lbu $3,-1($4)
>>> lbu $2,0($5)  # This is a branch delay slot
>>> beq $3,$0,$L7
>>> addiu   $5,$5,1
>>> beq $3,$2,$L8
>>> addiu   $4,$4,1# This is a branch delay slot
>>>
>>> subu$2,$3,$2   # Done only once now after exiting loop.
>>>
>>> Where we load $2 earlier (using a 0 offset instead of a -1 offset) and
>>> then do the increment of $5 after using it in the load.  The problem
>>> is that this isn't something that can just be done in the instruction
>>> scheduler because we are changing one of the instructions (to modify the
>>> offset) in addition to rearranging them and I don't think the instruction
>>> scheduler supports that.
>> Hmm, I think Bernd introduced sched_flag !DONT_BREAK_DEPENDENCIES to
>> resolve dependence by modifying address expression.  I think this is
>> the same problem, what's needed is to model dependence using that
>> framework.  Maybe delay slot is special here?
>>
>>>
>>> It looks like is the ivopts code that decided to increment the registers
>>> first and use the -1 offsets in the loads after instead of using 0 offsets
>>> and then incrementing the offsets after the loads but I can't figure out
>>> how or why ivopts made that decision.
>>>
>>> Does anyone have any ideas on how I could 'fix' GCC to make it generate
>>> the ideal code?  Is there some way to do it in the instruction scheduler?
>>> Is there some way to modify ivopts to fix this by modifying the cost
>> It's likely IVO just peaks the first candidate when it runs into a
>> tie.  Could you please post preprocessed source code so that I can
>> have a look?  I am not familiar with glibc.  Thanks.
> Oh, I saw the example in another thread of yours.
>
Dump before IVO is as below:

  :
  # s1_1 = PHI 
  # s2_2 = PHI 
  s1_6 = s1_1 + 1;
  c1_8 = *s1_1;
  s2_9 = s2_2 + 1;
  c2_10 = *s2_2;
  if (c1_8 == 0)
goto ;
  else
goto ;

And the iv candidates are as:
candidate 1 (important)
  var_before ivtmp.6
  var_after ivtmp.6
  incremented before exit test
  type unsigned int
  base (unsigned int) p1_4(D)
  step 1
  base object (void *) p1_4(D)
candidate 2 (important)
  original biv
  type const unsigned char *
  base (const unsigned char *) p1_4(D)
  step 1
  base object (void *) p1_4(D)
candidate 3 (important)
  var_before ivtmp.7
  var_after ivtmp.7
  incremented before exit test
  type unsigned int
  base (unsigned int) p2_5(D)
  step 1
  base object (void *) p2_5(D)
candidate 4 (important)
  original biv
  type const unsigned char *
  base (const unsigned char *) p2_5(D)
  step 1
  base object (void *) p2_5(D)

Generally GCC would choose normal candidates {1, 3} and insert
increment before exit condition.  This is expected in this case.  But
when there is applicable original candidates {2, 4}, GCC would prefer
these in order to achieve better debugging.  Also as I suspected,
[reg] and [reg-1] have same address cost on mips, that's why GCC makes
current decision.

Thanks,
bin


Re: Test case mis-categorized as UNSUPPORTED?

2016-01-07 Thread Bin.Cheng
On Thu, Jan 7, 2016 at 2:21 PM, Kyrill Tkachov
 wrote:
> Hi Bin,
>
>
> On 07/01/16 14:15, Bin.Cheng wrote:
>>
>> Hi,
>> Below test is supposed to be compiled and run, but we failed to link
>> the binary with tiny memory model.
>>
>> spawn
>> /data/work/build-aarch64_be-none-elf/obj/gcc2/gcc/testsuite/g++14/../../xg++
>> -B/data/work/build-aarch64_be-none-elf/obj/gcc2/gcc/testsuite/g++14/../../
>> /data/work/src/gcc/gcc/testsuite/g++.dg/torture/pr67600.C
>> -fno-diagnostics-show-caret -fdiagnostics-color=never -nostdinc++
>>
>> -I/data/work/build-aarch64_be-none-elf/obj/gcc2/aarch64_be-none-elf/libstdc++-v3/include/aarch64_be-none-elf
>>
>> -I/data/work/build-aarch64_be-none-elf/obj/gcc2/aarch64_be-none-elf/libstdc++-v3/include
>> -I/data/work/src/gcc/libstdc++-v3/libsupc++
>> -I/data/work/src/gcc/libstdc++-v3/include/backward
>> -I/data/work/src/gcc/libstdc++-v3/testsuite/util -fmessage-length=0
>> -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects
>> -specs=aem-validation.specs
>>
>> -L/data/work/build-aarch64_be-none-elf/obj/gcc2/aarch64_be-none-elf/./libstdc++-v3/src/.libs
>>
>> -B/data/work/build-aarch64_be-none-elf/obj/gcc2/aarch64_be-none-elf/./libstdc++-v3/src/.libs
>> -lm -mcmodel=tiny -o ./pr67600.exe
>> /tmp/ccd32hub.ltrans0.ltrans.o: In function `main':
>> :(.text.startup+0x68): relocation truncated to fit:
>> R_AARCH64_ADR_PREL_LO21 against symbol `std::cout' defined in
>> .bss._ZSt4cout section in
>>
>> /data/work/build-aarch64_be-none-elf/obj/gcc2/aarch64_be-none-elf/./libstdc++-v3/src/.libs/libstdc++.a(globals_io.o)
>> collect2: error: ld returned 1 exit status
>> compiler exited with status 1
>> output is:
>> /tmp/ccd32hub.ltrans0.ltrans.o: In function `main':
>> :(.text.startup+0x68): relocation truncated to fit:
>> R_AARCH64_ADR_PREL_LO21 against symbol `std::cout' defined in
>> .bss._ZSt4cout section in
>>
>> /data/work/build-aarch64_be-none-elf/obj/gcc2/aarch64_be-none-elf/./libstdc++-v3/src/.libs/libstdc++.a(globals_io.o)
>> collect2: error: ld returned 1 exit status
>>
>> UNSUPPORTED: g++.dg/torture/pr67600.C   -O2 -flto -fuse-linker-plugin
>> -fno-fat-lto-objects : memory full
>>
>> In my understanding, dg-do run test case should be marked as
>> FAIL&UNRESOLVED if binary file can't be generated.  But here it's
>> categorized as an UNSUPPORTED test.  This could be mis-leading
>> sometimes since unsupported test could be ignored.
>
>
> The problem is that many of these libstdc++ tests got too big for the tiny
> memory model
> and the whole testsuite got very noisy due to these relocation truncation
> errors.
> That's why we try to mark them as unsupported. I tried doing it in the past
> and
> Szabolcs fixed it properly with
> https://gcc.gnu.org/ml/libstdc++/2015-10/msg00037.html

Good to know it's intended behavior.

Thanks,
bin


Question about how to fix PR69052

2016-01-25 Thread Bin.Cheng
Hi,
In revision 229402, I added logic in loop-invariant.c to check if a
look invariant can be forward propagated to memory references in the
same loop; and increase loop invariant cost if the answer is no.  I
think this is a reasonable change.  Some targets like AArch64 can
benefit from it because many sp/fp related expressions can be hoisted
out of loop.
Unfortunately, it runs into problem on x86+PIC.  Given below GOT related loads:

(insn 67 66 69 9 (set (reg/f:SI 147)
(mem/u/c:SI (plus:SI (reg:SI 87)
(const:SI (unspec:SI [
(symbol_ref:SI ("out")  )
] UNSPEC_GOT))) [3  S4 A8])) pr69052.c:39 86
{*movsi_internal}
 (nil))
(insn 69 67 70 9 (set (reg/f:SI 149)
(plus:SI (reg:SI 87)
(const:SI (unspec:SI [
(symbol_ref:SI ("ind") [flags 0x2]  )
] UNSPEC_GOTOFF pr69052.c:39 212 {*leasi}
 (expr_list:REG_DEAD (reg:SI 87)
(nil)))
(insn 70 69 71 9 (set (reg:SI 150)
(mem/u:SI (plus:SI (mult:SI (reg/v:SI 90 [ k ])
(const_int 4 [0x4]))
(reg/f:SI 149)) [1 ind S4 A32])) pr69052.c:39 86
{*movsi_internal}
 (expr_list:REG_DEAD (reg/f:SI 149)
(nil)))
(insn 71 70 72 9 (set (mem:SI (plus:SI (mult:SI (reg:SI 150)
(const_int 4 [0x4]))
(reg/f:SI 147)) [1 out S4 A32])
(reg/v:SI 89 [ result ])) pr69052.c:39 86 {*movsi_internal}
 (expr_list:REG_DEAD (reg:SI 150)
(expr_list:REG_DEAD (reg/f:SI 147)
(expr_list:REG_DEAD (reg/v:SI 89 [ result ])
(nil)

insn 67/69 can be merged into insn 71/70 respectively in combine pass.
Problem is transformation done by combine is complex and hard to in
other passes like loop invariant.

Yuri Rumyantsev suggested we may add a hook to handle GOT related
instruction propagation specially so it won't be hoisted out.  Is a
hook at this stage sounds feasible?

Also it would be great if we can abstract an interface out of combine,
and the interface can be used in other passes checking whether
instructions can be merged or not.  I know this will be even harder
than emulation of combine behavior.

Another point is if we can do combine among different basic blocks?

Any idea on this?

Thanks,
bin


Re: Question about how to fix PR69052

2016-01-26 Thread Bin.Cheng
On Mon, Jan 25, 2016 at 7:51 PM, Jeff Law  wrote:
> On 01/25/2016 11:42 AM, Bin.Cheng wrote:
>>
>>
>> Yuri Rumyantsev suggested we may add a hook to handle GOT related
>> instruction propagation specially so it won't be hoisted out.  Is a
>> hook at this stage sounds feasible?
>
> I think that would be a mistake.  ISTM this really needs to be cost driven.
>
>>
>> Also it would be great if we can abstract an interface out of combine,
>> and the interface can be used in other passes checking whether
>> instructions can be merged or not.  I know this will be even harder
>> than emulation of combine behavior.
>>
>> Another point is if we can do combine among different basic blocks?
>
> No, the combiner works within a basic block only.  There was a group, I
> believe in Moscow, that worked on a cross-block combiner.  It was discussed
> at the Cauldron in California a few years back.  I don't know if they did
> any further work on those ideas.
>
> Is the issue here not knowing when the loop invariant can be forward
> propagated back into the memory reference -- and for complex RTL like you
> cited, you ultimately determine that no it can't be propagated and thus
> increase its invariant cost?
Yes, loop invariant now increased invariant cost if the invariant
can't be propagated into address expression.  Problem is we check
propagation by simply replacing use with def in memory reference then
verifying result insn with verify_changes.  Apparently more advanced
transforms are necessary, just like what combine does.
I am surprised that rtl_fwprop_addr doesn't handle this either, it's
an exact address expression propagation example.

Thanks,
bin

>
> jeff


Re: Question about how to fix PR69052

2016-01-26 Thread Bin.Cheng
On Mon, Jan 25, 2016 at 8:05 PM, Bernd Schmidt  wrote:
> On 01/25/2016 08:51 PM, Jeff Law wrote:
>>
>> No, the combiner works within a basic block only.  There was a group, I
>> believe in Moscow, that worked on a cross-block combiner.  It was
>> discussed at the Cauldron in California a few years back.  I don't know
>> if they did any further work on those ideas.
>>
>> Is the issue here not knowing when the loop invariant can be forward
>> propagated back into the memory reference -- and for complex RTL like
>> you cited, you ultimately determine that no it can't be propagated and
>> thus increase its invariant cost?
>
>
> Is this just a pass ordering problem? What happens if you do loop-invariant
> after combine?
Yes, I moved whole loop pass (also the pass_web) after combine and it
worked.  A combine pass before loop-invariant can fix this problem.
Below passes are currently between loop transform and combine:

  NEXT_PASS (pass_web);
  NEXT_PASS (pass_rtl_cprop);
  NEXT_PASS (pass_cse2);
  NEXT_PASS (pass_rtl_dse1);
  NEXT_PASS (pass_rtl_fwprop_addr);
  NEXT_PASS (pass_inc_dec);
  NEXT_PASS (pass_initialize_regs);
  NEXT_PASS (pass_ud_rtl_dce);

I think pass_web needs to be after loop transform because it's used to
handle unrolled register live range.
pass_fwprop_addr and pass_inc_dec should stay where they are now.  And
putting pass_inc_dec before loop unroll may be helpful to keep more
auto increment addressing mode chosen by IVO.
We should not need to duplicate pass_initialize_regs.
So what's about pass_rtl_cprop, cse2 and dse1.  Should these pass be
duplicated after loop transform thus loop transformed code can be
cleaned up?

Thanks,
bin
>
>
> Bernd
>


Re: Question about how to fix PR69052

2016-01-26 Thread Bin.Cheng
On Tue, Jan 26, 2016 at 12:56 PM, Bernd Schmidt  wrote:
> On 01/26/2016 10:48 AM, Bin.Cheng wrote:
>>
>> Yes, I moved whole loop pass (also the pass_web) after combine and it
>> worked.  A combine pass before loop-invariant can fix this problem.
>> Below passes are currently between loop transform and combine:
>>
>>NEXT_PASS (pass_web);
>>NEXT_PASS (pass_rtl_cprop);
>>NEXT_PASS (pass_cse2);
>>NEXT_PASS (pass_rtl_dse1);
>>NEXT_PASS (pass_rtl_fwprop_addr);
>>NEXT_PASS (pass_inc_dec);
>>NEXT_PASS (pass_initialize_regs);
>>NEXT_PASS (pass_ud_rtl_dce);
>>
>> I think pass_web needs to be after loop transform because it's used to
>> handle unrolled register live range.
>> pass_fwprop_addr and pass_inc_dec should stay where they are now.  And
>> putting pass_inc_dec before loop unroll may be helpful to keep more
>> auto increment addressing mode chosen by IVO.
>> We should not need to duplicate pass_initialize_regs.
>> So what's about pass_rtl_cprop, cse2 and dse1.  Should these pass be
>> duplicated after loop transform thus loop transformed code can be
>> cleaned up?
>
>
> Hard to tell in advance - you might want to experiment a bit; set up a large
> collection of input files and see what various approaches do to code
> generation. In any case, I suspect this is gcc-7 material (unfortunately).

Quite surprising, there is only one new failure after moving loop
transforms (only along with pass_web).  Yes, it is GCC7 change, will
revisit this to see if it makes substantial code generation
difference.  Anyway, for the bug itself, there might be simple fix in
x86 backend according to newest update.

Thanks,
bin


Re: Question about how to fix PR69052

2016-01-26 Thread Bin.Cheng
On Tue, Jan 26, 2016 at 1:51 PM, Bin.Cheng  wrote:
> On Tue, Jan 26, 2016 at 12:56 PM, Bernd Schmidt  
> wrote:
>> On 01/26/2016 10:48 AM, Bin.Cheng wrote:
>>>
>>> Yes, I moved whole loop pass (also the pass_web) after combine and it
>>> worked.  A combine pass before loop-invariant can fix this problem.
>>> Below passes are currently between loop transform and combine:
>>>
>>>NEXT_PASS (pass_web);
>>>NEXT_PASS (pass_rtl_cprop);
>>>NEXT_PASS (pass_cse2);
>>>NEXT_PASS (pass_rtl_dse1);
>>>NEXT_PASS (pass_rtl_fwprop_addr);
>>>NEXT_PASS (pass_inc_dec);
>>>NEXT_PASS (pass_initialize_regs);
>>>NEXT_PASS (pass_ud_rtl_dce);
>>>
>>> I think pass_web needs to be after loop transform because it's used to
>>> handle unrolled register live range.
>>> pass_fwprop_addr and pass_inc_dec should stay where they are now.  And
>>> putting pass_inc_dec before loop unroll may be helpful to keep more
>>> auto increment addressing mode chosen by IVO.
>>> We should not need to duplicate pass_initialize_regs.
>>> So what's about pass_rtl_cprop, cse2 and dse1.  Should these pass be
>>> duplicated after loop transform thus loop transformed code can be
>>> cleaned up?
>>
>>
>> Hard to tell in advance - you might want to experiment a bit; set up a large
>> collection of input files and see what various approaches do to code
>> generation. In any case, I suspect this is gcc-7 material (unfortunately).
>
> Quite surprising, there is only one new failure after moving loop
> transforms (only along with pass_web).  Yes, it is GCC7 change, will
For the record.  The only new failure is loop-9.c.  Given insn
patterns as below:

Loop:
   23: r106:DF=[`*.LC0']
  REG_EQUAL 1.842419990222931955941021442413330078125e+1
   24: [r101:DI]=r106:DF

insn 23 is merged into insn 24 by combine if it's run before
loop_invariant, resulting in below insn:

   23: NOTE_INSN_DELETED
   24: [r101:DI]=1.842419990222931955941021442413330078125e+1

But we can't support the floating constant, so insn24 will be split
anyway by reload:

   35: xmm0:DF=[`*.LC0']
   24: [di:DI]=xmm0:DF

This time both instructions are in loop, causing a regression.

Combine may need some change if we run it before loop_invariant.

Thanks,
bin

> revisit this to see if it makes substantial code generation
> difference.  Anyway, for the bug itself, there might be simple fix in
> x86 backend according to newest update.
>
> Thanks,
> bin


Re: Question about how to fix PR69052

2016-01-26 Thread Bin.Cheng
On Tue, Jan 26, 2016 at 4:26 PM, Jeff Law  wrote:
> On 01/26/2016 02:28 AM, Bin.Cheng wrote:
>>
>> Yes, loop invariant now increased invariant cost if the invariant
>> can't be propagated into address expression.  Problem is we check
>> propagation by simply replacing use with def in memory reference then
>> verifying result insn with verify_changes.  Apparently more advanced
>> transforms are necessary, just like what combine does.
>> I am surprised that rtl_fwprop_addr doesn't handle this either, it's
>> an exact address expression propagation example.
>
> So the question now becomes whether or not the work done by combine.c is
> dependent on knowledge that is specific to combine or is it just a series of
> transformations that we could make available (preferably moving it to
> simplify-rtx).
>
> If we can factor the transformation out and shove it into simplify-rtx, then
> it could be used elsewhere.
According to comment at
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052#c6, it should be
just re-arrangement of operands, if the comment itself holds.
In this way, the transformation is backend related, maybe better to
keep it in backend, right?

Thanks,
bin
>
> jeff


libstdc++ and c library compatible issue when bootstrap GCC

2016-01-28 Thread Bin.Cheng
Hi,
I ran into below error message at stage2 of bootstrap GCC:

/work/obj/gcc-bootstrap/./prev-gcc/xg++
-B/work/obj/gcc-bootstrap/./prev-gcc/ -B//aarch64-none-linux-gnu/bin/
-nostdinc++ 
-B/work/obj/gcc-bootstrap/prev-aarch64-none-linux-gnu/libstdc++-v3/src/.libs
-B/work/obj/gcc-bootstrap/prev-aarch64-none-linux-gnu/libstdc++-v3/libsupc++/.libs
 
-I/work/obj/gcc-bootstrap/prev-aarch64-none-linux-gnu/libstdc++-v3/include/aarch64-none-linux-gnu
 -I/work/obj/gcc-bootstrap/prev-aarch64-none-linux-gnu/libstdc++-v3/include
 -I/src/gcc/libstdc++-v3/libsupc++
-L/work/obj/gcc-bootstrap/prev-aarch64-none-linux-gnu/libstdc++-v3/src/.libs
-L/work/obj/gcc-bootstrap/prev-aarch64-none-linux-gnu/libstdc++-v3/libsupc++/.libs
-c   -g -O2 -gtoggle -DIN_GCC -fno-exceptions -fno-rtti
-fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings
-Wcast-qual -Wmissing-format-attribute -Woverloaded-virtual -pedantic
-Wno-long-long -Wno-variadic-macros -Wno-overlength-strings
-fno-common  -DHAVE_CONFIG_H -DGENERATOR_FILE -fno-PIE -I. -Ibuild
-I/src/gcc/gcc -I/src/gcc/gcc/build -I/src/gcc/gcc/../include
-I/src/gcc/gcc/../libcpp/include  \
-o build/genautomata.o /src/gcc/gcc/genautomata.c
In file included from /src/gcc/gcc/genautomata.c:116:0:
/work/obj/gcc-bootstrap/prev-aarch64-none-linux-gnu/libstdc++-v3/include/math.h:65:12:
error: ‘constexpr bool std::isinf(double)’ conflicts with a previous
declaration
 using std::isinf;
^

In file included from /usr/include/features.h:374:0,
 from /usr/include/stdio.h:27,
 from /src/gcc/gcc/system.h:46,
 from /src/gcc/gcc/genautomata.c:108:
/usr/include/aarch64-linux-gnu/bits/mathcalls.h:201:1: note: previous
declaration ‘int isinf(double)’
 __MATHDECL_1 (int,isinf,, (_Mdouble_ __value)) __attribute__ ((__const__));
 ^

I kind of understand what the problem is base on below facts:
1) I bootstrap gcc against new glibc;
2) I configure gcc with "--with-build-sysroot". According to GCC
document, this only affects how target libraries are built.
3) binary executables are built in GCC for internal use purpose. These
binaries are used to generate internal files during building gcc, and
I guess they are not considered as target library. So system C
libraries/headers are used.
4) at stage2 of gcc bootstrap, these binaries are compiled by stage1
xg++, which by default uses new libstdc+ just built in stage1.
5) Problem, the new libstdc++ is configured/built against new glibc
because of "--with-build-sysroot". It's incompatible with system C
headers.

It's a little tricky.  First question is:  is it reasonable to use
"--with-build-sysroot" alone when building/bootstrapping gcc?  If yes,
how should we build binaries in GCC?  Seems to me it's not reasonable
to use new libstdc++ along with system C library since it's built
against new glibc.

Any suggestion?  Thanks in advance.

Thanks,
bin


Help about how to bootstrap gcc with local version glibc other than system one

2016-02-01 Thread Bin.Cheng
Hi,
Recently I tried to bootstrap gcc against new glibc but failed.  What
I want to do is just bootstrap gcc against local version glibc other
than system one, because I can't update glibc in that system.
I tried this by configuring GCC using "--with-build-sysroot" or
"--with-sysroot"  or both, but all failed.

When "--with-build-sysroot" is used, stage2 gcc failed when building
internal binaries like genautomata, because it uses system version
glibc and stage1 libstdc++.  Apparently these two libraries are
incompatible because stage1 libstdc++ is built against new glibc.

When "--with-sysroot" is used, stage1 gcc failed when building libgcc,
because it tried to find sysroot in build directory according to how
sysroot is relocated in GCC.  Apparently this doesn't exist because
xgcc doesn't come from toolchain distribution.

When "--with-sysroot" and "--with-build-sysroot" are used, stage2 GCC
failed when configuring itself because some configuration check cannot
find correct sysroot (standard headers like stdio.h).

Seems to me Andrew was right in comment of PR69559, that we simply
couldn't bootstrap GCC with sysroot.

My question here is: If this is the case, how should I bootstrap a gcc
against local version glibc, rather than the system one?  Is chroot
the only way to do that?

Thanks,
bin


Re: Help about how to bootstrap gcc with local version glibc other than system one

2016-02-01 Thread Bin.Cheng
On Mon, Feb 1, 2016 at 6:08 PM, Andreas Schwab  wrote:
> "Bin.Cheng"  writes:
>
>> Seems to me Andrew was right in comment of PR69559, that we simply
>> couldn't bootstrap GCC with sysroot.
>
> The main use of sysroot is to build a cross compiler, which you cannot
> bootstrap anyway.
>
>> My question here is: If this is the case, how should I bootstrap a gcc
>> against local version glibc, rather than the system one?  Is chroot
>> the only way to do that?
>
> Yes, building in a chroot or a VM is the best way to do it.  For
> example, that's how the openSUSE Build Service works.
Hi Andreas,
Thanks very much for helping.  I will try to do it in chroot.

Thanks,
Biin
>
> Andreas.
>
> --
> Andreas Schwab, sch...@linux-m68k.org
> GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
> "And now for something completely different."


Re: Help about how to bootstrap gcc with local version glibc other than system one

2016-02-02 Thread Bin.Cheng
On Mon, Feb 1, 2016 at 7:45 PM, Jeff Law  wrote:
> On 02/01/2016 12:07 PM, Bin.Cheng wrote:
>>
>> On Mon, Feb 1, 2016 at 6:08 PM, Andreas Schwab 
>> wrote:
>>>
>>> "Bin.Cheng"  writes:
>>>
>>>> Seems to me Andrew was right in comment of PR69559, that we simply
>>>> couldn't bootstrap GCC with sysroot.
>>>
>>>
>>> The main use of sysroot is to build a cross compiler, which you cannot
>>> bootstrap anyway.
>>>
>>>> My question here is: If this is the case, how should I bootstrap a gcc
>>>> against local version glibc, rather than the system one?  Is chroot
>>>> the only way to do that?
>>>
>>>
>>> Yes, building in a chroot or a VM is the best way to do it.  For
>>> example, that's how the openSUSE Build Service works.
>>
>> Hi Andreas,
>> Thanks very much for helping.  I will try to do it in chroot.
>
> Definitely what I'd recommend as well.
>
> We do this regularly with something called "mock" on Fedora.  I'm sure SuSE,
> Debian, Ubuntu, etc have an equivalent.
>
> Essentially they create a chroot, populate it with build dependencies
> extracted from the source package, then build within the chroot.  You can
> arrange to get a different glibc during instantiation of the chroot, or
> upgrade it after the chroot is fully instantiated.
>
> I'm sure there's a way to do this with containers too.
Thanks very much.  I realized bootstrap GCC against new local glibc is
mainly for testing purpose, because that gcc cannot be distributed
alone to system with old glibc.  The old glibc might don't have new
symbols supported in new glibc.  Or, if I want to populate whole
system from scratch.

Thanks,
bin


Inconsistent initialization for pic_offset_table_rtx?

2016-02-04 Thread Bin.Cheng
Hi,
I noticed that pic_offset_table_rtx is initialized twice in GCC.  Take
x86_32 as an example.
The first initialization is done in emit_init_regs, with below code:

  pic_offset_table_rtx = NULL_RTX;
  if ((unsigned) PIC_OFFSET_TABLE_REGNUM != INVALID_REGNUM)
pic_offset_table_rtx = gen_raw_REG (Pmode, PIC_OFFSET_TABLE_REGNUM);

On x86_32 with pic, we have:

(gdb) call debug_rtx(this_target_rtl->x_pic_offset_table_rtx)
(reg:SI 3 bx)

The second initialization is in expand_used_vars, with below code:

  if (targetm.use_pseudo_pic_reg ())
pic_offset_table_rtx = gen_reg_rtx (Pmode);

On x86_32 with pic, we have:

(gdb) call debug_rtx(this_target_rtl->x_pic_offset_table_rtx)
(reg:SI 87)

So basically after expanding the first function, pic_offset_table_rtx
is set to a pseudo register, rather than the one initialized in
emit_init_regs.

Also this causes inconsistent compilation for the first/rest functions
in one compilation unit.

A bug?

Thanks,
bin


Re: Inconsistent initialization for pic_offset_table_rtx?

2016-02-04 Thread Bin.Cheng
On Thu, Feb 4, 2016 at 3:18 PM, Ilya Enkovich  wrote:
> 2016-02-04 17:12 GMT+03:00 Bin.Cheng :
>> Hi,
>> I noticed that pic_offset_table_rtx is initialized twice in GCC.  Take
>> x86_32 as an example.
>> The first initialization is done in emit_init_regs, with below code:
>>
>>   pic_offset_table_rtx = NULL_RTX;
>>   if ((unsigned) PIC_OFFSET_TABLE_REGNUM != INVALID_REGNUM)
>> pic_offset_table_rtx = gen_raw_REG (Pmode, PIC_OFFSET_TABLE_REGNUM);
>>
>> On x86_32 with pic, we have:
>>
>> (gdb) call debug_rtx(this_target_rtl->x_pic_offset_table_rtx)
>> (reg:SI 3 bx)
>>
>> The second initialization is in expand_used_vars, with below code:
>>
>>   if (targetm.use_pseudo_pic_reg ())
>> pic_offset_table_rtx = gen_reg_rtx (Pmode);
>>
>> On x86_32 with pic, we have:
>>
>> (gdb) call debug_rtx(this_target_rtl->x_pic_offset_table_rtx)
>> (reg:SI 87)
>>
>> So basically after expanding the first function, pic_offset_table_rtx
>> is set to a pseudo register, rather than the one initialized in
>> emit_init_regs.
>>
>> Also this causes inconsistent compilation for the first/rest functions
>> in one compilation unit.
>>
>> A bug?
>
> For i386 target PIC_OFFSET_TABLE_REGNUM actually checks
> ix86_use_pseudo_pic_reg and is supposed to return INVALID_REGNUM
> in case we use pseudo register for PIC. BUT we hit a case when PIC
> code is generated for cost estimation via target hooks while performing
> some GIMPLE pass. In this case we need to return some register to
Thanks IIya.  This is exact the case I ran into.  See PR69042.

> generate PIC usage but we don't have any allocated. In this case we
> return a hard register. We detect such situation by checking
> pic_offset_table_rtx.
>
> Thus if we use pseudo PIC register but pic_offset_table_rtx is not
> initialized yet,
> then PIC_OFFSET_TABLE_REGNUM returns a hard register.
>
> So I suppose we may consider the first assignment as a bug.

But I don't quite follow.  So hard register is returned so that gimple
passes can construct PIC related addresses?  If this is the case, the
first initialization is necessary.
Another question is about address cost:

  if (parts.index
  && (!REG_P (parts.index) || REGNO (parts.index) >= FIRST_PSEUDO_REGISTER)
  && (current_pass->type == GIMPLE_PASS
  || !pic_offset_table_rtx
  || !REG_P (parts.index)
  || REGNO (pic_offset_table_rtx) != REGNO (parts.index)))
cost++;
Is it a bug in the second sub condition?  Considering
"current_pass->type == GIMPLE_PASS" in the third sub condition, can I
assume the second is for non-GIMPLE passes only?

Thanks,
bin
>
> Thanks,
> Ilya
>
>>
>> Thanks,
>> bin


Re: Inconsistent initialization for pic_offset_table_rtx?

2016-02-09 Thread Bin.Cheng
On Fri, Feb 5, 2016 at 10:32 AM, Ilya Enkovich  wrote:
> 2016-02-04 19:16 GMT+03:00 Bin.Cheng :
>> On Thu, Feb 4, 2016 at 3:18 PM, Ilya Enkovich  wrote:
>>> 2016-02-04 17:12 GMT+03:00 Bin.Cheng :
>>>> Hi,
>>>> I noticed that pic_offset_table_rtx is initialized twice in GCC.  Take
>>>> x86_32 as an example.
>>>> The first initialization is done in emit_init_regs, with below code:
>>>>
>>>>   pic_offset_table_rtx = NULL_RTX;
>>>>   if ((unsigned) PIC_OFFSET_TABLE_REGNUM != INVALID_REGNUM)
>>>> pic_offset_table_rtx = gen_raw_REG (Pmode, PIC_OFFSET_TABLE_REGNUM);
>>>>
>>>> On x86_32 with pic, we have:
>>>>
>>>> (gdb) call debug_rtx(this_target_rtl->x_pic_offset_table_rtx)
>>>> (reg:SI 3 bx)
>>>>
>>>> The second initialization is in expand_used_vars, with below code:
>>>>
>>>>   if (targetm.use_pseudo_pic_reg ())
>>>> pic_offset_table_rtx = gen_reg_rtx (Pmode);
>>>>
>>>> On x86_32 with pic, we have:
>>>>
>>>> (gdb) call debug_rtx(this_target_rtl->x_pic_offset_table_rtx)
>>>> (reg:SI 87)
>>>>
>>>> So basically after expanding the first function, pic_offset_table_rtx
>>>> is set to a pseudo register, rather than the one initialized in
>>>> emit_init_regs.
>>>>
>>>> Also this causes inconsistent compilation for the first/rest functions
>>>> in one compilation unit.
>>>>
>>>> A bug?
>>>
>>> For i386 target PIC_OFFSET_TABLE_REGNUM actually checks
>>> ix86_use_pseudo_pic_reg and is supposed to return INVALID_REGNUM
>>> in case we use pseudo register for PIC. BUT we hit a case when PIC
>>> code is generated for cost estimation via target hooks while performing
>>> some GIMPLE pass. In this case we need to return some register to
>> Thanks IIya.  This is exact the case I ran into.  See PR69042.
>>
>>> generate PIC usage but we don't have any allocated. In this case we
>>> return a hard register. We detect such situation by checking
>>> pic_offset_table_rtx.
>>>
>>> Thus if we use pseudo PIC register but pic_offset_table_rtx is not
>>> initialized yet,
>>> then PIC_OFFSET_TABLE_REGNUM returns a hard register.
>>>
>>> So I suppose we may consider the first assignment as a bug.
>>
>> But I don't quite follow.  So hard register is returned so that gimple
>> passes can construct PIC related addresses?  If this is the case, the
>> first initialization is necessary.
>
> Right, we need some initialization but current way leads to inconsistent
> value and we may 'randomly' get hard or pseudo register for different 
> functions
> which possibly affects some optimizations. We probably need a better
> way to check if we should return a hard reg for PIC register. Or maybe
> reset ix86_use_pseudo_pic_reg to NULL when function is finalized or
> get rid of hard reg at all and always use a pseudo register.
>
>> Another question is about address cost:
>>
>>   if (parts.index
>>   && (!REG_P (parts.index) || REGNO (parts.index) >= 
>> FIRST_PSEUDO_REGISTER)
>>   && (current_pass->type == GIMPLE_PASS
>>   || !pic_offset_table_rtx
>>   || !REG_P (parts.index)
>>   || REGNO (pic_offset_table_rtx) != REGNO (parts.index)))
>> cost++;
>> Is it a bug in the second sub condition?  Considering
>> "current_pass->type == GIMPLE_PASS" in the third sub condition, can I
>> assume the second is for non-GIMPLE passes only?
>
> There is just a code duplicated for parts.base and parts.index
> registers. We use same
> conditions for both of them because you can easily swap them if scale is 1.
>
> I don't know why we don't try to recognize PIC registers when in GIMPLE pass.

Could we define PIC_OFFSET_TABLE_REGNUM to a pseudo register when
ix86_use_pseudo_pic_reg returns true, as in this case?  Current logic
doesn't look consistent to me.  Of course, I know little about x86.

Thanks,
bin
>
> Thanks,
> Ilya
>
>>
>> Thanks,
>> bin
>>>
>>> Thanks,
>>> Ilya
>>>
>>>>
>>>> Thanks,
>>>> bin


Re: Inconsistent initialization for pic_offset_table_rtx?

2016-02-09 Thread Bin.Cheng
On Tue, Feb 9, 2016 at 2:56 PM, Ilya Enkovich  wrote:
> 2016-02-09 17:27 GMT+03:00 Bin.Cheng :
>> On Fri, Feb 5, 2016 at 10:32 AM, Ilya Enkovich  
>> wrote:
>>> 2016-02-04 19:16 GMT+03:00 Bin.Cheng :
>>>> On Thu, Feb 4, 2016 at 3:18 PM, Ilya Enkovich  
>>>> wrote:
>>>>> 2016-02-04 17:12 GMT+03:00 Bin.Cheng :
>>>>>> Hi,
>>>>>> I noticed that pic_offset_table_rtx is initialized twice in GCC.  Take
>>>>>> x86_32 as an example.
>>>>>> The first initialization is done in emit_init_regs, with below code:
>>>>>>
>>>>>>   pic_offset_table_rtx = NULL_RTX;
>>>>>>   if ((unsigned) PIC_OFFSET_TABLE_REGNUM != INVALID_REGNUM)
>>>>>> pic_offset_table_rtx = gen_raw_REG (Pmode, PIC_OFFSET_TABLE_REGNUM);
>>>>>>
>>>>>> On x86_32 with pic, we have:
>>>>>>
>>>>>> (gdb) call debug_rtx(this_target_rtl->x_pic_offset_table_rtx)
>>>>>> (reg:SI 3 bx)
>>>>>>
>>>>>> The second initialization is in expand_used_vars, with below code:
>>>>>>
>>>>>>   if (targetm.use_pseudo_pic_reg ())
>>>>>> pic_offset_table_rtx = gen_reg_rtx (Pmode);
>>>>>>
>>>>>> On x86_32 with pic, we have:
>>>>>>
>>>>>> (gdb) call debug_rtx(this_target_rtl->x_pic_offset_table_rtx)
>>>>>> (reg:SI 87)
>>>>>>
>>>>>> So basically after expanding the first function, pic_offset_table_rtx
>>>>>> is set to a pseudo register, rather than the one initialized in
>>>>>> emit_init_regs.
>>>>>>
>>>>>> Also this causes inconsistent compilation for the first/rest functions
>>>>>> in one compilation unit.
>>>>>>
>>>>>> A bug?
>>>>>
>>>>> For i386 target PIC_OFFSET_TABLE_REGNUM actually checks
>>>>> ix86_use_pseudo_pic_reg and is supposed to return INVALID_REGNUM
>>>>> in case we use pseudo register for PIC. BUT we hit a case when PIC
>>>>> code is generated for cost estimation via target hooks while performing
>>>>> some GIMPLE pass. In this case we need to return some register to
>>>> Thanks IIya.  This is exact the case I ran into.  See PR69042.
>>>>
>>>>> generate PIC usage but we don't have any allocated. In this case we
>>>>> return a hard register. We detect such situation by checking
>>>>> pic_offset_table_rtx.
>>>>>
>>>>> Thus if we use pseudo PIC register but pic_offset_table_rtx is not
>>>>> initialized yet,
>>>>> then PIC_OFFSET_TABLE_REGNUM returns a hard register.
>>>>>
>>>>> So I suppose we may consider the first assignment as a bug.
>>>>
>>>> But I don't quite follow.  So hard register is returned so that gimple
>>>> passes can construct PIC related addresses?  If this is the case, the
>>>> first initialization is necessary.
>>>
>>> Right, we need some initialization but current way leads to inconsistent
>>> value and we may 'randomly' get hard or pseudo register for different 
>>> functions
>>> which possibly affects some optimizations. We probably need a better
>>> way to check if we should return a hard reg for PIC register. Or maybe
>>> reset ix86_use_pseudo_pic_reg to NULL when function is finalized or
>>> get rid of hard reg at all and always use a pseudo register.
>>>
>>>> Another question is about address cost:
>>>>
>>>>   if (parts.index
>>>>   && (!REG_P (parts.index) || REGNO (parts.index) >= 
>>>> FIRST_PSEUDO_REGISTER)
>>>>   && (current_pass->type == GIMPLE_PASS
>>>>   || !pic_offset_table_rtx
>>>>   || !REG_P (parts.index)
>>>>   || REGNO (pic_offset_table_rtx) != REGNO (parts.index)))
>>>> cost++;
>>>> Is it a bug in the second sub condition?  Considering
>>>> "current_pass->type == GIMPLE_PASS" in the third sub condition, can I
>>>> assume the second is for non-GIMPLE passes only?
>>>
>>> There is just a code duplicated for parts.base and parts.index
>>> registers. We use same
>>> conditions for both of them because you can easily swap them if scale is 1.
>>>
>>> I don't know why we don't try to recognize PIC registers when in GIMPLE 
>>> pass.
>>
>> Could we define PIC_OFFSET_TABLE_REGNUM to a pseudo register when
>> ix86_use_pseudo_pic_reg returns true, as in this case?  Current logic
>> doesn't look consistent to me.  Of course, I know little about x86.
>
> I agree it looks inconsistent.  But I don't think PIC_OFFSET_TABLE_REGNUM is
> supposed to return pseudo regno.  Using EBX_REG value for this macro was a
> workaround for problem of NULL pic_offset_table_rtx usage in cost
> functions. I think
> we should try to initialize pic_offset_table_rtx with some pseudo
> register in i386
> target for cost estimation purposes and always return INVALID_REG for
> PIC_OFFSET_TABLE_REGNUM.
Hi IIya,
Could you please test a patch in this way?  You are more experienced
than me here.

Thanks,
bin
>
> Thanks,
> Ilya
>
>>
>> Thanks,
>> bin
>>>
>>> Thanks,
>>> Ilya
>>>
>>>>
>>>> Thanks,
>>>> bin
>>>>>
>>>>> Thanks,
>>>>> Ilya
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> bin


Re: Inconsistent initialization for pic_offset_table_rtx?

2016-02-09 Thread Bin.Cheng
On Tue, Feb 9, 2016 at 3:53 PM, Ilya Enkovich  wrote:
> 2016-02-09 18:45 GMT+03:00 Richard Earnshaw (lists) 
> :
>> On 09/02/16 14:56, Ilya Enkovich wrote:
>>>
>>> I agree it looks inconsistent.  But I don't think PIC_OFFSET_TABLE_REGNUM is
>>> supposed to return pseudo regno.  Using EBX_REG value for this macro was a
>>> workaround for problem of NULL pic_offset_table_rtx usage in cost
>>> functions. I think
>>> we should try to initialize pic_offset_table_rtx with some pseudo
>>> register in i386
>>> target for cost estimation purposes and always return INVALID_REG for
>>> PIC_OFFSET_TABLE_REGNUM.
>>>
>>
>> The ARM port has been using a pseudo for P_O_T_R for many years now.  It
>> needs some care to establish it correctly in the prologue when needed,
>> but otherwise works very well.  It means that leaf functions in
>> particular can be made much more efficient as they can pick a
>> call-clobbered register when that's desirable.
>
> We may need it from middle-end when estimate statements cost.  So
> initialization in
> prologue is not enough (at least for i386 target).  Looks like those
> cost estimation hooks
> should take care of PIC register by themselves.
By cost estimation hooks, do you mean xxx_address_cost here?  Middle
end itself doesn't know how PIC register is supported by targets.  I
admit that there is a missing part for PIC register in register
pressure estimate in IVO.
IIUC, the hardreg is only used for cost estimation in middle end here,
and it's only used for the first function in each compilation unit,
right?  It's overridden into pseudo register when expanding the first
function anyway?

Thanks,
bin
>
> Thanks,
> Ilya
>
>>
>> R.
>>
>>> Thanks,
>>> Ilya
>>>



Re: ipa vrp implementation in gcc

2016-02-10 Thread Bin.Cheng
On Mon, Jan 18, 2016 at 5:10 PM, Jan Hubicka  wrote:
>> On Mon, Jan 18, 2016 at 12:00 AM, Kugan
>>  wrote:
>> > Hi,
>> >
>> >> Another potential use of value ranges is the profile estimation.
>> >> http://www.lighterra.com/papers/valuerangeprop/Patterson1995-ValueRangeProp.pdf
>> >> It seems to me that we may want to have something that can feed sane loop
>> >> bounds for profile estimation as well and we can easily store the known
>> >> value ranges to SSA name annotations.
>> >> So I think separate local pass to compute value ranges (perhaps with less
>> >> accuracy than full blown VRP) is desirable.
>> >
>> > Thanks for the reference. I am looking at implementing a local pass for
>> > VRP. The value range computation in tree-vrp is based on the above
>> > reference and uses ASSERT_EXPR insertion (I understand that you posted
>> > the reference above for profile estimation). As Richard mentioned in his
>> > reply, the local pass should not rely on ASSERT_EXPR insertion.
>> > Therefore, do you have any specific algorithm in mind (i.e. Any
>> > published paper or reference from book)?. Of course we can tweak the
>> > algorithm from the reference above but would like to understand what
>> > your intension are.
>>
>> I have (very incomplete) prototype patches to do a dominator-based
>> approach instead (what is refered to downthread as non-iterating approach).
>> That's cheaper and is what I'd like to provide as an "utility style" 
>> interface
>> to things liker niter analysis which need range-info based on a specific
>> dominator (the loop header for example).
>
> In general, given that we have existing VRP implementation I would suggest
> first implementing the IPA propagation and profile estimation bits using
> existing VRP pass and then try to compare the simple dominator based approach
> with the VRP we have and see what are the compile time/code quality effects
> of both. Based on that we can decide how complex VRP we really want.
Hi Honza,
These two are not conflict with each other, right?  The control-flow
sensitive VRP in each function could well inherit IPA analysis
results.

Thanks,
bin
>
> It will be probably also more fun to implement it this way :)
> I plan to collect some data on early VRP and firefox today or tomorrow.
>
> Honza
>>
>> Richard.
>>
>> >> I think the ipa-prop.c probably won't need any siginificant changes.  The
>> >> code basically analyze what values are passed thorugh the function and
>> >> this works for constants as well as for intervals. In fact ipa-cp already
>> >> uses the same ipa-prop analysis for
>> >>  1) constant propagation
>> >>  2) alignment propagation
>> >>  3) propagation of known polymorphic call contextes.
>> >>
>> >> So replacing 1) by value range propagation should be easily doable.
>> >> I would also like to replace alignment propagation by bitwise constant
>> >> propagation (i.e. propagating what bits are known to be zero and what
>> >> bits are known to be one). We already do have bitwise CCP, so we could
>> >> deal with this basically in the same way as we deal with value ranges.
>> >>
>> >> ipa-prop could use bit of clenaing up and modularizing that I hope will
>> >> be done next stage1 :)
>> >
>> > We (Myself and Prathamesh) are interested in working on LTO
>> > improvements. Let us have a look at this.
>> >
>> >>>
>>  - Once we have the value ranges for parameter/return values, we could
>>  rely on tree-vrp to use this and do the optimizations
>> >>>
>> >>> Yep.  IPA transform phase should annotate parameter default defs with
>> >>> computed ranges.
>> >>
>> >> Yep, in addition we will end up with known value ranges stored in 
>> >> aggregates
>> >> for that we need better separate representaiton
>> >>
>> >> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68930
>> >>>
>> >
>> > Thanks,
>> > Kugan


Re: Help about how to bootstrap gcc with local version glibc other than system one

2016-02-22 Thread Bin.Cheng
On Mon, Feb 1, 2016 at 7:45 PM, Jeff Law  wrote:
> On 02/01/2016 12:07 PM, Bin.Cheng wrote:
>>
>> On Mon, Feb 1, 2016 at 6:08 PM, Andreas Schwab 
>> wrote:
>>>
>>> "Bin.Cheng"  writes:
>>>
>>>> Seems to me Andrew was right in comment of PR69559, that we simply
>>>> couldn't bootstrap GCC with sysroot.
>>>
>>>
>>> The main use of sysroot is to build a cross compiler, which you cannot
>>> bootstrap anyway.
>>>
>>>> My question here is: If this is the case, how should I bootstrap a gcc
>>>> against local version glibc, rather than the system one?  Is chroot
>>>> the only way to do that?
>>>
>>>
>>> Yes, building in a chroot or a VM is the best way to do it.  For
>>> example, that's how the openSUSE Build Service works.
>>
>> Hi Andreas,
>> Thanks very much for helping.  I will try to do it in chroot.
>
> Definitely what I'd recommend as well.
>
> We do this regularly with something called "mock" on Fedora.  I'm sure SuSE,
> Debian, Ubuntu, etc have an equivalent.
>
> Essentially they create a chroot, populate it with build dependencies
> extracted from the source package, then build within the chroot.  You can
> arrange to get a different glibc during instantiation of the chroot, or
> upgrade it after the chroot is fully instantiated.
Hi,
I still don't quite follow this method.  If I pop up chroot
environment with new glibc, it's still possible that the new glibc
isn't compatible with the default gcc in chroot.  Won't this a
chicken-egg problem because we want to build our gcc against new glibc
in the first place?

Thanks,
bin
>
> I'm sure there's a way to do this with containers too.
>
> jeff


Unnecessary check on phi node in tree if-conversion?

2016-04-06 Thread Bin.Cheng
Hi,
Function if_convertible_phi_p has below check on virtual PHI nodes:


  if (any_mask_load_store)
return true;

  /* When there were no if-convertible stores, check
 that there are no memory writes in the branches of the loop to be
 if-converted.  */
  if (virtual_operand_p (gimple_phi_result (phi)))
{
  imm_use_iterator imm_iter;
  use_operand_p use_p;

  if (bb != loop->header)
{
  if (dump_file && (dump_flags & TDF_DETAILS))
fprintf (dump_file, "Virtual phi not on loop->header.\n");
  return false;
}

  FOR_EACH_IMM_USE_FAST (use_p, imm_iter, gimple_phi_result (phi))
{
  if (gimple_code (USE_STMT (use_p)) == GIMPLE_PHI
  && USE_STMT (use_p) != phi)
{
  if (dump_file && (dump_flags & TDF_DETAILS))
fprintf (dump_file, "Difficult to handle this virtual phi.\n");
  return false;
}
}
}

Since the check is short-cut by any_mask_load_store, when the check is
triggered, it means there is virtual phi node but no conditional
memory stores?  Is this possible?  Plus, any_mask_load_store is set by
below code in if_convertible_gimple_assign_stmt_p:
  if (ifcvt_can_use_mask_load_store (stmt))
{
  gimple_set_plf (stmt, GF_PLF_2, true);
  *any_mask_load_store = true;
  return true;
}

So in theory it's possible to skip aforementioned check when only mask
load is encountered.

Any ideas?

Thanks,
bin


Re: Unnecessary check on phi node in tree if-conversion?

2016-04-06 Thread Bin.Cheng
On Wed, Apr 6, 2016 at 5:07 PM, Bin.Cheng  wrote:
> Hi,
> Function if_convertible_phi_p has below check on virtual PHI nodes:
>
>
>   if (any_mask_load_store)
> return true;
>
>   /* When there were no if-convertible stores, check
>  that there are no memory writes in the branches of the loop to be
>  if-converted.  */
>   if (virtual_operand_p (gimple_phi_result (phi)))
> {
>   imm_use_iterator imm_iter;
>   use_operand_p use_p;
>
>   if (bb != loop->header)
> {
>   if (dump_file && (dump_flags & TDF_DETAILS))
> fprintf (dump_file, "Virtual phi not on loop->header.\n");
>   return false;
> }
>
>   FOR_EACH_IMM_USE_FAST (use_p, imm_iter, gimple_phi_result (phi))
> {
>   if (gimple_code (USE_STMT (use_p)) == GIMPLE_PHI
>   && USE_STMT (use_p) != phi)
> {
>   if (dump_file && (dump_flags & TDF_DETAILS))
> fprintf (dump_file, "Difficult to handle this virtual phi.\n");
>   return false;
> }
> }
> }
>
> Since the check is short-cut by any_mask_load_store, when the check is
> triggered, it means there is virtual phi node but no conditional
> memory stores?  Is this possible?  Plus, any_mask_load_store is set by
> below code in if_convertible_gimple_assign_stmt_p:
>   if (ifcvt_can_use_mask_load_store (stmt))
> {
>   gimple_set_plf (stmt, GF_PLF_2, true);
>   *any_mask_load_store = true;
>   return true;
> }
>
> So in theory it's possible to skip aforementioned check when only mask
> load is encountered.
>
> Any ideas?
It's possible to have a loop like:


  .MEM_2232 = PHI <.MEM_574(179), .MEM_1247(183)>
  ...
  if (cond)
goto 
  else
goto 

:  //empty
:
  .MEM_1247 = PHI <.MEM_2232(180), .MEM_2232(181)>
  if (cond2)
goto 
  else
goto 

:
  goto 

So can we handle the PHI which can be degenerated in if-cvt?

Thanks,
bin


Re: Unnecessary check on phi node in tree if-conversion?

2016-04-08 Thread Bin.Cheng
On Thu, Apr 7, 2016 at 10:30 AM, Richard Biener
 wrote:
> On April 6, 2016 8:21:35 PM GMT+02:00, "Bin.Cheng"  
> wrote:
>>On Wed, Apr 6, 2016 at 5:07 PM, Bin.Cheng 
>>wrote:
>>> Hi,
>>> Function if_convertible_phi_p has below check on virtual PHI nodes:
>>>
>>>
>>>   if (any_mask_load_store)
>>> return true;
>>>
>>>   /* When there were no if-convertible stores, check
>>>  that there are no memory writes in the branches of the loop to
>>be
>>>  if-converted.  */
>>>   if (virtual_operand_p (gimple_phi_result (phi)))
>>> {
>>>   imm_use_iterator imm_iter;
>>>   use_operand_p use_p;
>>>
>>>   if (bb != loop->header)
>>> {
>>>   if (dump_file && (dump_flags & TDF_DETAILS))
>>> fprintf (dump_file, "Virtual phi not on loop->header.\n");
>>>   return false;
>>> }
>>>
>>>   FOR_EACH_IMM_USE_FAST (use_p, imm_iter, gimple_phi_result
>>(phi))
>>> {
>>>   if (gimple_code (USE_STMT (use_p)) == GIMPLE_PHI
>>>   && USE_STMT (use_p) != phi)
>>> {
>>>   if (dump_file && (dump_flags & TDF_DETAILS))
>>> fprintf (dump_file, "Difficult to handle this virtual
>>phi.\n");
>>>   return false;
>>> }
>>> }
>>> }
>>>
>>> Since the check is short-cut by any_mask_load_store, when the check
>>is
>>> triggered, it means there is virtual phi node but no conditional
>>> memory stores?  Is this possible?  Plus, any_mask_load_store is set
>>by
>>> below code in if_convertible_gimple_assign_stmt_p:
>>>   if (ifcvt_can_use_mask_load_store (stmt))
>>> {
>>>   gimple_set_plf (stmt, GF_PLF_2, true);
>>>   *any_mask_load_store = true;
>>>   return true;
>>> }
>>>
>>> So in theory it's possible to skip aforementioned check when only
>>mask
>>> load is encountered.
>>>
>>> Any ideas?
>>It's possible to have a loop like:
>>
>>
>>  .MEM_2232 = PHI <.MEM_574(179), .MEM_1247(183)>
>>  ...
>>  if (cond)
>>goto 
>>  else
>>goto 
>>
>>:  //empty
>>:
>>  .MEM_1247 = PHI <.MEM_2232(180), .MEM_2232(181)>
>>  if (cond2)
>>goto 
>>  else
>>goto 
>>
>>:
>>  goto 
>>
>>So can we handle the PHI which can be degenerated in if-cvt?
>
> I believe the check is bogus and we should drop it.
We do need to handle aforementioned case though.  I suppose an
additional pass of dce before if-conv will fix the problem, or some
light weight functions can?

Thanks,
bin


Question about fold C_MAYBE_CONST_EXPR expressions

2016-07-24 Thread Bin.Cheng
Hi,
I ran into a problem that C frontend (in function
build_conditional_expr) creates expression like (C_MAYBE_CONST_EXPR
(NULL, x + const)).  The inner expression (and its operands) have
unsigned int type.  After that, the expression needs to be casted to
result type which is unsigned short.  In this case,
convert_to_integer/convert_to_integer_1 doesn't fold into
C_MAYBE_CONST_EXPR and just returns (unsigned
short)(C_MAYBE_CONST_EXPR (NULL, x + const)), as a result, (unsigned
short)(x + const) won't be simplified as (unsigned short)x + const.
My questions are, 1) Is it possible to fold type conversion into
C_MAYBE_CONST_EXPR's inner expression in convert_to_integer_1?  2) If
not, looks like there are couple of places the mentioned fold can be
done, for example, after stripping C_MAYBE_CONST_EXPR and before (or
during) gimplify.  So which place is preferred?

Thanks,
bin


Re: Question about fold C_MAYBE_CONST_EXPR expressions

2016-07-25 Thread Bin.Cheng
On Sun, Jul 24, 2016 at 11:04 PM, Prathamesh Kulkarni
 wrote:
> On 24 July 2016 at 21:26, Bin.Cheng  wrote:
>> Hi,
>> I ran into a problem that C frontend (in function
>> build_conditional_expr) creates expression like (C_MAYBE_CONST_EXPR
>> (NULL, x + const)).  The inner expression (and its operands) have
>> unsigned int type.  After that, the expression needs to be casted to
>> result type which is unsigned short.  In this case,
>> convert_to_integer/convert_to_integer_1 doesn't fold into
>> C_MAYBE_CONST_EXPR and just returns (unsigned
>> short)(C_MAYBE_CONST_EXPR (NULL, x + const)), as a result, (unsigned
>> short)(x + const) won't be simplified as (unsigned short)x + const.
>> My questions are, 1) Is it possible to fold type conversion into
>> C_MAYBE_CONST_EXPR's inner expression in convert_to_integer_1?  2) If
>> not, looks like there are couple of places the mentioned fold can be
>> done, for example, after stripping C_MAYBE_CONST_EXPR and before (or
>> during) gimplify.  So which place is preferred?
> Sorry, I have a very silly question:
> I don't understand how the following transform
> (unsigned short) (x + const) to (unsigned short)x + const would be legal
> since the operands to PLUS_EXPR in latter case would have different types
> (unsigned short, unsigned) ?
> I suppose in GIMPLE (and GENERIC too?)
> we require rhs1 and rhs2 to have same types  ?
I skipped type information for the const operand, of course the const
will be converted to unsigned short which is the type for computation.

Thanks,
bin


Questions about pedantic_non_lvalue_loc and related code

2016-10-28 Thread Bin.Cheng
Hi,
I am trying to move simplifications in fold_cond_expr_with_comparison
to match.pd and had below questions about pedantic_non_lvalue_loc.

Last change for the function is:

commit 565353e8cf24046c034ecaf04dcb69ea8319a57c
Author: rguenth 
Date:   Tue Nov 11 15:21:12 2014 +

2014-11-11  Richard Biener  

* tree-core.h (pedantic_lvalues): Remove.
* fold-const.c (pedantic_lvalues): Likewise.
(pedantic_non_lvalue_loc): Remove conditional non_lvalue_loc call.

c/
* c-decl.c (c_init_decl_processing): Do not set pedantic_lvalues
to true.

It removes call to non_lvalue_loc in pedantic_non_lvalue_loc.  IIUC,
pedantic_lvalues was introduced because of Generic Lvalue extension
originally, and it is set to true after that feature was removed long
ago.  So before the aforementioned commit, pedantic_non_lvalue_loc is
like:
 static tree
 pedantic_non_lvalue_loc (location_t loc, tree x)
 {
   if (pedantic_lvalues)
 return non_lvalue_loc (loc, x);

   return protected_set_expr_location_unshare (x, loc);
 }

The net effect of this function is actually always returning what
non_lvalue_loc returns, it should be changed into below?
 static tree
 pedantic_non_lvalue_loc (location_t loc, tree x)
 {
   return non_lvalue_loc (loc, x);
 }
Given the change is made in 2014 and not causing any issue after it, I
can only assuming two possible reasons:
A) expressions passed to pedantic_non_lvalue_loc will never be lvalue,
thus no need to wrap it with NON_LVALUE by calling non_lvalue_loc.
B) expressions passed to pedantic_non_lvalue_loc could be lvalue, and
the lvalue is kept by not calling non_lvalue_loc as it is now.

Question 1) is, which one of above is correct?  After going through
all calls to pedantic_non_lvalue_loc, I tend to agree A) is correct.
Most calls in fold_cond_expr_with_comparison are lvalue-safe, and the
only exception is now also guarded by fix from PR19199.

Well, there are two calls outside of fold_cond_expr_with_comparison
which I am not sure: The one in fold_binary_loc for COMPOUND_EXPR, the
other one in fold_tenary_loc for COND_EXPR, the two calls are both
(somehow) guarded by !TREE_SIDE_EFFECTS, for example:
case COMPOUND_EXPR:
  /* When pedantic, a compound expression can be neither an lvalue
 nor an integer constant expression.  */
  if (TREE_SIDE_EFFECTS (arg0) || TREE_CONSTANT (arg1))
return NULL_TREE;
  /* Don't let (0, 0) be null pointer constant.  */
  tem = integer_zerop (arg1) ? build1 (NOP_EXPR, type, arg1)
 : fold_convert_loc (loc, type, arg1);
  return pedantic_non_lvalue_loc (loc, tem);

The comment leads me to
Question 2) Does this mean a lvalue expression must have side_effects_flag set?
If answer is yes, then question 1.A) will hold.

Get back to current implementation of pedantic_non_lvalue_loc:
static tree
pedantic_non_lvalue_loc (location_t loc, tree x)
{
  return protected_set_expr_location_unshare (x, loc);
}

Function protected_set_expr_location_unshare was introduced by
PR45700.  It I read it correctly, the change is to, when necessarily,
set different loc information only after copying the original
expression.  Again if Question 2) is a yes, pedantic_non_lvalue_loc
can be cleaned up, based on the code before 2014's change.  Even
Question 2) is not always yes, calls to pedantic_non_lvalue_loc in
fold_cond_expr_with_comparison are all dead because it's for sure the
expression can't be lvalue.

So any explanation?  Thanks very much in advance.

Thanks,
bin


Re: Why is this not optimized?

2014-05-14 Thread Bin.Cheng
On Wed, May 14, 2014 at 9:14 PM, Bingfeng Mei  wrote:
> Hi,
> I am looking at some code of our target, which is not optimized as expected. 
> For the following RTX, I expect source of insn 17 should be propagated into 
> insn 20, and insn 17 is eliminated as a result. On our target, it will become 
> a predicated xor instruction instead of two. Initially, I thought fwprop pass 
> should do this.
>
> (insn 17 16 18 3 (set (reg/v:HI 102 [ crc ])
> (xor:HI (reg/v:HI 108 [ crc ])
> (const_int 16386 [0x4002]))) coremark.c:1632 725 {xorhi3}
>  (nil))
> (insn 18 17 19 3 (set (reg:BI 113)
> (ne:BI (reg:QI 101 [ D.4446 ])
> (const_int 1 [0x1]))) 1397 {cmp_qimode}
>  (nil))
> (jump_insn 19 18 55 3 (set (pc)
> (if_then_else (ne (reg:BI 113)
> (const_int 0 [0]))
> (label_ref 23)
> (pc))) 1477 {cbranchbi4}
>  (expr_list:REG_DEAD (reg:BI 113)
> (expr_list:REG_BR_PROB (const_int 7100 [0x1bbc])
> (expr_list:REG_PRED_WIDTH (const_int 1 [0x1])
> (nil
>  -> 23)
> (note 55 19 20 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
> (insn 20 55 23 4 (set (reg:HI 112 [ crc ])
> (reg/v:HI 102 [ crc ])) 502 {fp_movhi}
>  (expr_list:REG_DEAD (reg/v:HI 102 [ crc ])
> (nil)))
> (code_label 23 20 56 5 2 "" [1 uses])
>
>
> But it can't. First propagate_rtx_1 will return false because PR_CAN_APPEAR 
> is false and
> following code is executed.
>
>   if (x == old_rtx)
> {
>   *px = new_rtx;
>   return can_appear;
> }
>
> Even I forces PR_CAN_APPEAR to be set in flags, fwprop still won't go ahead in
> try_fwprpp_subst because old_cost is 0 (REG only rtx), and set_src_cost 
> (SET_SRC (set),
> speed) is bigger than 0. So the change is deemed as not profitable, which is 
> not correct
> IMO.
Pass fwprop is too conservative with respect to propagation
opportunities outside of memory reference, it just gives up at many
places.  Also as in your case, seems it doesn't take into
consideration that the original insn can be removed after propagation.

We Mi once sent a patch re-implementing fwprop pass at
https://gcc.gnu.org/ml/gcc-patches/2013-03/msg00617.html .
I also did some experiments and worked out a local patch doing similar
work to handle cases exactly like yours.
The problem is even though one instruction can be saved (as in your
case), it's not always good, because it tends to generate more complex
instructions, and such insns are somehow more vulnerable to pipeline
hazard.  Unfortunately, it's kind of impossible for fwprop to
understand the pipeline risk.

Thanks,
bin
>
> If fwprop is not the place to do this optimization, where should it be done? 
> I am working on up-to-date GCC 4.8.
>
> Thanks,
> Bingfeng Mei



-- 
Best Regards.


[RFC]Better support for big-endian targets in GCC vectorizer

2014-05-27 Thread Bin.Cheng
Hi,

To attract more eyes, I should have used a scarier subject like "GCC's
vectorizer is heading in the wrong direction on big-endian targets".

The idea came from a simple vectorization case I ran into and a
discussion with Richard.  Given simple case like:

typedef signed short *__restrict__ pRSINT16;
void check_vector_S16 (pRSINT16 a, pRSINT16 b);
int foo (void)
{
  int i;
  signed char input1_S8[16] =
  { 127, 126, 125, 124, 123, 122, 121, 120,
119, 118, 117, 116, 115, 114, 113, 112};
  signed char input2_S8[16] =
  { 127, 125, 123, 121, 119, 117, 115, 113,
111, 109, 107, 105, 103, 101, 99,  97};
  signed short output_S16[16];
  signed short expected_S16[] =
  { 16129, 15750, 15375, 15004, 14637, 14274, 13915, 13560,
13209, 12862, 12519, 12180, 11845, 11514, 11187, 10864 };

  for (i=0; i<16; i++)
output_S16[i] = (signed short)input1_S8[i] * input2_S8[i];

  check_vector_S16 (expected_S16, output_S16);
  return 0;
}


It is vectorized at optimization level "O3", and the GIMPLE optimized
dump for aarch64 big-endian is like:

;; Function foo (foo, funcdef_no=0, decl_uid=2564, symbol_order=0)

foo ()
{
  vector(8) short int vect_patt_77.9;
  vector(16) signed char vect__54.8;
  vector(16) signed char vect__52.5;
  short int expected_S16[16];
  short int output_S16[16];
  signed char input2_S8[16];
  signed char input1_S8[16];

  :
  MEM[(signed char[16] *)&input1_S8] = { 127, 126, 125, 124, 123, 122,
121, 120, 119, 118, 117, 116, 115, 114, 113, 112 };
  MEM[(signed char[16] *)&input2_S8] = { 127, 125, 123, 121, 119, 117,
115, 113, 111, 109, 107, 105, 103, 101, 99, 97 };
  MEM[(short int *)&expected_S16] = { 16129, 15750, 15375, 15004,
14637, 14274, 13915, 13560 };
  MEM[(short int *)&expected_S16 + 16B] = { 13209, 12862, 12519,
12180, 11845, 11514, 11187, 10864 };
  vect__52.5_2 = MEM[(signed char[16] *)&input1_S8];
  vect__54.8_73 = MEM[(signed char[16] *)&input2_S8];
  vect_patt_77.9_72 = WIDEN_MULT_HI_EXPR ;
  vect_patt_77.9_71 = WIDEN_MULT_LO_EXPR ;
  MEM[(short int *)&output_S16] = vect_patt_77.9_72;
  MEM[(short int *)&output_S16 + 16B] = vect_patt_77.9_71;
  check_vector_S16 (&expected_S16, &output_S16);
  input1_S8 ={v} {CLOBBER};
  input2_S8 ={v} {CLOBBER};
  output_S16 ={v} {CLOBBER};
  expected_S16 ={v} {CLOBBER};
  return 0;

}


while dump for aarch64 little-endian is like:

;; Function foo (foo, funcdef_no=0, decl_uid=2564, symbol_order=0)

foo ()
{
  vector(8) short int vect_patt_77.9;
  vector(16) signed char vect__54.8;
  vector(16) signed char vect__52.5;
  short int expected_S16[16];
  short int output_S16[16];
  signed char input2_S8[16];
  signed char input1_S8[16];

  :
  MEM[(signed char[16] *)&input1_S8] = { 127, 126, 125, 124, 123, 122,
121, 120, 119, 118, 117, 116, 115, 114, 113, 112 };
  MEM[(signed char[16] *)&input2_S8] = { 127, 125, 123, 121, 119, 117,
115, 113, 111, 109, 107, 105, 103, 101, 99, 97 };
  MEM[(short int *)&expected_S16] = { 16129, 15750, 15375, 15004,
14637, 14274, 13915, 13560 };
  MEM[(short int *)&expected_S16 + 16B] = { 13209, 12862, 12519,
12180, 11845, 11514, 11187, 10864 };
  vect__52.5_2 = MEM[(signed char[16] *)&input1_S8];
  vect__54.8_73 = MEM[(signed char[16] *)&input2_S8];
  vect_patt_77.9_72 = WIDEN_MULT_LO_EXPR ;
  vect_patt_77.9_71 = WIDEN_MULT_HI_EXPR ;
  MEM[(short int *)&output_S16] = vect_patt_77.9_72;
  MEM[(short int *)&output_S16 + 16B] = vect_patt_77.9_71;
  check_vector_S16 (&expected_S16, &output_S16);
  input1_S8 ={v} {CLOBBER};
  input2_S8 ={v} {CLOBBER};
  output_S16 ={v} {CLOBBER};
  expected_S16 ={v} {CLOBBER};
  return 0;

}


It's clear that WIDEN_MULT_LO_EXPR/WIDEN_MULT_HI_EXPR are switched
depending on big-endian/little-endian.  The corresponding code doing
the shuffle is like below in
tree-vect-stmts.c:supportable_widening_operation

  if (BYTES_BIG_ENDIAN && c1 != VEC_WIDEN_MULT_EVEN_EXPR)
{
  enum tree_code ctmp = c1;
  c1 = c2;
  c2 = ctmp;
}


There are some other similar cases in vectorizer and all of them look
suspicious since intuitively, vectorizer should neither care about
target endianess nor do such shuffle.  Anyway, this is how we do
vectorization currently.

Stick to the test case, below insns are expanded for the WIDEN_MULT
pair on big-endian:

;;WIDEN_MULT_HI_EXPR part
(insn:TI 39 50 16 (set (reg:V8HI 36 v4 [orig:106 vect_patt_77.9 ] [106])
(mult:V8HI (sign_extend:V8HI (vec_select:V8QI (reg:V16QI 33 v1 [82])
(parallel:V16QI [
(const_int 8 [0x8])
(const_int 9 [0x9])
(const_int 10 [0xa])
(const_int 11 [0xb])
(const_int 12 [0xc])
(const_int 13 [0xd])
(const_int 14 [0xe])
(const_int 15 [0xf])
])))
(sign_extend:V8HI (vec_select:V8QI (reg:V16QI 32 v0 [87])
  

Help understand the may_be_zero field in loop niter information

2014-06-12 Thread Bin.Cheng
Hi,
I noticed there is below code/comments about may_be_zero field in loop
niter desc:

  tree may_be_zero;/* The boolean expression.  If it evaluates to true,
   the loop will exit in the first iteration (i.e.
   its latch will not be executed), even if the niter
   field says otherwise.  */

I had difficulty in understanding this because I ran into some cases
in which it didn't behave as said.

Example1, the dump of loop before sccp is like:

  :
  bnd_4 = len_3(D) + 1;

  :
  # ivtmp_1 = PHI <0(2), ivtmp_11(4)>
  _6 = ivtmp_1 + len_3(D);
  _7 = a[ivtmp_1];
  _8 = b[ivtmp_1];
  _9 = _7 + _8;
  a[_6] = _9;
  ivtmp_11 = ivtmp_1 + 1;
  if (bnd_4 > ivtmp_11)
goto ;
  else
goto ;

  :
  goto ;

The loop niter information analyzed in sccp is like:

Analyzing # of iterations of loop 1
  exit condition [1, + , 1] < len_3(D) + 1
  bounds on difference of bases: -1 ... 4294967294
  result:
zero if len_3(D) == 4294967295
# of iterations len_3(D), bounded by 4294967294

Qeustion1, shouldn't it be like "len_3 +1 <= 1" because the latch
won't be executed when "len_3 == 0", right?

It seems that GCC deliberately skips boundary case.  It's ok in this
case we can still get zero from niter info "bnd_4 - ivtmp_11" when
"bnd_4 == 1" (i,e. len_3 == 0)

But when boundary condition is the only case that latch get ZERO
executed, the may_be_zero info will not be computed.  See example2,
with dump of loop before sccp like:

foo (int M)

  :
  if (M_4(D) > 0)
goto ;
  else
goto ;

  :
  return;

  :

  :
  # i_13 = PHI <0(4), i_10(6)>
  _5 = i_13 + M_4(D);
  _6 = a[i_13];
  _7 = b[i_13];
  _8 = _6 + _7;
  a[_5] = _8;
  i_10 = i_13 + 1;
  if (M_4(D) > i_10)
goto ;
  else
goto ;

  :
  goto ;

The niter information analyzed in sccp is like:

Analyzing # of iterations of loop 1
  exit condition [1, + , 1](no_overflow) < M_4(D)
  bounds on difference of bases: 0 ... 2147483646
  result:
# of iterations (unsigned int) M_4(D) + 4294967295, bounded by 2147483646

So may_be_zero is always false here, but the latch may be ZERO
executed when "M_4 == 1".

Start from Example1, we can create Example3 which makes no sense to
me.  Again, the dump of loop is like:

  :
  bnd_4 = len_3(D) + 1;

  :
  # ivtmp_1 = PHI <0(2), ivtmp_11(4)>
  _6 = ivtmp_1 + len_3(D);
  _7 = a[ivtmp_1];
  _8 = b[ivtmp_1];
  _9 = _7 + _8;
  a[_6] = _9;
  ivtmp_11 = ivtmp_1 + 4;
  if (bnd_4 > ivtmp_11)
goto ;
  else
goto ;

  :
  goto ;

  :
  return 0;

The niter info is like:

Analyzing # of iterations of loop 1
  exit condition [4, + , 4] < len_3(D) + 1
  bounds on difference of bases: -4 ... 4294967291
  result:
under assumptions len_3(D) + 1 <= 4294967292
zero if len_3(D) == 4294967295
# of iterations len_3(D) / 4, bounded by 1073741823

The problem is: won't latch be ZERO executed when "len_3 == 0/1/2/3"?

With these examples, my first feeling is that may_be_zero is computed
wrongly, but searching code shows that may_be_zero is mostly used by
ivopt.  Considering ivopt works well with it now, I am wondering if
it's designed behavior?

Anyone help?  Thanks very much.

Thanks,
bin
-- 
Best Regards.


Re: Help understand the may_be_zero field in loop niter information

2014-06-13 Thread Bin.Cheng
On Thu, Jun 12, 2014 at 7:59 PM, Zdenek Dvorak  wrote:
> Hi,
>
>> > I noticed there is below code/comments about may_be_zero field in loop
>> > niter desc:
>> >
>> >   tree may_be_zero;/* The boolean expression.  If it evaluates to true,
>> >the loop will exit in the first iteration (i.e.
>> >its latch will not be executed), even if the niter
>> >field says otherwise.  */
>> >
>> > I had difficulty in understanding this because I ran into some cases
>> > in which it didn't behave as said.
>
> actually, in all the examples below, the field behaves as described,
> i.e.,
>
> the number of iterations = may_be_zero ? 0 : niter;
>
> In particular, the fact that may_be_zero is false *does not imply*
> that the number of iterations as described by niter is non-zero.
>
>> > Example1, the dump of loop before sccp is like:
>> >
>> >   :
>> >   bnd_4 = len_3(D) + 1;
>> >
>> >   :
>> >   # ivtmp_1 = PHI <0(2), ivtmp_11(4)>
>> >   _6 = ivtmp_1 + len_3(D);
>> >   _7 = a[ivtmp_1];
>> >   _8 = b[ivtmp_1];
>> >   _9 = _7 + _8;
>> >   a[_6] = _9;
>> >   ivtmp_11 = ivtmp_1 + 1;
>> >   if (bnd_4 > ivtmp_11)
>> > goto ;
>> >   else
>> > goto ;
>> >
>> >   :
>> >   goto ;
>> >
>> > The loop niter information analyzed in sccp is like:
>> >
>> > Analyzing # of iterations of loop 1
>> >   exit condition [1, + , 1] < len_3(D) + 1
>> >   bounds on difference of bases: -1 ... 4294967294
>> >   result:
>> > zero if len_3(D) == 4294967295
>> > # of iterations len_3(D), bounded by 4294967294
>> >
>> > Qeustion1, shouldn't it be like "len_3 +1 <= 1" because the latch
>> > won't be executed when "len_3 == 0", right?
>
> the analysis determines the number of iterations as len_3, that is
> 0 if len_3 == 0.  So, the information is computed correctly here.
>
>> > But when boundary condition is the only case that latch get ZERO
>> > executed, the may_be_zero info will not be computed.  See example2,
>> > with dump of loop before sccp like:
>> >
>> > foo (int M)
>> >
>> >   :
>> >   if (M_4(D) > 0)
>> > goto ;
>> >   else
>> > goto ;
>> >
>> >   :
>> >   return;
>> >
>> >   :
>> >
>> >   :
>> >   # i_13 = PHI <0(4), i_10(6)>
>> >   _5 = i_13 + M_4(D);
>> >   _6 = a[i_13];
>> >   _7 = b[i_13];
>> >   _8 = _6 + _7;
>> >   a[_5] = _8;
>> >   i_10 = i_13 + 1;
>> >   if (M_4(D) > i_10)
>> > goto ;
>> >   else
>> > goto ;
>> >
>> >   :
>> >   goto ;
>> >
>> > The niter information analyzed in sccp is like:
>> >
>> > Analyzing # of iterations of loop 1
>> >   exit condition [1, + , 1](no_overflow) < M_4(D)
>> >   bounds on difference of bases: 0 ... 2147483646
>> >   result:
>> > # of iterations (unsigned int) M_4(D) + 4294967295, bounded by 
>> > 2147483646
>> >
>> > So may_be_zero is always false here, but the latch may be ZERO
>> > executed when "M_4 == 1".
>
> Again, this is correct, since then ((unsigned int) M_4) + 4294967295 == 0.
>
>> > Start from Example1, we can create Example3 which makes no sense to
>> > me.  Again, the dump of loop is like:
>> >
>> >   :
>> >   bnd_4 = len_3(D) + 1;
>> >
>> >   :
>> >   # ivtmp_1 = PHI <0(2), ivtmp_11(4)>
>> >   _6 = ivtmp_1 + len_3(D);
>> >   _7 = a[ivtmp_1];
>> >   _8 = b[ivtmp_1];
>> >   _9 = _7 + _8;
>> >   a[_6] = _9;
>> >   ivtmp_11 = ivtmp_1 + 4;
>> >   if (bnd_4 > ivtmp_11)
>> > goto ;
>> >   else
>> > goto ;
>> >
>> >   :
>> >   goto ;
>> >
>> >   :
>> >   return 0;
>> >
>> > The niter info is like:
>> >
>> > Analyzing # of iterations of loop 1
>> >   exit condition [4, + , 4] < len_3(D) + 1
>> >   bounds on difference of bases: -4 ... 4294967291
>> >   result:
>> > under assumptions len_3(D) + 1 <= 4294967292
>> > zero if len_3(D) == 4294967295
>> > # of iterations len_3(D) / 4, bounded by 1073741823
>> >
>> > The problem is: won't latch be ZERO executed when "len_3 == 0/1/2/3"?
>
> Again, in all these cases the number of iterations is len_3 / 4 == 0.

Thanks, I realized that boundary condition is described by niter part
of information and it's more conveniently handled in this way for
consumer of may_be_zero like IVOPT.  One problem is with below
example:

  :
  vectp_a.8_57 = &a;
  vectp_b.11_61 = &b;
  _67 = _1 * 4;
  vectp_a.15_66 = &a + _67;

  :
  # vectp_a.7_58 = PHI 
  # vectp_b.10_62 = PHI 
  # vectp_a.14_68 = PHI 
  # ivtmp_9 = PHI <0(6), ivtmp_71(14)>
  vect__6.9_60 = MEM[(float *)vectp_a.7_58];
  vect__7.12_64 = MEM[(float *)vectp_b.10_62];
  vect__8.13_65 = vect__6.9_60 + vect__7.12_64;
  MEM[(float *)vectp_a.14_68] = vect__8.13_65;
  vectp_a.7_59 = vectp_a.7_58 + 16;
  vectp_b.10_63 = vectp_b.10_62 + 16;
  vectp_a.14_69 = vectp_a.14_68 + 16;
  ivtmp_71 = ivtmp_9 + 1;
  if (ivtmp_71 < bnd.4_36)
goto ;
  else
goto ;

  :
  goto ;

After ivopt:

  :
  vectp_a.8_57 = &a;
  vectp_b.11_61 = &b;
  _67 = _1 * 4;
  vectp_a.15_66 = &a + _67;

  :
  # ivtmp.24_26 = PHI <0(6), ivtmp.24_33(14)>
  # ivtmp.27_88 = PHI <0(6), ivtmp.27_89(14)>
  vect__6.9_60 = MEM[symbol: a, index: ivt

Re: [PATCH] gcc/dwarf2asm.c: Add static_output_delta() with var_list for dw2_asm_output_delta()

2014-06-15 Thread Bin.Cheng
On Sat, Jun 14, 2014 at 5:50 PM, Chen Gang  wrote:
> On 06/14/2014 05:45 PM, Chen Gang wrote:
>> dw2_asm_output_vms_delta() calls dw2_asm_output_delta() in an abnormal
>> way, so need add a new function just like vprintf() for printf(), and
>> then the related call will be in normal way.
>>
>> The related warning:
>>
>>   ../../gcc/gcc/dwarf2asm.c: In function 'void dw2_asm_output_vms_delta(int, 
>> const char*, const char*, const char*, ...)':
>>   ../../gcc/gcc/dwarf2asm.c:167:50: warning: format not a string literal and 
>> no format arguments [-Wformat-security]
>>
>>
>> Signed-off-by: Chen Gang 
>> ---
>>  gcc/ChangeLog   |  6 ++
>>  gcc/dwarf2asm.c | 23 +++
>>  2 files changed, 21 insertions(+), 8 deletions(-)
>>
>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>> index edb3fc0..0e1df0e 100644
>> --- a/gcc/ChangeLog
>> +++ b/gcc/ChangeLog
>> @@ -1,3 +1,9 @@
>> +2014-06-14  Chen Gang 
>> +
>> + * dwarf2asm.c (dw2_asm_output_delta): Add static_output_delta() for
>> + dw2_asm_output_delta(), just like vprintf() for printf(), so that
>> + dw2_asm_output_vms_delta() can use it in normal way.
>> +
>>  2014-06-13  Vladimir Makarov  
>>
>>   * lra-assign.c (assign_by_spills): Add code to assign vector regs
>> diff --git a/gcc/dwarf2asm.c b/gcc/dwarf2asm.c
>> index 1372b23..69733b3 100644
>> --- a/gcc/dwarf2asm.c
>> +++ b/gcc/dwarf2asm.c
>> @@ -123,14 +123,10 @@ dw2_asm_output_data (int size, unsigned HOST_WIDE_INT 
>> value,
>> impossible to do here, since the ASM_SET_OP for the difference
>> symbol must appear after both symbols are defined.  */
>>
>> -void
>> -dw2_asm_output_delta (int size, const char *lab1, const char *lab2,
>> -   const char *comment, ...)
>> +static void
>> +static_output_delta (int size, const char *lab1, const char *lab2,
>> +   const char *comment, va_list ap)
>>  {
>> -  va_list ap;
>> -
>> -  va_start (ap, comment);
>> -
>>  #ifdef ASM_OUTPUT_DWARF_DELTA
>>ASM_OUTPUT_DWARF_DELTA (asm_out_file, size, lab1, lab2);
>>  #else
>> @@ -145,7 +141,18 @@ dw2_asm_output_delta (int size, const char *lab1, const 
>> char *lab2,
>>vfprintf (asm_out_file, comment, ap);
>>  }
>>fputc ('\n', asm_out_file);
>> +}
>> +
>>
>> +
>
> Oh, sorry, it contents 2 new redundant '\n', if necessary to send patch
> v2 for it, please let me know.
>
>
> Thanks.
> --
> Chen Gang
>
> Open, share, and attitude like air, water, and life which God blessed

Hi,

Please send patches to gcc-patches mailing list.

Thanks,
bin


-- 
Best Regards.


Re: Help understand the may_be_zero field in loop niter information

2014-06-16 Thread Bin.Cheng
On Thu, Jun 12, 2014 at 7:59 PM, Zdenek Dvorak  wrote:
> Hi,
>
>> > I noticed there is below code/comments about may_be_zero field in loop
>> > niter desc:
>> >
>> >   tree may_be_zero;/* The boolean expression.  If it evaluates to true,
>> >the loop will exit in the first iteration (i.e.
>> >its latch will not be executed), even if the niter
>> >field says otherwise.  */
>> >
>> > I had difficulty in understanding this because I ran into some cases
>> > in which it didn't behave as said.
>
> actually, in all the examples below, the field behaves as described,
> i.e.,
>
> the number of iterations = may_be_zero ? 0 : niter;
>
> In particular, the fact that may_be_zero is false *does not imply*
> that the number of iterations as described by niter is non-zero.
>
>> > Example1, the dump of loop before sccp is like:
>> >
>> >   :
>> >   bnd_4 = len_3(D) + 1;
>> >
>> >   :
>> >   # ivtmp_1 = PHI <0(2), ivtmp_11(4)>
>> >   _6 = ivtmp_1 + len_3(D);
>> >   _7 = a[ivtmp_1];
>> >   _8 = b[ivtmp_1];
>> >   _9 = _7 + _8;
>> >   a[_6] = _9;
>> >   ivtmp_11 = ivtmp_1 + 1;
>> >   if (bnd_4 > ivtmp_11)
>> > goto ;
>> >   else
>> > goto ;
>> >
>> >   :
>> >   goto ;
>> >
>> > The loop niter information analyzed in sccp is like:
>> >
>> > Analyzing # of iterations of loop 1
>> >   exit condition [1, + , 1] < len_3(D) + 1
>> >   bounds on difference of bases: -1 ... 4294967294
>> >   result:
>> > zero if len_3(D) == 4294967295
>> > # of iterations len_3(D), bounded by 4294967294
>> >
>> > Qeustion1, shouldn't it be like "len_3 +1 <= 1" because the latch
>> > won't be executed when "len_3 == 0", right?
>
> the analysis determines the number of iterations as len_3, that is
> 0 if len_3 == 0.  So, the information is computed correctly here.
>
>> > But when boundary condition is the only case that latch get ZERO
>> > executed, the may_be_zero info will not be computed.  See example2,
>> > with dump of loop before sccp like:
>> >
>> > foo (int M)
>> >
>> >   :
>> >   if (M_4(D) > 0)
>> > goto ;
>> >   else
>> > goto ;
>> >
>> >   :
>> >   return;
>> >
>> >   :
>> >
>> >   :
>> >   # i_13 = PHI <0(4), i_10(6)>
>> >   _5 = i_13 + M_4(D);
>> >   _6 = a[i_13];
>> >   _7 = b[i_13];
>> >   _8 = _6 + _7;
>> >   a[_5] = _8;
>> >   i_10 = i_13 + 1;
>> >   if (M_4(D) > i_10)
>> > goto ;
>> >   else
>> > goto ;
>> >
>> >   :
>> >   goto ;
>> >
>> > The niter information analyzed in sccp is like:
>> >
>> > Analyzing # of iterations of loop 1
>> >   exit condition [1, + , 1](no_overflow) < M_4(D)
>> >   bounds on difference of bases: 0 ... 2147483646
>> >   result:
>> > # of iterations (unsigned int) M_4(D) + 4294967295, bounded by 
>> > 2147483646
>> >
>> > So may_be_zero is always false here, but the latch may be ZERO
>> > executed when "M_4 == 1".
>
> Again, this is correct, since then ((unsigned int) M_4) + 4294967295 == 0.
>
>> > Start from Example1, we can create Example3 which makes no sense to
>> > me.  Again, the dump of loop is like:
>> >
>> >   :
>> >   bnd_4 = len_3(D) + 1;
>> >
>> >   :
>> >   # ivtmp_1 = PHI <0(2), ivtmp_11(4)>
>> >   _6 = ivtmp_1 + len_3(D);
>> >   _7 = a[ivtmp_1];
>> >   _8 = b[ivtmp_1];
>> >   _9 = _7 + _8;
>> >   a[_6] = _9;
>> >   ivtmp_11 = ivtmp_1 + 4;
>> >   if (bnd_4 > ivtmp_11)
>> > goto ;
>> >   else
>> > goto ;
>> >
>> >   :
>> >   goto ;
>> >
>> >   :
>> >   return 0;
>> >
>> > The niter info is like:
>> >
>> > Analyzing # of iterations of loop 1
>> >   exit condition [4, + , 4] < len_3(D) + 1
>> >   bounds on difference of bases: -4 ... 4294967291
>> >   result:
>> > under assumptions len_3(D) + 1 <= 4294967292
>> > zero if len_3(D) == 4294967295
>> > # of iterations len_3(D) / 4, bounded by 1073741823
>> >
>> > The problem is: won't latch be ZERO executed when "len_3 == 0/1/2/3"?
>
> Again, in all these cases the number of iterations is len_3 / 4 == 0.
>
> Zdenek

Hi Zdenek, I spent some more time pondering over this and I think I
understand the (at least one) motivation why may_be_zero acts as it is
now.  At least for IVOPT, the boundary condition for which loop latch
is not executed doesn't need to be handled specially when trying to
eliminate condition iv uses.

So, I am thinking if it's ok for me to send a documentation patch to
describe how it works since it's a little bit confusing for me at the
first glance.

Thanks,
bin



-- 
Best Regards.


Re: regs_used estimation in IVOPTS seriously flawed

2014-06-19 Thread Bin.Cheng
On Tue, Jun 17, 2014 at 10:59 PM, Bingfeng Mei  wrote:
> Hi,
> I am looking at a performance regression in our code. A big loop produces
> and uses a lot of temporary variables inside the loop body. The problem
> appears that IVOPTS pass creates even more induction variables (from original
> 2 to 27). It causes a lot of register spilling later and performance
Do you have a simplified case which can be posted here?  I guess it
affects some other targets too.

> take a severe hit. I looked into tree-ssa-loop-ivopts.c, it does call
> estimate_reg_pressure_cost function to take # of registers into
> consideration. The second parameter passed as data->regs_used is supposed
> to represent old register usage before IVOPTS.
>
>   return size + estimate_reg_pressure_cost (size, data->regs_used, 
> data->speed,
> data->body_includes_call);
>
> In this case, it is mere 2 by following calculation. Essentially, it only 
> counts
> all loop invariant registers, ignoring all registers produced/used inside the 
> loop.
There are two kinds of registers produced/used inside the loop.  One
is induction variable irrelevant, it includes non-linear uses as
mentioned by Richard.  The other kind relates to induction variable
rewrite, and one issue with this kind is expression generated during
iv use rewriting is not reflecting the estimated one in ivopt very
well.

Thanks,
bin
>
>   n = 0;
>   for (psi = gsi_start_phis (loop->header); !gsi_end_p (psi); gsi_next (&psi))
> {
>   phi = gsi_stmt (psi);
>   op = PHI_RESULT (phi);
>
>   if (virtual_operand_p (op))
> continue;
>
>   if (get_iv (data, op))
> continue;
>
>   n++;
> }
>
>   EXECUTE_IF_SET_IN_BITMAP (data->relevant, 0, j, bi)
> {
>   struct version_info *info = ver_info (data, j);
>
>   if (info->inv_id && info->has_nonlin_use)
> n++;
> }
>
>   data->regs_used = n;
>
> I believe how regs_used is calculated is seriously flawed,
> or estimate_reg_pressure_cost is problematic if n_old is
> only supposed to be loop invariant registers. Either way,
> it affects how IVOPTS makes decision and could result in
> worse code. What do you think? Any idea on how to improve
> this?
>
>
> Thanks,
> Bingfeng
>



-- 
Best Regards.


Re: regs_used estimation in IVOPTS seriously flawed

2014-06-20 Thread Bin.Cheng
On Fri, Jun 20, 2014 at 5:01 PM, Bingfeng Mei  wrote:
>
>
>> -Original Message-----
>> From: Bin.Cheng [mailto:amker.ch...@gmail.com]
>> Sent: 20 June 2014 06:25
>> To: Bingfeng Mei
>> Cc: gcc@gcc.gnu.org
>> Subject: Re: regs_used estimation in IVOPTS seriously flawed
>>
>> On Tue, Jun 17, 2014 at 10:59 PM, Bingfeng Mei  wrote:
>> > Hi,
>> > I am looking at a performance regression in our code. A big loop
>> produces
>> > and uses a lot of temporary variables inside the loop body. The
>> problem
>> > appears that IVOPTS pass creates even more induction variables (from
>> original
>> > 2 to 27). It causes a lot of register spilling later and performance
>> Do you have a simplified case which can be posted here?  I guess it
>> affects some other targets too.
>>
>> > take a severe hit. I looked into tree-ssa-loop-ivopts.c, it does call
>> > estimate_reg_pressure_cost function to take # of registers into
>> > consideration. The second parameter passed as data->regs_used is
>> supposed
>> > to represent old register usage before IVOPTS.
>> >
>> >   return size + estimate_reg_pressure_cost (size, data->regs_used,
>> data->speed,
>> > data->body_includes_call);
>> >
>> > In this case, it is mere 2 by following calculation. Essentially, it
>> only counts
>> > all loop invariant registers, ignoring all registers produced/used
>> inside the loop.
>> There are two kinds of registers produced/used inside the loop.  One
>> is induction variable irrelevant, it includes non-linear uses as
>> mentioned by Richard.  The other kind relates to induction variable
>> rewrite, and one issue with this kind is expression generated during
>> iv use rewriting is not reflecting the estimated one in ivopt very
>> well.
>>
>
> As a short term solution, I tried some simple non-linear functions as
Richard suggested

Oh, I misread the non-linear way as non-linear iv uses.

> to penalize using too many IVs. For example, the following cost in
> ivopts_global_cost_for_size fixed my regression and actually improves 
> performance
> slightly over a set of benchmarks we usually use.

Great, I will try to tweak it on ARM.

>
>   return size * (1 + size * 0.2)
>   + estimate_reg_pressure_cost (size, data->regs_used, data->speed,
>
> data->body_includes_call);
>
> The trouble is choice of this non-linear function could be highly target 
> dependent
> (# of registers?). I don't have setup to prove performance gain for other 
> targets.
>
> I also tried counting all SSA names and divide it by a factor. It does seem 
> to work

So the number currently computed is the lower bound which is too
small.  Maybe it's possible to do some analysis with relatively low
cost increasing the number somehow.  While on the other hand, doesn't
bring restriction to IVOPT for loops with low register pressure.

Thanks,
bin

> so well.
>
> Long term, if we have infrastructure to analyze maximal live variable in a 
> loop
> at tree-level, that would be great for many loop optimizations.
>
> Thanks,
> Bingfeng



-- 
Best Regards.


Re: Comparison of GCC-4.9 and LLVM-3.4 performance on SPECInt2000 for x86-64 and ARM

2014-06-25 Thread Bin.Cheng
On Wed, Jun 25, 2014 at 5:26 PM, Bingfeng Mei  wrote:
> Thanks for nice benchmarks. Vladimir.
>
> Why is GCC code size so much bigger than LLVM? Does -Ofast have more unrolling
On the contrary, I don't think rtl unrolling is enabled by default on
GCC with level O3/Ofast. There is no unroll dump file at all unless
-funroll-loops/-funroll-all-loops is explicitly specified.

Thanks,
bin

> on GCC? It doesn't seem increasing code size help performance (164.gzip & 
> 197.parser)
> Is there comparisons for O2? I guess that is more useful for typical
> mobile/embedded programmers.
>
> Bingfeng
>
>> -Original Message-
>> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
>> Vladimir Makarov
>> Sent: 24 June 2014 16:07
>> To: Ramana Radhakrishnan; gcc.gcc.gnu.org
>> Subject: Re: Comparison of GCC-4.9 and LLVM-3.4 performance on
>> SPECInt2000 for x86-64 and ARM
>>
>> On 06/24/2014 10:57 AM, Ramana Radhakrishnan wrote:
>> >
>> > The ball-park number you have probably won't change much.
>> >
>> >>>
>> >> Unfortunately, that is the configuration I can use on my system
>> because
>> >> of lack of libraries for other configurations.
>> >
>> > Using --with-fpu={neon / neon-vfpv4} shouldn't cause you ABI issues
>> > with libraries for any other configurations. neon / neon-vfpv4 enable
>> > use of the neon unit in a manner that is ABI compatible with the rest
>> > of the system.
>> >
>> > For more on command line options for AArch32 and how they map to
>> > various CPU's you might find this blog interesting.
>> >
>> > http://community.arm.com/groups/tools/blog/2013/04/15/arm-cortex-a-
>> processors-and-gcc-command-lines
>> >
>> >
>> >>
>> >> I don't think Neon can improve score for SPECInt2000 significantly
>> but
>> >> may be I am wrong.
>> >
>> > It won't probably improve the overall score by a large amount but some
>> > individual benchmarks will get some help.
>> >
>> There are some few benchmarks which benefit from autovectorization (eon
>> particularly).
>> >>> Did you add any other architecture specific options to your SPEC2k
>> >>> runs ?
>> >>>
>> >>>
>> >> No.  The only options I used are -Ofast.
>> >>
>> >> Could you recommend me what best options you think I should use for
>> this
>> >> processor.
>> >>
>> >
>> > I would personally use --with-cpu=cortex-a15 --with-fpu=neon-vfpv4
>> > --with-float=hard on this processor as that maps with the processor
>> > available on that particular piece of Silicon.
>> Thanks, Ramana.  Next time, I'll try these options.
>> >
>> > Also given it's a big LITTLE system with probably kernel switching -
>> > it may be better to also make sure that you are always running on the
>> > big core.
>> >
>> The results are pretty stable.  Also this version of Fedora does not
>> implement switching from Big to Little processors.
>


Re: Comparison of GCC-4.9 and LLVM-3.4 performance on SPECInt2000 for x86-64 and ARM

2014-06-25 Thread Bin.Cheng
On Wed, Jun 25, 2014 at 5:47 PM, Bin.Cheng  wrote:
> On Wed, Jun 25, 2014 at 5:26 PM, Bingfeng Mei  wrote:
>> Thanks for nice benchmarks. Vladimir.
>>
>> Why is GCC code size so much bigger than LLVM? Does -Ofast have more 
>> unrolling
> On the contrary, I don't think rtl unrolling is enabled by default on
> GCC with level O3/Ofast. There is no unroll dump file at all unless
> -funroll-loops/-funroll-all-loops is explicitly specified.
Need to clarify, I did see cases in which GCC's rtl unroller more
aggressive than llvm's once it's specified.
>
> Thanks,
> bin
>
>> on GCC? It doesn't seem increasing code size help performance (164.gzip & 
>> 197.parser)
>> Is there comparisons for O2? I guess that is more useful for typical
>> mobile/embedded programmers.
>>
>> Bingfeng
>>
>>> -Original Message-
>>> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
>>> Vladimir Makarov
>>> Sent: 24 June 2014 16:07
>>> To: Ramana Radhakrishnan; gcc.gcc.gnu.org
>>> Subject: Re: Comparison of GCC-4.9 and LLVM-3.4 performance on
>>> SPECInt2000 for x86-64 and ARM
>>>
>>> On 06/24/2014 10:57 AM, Ramana Radhakrishnan wrote:
>>> >
>>> > The ball-park number you have probably won't change much.
>>> >
>>> >>>
>>> >> Unfortunately, that is the configuration I can use on my system
>>> because
>>> >> of lack of libraries for other configurations.
>>> >
>>> > Using --with-fpu={neon / neon-vfpv4} shouldn't cause you ABI issues
>>> > with libraries for any other configurations. neon / neon-vfpv4 enable
>>> > use of the neon unit in a manner that is ABI compatible with the rest
>>> > of the system.
>>> >
>>> > For more on command line options for AArch32 and how they map to
>>> > various CPU's you might find this blog interesting.
>>> >
>>> > http://community.arm.com/groups/tools/blog/2013/04/15/arm-cortex-a-
>>> processors-and-gcc-command-lines
>>> >
>>> >
>>> >>
>>> >> I don't think Neon can improve score for SPECInt2000 significantly
>>> but
>>> >> may be I am wrong.
>>> >
>>> > It won't probably improve the overall score by a large amount but some
>>> > individual benchmarks will get some help.
>>> >
>>> There are some few benchmarks which benefit from autovectorization (eon
>>> particularly).
>>> >>> Did you add any other architecture specific options to your SPEC2k
>>> >>> runs ?
>>> >>>
>>> >>>
>>> >> No.  The only options I used are -Ofast.
>>> >>
>>> >> Could you recommend me what best options you think I should use for
>>> this
>>> >> processor.
>>> >>
>>> >
>>> > I would personally use --with-cpu=cortex-a15 --with-fpu=neon-vfpv4
>>> > --with-float=hard on this processor as that maps with the processor
>>> > available on that particular piece of Silicon.
>>> Thanks, Ramana.  Next time, I'll try these options.
>>> >
>>> > Also given it's a big LITTLE system with probably kernel switching -
>>> > it may be better to also make sure that you are always running on the
>>> > big core.
>>> >
>>> The results are pretty stable.  Also this version of Fedora does not
>>> implement switching from Big to Little processors.
>>


Question about GCC's standard dependent optimization

2014-06-26 Thread Bin.Cheng
Hi,
I ran into PR60947, in which GCC understands the return value of
memset is the first argument passed in, according to standard, then
does optimization like below:
movip, sp
stmfdsp!, {r4, r5, r6, r7, r8, r9, r10, fp, ip, lr, pc}
subfp, ip, #4
subsp, sp, #20
ldrr8, [r0, #112]
addr3, r8, #232
addr4, r8, #328
.L1064:
movr0, r3
movr1, #255
movr2, #8
blmemset
addr3, r0, #32   

Re: Question about GCC's standard dependent optimization

2014-06-26 Thread Bin.Cheng
Thanks for elaborating.

On Thu, Jun 26, 2014 at 5:18 PM, Richard Biener
 wrote:
> On Thu, Jun 26, 2014 at 10:44 AM, Bin.Cheng  wrote:
>> Hi,
>> I ran into PR60947, in which GCC understands the return value of
>> memset is the first argument passed in, according to standard, then
>> does optimization like below:
>> movip, sp
>> stmfdsp!, {r4, r5, r6, r7, r8, r9, r10, fp, ip, lr, pc}
>> subfp, ip, #4
>> subsp, sp, #20
>> ldrr8, [r0, #112]
>> addr3, r8, #232
>> addr4, r8, #328
>> .L1064:
>> movr0, r3
>> movr1, #255
>> movr2, #8
>> blmemset
>> addr3, r0, #32   <X
>> cmpr3, r4
>> bne.L1064
>>
>> For X insn, GCC takes advantage of standard by using the returned r0 
>> directly.
>>
>> My question is, is it always safe for GCC to do such optimization?  Do
>
> Yes, I think so.
>
>> we have an option to disable such standard dependent optimization?
>
> -fno-builtin / -ffreestanding, but for memset/memcpy/memove that are also
> generated by GCC that won't help.
>
>> BTW, the sample should be further optimized into below (with Y redundant 
>> now):
>>
>> movip, sp
>> stmfdsp!, {r4, r5, r6, r7, r8, r9, r10, fp, ip, lr, pc}
>> subfp, ip, #4
>> subsp, sp, #20
>> ldrr8, [r0, #112]
>> addr0, r8, #232
>> addr4, r8, #328
>> .L1064:
>> movr0, r0   <--Y
>> movr1, #255
>> movr2, #8
>> blmemset
>> addr0, r0, #32
>> cmpr0, r4
>> bne.L1064
>
> Probably because the feature was not integrated into the register allocator
> (telling it that r3 can be materialized from r0) but in some other way.
I agree it's IRA that still doesn't understand thus allocates the
pseudo register to r3, rather than r0.

Thanks,
bin
>
> Richard.
>
>>
>> Thanks,
>> bin


Re: Question about GCC's standard dependent optimization

2014-06-26 Thread Bin.Cheng
On Fri, Jun 27, 2014 at 4:13 AM, Jeff Law  wrote:
> On 06/26/14 02:44, Bin.Cheng wrote:
>>
>> Hi,
>> I ran into PR60947, in which GCC understands the return value of
>> memset is the first argument passed in, according to standard, then
>> does optimization like below:
>>  movip, sp
>>  stmfdsp!, {r4, r5, r6, r7, r8, r9, r10, fp, ip, lr, pc}
>>  subfp, ip, #4
>>  subsp, sp, #20
>>  ldrr8, [r0, #112]
>>  addr3, r8, #232
>>  addr4, r8, #328
>> .L1064:
>>  movr0, r3
>>  movr1, #255
>>  movr2, #8
>>  blmemset
>>  addr3, r0, #32   <X
>>  cmpr3, r4
>>  bne.L1064
>>
>> For X insn, GCC takes advantage of standard by using the returned r0
>> directly.
>>
>> My question is, is it always safe for GCC to do such optimization?  Do
>> we have an option to disable such standard dependent optimization?
>
> Others have already answered this question.
>
> FWIW, I just locally added the capability to track equivalences between the
> destination argument to the various mem* str* functions and their return
> value in DOM.  It triggers, but not terribly often.  I'll be looking to see
> if the additional equivalences actually enable any optimizations before
> going through the full bootstrap and test.
>
> It doesn't help your testcase though since optimizer weakness you're showing
> is much later in the optimization pipeline.  I wonder if you could model
> this as an implicit copy in IRA and try to encourage IRA to tie the hard reg
> output from memset to the pseudo.
Thanks for pointing me the direction.  I shall have a look at it.

Thanks,
bin
>
>
> jeff
>
>


Re: Question about GCC's standard dependent optimization

2014-07-01 Thread Bin.Cheng
On Mon, Jun 30, 2014 at 5:42 PM, Jeff Law  wrote:
> On 06/26/14 14:13, Jeff Law wrote:
>>
>> On 06/26/14 02:44, Bin.Cheng wrote:
>>>
>>> Hi,
>>> I ran into PR60947, in which GCC understands the return value of
>>> memset is the first argument passed in, according to standard, then
>>> does optimization like below:
>>>  movip, sp
>>>  stmfdsp!, {r4, r5, r6, r7, r8, r9, r10, fp, ip, lr, pc}
>>>  subfp, ip, #4
>>>  subsp, sp, #20
>>>  ldrr8, [r0, #112]
>>>  addr3, r8, #232
>>>  addr4, r8, #328
>>> .L1064:
>>>  movr0, r3
>>>  movr1, #255
>>>  movr2, #8
>>>  blmemset
>>>  addr3, r0, #32   <X
>>>  cmpr3, r4
>>>  bne.L1064
>>>
>>> For X insn, GCC takes advantage of standard by using the returned r0
>>> directly.
>>>
>>> My question is, is it always safe for GCC to do such optimization?  Do
>>> we have an option to disable such standard dependent optimization?
>>
>> Others have already answered this question.
>>
>> FWIW, I just locally added the capability to track equivalences between
>> the destination argument to the various mem* str* functions and their
>> return value in DOM.  It triggers, but not terribly often.  I'll be
>> looking to see if the additional equivalences actually enable any
>> optimizations before going through the full bootstrap and test.
>
> Just as a follow-up.  This turns out to be a relatively bad idea as it gets
> in the way of tail call optimizations.
>
> Probably the only place where this is going to really be useful is in the
> allocators to allow us to cheaply rematerialize values and/or tie together
> two values that normally wouldn't be seen as related to each other.

Also it restrict the inline of string operation functions at expand
time.  Once we reuse the return value then inlining need to calculate
the return value.  I don't know if it will break some targets
expand/inline now, but it surely increases cost of inlined code.

Thanks,
bin
>
> Jeff
>


Why is unsigned type introduced in a simple case?

2014-07-14 Thread Bin.Cheng
Hi,
For a simple example like below.

int
f1 (int p, short i, short n)
{
  int sum = 0;

  do
{
  sum += i;
  i++;
}
  while (i < n);

  return sum;
}

When compiling with -O2 -fdump-tree-all options, GCC introduces
unsigned type at the very beginning of gimple pass, for example, the
dump for gimple pass is like below.
f1 (int p, short int i, short int n)
{
  int D.4116;
  short int i.0;
  unsigned short i.1;
  unsigned short D.4119;
  int D.4120;
  int sum;

  sum = 0;
  :
  D.4116 = (int) i;
  sum = D.4116 + sum;
  i.0 = i;
  i.1 = (unsigned short) i.0;
  D.4119 = i.1 + 1;
  i = (short int) D.4119;
  if (i < n) goto ; else goto ;
  :
  D.4120 = sum;
  return D.4120;
}

It uses i.1 to increase the induction variable and converts it back to
signed type for comparison.  Is it designed behavior?  &why?

Thanks,
bin


Re: [AArch64] Using QEMU to run Ubuntu ARM 64-bit

2014-08-26 Thread Bin.Cheng
On Tue, Aug 26, 2014 at 9:56 PM, Sebastian Pop  wrote:
> Hi,
>
> my colleague Brian has posted the instructions to set up
> an AArch64 virtual machine running ubuntu on top of qemu:
>
> http://rzycki.blogspot.com/2014/08/using-qemu-to-run-ubuntu-arm-64-bit.html
>
> We are using this setup to bootstrap gcc on aarch64.
Great, I will have a try!

Thanks,
bin
>
> Sebastian


Re: Some questions about pass web

2014-09-03 Thread Bin.Cheng
On Wed, Sep 3, 2014 at 7:35 AM, Carrot Wei  wrote:
> Hi
>
> I have following questions about web (pseudo register renaming) pass:
>
> 1. It is well known that register renaming is a big help to register
> allocation, but in gcc's backend, the web pass is far before RA, there
> are about 20 passes between them. Does it mean register renaming can
> also heavily benefit other optimizations? And the passes between them
> usually don't generate more register renaming chances?
I think one purpose is to break long dependency chain into short ones.
For example, with below code

   use(i)
   i = i + 1;
   ...
   use(i)
   i = i + 1;
   ...
   use(i)
   i = i + 1;
   ...

Pass fweb could change it into below form

   use(i)
   i0 = i + 1
   ...
   use(i0)
   i1 = i0 + 1
   ...
   use(i1)
   i = i0 + 2
   ...

Apparently, latter form has shorter chains, which makes df stuff more efficient.

>
> 2. It looks current web pass can't handle AUTOINC expressions, a reg
> operand is used as both use and def reference in an AUTOINC
> expression, so this def side should not be renamed. Pass web doesn't
> explicitly check this case, may rename the reg operand of AUTOINC
> expression. Is this expected because it is before auto_inc_dec pass?

Last time I tried, there are several passes after loop_done and before
auto-inc-dec can't handle auto-increment addressing mode, including
fweb.
>
> 3. Are AUTOINC expressions only generated by auto_inc_dec pass? All
> passes before auto_inc_dec including expand should not generate
> AUTOINC expressions, otherwise it will break web.
Yes.  Yet other passes may generate auto-inc friendly instruction
patterns thus auto-inc-dec can capture more opportunities.  IVOPT is a
typical example.

Thanks,
bin

>
> Could anybody help to answer these questions?
>
> thanks a lot
> Guozhi Wei


Re: GCC ARM: aligned access

2014-09-03 Thread Bin.Cheng
On Mon, Sep 1, 2014 at 9:14 AM, Peng Fan  wrote:
>
>
> On 09/01/2014 08:09 AM, Matt Thomas wrote:
>>
>> On Aug 31, 2014, at 11:32 AM, Joel Sherrill  
>> wrote:
>>
 Hi,

 I am writing some code and found that system crashed. I found it was
 unaligned access which causes `data abort` exception. I write a piece
 of code and objdump
 it. I am not sure this is right or not.

 command:
 arm-poky-linux-gnueabi-gcc -marm -mno-thumb-interwork -mabi=aapcs-linux
 -mword-relocations -march=armv7-a -mno-unaligned-access
 -ffunction-sections -fdata-sections -fno-common -ffixed-r9 -msoft-float
 -pipe  -O2 -c 2.c -o 2.o

 arch is armv7-a and used '-mno-unaligned access'
>>>
>>> I think this is totally expected. You were passed a u8 pointer which is 
>>> aligned for that type (no restrictions likely). You cast it to a type with 
>>> stricter alignment requirements. The code is just flawed. Some CPUs handle 
>>> unaligned accesses but not your ARM.
>>
> armv7 and armv6 arch except armv6-m support unaligned access. a u8 pointer is 
> casted to u32 pointer, should gcc take the align problem into consideration 
> to avoid possible errors? because -mno-unaligned-access.
>> While armv7 and armv6 supports unaligned access, that support has to be
>> enabled by the underlying O/S.  Not knowing the underlying environment,
>> I can't say whether that support is enabled.  One issue we had in NetBSD
>> in moving to gcc4.8 was that the NetBSD/arm kernel didn't enable unaligned
>> access for armv[67] CPUs.  We quickly changed things so unaligned access
>> is supported.
>
> Yeah. by set a hardware bit in arm coprocessor, unaligned access will not 
> cause data abort exception.
> I just wonder is the following correct? should gcc take the responsibility to 
> take care possible unaligned pointer `u8 *data`? because 
> -mno-unaligned-access is passed to gcc.
I suppose no.  It explicit type conversion, the compiler assumes you
take the responsibility I think.
Actually you can dump the final rtl using -fdump-rtl-shorten,look at
the memory alignment information.  In my experiment, it's A32 with
-mno-unaligned-access, which means it's 32 bits aligned.

Thanks,
bin
>
> int func(u8 *data)
> {
> return *(unsigned int *)data;
> }
>
>  :
>0: e590  ldr r0, [r0]
>4: e12fff1e  bx  lr
>
> Regards,
> Peng.
>>


Question on param MAX_PENDING_LIST_LENGTH in sched-deps

2014-11-04 Thread Bin.Cheng
Hi,
The parameter MAX_PENDING_LIST_LENGTH is set to 32 by default.  It
seems to me the length of pending list can't be larger than 32.  But
in sched-deps.c, below code is used:
  /* Pending lists can't get larger with a readonly context.  */
  if (!deps->readonly
  && ((deps->pending_read_list_length + deps->pending_write_list_length)
  > MAX_PENDING_LIST_LENGTH))

Since we compares it using ">", the list can have 33 instructions at
most, which is inconsistent with the parameter name.

Well, it's some kind of nit picking.

Another question.  This parameter is introduced to prevent GCC from
running for too long time.  I am not clear if the running time is a
quadratic function of pending list, or a quadratic function of
dependency nodes?  If it's the latter, could we use the number of
dependency nodes as the parameter directly?  Because in some cases
like memory initialization, there are more than 32 store instructions
in flow, but the dependency are actually very simple.

Thanks,
bin


Re: Question on param MAX_PENDING_LIST_LENGTH in sched-deps

2014-11-06 Thread Bin.Cheng
On Fri, Nov 7, 2014 at 7:10 AM, Jeff Law  wrote:
> On 11/04/14 20:29, Bin.Cheng wrote:
>>
>> Hi,
>> The parameter MAX_PENDING_LIST_LENGTH is set to 32 by default.  It
>> seems to me the length of pending list can't be larger than 32.  But
>> in sched-deps.c, below code is used:
>>/* Pending lists can't get larger with a readonly context.  */
>>if (!deps->readonly
>>&& ((deps->pending_read_list_length +
>> deps->pending_write_list_length)
>>> MAX_PENDING_LIST_LENGTH))
>>
>> Since we compares it using ">", the list can have 33 instructions at
>> most, which is inconsistent with the parameter name.
>>
>> Well, it's some kind of nit picking.
>
> Yea, seems like a bit of a nit.  Might be worth fixing as someone might size
> an array based on the param's value only to find out it can have an extra
> element.
Yes, I found this only because in some extreme case with lots of
consecutive stores, length of 33 breaks the store pairing one time in
the middle of instruction flow.


>
>>
>> Another question.  This parameter is introduced to prevent GCC from
>> running for too long time.  I am not clear if the running time is a
>> quadratic function of pending list, or a quadratic function of
>> dependency nodes?  If it's the latter, could we use the number of
>> dependency nodes as the parameter directly?  Because in some cases
>> like memory initialization, there are more than 32 store instructions
>> in flow, but the dependency are actually very simple.
>
> See:
>
> https://gcc.gnu.org/ml/gcc-patches/2001-07/msg01668.html
>
> I don't recall what part of the algorithm was quadratic, but it should be
> too hard to find out given the thread references the target & the testcase.
At least dependency analysis part is of quadratic complexity because
for each new memory access, it iterates over all insns in the list to
see if it depends on them.  But this seems not the real problem, there
are comments in both source code and the patch you mentioned, that
perf issue is caused by too many dependency nodes created in this
scenario.

Also this is introduced at time we only have low frequency/memory
machines, I bet it can be relax to a larger number without hurting
compilation time.  The reason I didn't is because I can't get anything
good from it, seems it barely has any impact on performance.

Thanks,
bin


Do we create new insn in combine? Or can we rely on INSN_LUID checking the order of instuctions?

2014-12-10 Thread Bin.Cheng
Hi,
I am looking into distribute_notes, one reason why it's complicated is
the live range of register noted by REG_DEAD could be both shrunk or
extended.  When live range shrinks, we need to search backwards to
find previous reference and mark it as REG_DEAD (or delete the
definition if there is no reference anymore); when live range extends,
we need to search forward to see if we can mark later reference as
REG_DEAD.  Maybe the reason why distribute_notes is so vulnerable is
because it guesses how to distribute DEAD note based on other
information (elim_ix, i2, i3, etc..), rather than how register's live
range changes.
For example, PR62151 shows the case in which the REG_DEAD should be
discarded, but distribute_notes falsely tries to shrink the live range
(even worse, from a wrong point), resulting in wrong instruction
deleted.

So I am wondering if I can rely on INSN_LUID checking orders of
difference instruction.  If it can be done, I can easily differentiate
live range shrink and extend.
Further question is, if we don't insert new insns, can I use INSN_LUID
safely for this purpose?

Thanks,
bin


Re: Fixing inconsistent uses of address costs

2015-03-26 Thread Bin.Cheng
On Thu, Mar 26, 2015 at 11:40 PM, Kyrill Tkachov  wrote:
> Hi all,
>
> I'd like to attempt to make GCC's usage of costs in the backends consistent.
> We have a lot of different types: rtx costs, address costs, regmove costs,
> vector costs etc. Some of them are use in different units, compared against
> different types of costs and in general are a bit of a mess. For now I'd
> like
> to clean up address costs since they are actually not used as much as some
> other costs and seem to me to be a slightly simpler task.
>
> From what I can see, address costs i.e. the TARGET_ADDRESS_COST hook are
> used
> in 5 places overall:
> * fwprop.c: forward propagation tries to propagate an rtx into an address
> and compare the address cost of the old address and the one it wants to
> propagate into it and picks the most profitable.
>
> * postreload.c: Again, tries to replace an address with a new one and
> compares
> the two address costs and picks the most profitable.
>
> * tree-ssa-loop-ivopts.c: A bit more involved here. From what I can tell it
> is
> used to assign a cost to various addressing modes, but ends up comparing but
> in the computation_cost function adds/mixes the address cost with rtx costs.
>
> * Some targets use address costs as part of their calculation of rtx costs
> for, say, a memory access instruction.
>
> * The final and worst part is in loop-invariant.c in the
> create_new_invariant
> function that needs to get a feel for how 'cheap' an address is and for that
> it compares the address cost to a magic number 3. See the thread that
> introduced it at http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html.
>
> There are a few issues wiht the above usages:
> - The magic number in loop-invariant.c was picked to help bwaves on x86 and
> makes no analytical sense for any other target.
>
> - Some targets use COSTS_N_INSNS units for their address costs, others use
> normal scalar units (every target that doesn't define the address costs hook
> ends up going through rtx costs which use COSTS_N_INSNS).
> I think this rather undermines the usage in tree-ssa-loop-ivopts.c which
> uses
> rtx costs in COSTS_N_INSNS and address costs interchangeably.
>
> An insight here is that address costs have two types of usage:
> 1. To compare two addresses and choose whether to replace one by the other
> (as in fwprop and postreload)
> 2. To give a metric of cheapness for use in loop optimisations.
>
> So I propose to split the two uses up. For the fwprop and postreload case
> introduce a hook that will take two addresses 'x' and 'y' and determine
> whether to substitute 'x' by 'y'. This can use address costs and allow the
> backend to do as deep comparison between two addresses as it could possibly
> want, unconstrained by the requirement to assign just a single number to
> each one individually and having the midend just pick the smallest number.
> We can call it something like TARGET_ADDRESS_PREFERABLE_P or something like
> that. Then get fwprop and postreload to use that to determine whether to
> substitute one address for another.
>
> What to do with the remaining address costs hook and its uses in
> tree-ssa-loop-ivopts and loop-invariant? Since tree-ssa-loop-ivopts uses
> the address costs interchangeably with rtx costs I think we should at least
> convert all address costs to use the same units: COSTS_N_INSNS. This raises
> a question: could we not just use rtx costs for the addresses? Someone with
> more familiarity with the code can correct me, but is the address cost usage
> in that file supposed to reflect the cost of computing the address outside
> of
> an addressing mode? If so, then shouldn't rtx costs be used? Conversely, if
> address costs are a measure of something different, surely we shouldn't be
> adding or comparing them with rtx costs, but rather create some
> two-dimensional
> structure to compare two rtx/address entities? (similarly to the
> rtx cost/latency calculations in expmed for the mult synthesis code).
>
> Then there's the usage in loop-invariant.c where the address cost number is
> compared with '3'. This is really not portable across targets. The code
> wants
> some measure of whether an address is 'cheap'. There could be various
> approaches on how to provide that.
> Could we enumerate across an rtx array of representatives of
> all possible addresses for a target sorted by the
> TARGET_ADDRESS_PREFERABLE_P
> hook above? Should we be asking the target to tell us which address type is
> cheap? In any case, it seems to me that the cheapest address in a target is
> something that only the target can tell, unless we agree on a universal
> meaning and unit of measurement for address costs.
>
> What do people think? Would this approach be worth pursuing if the above
> questions can be answered? From having a quick look at config/ I think that
> converting the targets to use COSTS_N_INSNS units would not be a
> controversial
> task as long as the midend usage of address costs is consistent.

Re: Fixing inconsistent uses of address costs

2015-03-30 Thread Bin.Cheng
On Fri, Mar 27, 2015 at 5:43 PM, Kyrill Tkachov  wrote:
>
> On 27/03/15 03:29, Bin.Cheng wrote:
>>
>> On Thu, Mar 26, 2015 at 11:40 PM, Kyrill Tkachov 
>> wrote:
>>>
>>> Hi all,
>>>
>>> I'd like to attempt to make GCC's usage of costs in the backends
>>> consistent.
>>> We have a lot of different types: rtx costs, address costs, regmove
>>> costs,
>>> vector costs etc. Some of them are use in different units, compared
>>> against
>>> different types of costs and in general are a bit of a mess. For now I'd
>>> like
>>> to clean up address costs since they are actually not used as much as
>>> some
>>> other costs and seem to me to be a slightly simpler task.
>>>
>>>  From what I can see, address costs i.e. the TARGET_ADDRESS_COST hook are
>>> used
>>> in 5 places overall:
>>> * fwprop.c: forward propagation tries to propagate an rtx into an address
>>> and compare the address cost of the old address and the one it wants to
>>> propagate into it and picks the most profitable.
>>>
>>> * postreload.c: Again, tries to replace an address with a new one and
>>> compares
>>> the two address costs and picks the most profitable.
>>>
>>> * tree-ssa-loop-ivopts.c: A bit more involved here. From what I can tell
>>> it
>>> is
>>> used to assign a cost to various addressing modes, but ends up comparing
>>> but
>>> in the computation_cost function adds/mixes the address cost with rtx
>>> costs.
>>>
>>> * Some targets use address costs as part of their calculation of rtx
>>> costs
>>> for, say, a memory access instruction.
>>>
>>> * The final and worst part is in loop-invariant.c in the
>>> create_new_invariant
>>> function that needs to get a feel for how 'cheap' an address is and for
>>> that
>>> it compares the address cost to a magic number 3. See the thread that
>>> introduced it at http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html.
>>>
>>> There are a few issues wiht the above usages:
>>> - The magic number in loop-invariant.c was picked to help bwaves on x86
>>> and
>>> makes no analytical sense for any other target.
>>>
>>> - Some targets use COSTS_N_INSNS units for their address costs, others
>>> use
>>> normal scalar units (every target that doesn't define the address costs
>>> hook
>>> ends up going through rtx costs which use COSTS_N_INSNS).
>>> I think this rather undermines the usage in tree-ssa-loop-ivopts.c which
>>> uses
>>> rtx costs in COSTS_N_INSNS and address costs interchangeably.
>>>
>>> An insight here is that address costs have two types of usage:
>>> 1. To compare two addresses and choose whether to replace one by the
>>> other
>>> (as in fwprop and postreload)
>>> 2. To give a metric of cheapness for use in loop optimisations.
>>>
>>> So I propose to split the two uses up. For the fwprop and postreload case
>>> introduce a hook that will take two addresses 'x' and 'y' and determine
>>> whether to substitute 'x' by 'y'. This can use address costs and allow
>>> the
>>> backend to do as deep comparison between two addresses as it could
>>> possibly
>>> want, unconstrained by the requirement to assign just a single number to
>>> each one individually and having the midend just pick the smallest
>>> number.
>>> We can call it something like TARGET_ADDRESS_PREFERABLE_P or something
>>> like
>>> that. Then get fwprop and postreload to use that to determine whether to
>>> substitute one address for another.
>>>
>>> What to do with the remaining address costs hook and its uses in
>>> tree-ssa-loop-ivopts and loop-invariant? Since tree-ssa-loop-ivopts uses
>>> the address costs interchangeably with rtx costs I think we should at
>>> least
>>> convert all address costs to use the same units: COSTS_N_INSNS. This
>>> raises
>>> a question: could we not just use rtx costs for the addresses? Someone
>>> with
>>> more familiarity with the code can correct me, but is the address cost
>>> usage
>>> in that file supposed to reflect the cost of computing the address
>>> outside
>>> of
>>> an addressing mode? If so, then shouldn't rtx costs be used? Conversely,
>>> if
>>> 

Re: Fixing inconsistent uses of address costs

2015-03-30 Thread Bin.Cheng
On Sat, Mar 28, 2015 at 1:31 AM, Sandra Loosemore
 wrote:
> On 03/27/2015 03:43 AM, Kyrill Tkachov wrote:
>>
>>
>> On 27/03/15 03:29, Bin.Cheng wrote:
>>>
>>> [much snippage]
>>>
>>>
>>> As for tree ivopts, address cost is used in both ways.  For any
>>> address computation that's invalid, it tries to legitimize it into two
>>> parts, the first part results in alu instructions, the second part
>>> results in address expression of different addressing modes.  Right
>>> now the rtx cost (for the first part) and the address cost (for the
>>> second part) are accumulated and compared altogether.  I do like the
>>> idea split costs into different types because all elements are
>>> entangled in single cost model, and it's hard to tune for specific
>>> case.
>>
>>
>> Thanks for explaining.
>> I think it would be possible to make the comparisons there a bit more
>> sane.
>> If an address computation is split into alu instructions and a
>> legitimate address
>> then carry around the rtx cost of the alu part and the address. When the
>> time
>> comes to compare two computations, we create a more involved way of
>> comparing.
>> For example (would need benchmarking, of course):
>> * Compare the rtx costs of the alu components and check the address
>> preference for
>> the legitimate address components using the new hook.
>> * If the alu part costs are of equal rtx cost, pick the one which has
>> the preferable legitimate address.
>> * If expression 'a' has a more expensive alu component than 'b' but a more
>> preferable address component, then use some tie breaker. Perhaps apply
>> rtx costs
>> on the address expression and compare those...
>
>
> Just as an aside here, tree-ssa-loop-ivopts.c has a lot of other problems
> with how it's computing address costs, or it least it did when I last looked
> at it a few years ago:
>
> https://gcc.gnu.org/ml/gcc-patches/2012-06/msg00319.html
>
> Shortly after I posted that, Qualcomm lost interest in pushing the Hexagon
> port upstream or improving performance, so I lost interest in pursuing that
> patch when it was evident that it was going to take a lot of work to resolve
> the objections.  I think fixes for problems (2) and (3) there have since
> been pushed by other people, but I'm not sure about the others.
>
> FWIW, I think a big part of the problem here is that the GCC internals
> documentation isn't very clear on where the cost of legitimizing an address
> (the "alu components" above) should be computed.  IIRC when I was last
I aggre that legitimize_address interface matters more than address
cost itself.  IVOPT uses it to compute the cost but doesn't use it to
generate final code when rewriting address type uses (though it tries
to mimicry the legitimization behavior and generate lower cost
instructions and addressing expression?).

Thanks,
bin
> looking at current practice, most targets' implementation of
> TARGET_RTX_COSTS didn't make any attempt to account for the address cost in
> a MEM -- either adding the cost of legitimizing the address or the cost of
> the addressing mode itself (TARGET_ADDRESS_COST).  If TARGET_RTX_COSTS is
> supposed to do that, its documentation should say so.  Or maybe we need a
> separate hook like TARGET_LEGITIMIZE_ADDRESS_COST to capture the "alu
> components" cost.
>
> -Sandra
>


Combine changes ASHIFT into mult for non-MEM rtx

2015-04-02 Thread Bin.Cheng
Hi,
In function make_compound_operation, the code/comment says:

case ASHIFT:
  /* Convert shifts by constants into multiplications if inside
 an address.  */
  if (in_code == MEM && CONST_INT_P (XEXP (x, 1))
  && INTVAL (XEXP (x, 1)) < HOST_BITS_PER_WIDE_INT
  && INTVAL (XEXP (x, 1)) >= 0
  && SCALAR_INT_MODE_P (mode))
{


Right now, it changes ASHIFT in any SET into mult because of below code:

  /* Select the code to be used in recursive calls.  Once we are inside an
 address, we stay there.  If we have a comparison, set to COMPARE,
 but once inside, go back to our default of SET.  */

  next_code = (code == MEM ? MEM
   : ((code == PLUS || code == MINUS)
  && SCALAR_INT_MODE_P (mode)) ? MEM // 

Re: Combine changes ASHIFT into mult for non-MEM rtx

2015-04-02 Thread Bin.Cheng
On Thu, Apr 2, 2015 at 5:49 PM, Kugan  wrote:
> On 02/04/15 20:39, Bin.Cheng wrote:
>> Hi,
>> In function make_compound_operation, the code/comment says:
>>
>> case ASHIFT:
>>   /* Convert shifts by constants into multiplications if inside
>>  an address.  */
>>   if (in_code == MEM && CONST_INT_P (XEXP (x, 1))
>>   && INTVAL (XEXP (x, 1)) < HOST_BITS_PER_WIDE_INT
>>   && INTVAL (XEXP (x, 1)) >= 0
>>   && SCALAR_INT_MODE_P (mode))
>> {
>>
>>
>> Right now, it changes ASHIFT in any SET into mult because of below code:
>>
>>   /* Select the code to be used in recursive calls.  Once we are inside an
>>  address, we stay there.  If we have a comparison, set to COMPARE,
>>  but once inside, go back to our default of SET.  */
>>
>>   next_code = (code == MEM ? MEM
>>: ((code == PLUS || code == MINUS)
>>   && SCALAR_INT_MODE_P (mode)) ? MEM // <bogus?
>>: ((code == COMPARE || COMPARISON_P (x))
>>   && XEXP (x, 1) == const0_rtx) ? COMPARE
>>: in_code == COMPARE ? SET : in_code);
>>
>> This seems an overlook to me.  The effect is all targets have to
>> support the generated expression in the corresponding pattern.  Some
>> times the generated expression is just too stupid and missed.  For
>> example below insn is tried by combine:
>> (set (reg:SI 79 [ D.2709 ])
>> (plus:SI (subreg:SI (sign_extract:DI (mult:DI (reg:DI 1 x1 [ i ])
>> (const_int 2 [0x2]))
>> (const_int 17 [0x11])
>> (const_int 0 [0])) 0)
>> (reg:SI 0 x0 [ a ])))
>>
>>
>> It actually equals to
>> (set (reg/i:SI 0 x0)
>> (plus:SI (ashift:SI (sign_extend:SI (reg:HI 1 x1 [ i ]))
>> (const_int 1 [0x1]))
>> (reg:SI 0 x0 [ a ])))
>>
>> equals to below instruction on AARCH64:
>> addw0, w0, w1, sxth 1
>>
>>
>> Because of the existing comment, also because it will make backend
>> easier (I suppose), is it reasonable to fix this behavior in
>> combine.c?  Another question is, if we are going to make the change,
>> how many targets might be affected?
>>
>
> I think https://gcc.gnu.org/ml/gcc-patches/2015-01/msg01020.html is
> related to this.

Hi Kugan,
Thanks for pointing me to this.  Actually it's exactly the same case I
am looking.
Glad to know that somebody else is looking after it, only that message
isn't followed up recently :(

Thanks,
bin

>
>
> Thanks,
> kugan


Re: Combine changes ASHIFT into mult for non-MEM rtx

2015-04-02 Thread Bin.Cheng
On Thu, Apr 2, 2015 at 8:32 PM, Jeff Law  wrote:
> On 04/02/2015 03:39 AM, Bin.Cheng wrote:
>>
>> Hi,
>> In function make_compound_operation, the code/comment says:
>>
>>  case ASHIFT:
>>/* Convert shifts by constants into multiplications if inside
>>   an address.  */
>>if (in_code == MEM && CONST_INT_P (XEXP (x, 1))
>>&& INTVAL (XEXP (x, 1)) < HOST_BITS_PER_WIDE_INT
>>&& INTVAL (XEXP (x, 1)) >= 0
>>&& SCALAR_INT_MODE_P (mode))
>>  {
>>
>>
>> Right now, it changes ASHIFT in any SET into mult because of below code:
>>
>>/* Select the code to be used in recursive calls.  Once we are inside
>> an
>>   address, we stay there.  If we have a comparison, set to COMPARE,
>>   but once inside, go back to our default of SET.  */
>>
>>next_code = (code == MEM ? MEM
>> : ((code == PLUS || code == MINUS)
>>&& SCALAR_INT_MODE_P (mode)) ? MEM // <bogus?
>> : ((code == COMPARE || COMPARISON_P (x))
>>&& XEXP (x, 1) == const0_rtx) ? COMPARE
>> : in_code == COMPARE ? SET : in_code);
>>
>> This seems an overlook to me.  The effect is all targets have to
>> support the generated expression in the corresponding pattern.  Some
>> times the generated expression is just too stupid and missed.  For
>> example below insn is tried by combine:
>> (set (reg:SI 79 [ D.2709 ])
>>  (plus:SI (subreg:SI (sign_extract:DI (mult:DI (reg:DI 1 x1 [ i ])
>>  (const_int 2 [0x2]))
>>  (const_int 17 [0x11])
>>  (const_int 0 [0])) 0)
>>  (reg:SI 0 x0 [ a ])))
>>
>>
>> It actually equals to
>> (set (reg/i:SI 0 x0)
>>  (plus:SI (ashift:SI (sign_extend:SI (reg:HI 1 x1 [ i ]))
>>  (const_int 1 [0x1]))
>>  (reg:SI 0 x0 [ a ])))
>>
>> equals to below instruction on AARCH64:
>> addw0, w0, w1, sxth 1
>>
>>
>> Because of the existing comment, also because it will make backend
>> easier (I suppose), is it reasonable to fix this behavior in
>> combine.c?  Another question is, if we are going to make the change,
>> how many targets might be affected?
>
> There's already a thread on this issue -- it's not a regression so I haven't
> pushed it forward and probably won't until stage1 reopens.
Indeed, Kugan pointed me to that thread.  Glad somebody else is
looking after the problem.

Thanks,
bin
>
> jeff


Question about macro _GLIBCXX_RES_LIMITS in libstdc++ testsuite

2015-04-22 Thread Bin.Cheng
Hi,
In libstdc++ testsuite, I noticed that macro _GLIBCXX_RES_LIMITS is
checked/set by GLIBCXX_CHECK_SETRLIMIT, which is further guarded by
GLIBCXX_IS_NATIVE as below:

AC_DEFUN([GLIBCXX_CONFIGURE_TESTSUITE], [
  if $GLIBCXX_IS_NATIVE ; then
# Do checks for resource limit functions.
GLIBCXX_CHECK_SETRLIMIT

# Look for setenv, so that extended locale tests can be performed.
GLIBCXX_CHECK_STDLIB_DECL_AND_LINKAGE_3(setenv)
  fi

For cross toolchain like arm-linux, _GLIBCXX_RES_LIMITS isn't set.  As
a result, function __gnu_test::set_file_limit is actually nullified
and causing case 27_io/fpos/14775.cc failed.

My question is why we want to guard the check with GLIBCXX_IS_NATIVE?
Could we check it directly, if it's not supported, it's going to fail
and undef the macro anyway?


Thanks,
bin


Re: Question about macro _GLIBCXX_RES_LIMITS in libstdc++ testsuite

2015-05-17 Thread Bin.Cheng
On Sat, May 16, 2015 at 5:35 PM, Hans-Peter Nilsson  wrote:
> On Thu, 23 Apr 2015, Bin.Cheng wrote:
>> Hi,
>> In libstdc++ testsuite, I noticed that macro _GLIBCXX_RES_LIMITS is
>> checked/set by GLIBCXX_CHECK_SETRLIMIT, which is further guarded by
>> GLIBCXX_IS_NATIVE as below:
>>
>> AC_DEFUN([GLIBCXX_CONFIGURE_TESTSUITE], [
>>   if $GLIBCXX_IS_NATIVE ; then
>> # Do checks for resource limit functions.
>> GLIBCXX_CHECK_SETRLIMIT
>>
>> # Look for setenv, so that extended locale tests can be performed.
>> GLIBCXX_CHECK_STDLIB_DECL_AND_LINKAGE_3(setenv)
>>   fi
>>
>> For cross toolchain like arm-linux, _GLIBCXX_RES_LIMITS isn't set.  As
>> a result, function __gnu_test::set_file_limit is actually nullified
>> and causing case 27_io/fpos/14775.cc failed.
>>
>> My question is why we want to guard the check with GLIBCXX_IS_NATIVE?
>> Could we check it directly, if it's not supported, it's going to fail
>> and undef the macro anyway?
>>
>>
>> Thanks,
>> bin
>>
>
> Good question.  I'm CC:ing the libstdc++ list, maybe that'll
> bring an answer before another 1.5 months.

Ah, thank you for helping, and I can wait another 1.5 months :)

Thanks,
bin
>
> brgds, H-P


Re: Question about macro _GLIBCXX_RES_LIMITS in libstdc++ testsuite

2015-05-17 Thread Bin.Cheng
On Mon, May 18, 2015 at 9:40 AM, Jim Wilson  wrote:
> On 05/17/2015 01:16 AM, Bin.Cheng wrote:
>> On Sat, May 16, 2015 at 5:35 PM, Hans-Peter Nilsson  
>> wrote:
>>> On Thu, 23 Apr 2015, Bin.Cheng wrote:
>>>> Hi,
>>>> In libstdc++ testsuite, I noticed that macro _GLIBCXX_RES_LIMITS is
>>>> checked/set by GLIBCXX_CHECK_SETRLIMIT, which is further guarded by
>>>> GLIBCXX_IS_NATIVE as below:
>
> The setrlimit checks were made dependent on GLIBCXX_IS_NATIVE on Aug 9,
> 2001.
> https://gcc.gnu.org/ml/gcc-patches/2001-08/msg00536.html
> This is 3 days after the feature was added.  This was 14 years ago, so
> people might not remember exactly why the change was made.  There was
> probably no specific reason for this, other than a concern that it might
> not work cross, and/or wasn't considered worth the effort to make it
> work cross at the time.
>
> It does look like this can work cross, at least for a cross to a linux
> target.  For a cross to a bare metal target, it should be OK to run the
> tests, they will just fail and disable the macros.  Someone just needs
> write the patches to make it work and test it.  You could try submitting
> a bug report if you haven't already done so.
Thanks Jim.
It's very probably the case as you said.  We already had a patch and
will send it for review.

Thanks,
bin
>
> Jim
>


Re: Identifying Chain of Recurrence

2015-05-28 Thread Bin.Cheng
On Fri, May 29, 2015 at 12:41 PM, Pritam Gharat
 wrote:
> GCC builds a chain of recurrence to capture a pattern in which an
> array is accessed in a loop. Is there any function which identifies
> that gcc has built a chain of recurrence? Is this information
> associated to the gimple assignment which accesses the array elements?
>
GCC analyzes evolution of scalar variables on SSA representation.
Each ssa var is treated as unique and can be analyzed.  If the address
expression itself is a ssa var, it can be analyzed by scev; otherwise,
the users of scev have to compute by themselves on the bases of other
scev vars.  For example, address of MEM[scev_var] can be analyzed by
scev; while address of MEM[invariant_base+scev_iv] is computed by
users.  Well, at least this is how IVOPT works.  You can refer to
top-file comment in tree-scalar-evolution.c for detailed information.
General routines for chain of recurrence is in file tree-chrec.c.

Thanks,
bin
>
> Thanks,
> Pritam Gharat


Re: Identifying Chain of Recurrence

2015-05-29 Thread Bin.Cheng
On Fri, May 29, 2015 at 11:14 PM, Pritam Gharat
 wrote:
> I am writing a Simple IPA Pass which is inserted after ipa-pta. This
> pass identifies whether a chain of recurrence is built by gcc or not.
> I have used tree_is_chrec(t) to check if t represents chain of
> recurrence and function find_var_scev_info(bb, var)  to obtain the
> scev information generated (which comprises of chrec information).
> However, find_var_scev_info() is a static function and cannot be
> called from my plugin.
find_var_scev_info is for internal use only.   I think the proper
interface is simple_iv.  You may refer to other users of that
interface in GCC.
Also, please don't top-reply the message.

Thanks,
bin
>
> Is there any alternative way to get the chain of recurrence
> information? Or do we need to change the gcc source code by making the
> function find_var_scev_info() non-static and recompiling the source
> code?


>
> Thanks,
> Pritam Gharat
>
> On Fri, May 29, 2015 at 10:34 AM, Bin.Cheng  wrote:
>> On Fri, May 29, 2015 at 12:41 PM, Pritam Gharat
>>  wrote:
>>> GCC builds a chain of recurrence to capture a pattern in which an
>>> array is accessed in a loop. Is there any function which identifies
>>> that gcc has built a chain of recurrence? Is this information
>>> associated to the gimple assignment which accesses the array elements?
>>>
>> GCC analyzes evolution of scalar variables on SSA representation.
>> Each ssa var is treated as unique and can be analyzed.  If the address
>> expression itself is a ssa var, it can be analyzed by scev; otherwise,
>> the users of scev have to compute by themselves on the bases of other
>> scev vars.  For example, address of MEM[scev_var] can be analyzed by
>> scev; while address of MEM[invariant_base+scev_iv] is computed by
>> users.  Well, at least this is how IVOPT works.  You can refer to
>> top-file comment in tree-scalar-evolution.c for detailed information.
>> General routines for chain of recurrence is in file tree-chrec.c.
>>
>> Thanks,
>> bin
>>>
>>> Thanks,
>>> Pritam Gharat


progress on PR85804?

2018-10-14 Thread Bin.Cheng
Hi,
Is there any progress on PR85804?  There were some discussion about
the old path at below address, but looks like no further attentions.
https://gcc.gnu.org/ml/gcc-patches/2018-05/msg01026.html

Thanks,
bin


A GCC bug related to inline?

2018-11-10 Thread Bin.Cheng
Hi,
Given below simple code:

inline int foo (int a) {
  return a + 1;
}
int g = 5;
int main(void) {
  return foo(g);
}

When compiled with -O0, GCC generates below assembly:
.file "test.c"
.text
.globl g
.data
.align 4
.type g, @object
.size g, 4
g:
.long 5
.text
.globl main
.type main, @function
main:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl g(%rip), %eax
movl %eax, %edi
call foo
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE1:
.size main, .-main
.ident "GCC: (GNU) 9.0.0 20181031 (experimental)"
.section .note.GNU-stack,"",@progbits

Note the body of function foo is removed while call to foo isn't
inlined, this results in undefined reference to "foo" at linking.

Guess this is a bug?

Thanks,
bin


Re: Parallelize the compilation using Threads

2018-12-13 Thread Bin.Cheng
On Wed, Dec 12, 2018 at 11:46 PM Giuliano Augusto Faulin Belinassi
 wrote:
>
> Hi, I have some news. :-)
>
> I replicated the Martin Liška experiment [1] on a 64-cores machine for
> gcc [2] and Linux kernel [3] (Linux kernel was fully parallelized),
> and I am excited to dive into this problem. As a result, I want to
> propose GSoC project on this issue, starting with something like:
> 1- Systematically create a benchmark for easily information
> gathering. Martin Liška already made the first version of it, but I
> need to improve it.
> 2- Find and document the global states (Try to reduce the gcc's
> global states as well).
> 3- Define the parallelization strategy.
> 4- First parallelization attempt.
Hi Giuliano,

Thanks very much for working on this.  It could be very useful, for
example, one bottleneck we have is slow compilation of big single
source file after intensively using distribution compilation.  Of
course, a good parallelization strategy is needed.

Thanks,
bin
>
> I also proposed this issue as a research project to my advisor and he
> supported me on this idea. So I can work for at least one year on
> this, and other things related to it.
>
> Would anyone be willing to mentor me on this?
>
> [1] https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> [2] https://www.ime.usp.br/~belinass/64cores-experiment.svg
> [3] https://www.ime.usp.br/~belinass/64cores-kernel-experiment.svg
> On Mon, Nov 19, 2018 at 8:53 AM Richard Biener
>  wrote:
> >
> > On Fri, Nov 16, 2018 at 8:00 PM Giuliano Augusto Faulin Belinassi
> >  wrote:
> > >
> > > Hi! Sorry for the late reply again :P
> > >
> > > On Thu, Nov 15, 2018 at 8:29 AM Richard Biener
> > >  wrote:
> > > >
> > > > On Wed, Nov 14, 2018 at 10:47 PM Giuliano Augusto Faulin Belinassi
> > > >  wrote:
> > > > >
> > > > > As a brief introduction, I am a graduate student that got interested
> > > > >
> > > > > in the "Parallelize the compilation using threads"(GSoC 2018 [1]). I
> > > > > am a newcommer in GCC, but already have sent some patches, some of
> > > > > them have already been accepted [2].
> > > > >
> > > > > I brought this subject up in IRC, but maybe here is a proper place to
> > > > > discuss this topic.
> > > > >
> > > > > From my point of view, parallelizing GCC itself will only speed up the
> > > > > compilation of projects which have a big file that creates a
> > > > > bottleneck in the whole project compilation (note: by big, I mean the
> > > > > amount of code to generate).
> > > >
> > > > That's true.  During GCC bootstrap there are some of those (see 
> > > > PR84402).
> > > >
> > >
> > > > One way to improve parallelism is to use link-time optimization where
> > > > even single source files can be split up into multiple link-time units. 
> > > >  But
> > > > then there's the serial whole-program analysis part.
> > >
> > > Did you mean this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402 ?
> > > That is a lot of data :-)
> > >
> > > It seems that 'phase opt and generate' is the most time-consuming
> > > part. Is that the 'GIMPLE optimization pipeline' you were talking
> > > about in this thread:
> > > https://gcc.gnu.org/ml/gcc/2018-03/msg00202.html
> >
> > It's everything that comes after the frontend parsing bits, thus this
> > includes in particular RTL optimization and early GIMPLE optimizations.
> >
> > > > > Additionally, I know that GCC must not
> > > > > change the project layout, but from the software engineering 
> > > > > perspective,
> > > > > this may be a bad smell that indicates that the file should be broken
> > > > > into smaller files. Finally, the Makefiles will take care of the
> > > > > parallelization task.
> > > >
> > > > What do you mean by GCC must not change the project layout?  GCC
> > > > happily re-orders functions and link-time optimization will reorder
> > > > TUs (well, linking may as well).
> > > >
> > >
> > > That was a response to a comment made on IRC:
> > >
> > > On Thu, Nov 15, 2018 at 9:44 AM Jonathan Wakely  
> > > wrote:
> > > >I think this is in response to a comment I made on IRC. Giuliano said
> > > >that if a project has a very large file that dominates the total build
> > > >time, the file should be split up into smaller pieces. I said  "GCC
> > > >can't restructure people's code. it can only try to compile it
> > > >faster". We weren't referring to code transformations in the compiler
> > > >like re-ordering functions, but physically refactoring the source
> > > >code.
> > >
> > > Yes. But from one of the attachments from PR84402, it seems that such
> > > files exist on GCC,
> > > https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> > >
> > > > > My questions are:
> > > > >
> > > > >  1. Is there any project compilation that will significantly be 
> > > > > improved
> > > > > if GCC runs in parallel? Do someone has data about something related
> > > > > to that? How about the Linux Kernel? If not, I can try to bring some.
> > > >
> > > > We do not have any data about this

Is it a bug allowing to copy GIMPLE_ASM with labels?

2018-12-28 Thread Bin.Cheng
Hi,
Give below test:

volatile int a, b, c;
__attribute__((noinline)) int foo ()
{
  int i;
  for (i = 0; i < 1000; i++)
{
  if (i % 17)
a++;
  else
b++;
  asm volatile(
  "foo_label:\n"
  "pushq %%rbx\n"
  "popq %%rbx\n" :"=m"(c) :"r"(i)
  );
  c++;
}
  return 0;
}
compiled with:
gcc -O2 -fno-crossjumping -ftracer tracer-1.c -o tracer-1.o -c
will end up with following assembler error message:
tracer-1.c: Assembler messages:
tracer-1.c:16: Error: symbol `foo_label' is already defined

Root cause is in tracer.c which duplicates basic block without
checking if any GIMPLE_ASM defines labels.
Is this a bug or invalid code?

Thanks,
bin


Re: Is it a bug allowing to copy GIMPLE_ASM with labels?

2018-12-28 Thread Bin.Cheng
On Sat, Dec 29, 2018 at 3:42 PM Alexander Monakov  wrote:
>
> On Sat, 29 Dec 2018, Bin.Cheng wrote:
> > tracer-1.c: Assembler messages:
> > tracer-1.c:16: Error: symbol `foo_label' is already defined
> >
> > Root cause is in tracer.c which duplicates basic block without
> > checking if any GIMPLE_ASM defines labels.
> > Is this a bug or invalid code?
>
> This is invalid code, GCC documentation is clear that the compiler
> may duplicate inline asm statements (passes other than tracer can do
> that too, loop unswitching just to give one example).
>
> We don't provide a way to write an asm that wouldn't be duplicated.
I see, thanks for elaboration.

Thanks,
bin
>
> Alexander


question about inlining long call sequence

2019-02-12 Thread Bin.Cheng
Hi,
When reading inlining code in GCC, I wonder if we have size heuristics
to limit inlining long call sequence?  For example, for call sequence
A -> B -> C -> D -> ... -> X -> Y -> Z
if each function call grows size by a very small amount, inlining Z
all the way up to the outermost function could result in a big
function which may hurt icache.  Is this case handled in inliner? if
yes, which code handles this?  Thanks in advance.

BTW, I am using GCC 6, not sure if trunk has different behavior.

Thanks,
bin


Re: question about inlining long call sequence

2019-02-13 Thread Bin.Cheng
On Tue, Feb 12, 2019 at 6:16 PM Martin Jambor  wrote:
>
> On Tue, Feb 12 2019, Bin.Cheng wrote:
> > Hi,
> > When reading inlining code in GCC, I wonder if we have size heuristics
> > to limit inlining long call sequence?  For example, for call sequence
> > A -> B -> C -> D -> ... -> X -> Y -> Z
> > if each function call grows size by a very small amount, inlining Z
> > all the way up to the outermost function could result in a big
> > function which may hurt icache.  Is this case handled in inliner? if
> > yes, which code handles this?  Thanks in advance.
> >
> > BTW, I am using GCC 6, not sure if trunk has different behavior.
>
> I believe it is done in caller_growth_limits() in ipa-inline.c in both
> trunk and GCC 6.  The following comment in the function might shed a bit
> more light on the behavior regarding big functions:
>
>   /* Look for function e->caller is inlined to.  While doing
>  so work out the largest function body on the way.  As
>  described above, we want to base our function growth
>  limits based on that.  Not on the self size of the
>  outer function, not on the self size of inline code
>  we immediately inline to.  This is the most relaxed
>  interpretation of the rule "do not grow large functions
>  too much in order to prevent compiler from exploding".  */
Thanks, assume it's below code collecting maximum code size limit
along call stack:

  while (true)
{
  info = inline_summaries->get (to);
  if (limit < info->self_size)
limit = info->self_size;
  if (stack_size_limit < info->estimated_self_stack_size)
stack_size_limit = info->estimated_self_stack_size;
  if (to->global.inlined_to)
to = to->callers->caller;

Question is it only collects the maximum self_size of outer callers,
since we are inlining from outer function to inner, only check
self_size of caller means we can still get bloated size for some
cases? Or what piece of code I have missed?

Another question is in want_inline_small_function_p:

  /* Apply MAX_INLINE_INSNS_AUTO limit for functions not declared inline
 Upgrade it to MAX_INLINE_INSNS_SINGLE when hints suggests that
 inlining given function is very profitable.  */
  else if (!DECL_DECLARED_INLINE_P (callee->decl)
   && !big_speedup
   && !(hints & INLINE_HINT_known_hot)
   && growth >= ((hints & (INLINE_HINT_indirect_call
   | INLINE_HINT_loop_iterations
   | INLINE_HINT_array_index
   | INLINE_HINT_loop_stride))
 ? MAX (MAX_INLINE_INSNS_AUTO,
MAX_INLINE_INSNS_SINGLE)
 : MAX_INLINE_INSNS_AUTO))
So for function not declared as inline, if it's considered as
known_hot, there will be no size restriction at all.  This looks like
an issue to me, given we do restrict size even for know_hot function
which is declared as inline:

  /* Apply MAX_INLINE_INSNS_SINGLE limit.  Do not do so when
 hints suggests that inlining given function is very profitable.  */
  else if (DECL_DECLARED_INLINE_P (callee->decl)
   && growth >= MAX_INLINE_INSNS_SINGLE
   && ((!big_speedup
&& !(hints & (INLINE_HINT_indirect_call
  | INLINE_HINT_known_hot
  | INLINE_HINT_loop_iterations
  | INLINE_HINT_array_index
  | INLINE_HINT_loop_stride)))
   || growth >= MAX_INLINE_INSNS_SINGLE * 16))

Again, this is on GCC6.

Thanks,
bin
>
> HTH
>
> Martin


Re: GCC missing -flto optimizations? SPEC lbm benchmark

2019-02-15 Thread Bin.Cheng
On Fri, Feb 15, 2019 at 3:30 AM Steve Ellcey  wrote:
>
> I have a question about SPEC CPU 2017 and what GCC can and cannot do
> with -flto.  As part of some SPEC analysis I am doing I found that with
> -Ofast, ICC and GCC were not that far apart (especially spec int rate,
> spec fp rate was a slightly larger difference).
>
> But when I added -ipo to the ICC command and -flto to the GCC command,
> the difference got larger.  In particular the 519.lbm_r was more than
> twice as fast with ICC and -ipo, but -flto did not help GCC at all.
>
> There are other tests that also show this type of improvement with -ipo
> like 538.imagick_r, 544.nab_r, 525.x264_r, 531.deepsjeng_r, and
> 548.exchange2_r, but none are as dramatic as 519.lbm_r.  Anyone have
> any idea on what ICC is doing that GCC is missing?  Is GCC just not
> agressive enough with its inlining?

IIRC Jun did some investigation before? CCing.

Thanks,
bin
>
> Steve Ellcey
> sell...@marvell.com


Re: Rust front-end

2019-08-27 Thread Bin.Cheng
On Fri, Aug 23, 2019 at 10:11 PM Mateus Carmo Martins de Freitas
Barbosa  wrote:
>
> I'm interested in working on the Rust front-end for GCC.
>
> So far I've cloned the repository 
> and tried to compile it as described in 
> .
> I've compiled it outside of the gcc directory tree with
>
> $ ../gccrs/configure --prefix=/opt/gccrs --enable-languages=rust
> --disable-multilib --disable-bootstrap
> $ make
>
>
> But this produces some linking errors for functions that were called
> but never defined:
>
>
> /usr/bin/ld: rust/rust-lang.o: in function
> `rust_langhook_handle_option(unsigned long, char const*, long, int,
> unsigned int, cl_option_handlers const*)':
> /home/baal/gccrs-build/gcc/../../gccrs/gcc/rust/rust-lang.c:185:
> undefined reference to `rust_add_search_path(char const*)'
> /usr/bin/ld: /home/baal/gccrs-build/gcc/../../gccrs/gcc/rust/rust-lang.c:213:
> undefined reference to `rust_add_search_path(char const*)'
> /usr/bin/ld: /home/baal/gccrs-build/gcc/../../gccrs/gcc/rust/rust-lang.c:217:
> undefined reference to `rust_add_search_path(char const*)'
> /usr/bin/ld: rust/rust-lang.o: in function `rust_langhook_post_options':
> /home/baal/gccrs-build/gcc/../../gccrs/gcc/rust/rust-lang.c:245:
> undefined reference to `rust_add_search_path(char const*)'
> /usr/bin/ld: rust/rust-lang.o: in function `rust_langhook_parse_file()':
> /home/baal/gccrs-build/gcc/../../gccrs/gcc/rust/rust-lang.c:282:
> undefined reference to `rust_parse_input_files(char const**, unsigned
> int, bool)'
> /usr/bin/ld: rust/rust-lang.o:/home/baal/gccrs-build/gcc/./gtype-rust.h:24:
> undefined reference to `rust_non_zero_struct'
> collect2: error: ld returned 1 exit status
> make[2]: *** [../../gccrs/gcc/rust/Make-lang.in:61: rust1] Error 1
> make[2]: Leaving directory '/home/baal/gccrs-build/gcc'
> make[1]: *** [Makefile:4319: all-gcc] Error 2
>
>
> It's doesn't really help that the latest commit message
> (3b1e76d808b9725e6ef439ae534011370e65fb85) says simply "x" and the
> previous one, only "more". Anyhow, I'm left with those questions:
>
Sorry I don't have the answer for your questions, just want to confirm
that I run into the same issue building it.
Given both wiki and the readme explicitly mention that the status is
"Very early and out of date", I am not surprised by the broken.  The
author lastly edited wiki page seems inactive for some time?

> - Considering I'm new to gcc development, what should I read before
> getting into this?
> - Is there any developer in particular I should talk to?
> - Is there anything else I need to know before getting started?
I know nothing about frontend.  The functions look like common ones
prefixed with "rust_", so maybe looking into other frontends can given
some clue.  For example, there is go_add_search_path and
brig_add_search_path.

And couple of words to the community, we may need to be more active in
order to attract new developers. IMHO, messages asking for information
shouldn't need one week to be answered?

Thanks,
bin


Any future plan supporting livepatching in kernel for AArch64?

2019-09-18 Thread Bin.Cheng
Hi,
I read through previous messages which tried to support livepatching
in kernel for AArch64.
Following is the patch supporting mfentry:
https://gcc.gnu.org/ml/gcc-patches/2016-03/msg00756.html
And this is the last message about above thread:
https://gcc.gnu.org/ml/gcc-patches/2016-04/msg01093.html

Here is another patch trying to support a general option, however. it
was not merged either.
https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01791.html

If I read messages correctly, kernel part is not merged (partly)
because of waiting for GCC.  As a result, we have to apply local
patches, I am wondering if there is any future plan to support it on
AArch64, or no one really uses it?

I CCed patch authors as well as who joined previous discussions,
please let me know if you don't want to be cced in this thread.

Thanks,
bin


Re: Any future plan supporting livepatching in kernel for AArch64?

2019-09-18 Thread Bin.Cheng
On Thu, Sep 19, 2019 at 1:02 PM Andrew Pinski  wrote:
>
> On Wed, Sep 18, 2019 at 9:56 PM Andrew Pinski  wrote:
> >
> > On Wed, Sep 18, 2019 at 9:24 PM Bin.Cheng  wrote:
> > >
> > > Hi,
> > > I read through previous messages which tried to support livepatching
> > > in kernel for AArch64.
> > > Following is the patch supporting mfentry:
> > > https://gcc.gnu.org/ml/gcc-patches/2016-03/msg00756.html
> > > And this is the last message about above thread:
> > > https://gcc.gnu.org/ml/gcc-patches/2016-04/msg01093.html
> > >
> > > Here is another patch trying to support a general option, however. it
> > > was not merged either.
> > > https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01791.html
> > >
> > > If I read messages correctly, kernel part is not merged (partly)
> > > because of waiting for GCC.  As a result, we have to apply local
> > > patches, I am wondering if there is any future plan to support it on
> > > AArch64, or no one really uses it?
> >
> > A GCC patch was merged for this already and is included in GCC 9.1.0 and 
> > above.
> > See 
> > https://gcc.gnu.org/onlinedocs/gcc-9.2.0/gcc/Common-Function-Attributes.html#index-patchable_005ffunction_005fentry-function-attribute
> > And
> > https://gcc.gnu.org/onlinedocs/gcc-9.2.0/gcc/Instrumentation-Options.html#index-fpatchable-function-entry
> >
> > Someone did not add it to the GCC 9 release notes though to highlight it.
> Note I Mean GCC 8 release notes as it was included in GCC 8.1.0 and above :).
>
> Also see https://lkml.org/lkml/2017/10/13/340 for the Linux side of
> discussion about this option :).
Thank you very much, the option name changed several times and I
missed this one!

Thanks,
bin
>
>
> >
> > Thanks,
> > Andrew Pinski
> >
> > >
> > > I CCed patch authors as well as who joined previous discussions,
> > > please let me know if you don't want to be cced in this thread.
> > >
> > > Thanks,
> > > bin


Re: Do we need to do a loop invariant motion after loop interchange ?

2019-11-21 Thread Bin.Cheng
On Fri, Nov 22, 2019 at 3:19 PM Richard Biener
 wrote:
>
> On November 22, 2019 6:51:38 AM GMT+01:00, Li Jia He  
> wrote:
> >
> >
> >On 2019/11/21 8:10 PM, Richard Biener wrote:
> >> On Thu, Nov 21, 2019 at 10:22 AM Li Jia He 
> >wrote:
> >>>
> >>> Hi,
> >>>
> >>> I found for the follow code:
> >>>
> >>> #define N 256
> >>> int a[N][N][N], b[N][N][N];
> >>> int d[N][N], c[N][N];
> >>> void __attribute__((noinline))
> >>> double_reduc (int n)
> >>> {
> >>> for (int k = 0; k < n; k++)
> >>> {
> >>>   for (int l = 0; l < n; l++)
> >>>{
> >>>  c[k][l] = 0;
> >>>   for (int m = 0; m < n; m++)
> >>> c[k][l] += a[k][m][l] * d[k][m] + b[k][m][l] * d[k][m];
> >>>}
> >>> }
> >>> }
> >>>
> >>> I dumped the file after loop interchange and got the following
> >information:
> >>>
> >>>  [local count: 118111600]:
> >>> # m_46 = PHI <0(7), m_45(11)>
> >>> # ivtmp_44 = PHI <_42(7), ivtmp_43(11)>
> >>> _39 = _49 + 1;
> >>>
> >>>  [local count: 955630224]:
> >>> # l_48 = PHI <0(3), l_47(12)>
> >>> # ivtmp_41 = PHI <_39(3), ivtmp_40(12)>
> >>> c_I_I_lsm.5_18 = c[k_28][l_48];
> >>> c_I_I_lsm.5_53 = m_46 != 0 ? c_I_I_lsm.5_18 : 0;
> >>> _2 = a[k_28][m_46][l_48];
> >>> _3 = d[k_28][m_46];
> >>> _4 = _2 * _3;
> >>> _5 = b[k_28][m_46][l_48];
> >>> _6 = _3 * _5;
> >>> _7 = _4 + _6;
> >>> _8 = _7 + c_I_I_lsm.5_53;
> >>> c[k_28][l_48] = _8;
> >>> l_47 = l_48 + 1;
> >>> ivtmp_40 = ivtmp_41 - 1;
> >>> if (ivtmp_40 != 0)
> >>>   goto ; [89.00%]
> >>> else
> >>>   goto ; [11.00%]
> >>>
> >>> we can see '_3 = d[k_28][m_46];'  is a loop invariant.
> >>> Do we need to add a loop invariant motion pass after the loop
> >interchange?
> >>
> >> There is one at the end of the loop pipeline.
> >
> >Hi,
> >
> >The one at the end of the loop pipeline may miss some optimization
> >opportunities.  If we vectorize the above code (a.c.158t.vect), we
> >can get information similar to the following:
> >
> >bb 3:
> >  # m_46 = PHI <0(7), m_45(11)>  // loop m, outer loop
> >   if (_59 <= 2)
> > goto bb 20;
> >   else
> > goto bb 15;
> >
> >bb 15:
> >   _89 = d[k_28][m_46];
> >   vect_cst__90 = {_89, _89, _89, _89};
> >
> >bb 4:
> ># l_48 = PHI  // loop l, inner loop
> >   vect__6.23_100 = vect_cst__99 * vect__5.22_98;
> >if (ivtmp_110 < bnd.8_1)
> > goto bb 12;
> >   else
> > goto bb 17;
> >
> >bb 20:
> >bb 18:
> >_27 = d[k_28][m_46];
> >if (ivtmp_12 != 0)
> > goto bb 19;
> >   else
> > goto bb 21;
> >
> >Vectorization will do some conversions in this case.  We can see
> >‘ _89 = d[k_28][m_46];’ and ‘_27 = d[k_28][m_46];’ are loop invariant
> >relative to loop l.  We can move ‘d[k_28][m_46]’ to the front of
> >‘if (_59 <= 2)’ to get rid of loading data from memory in both
> >branches.
> >
> >The one at at the end of the loop pipeline can't handle this situation.
> >If we move d[k_28][m_46] from loop l to loop m before doing
> >vectorization, we can get rid of this situation.
>
> But we can't run every pass after every other. With multiple passes having 
> ordering issues is inevitable.
>
> Now - interchange could trigger a region based invariant motion just for the 
> nest it interchanged. But that doesn't exist right now.
With data reference/dependence information in the pass, I think it
could be quite straightforward.  Didn't realize that we need it
before.

Thanks,
bin
>
> Richard.


Re: Do we need to do a loop invariant motion after loop interchange ?

2019-11-23 Thread Bin.Cheng
On Fri, Nov 22, 2019 at 3:23 PM Bin.Cheng  wrote:
>
> On Fri, Nov 22, 2019 at 3:19 PM Richard Biener
>  wrote:
> >
> > On November 22, 2019 6:51:38 AM GMT+01:00, Li Jia He 
> >  wrote:
> > >
> > >
> > >On 2019/11/21 8:10 PM, Richard Biener wrote:
> > >> On Thu, Nov 21, 2019 at 10:22 AM Li Jia He 
> > >wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> I found for the follow code:
> > >>>
> > >>> #define N 256
> > >>> int a[N][N][N], b[N][N][N];
> > >>> int d[N][N], c[N][N];
> > >>> void __attribute__((noinline))
> > >>> double_reduc (int n)
> > >>> {
> > >>> for (int k = 0; k < n; k++)
> > >>> {
> > >>>   for (int l = 0; l < n; l++)
> > >>>{
> > >>>  c[k][l] = 0;
> > >>>   for (int m = 0; m < n; m++)
> > >>> c[k][l] += a[k][m][l] * d[k][m] + b[k][m][l] * d[k][m];
> > >>>}
> > >>> }
> > >>> }
> > >>>
> > >>> I dumped the file after loop interchange and got the following
> > >information:
> > >>>
> > >>>  [local count: 118111600]:
> > >>> # m_46 = PHI <0(7), m_45(11)>
> > >>> # ivtmp_44 = PHI <_42(7), ivtmp_43(11)>
> > >>> _39 = _49 + 1;
> > >>>
> > >>>  [local count: 955630224]:
> > >>> # l_48 = PHI <0(3), l_47(12)>
> > >>> # ivtmp_41 = PHI <_39(3), ivtmp_40(12)>
> > >>> c_I_I_lsm.5_18 = c[k_28][l_48];
> > >>> c_I_I_lsm.5_53 = m_46 != 0 ? c_I_I_lsm.5_18 : 0;
> > >>> _2 = a[k_28][m_46][l_48];
> > >>> _3 = d[k_28][m_46];
> > >>> _4 = _2 * _3;
> > >>> _5 = b[k_28][m_46][l_48];
> > >>> _6 = _3 * _5;
> > >>> _7 = _4 + _6;
> > >>> _8 = _7 + c_I_I_lsm.5_53;
> > >>> c[k_28][l_48] = _8;
> > >>> l_47 = l_48 + 1;
> > >>> ivtmp_40 = ivtmp_41 - 1;
> > >>> if (ivtmp_40 != 0)
> > >>>   goto ; [89.00%]
> > >>> else
> > >>>   goto ; [11.00%]
> > >>>
> > >>> we can see '_3 = d[k_28][m_46];'  is a loop invariant.
> > >>> Do we need to add a loop invariant motion pass after the loop
> > >interchange?
> > >>
> > >> There is one at the end of the loop pipeline.
> > >
> > >Hi,
> > >
> > >The one at the end of the loop pipeline may miss some optimization
> > >opportunities.  If we vectorize the above code (a.c.158t.vect), we
> > >can get information similar to the following:
> > >
> > >bb 3:
> > >  # m_46 = PHI <0(7), m_45(11)>  // loop m, outer loop
> > >   if (_59 <= 2)
> > > goto bb 20;
> > >   else
> > > goto bb 15;
> > >
> > >bb 15:
> > >   _89 = d[k_28][m_46];
> > >   vect_cst__90 = {_89, _89, _89, _89};
> > >
> > >bb 4:
> > ># l_48 = PHI  // loop l, inner loop
> > >   vect__6.23_100 = vect_cst__99 * vect__5.22_98;
> > >if (ivtmp_110 < bnd.8_1)
> > > goto bb 12;
> > >   else
> > > goto bb 17;
> > >
> > >bb 20:
> > >bb 18:
> > >_27 = d[k_28][m_46];
> > >if (ivtmp_12 != 0)
> > > goto bb 19;
> > >   else
> > > goto bb 21;
> > >
> > >Vectorization will do some conversions in this case.  We can see
> > >‘ _89 = d[k_28][m_46];’ and ‘_27 = d[k_28][m_46];’ are loop invariant
> > >relative to loop l.  We can move ‘d[k_28][m_46]’ to the front of
> > >‘if (_59 <= 2)’ to get rid of loading data from memory in both
> > >branches.
> > >
> > >The one at at the end of the loop pipeline can't handle this situation.
> > >If we move d[k_28][m_46] from loop l to loop m before doing
> > >vectorization, we can get rid of this situation.
> >
> > But we can't run every pass after every other. With multiple passes having 
> > ordering issues is inevitable.
> >
> > Now - interchange could trigger a region based invariant motion just for 
> > the nest it interchanged. But that doesn't exist right now.
> With data reference/dependence information in the pass, I think it
> could be quite straightforward.  Didn't realize that we need it
> before.
FYI, attachment is a simple fix in loop interchange for the reported
issue. It's untested, neither for GCC10.

Thanks,
bin
>
> Thanks,
> bin
> >
> > Richard.


linterchange-invariant-dataref-motion.patch
Description: Binary data


Re: Do we need to do a loop invariant motion after loop interchange ?

2019-11-25 Thread Bin.Cheng
On Mon, Nov 25, 2019 at 5:29 PM Li Jia He  wrote:
>
>
>
> On 2019/11/24 2:26 PM, Bin.Cheng wrote:
> > On Fri, Nov 22, 2019 at 3:23 PM Bin.Cheng  wrote:
> >>
> >> On Fri, Nov 22, 2019 at 3:19 PM Richard Biener
> >>  wrote:
> >>>
> >>> On November 22, 2019 6:51:38 AM GMT+01:00, Li Jia He 
> >>>  wrote:
> >>>>
> >>>>
> >>>> On 2019/11/21 8:10 PM, Richard Biener wrote:
> >>>>> On Thu, Nov 21, 2019 at 10:22 AM Li Jia He 
> >>>> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I found for the follow code:
> >>>>>>
> >>>>>> #define N 256
> >>>>>> int a[N][N][N], b[N][N][N];
> >>>>>> int d[N][N], c[N][N];
> >>>>>> void __attribute__((noinline))
> >>>>>> double_reduc (int n)
> >>>>>> {
> >>>>>>  for (int k = 0; k < n; k++)
> >>>>>>  {
> >>>>>>for (int l = 0; l < n; l++)
> >>>>>> {
> >>>>>>   c[k][l] = 0;
> >>>>>>for (int m = 0; m < n; m++)
> >>>>>>  c[k][l] += a[k][m][l] * d[k][m] + b[k][m][l] * d[k][m];
> >>>>>> }
> >>>>>>  }
> >>>>>> }
> >>>>>>
> >>>>>> I dumped the file after loop interchange and got the following
> >>>> information:
> >>>>>>
> >>>>>>  [local count: 118111600]:
> >>>>>>  # m_46 = PHI <0(7), m_45(11)>
> >>>>>>  # ivtmp_44 = PHI <_42(7), ivtmp_43(11)>
> >>>>>>  _39 = _49 + 1;
> >>>>>>
> >>>>>>   [local count: 955630224]:
> >>>>>>  # l_48 = PHI <0(3), l_47(12)>
> >>>>>>  # ivtmp_41 = PHI <_39(3), ivtmp_40(12)>
> >>>>>>  c_I_I_lsm.5_18 = c[k_28][l_48];
> >>>>>>  c_I_I_lsm.5_53 = m_46 != 0 ? c_I_I_lsm.5_18 : 0;
> >>>>>>  _2 = a[k_28][m_46][l_48];
> >>>>>>  _3 = d[k_28][m_46];
> >>>>>>  _4 = _2 * _3;
> >>>>>>  _5 = b[k_28][m_46][l_48];
> >>>>>>  _6 = _3 * _5;
> >>>>>>  _7 = _4 + _6;
> >>>>>>  _8 = _7 + c_I_I_lsm.5_53;
> >>>>>>  c[k_28][l_48] = _8;
> >>>>>>  l_47 = l_48 + 1;
> >>>>>>  ivtmp_40 = ivtmp_41 - 1;
> >>>>>>  if (ivtmp_40 != 0)
> >>>>>>goto ; [89.00%]
> >>>>>>  else
> >>>>>>goto ; [11.00%]
> >>>>>>
> >>>>>> we can see '_3 = d[k_28][m_46];'  is a loop invariant.
> >>>>>> Do we need to add a loop invariant motion pass after the loop
> >>>> interchange?
> >>>>>
> >>>>> There is one at the end of the loop pipeline.
> >>>>
> >>>> Hi,
> >>>>
> >>>> The one at the end of the loop pipeline may miss some optimization
> >>>> opportunities.  If we vectorize the above code (a.c.158t.vect), we
> >>>> can get information similar to the following:
> >>>>
> >>>> bb 3:
> >>>>   # m_46 = PHI <0(7), m_45(11)>  // loop m, outer loop
> >>>>if (_59 <= 2)
> >>>>  goto bb 20;
> >>>>else
> >>>>  goto bb 15;
> >>>>
> >>>> bb 15:
> >>>>_89 = d[k_28][m_46];
> >>>>vect_cst__90 = {_89, _89, _89, _89};
> >>>>
> >>>> bb 4:
> >>>> # l_48 = PHI  // loop l, inner loop
> >>>>vect__6.23_100 = vect_cst__99 * vect__5.22_98;
> >>>> if (ivtmp_110 < bnd.8_1)
> >>>>  goto bb 12;
> >>>>else
> >>>>  goto bb 17;
> >>>>
> >>>> bb 20:
> >>>> bb 18:
> >>>> _27 = d[k_28][m_46];
> >>>> if (ivtmp_12 != 0)
> >>>>  goto bb 19;
> >>>>else
> >>>>  goto bb 21;
> >>>>
> >>>> Vectorizati

Re: Vectorization regression on s390x GCC6 vs GCC5

2017-01-26 Thread Bin.Cheng
On Thu, Jan 26, 2017 at 10:18 AM, Robin Dapp  wrote:
> Hi,
>
> while analyzing a test case with a lot of nested loops (>7) and double
> floating point operations I noticed a performance regression of GCC 6/7
> vs GCC 5 on s390x. It seems due to GCC 6 vectorizing something GCC 5
> couldn't.
>  Basically, each loop iterates over three dimensions, we fully unroll
> some of the inner loops until we have straight-line code of roughly 2000
> insns that are being executed three times in GCC 5. GCC 6 vectorizes two
> iterations and adds a scalar epilogue for the third iteration. The
> epilogue code is so bad that it slows down the execution by at least
> 50%, using only two hard registers and lots of spill slots.
> Although my analysis is not completed, I believe this is because
> register pressure is high in the epilogue and the live ranges span the
> vectorized code as well as the epilogue.
>
> Even reduced, the test case is huge, therefore I didn't include it. Some
> high-level questions instead:
>
> - Has anybody else observed similar problems and got around them?
Yes, I think so.  Also we have case that GCC vectorizes with larger
vect_factor, which causes regression too.

>
> - Is there some way around the register pressure/long live ranges?
I am doing some experiments calculating coarse-grained register
pressure for GIMPLE loop, but the motivation is not from vectorizer,
but predcom/pre, like PR77498.

> Perhaps something we could/should fix in the s390 backend? (Probably
> hard to tell without source)
>
> - Would it make sense to allow a backend to specify the minimal number
> of loop iterations considered for vectorization? Is this
> perhaps already possible somehow? I added a check to disable
> vectorization for loops with <= 3 iterations that shows no regressions
> and improves two SPEC benchmarks noticeably. I'm even considering <=5,
> since a vectorization factor of 4 should exhibit the same problematic
> pattern.
Is the niter number known at compilation time?  if yes, I am surprised
GCC's behavior here on such small iteration loops.  Cost-model?

Thanks,
bin
>
> Regards
>  Robin
>


Re: Duplicating loops and virtual phis

2017-05-17 Thread Bin.Cheng
On Mon, May 15, 2017 at 7:32 PM, Richard Biener
 wrote:
> On May 15, 2017 6:56:53 PM GMT+02:00, Steve Ellcey  wrote:
>>On Sat, 2017-05-13 at 08:18 +0200, Richard Biener wrote:
>>> On May 12, 2017 10:42:34 PM GMT+02:00, Steve Ellcey >> om> wrote:
>>> >
>>> > (Short version of this email, is there a way to recalculate/rebuild
>>> > virtual
>>> > phi nodes after modifying the CFG.)
>>> >
>>> > I have a question about duplicating loops and virtual phi nodes.
>>> > I am trying to implement the following optimization as a pass:
>>> >
>>> > Transform:
>>> >
>>> >   for (i = 0; i < n; i++) {
>>> >A[i] = A[i] + B[i];
>>> >C[i] = C[i-1] + D[i];
>>> >   }
>>> >
>>> > Into:
>>> >
>>> >   if (noalias between A&B, A&C, A&D)
>>> >for (i = 0; i < 100; i++)
>>> >A[i] = A[i] + B[i];
>>> >for (i = 0; i < 100; i++)
>>> >C[i] = C[i-1] + D[i];
>>> >   else
>>> >for (i = 0; i < 100; i++) {
>>> >A[i] = A[i] + B[i];
>>> >C[i] = C[i-1] + D[i];
>>> >}
>>> >
>>> > Right now the vectorizer sees that 'C[i] = C[i-1] + D[i];' cannot
>>be
>>> > vectorized so it gives up and does not vectorize the loop.  If we
>>split
>>> > up the loop into two loops then the vector add with A[i] could be
>>> > vectorized
>>> > even if the one with C[i] could not.
>>> Loop distribution does this transform but it doesn't know about
>>> versioning for unknown dependences.
>>>
>>
>>Yes, I looked at loop distribution.  But it only works with global
>>arrays and not with pointer arguments where it doesn't know the size of
>>the array being pointed at.  I would like to be able to have it work
>>with pointer arguments.  If I call a function with 2 or
>>more integer pointers, and I have a loop that accesses them with
>>offsets between 0 and N where N is loop invariant then I should have
>>enough information (at runtime) to determine if there are overlapping
>>memory accesses through the pointers and determine whether or not I can
>>distribute the loop.
>
> Not sure where you got that from. Loop distribution works with our data 
> reference / dependence analysis.  The cost model might be more restricted but 
> that can be fixed.
>
>>The loop splitting code seemed like a better template since it already
>>knows how to split a loop based on a runtime determined condition. That
>>part seems to be working for me, it is when I try to
>>distribute/duplicate one of those loops (under the unaliased condition)
>>that I am running into the problem with virtual PHIs.
>
> There's mark_virtual*for_renaming (sp?).
>
> But as said you are performing loop distribution so please enhance the 
> existing pass rather than writing a new one.
I happen to be working on loop distribution now (If guess correctly,
to get hmmer fixed).  So far my idea is to fuse the finest distributed
loop in two passes, in the first pass, we merge all SCCs due to "true"
data dependence; in the second one we identify all SCCs and breaks
them on dependent edges due to possible alias.  Breaking SCCs with
minimal edge set can be modeled as Feedback arc set problem which is
NP-hard. Fortunately the problem is small in our case and there are
approximation algorithms.  OTOH, we should also improve loop
distribution/fusion to maximize parallelism / minimize
synchronization, as well as maximize data locality, but I think this
is not needed to get hmmer vectorized.

Thanks,
bin
>
> Richard.
>
>>Steve Ellcey
>>sell...@cavium.com
>


Question about comment in vect_prune_runtime_alias_test_list

2017-05-18 Thread Bin.Cheng
Hi,
In that function, we have below comments:

  /* Basically, for each pair of dependent data refs store_ptr_0
 and load_ptr_0, we create an expression:

 ((store_ptr_0 + store_segment_length_0) <= load_ptr_0)
 || (load_ptr_0 + load_segment_length_0) <= store_ptr_0))

 for aliasing checks.  However, in some cases we can decrease
 the number of checks by combining two checks into one.  For
 example, suppose we have another pair of data refs store_ptr_0
 and load_ptr_1, and if the following condition is satisfied:

 load_ptr_0 < load_ptr_1  &&
 load_ptr_1 - load_ptr_0 - load_segment_length_0 < store_segment_length_0

 (this condition means, in each iteration of vectorized loop,
 the accessed memory of store_ptr_0 cannot be between the memory
 of load_ptr_0 and load_ptr_1.)

 we then can use only the following expression to finish the
 alising checks between store_ptr_0 & load_ptr_0 and
 store_ptr_0 & load_ptr_1:

 ((store_ptr_0 + store_segment_length_0) <= load_ptr_0)
 || (load_ptr_1 + load_segment_length_1 <= store_ptr_0))

 Note that we only consider that load_ptr_0 and load_ptr_1 have the
 same basic address.  */

I don't quite follow the example.  If (load_ptr_0 < load_ptr_1) is
true, IIUC store_ptr_0 & load_ptr_0 and store_ptr_0 & load_ptr_1 can
be simplified to:

 ((store_ptr_0 + store_segment_length_0) <= load_ptr_0)
 || (load_ptr_1 + load_segment_length_1 <= store_ptr_0))

under conditions:
 load_ptr_0 + load_segment_length_0 < load_ptr_1 + load_segment_length_1

But I can't get this from:

 load_ptr_0 < load_ptr_1  &&
 load_ptr_1 - load_ptr_0 - load_segment_length_0 < store_segment_length_0


Also it looks like the comment doesn't match code either.  The code
merges checks for store_ptr_0 & load_ptr_0 and store_ptr_0 &
load_ptr_1, but the comment is more like mixed checks between the two
pairs.

Thanks,
bin


Re: Duplicating loops and virtual phis

2017-05-22 Thread Bin.Cheng
On Mon, May 22, 2017 at 1:42 AM, Kugan Vivekanandarajah
 wrote:
> Hi Bin and Steve,
>
> On 17 May 2017 at 19:41, Bin.Cheng  wrote:
>> On Mon, May 15, 2017 at 7:32 PM, Richard Biener
>>  wrote:
>>> On May 15, 2017 6:56:53 PM GMT+02:00, Steve Ellcey  
>>> wrote:
>>>>On Sat, 2017-05-13 at 08:18 +0200, Richard Biener wrote:
>>>>> On May 12, 2017 10:42:34 PM GMT+02:00, Steve Ellcey >>>> om> wrote:
>>>>> >
>>>>> > (Short version of this email, is there a way to recalculate/rebuild
>>>>> > virtual
>>>>> > phi nodes after modifying the CFG.)
>>>>> >
>>>>> > I have a question about duplicating loops and virtual phi nodes.
>>>>> > I am trying to implement the following optimization as a pass:
>>>>> >
>>>>> > Transform:
>>>>> >
>>>>> >   for (i = 0; i < n; i++) {
>>>>> >A[i] = A[i] + B[i];
>>>>> >C[i] = C[i-1] + D[i];
>>>>> >   }
>>>>> >
>>>>> > Into:
>>>>> >
>>>>> >   if (noalias between A&B, A&C, A&D)
>>>>> >for (i = 0; i < 100; i++)
>>>>> >A[i] = A[i] + B[i];
>>>>> >for (i = 0; i < 100; i++)
>>>>> >C[i] = C[i-1] + D[i];
>>>>> >   else
>>>>> >for (i = 0; i < 100; i++) {
>>>>> >A[i] = A[i] + B[i];
>>>>> >C[i] = C[i-1] + D[i];
>>>>> >}
>>>>> >
>>>>> > Right now the vectorizer sees that 'C[i] = C[i-1] + D[i];' cannot
>>>>be
>>>>> > vectorized so it gives up and does not vectorize the loop.  If we
>>>>split
>>>>> > up the loop into two loops then the vector add with A[i] could be
>>>>> > vectorized
>>>>> > even if the one with C[i] could not.
>>>>> Loop distribution does this transform but it doesn't know about
>>>>> versioning for unknown dependences.
>>>>>
>>>>
>>>>Yes, I looked at loop distribution.  But it only works with global
>>>>arrays and not with pointer arguments where it doesn't know the size of
>>>>the array being pointed at.  I would like to be able to have it work
>>>>with pointer arguments.  If I call a function with 2 or
>>>>more integer pointers, and I have a loop that accesses them with
>>>>offsets between 0 and N where N is loop invariant then I should have
>>>>enough information (at runtime) to determine if there are overlapping
>>>>memory accesses through the pointers and determine whether or not I can
>>>>distribute the loop.
>>>
>>> Not sure where you got that from. Loop distribution works with our data 
>>> reference / dependence analysis.  The cost model might be more restricted 
>>> but that can be fixed.
>>>
>>>>The loop splitting code seemed like a better template since it already
>>>>knows how to split a loop based on a runtime determined condition. That
>>>>part seems to be working for me, it is when I try to
>>>>distribute/duplicate one of those loops (under the unaliased condition)
>>>>that I am running into the problem with virtual PHIs.
>>>
>>> There's mark_virtual*for_renaming (sp?).
>>>
>>> But as said you are performing loop distribution so please enhance the 
>>> existing pass rather than writing a new one.
>> I happen to be working on loop distribution now (If guess correctly,
>> to get hmmer fixed).  So far my idea is to fuse the finest distributed
>> loop in two passes, in the first pass, we merge all SCCs due to "true"
>> data dependence; in the second one we identify all SCCs and breaks
>> them on dependent edges due to possible alias.  Breaking SCCs with
>> minimal edge set can be modeled as Feedback arc set problem which is
>> NP-hard. Fortunately the problem is small in our case and there are
>> approximation algorithms.  OTOH, we should also improve loop
>> distribution/fusion to maximize parallelism / minimize
>> synchronization, as well as maximize data locality, but I think this
>> is not needed to get hmmer vectorized.
>
> I am also looking into vectoring homer loop. Glad to know you are also
> looking at this problem and looking forward to seeing the patches.
&

Lack of capability to represent arbitrary alias dependent information

2017-06-12 Thread Bin.Cheng
HI,
GCC adds runtime alias checks for data references in passes like
vectorizer, it would be very useful to pass along the runtime alias
dependent information to later passes.  Given below expample:

int foo (int *a, int *b, int *c, int *d, int *e, int len, int v)
{
  int k, x;
  for (k = 0; k < len; k++)
{
  c[k] = a[k-1] + b[k-1];
  if ((x = d[k-1] + e[k-1]) > c[k]) c[k] = x;
  if (c[k] < v) c[k] = v;
}

  return 0;
}
After vectorizer's runtime alias check, we know that c doesn't alias
to a/b/d/e arrays.  This enables dead store elimination for c array.
The question is how to pass the information to dse (or predcom, etc.).
So far MR_DEPENDENCE_* is suggested, but I found it's not capable of
that.  The fundamental issue is MR_DEPENDENCE_* can only represent
alias relations between references in a strong connected component of
dependence graph, while in most cases (like this one) the dependence
graph is not SCC.  In general, if we have below dependence graph:
Dependence Graph: G
V: {x(write), y(write), a(read), b(read)}
E: 
 
 
 
Since a and b are reads, we don't need to know the relations between a
and b;  also it's possible to have any dependence relation between x
and y.  In this case, we can't assign x, a and b into one clique.  We
can't assign x and y into different clique either because they both
are not-aliasing to a/b.  As a matter of fact, we need a way to
represent arbitrary dependence graph, rather than SCC only.
So any suggestions?

Thanks,
bin


Re: Lack of capability to represent arbitrary alias dependent information

2017-06-12 Thread Bin.Cheng
On Mon, Jun 12, 2017 at 10:15 AM, Richard Biener
 wrote:
> On Mon, Jun 12, 2017 at 11:02 AM, Bin.Cheng  wrote:
>> HI,
>> GCC adds runtime alias checks for data references in passes like
>> vectorizer, it would be very useful to pass along the runtime alias
>> dependent information to later passes.  Given below expample:
>>
>> int foo (int *a, int *b, int *c, int *d, int *e, int len, int v)
>> {
>>   int k, x;
>>   for (k = 0; k < len; k++)
>> {
>>   c[k] = a[k-1] + b[k-1];
>>   if ((x = d[k-1] + e[k-1]) > c[k]) c[k] = x;
>>   if (c[k] < v) c[k] = v;
>> }
>>
>>   return 0;
>> }
>> After vectorizer's runtime alias check, we know that c doesn't alias
>> to a/b/d/e arrays.  This enables dead store elimination for c array.
>> The question is how to pass the information to dse (or predcom, etc.).
>> So far MR_DEPENDENCE_* is suggested, but I found it's not capable of
>> that.  The fundamental issue is MR_DEPENDENCE_* can only represent
>> alias relations between references in a strong connected component of
>> dependence graph, while in most cases (like this one) the dependence
>> graph is not SCC.  In general, if we have below dependence graph:
>> Dependence Graph: G
>> V: {x(write), y(write), a(read), b(read)}
>> E: 
>>  
>>  
>>  
>> Since a and b are reads, we don't need to know the relations between a
>> and b;  also it's possible to have any dependence relation between x
>> and y.  In this case, we can't assign x, a and b into one clique.  We
>> can't assign x and y into different clique either because they both
>> are not-aliasing to a/b.  As a matter of fact, we need a way to
>> represent arbitrary dependence graph, rather than SCC only.
>> So any suggestions?
>
> The "simplest" solution is to assign the same BASE to those.
>
> This is how restrict works to record dependence on not restrict
> marked pointers or decls.
Yes, right.  I missed the base part.  In this case, the dependence
info can well model runtime alias information.

Thanks,
bin
>
> So it's not just SCCs but slightly more powerful (but only very
> slightly).
>
> Richard.
>
>> Thanks,
>> bin


Re: Lack of capability to represent arbitrary alias dependent information

2017-06-12 Thread Bin.Cheng
On Mon, Jun 12, 2017 at 11:46 AM, Bin.Cheng  wrote:
> On Mon, Jun 12, 2017 at 10:15 AM, Richard Biener
>  wrote:
>> On Mon, Jun 12, 2017 at 11:02 AM, Bin.Cheng  wrote:
>>> HI,
>>> GCC adds runtime alias checks for data references in passes like
>>> vectorizer, it would be very useful to pass along the runtime alias
>>> dependent information to later passes.  Given below expample:
>>>
>>> int foo (int *a, int *b, int *c, int *d, int *e, int len, int v)
>>> {
>>>   int k, x;
>>>   for (k = 0; k < len; k++)
>>> {
>>>   c[k] = a[k-1] + b[k-1];
>>>   if ((x = d[k-1] + e[k-1]) > c[k]) c[k] = x;
>>>   if (c[k] < v) c[k] = v;
>>> }
>>>
>>>   return 0;
>>> }
>>> After vectorizer's runtime alias check, we know that c doesn't alias
>>> to a/b/d/e arrays.  This enables dead store elimination for c array.
>>> The question is how to pass the information to dse (or predcom, etc.).
>>> So far MR_DEPENDENCE_* is suggested, but I found it's not capable of
>>> that.  The fundamental issue is MR_DEPENDENCE_* can only represent
>>> alias relations between references in a strong connected component of
>>> dependence graph, while in most cases (like this one) the dependence
>>> graph is not SCC.  In general, if we have below dependence graph:
>>> Dependence Graph: G
>>> V: {x(write), y(write), a(read), b(read)}
>>> E: 
>>>  
>>>  
>>>  
>>> Since a and b are reads, we don't need to know the relations between a
>>> and b;  also it's possible to have any dependence relation between x
>>> and y.  In this case, we can't assign x, a and b into one clique.  We
>>> can't assign x and y into different clique either because they both
>>> are not-aliasing to a/b.  As a matter of fact, we need a way to
>>> represent arbitrary dependence graph, rather than SCC only.
>>> So any suggestions?
>>
>> The "simplest" solution is to assign the same BASE to those.
>>
>> This is how restrict works to record dependence on not restrict
>> marked pointers or decls.
> Yes, right.  I missed the base part.  In this case, the dependence
> info can well model runtime alias information.
Hmm, my bad being over-optimistic.  So here is the mode: After fusing
all SCCs, the condition to model fused dependence graph using
clique/base is below:
For each connected component of the fused graph, the diameter of the
connected component is no larger than 2.  I believe this is true for
majority cases, but full proof is complicated.  Specially, it might be
not always true in case of restrict pointers.  I will think about if
we can only handle simplest cases.

Thanks,
bin
>
> Thanks,
> bin
>>
>> So it's not just SCCs but slightly more powerful (but only very
>> slightly).
>>
>> Richard.
>>
>>> Thanks,
>>> bin


Re: Optimization for static local variables

2017-06-14 Thread Bin.Cheng
On Wed, Jun 14, 2017 at 12:14 PM, Prachi Godbole
 wrote:
> I'm developing a solution to optimize away intermediate stores (loads) for 
> static local variables which are assigned to before referenced on every path 
> through a function.
>
> Currently GCC eliminates all loads/stores in a straight line code but not in 
> control structures. AFAIU, passes like sccvn, lim (store motion in 
> particular) sink etc. try to propagate the assignment exprs through SSA vars 
> instead of through memory thereby eliminating almost all loads and some 
> stores but still not all.
>
> For example:
>
> void some_function()
> {
> static some_type sum;
> ...
> for( i = 0 ; i < n ; i++ )
>  {
>  sum = matrix [i][n] ;
>
>  for( j = 0 ; j < i ; j++ )
>  {
>  sum -= matrix [i][j] * matrix [j][n] ;
>  }
>
> matrix [i][n] = sum ;
> }
> ...
> }
>
> Resulting SSA:
>
> ...
>[2.25%]:
>   _10 = matrix[0][n_13(D)];
>   sum = _10;
>   goto ; [100.00%]
> ...
>[12.75%]:
>   _1 = matrix[i_16][n_13(D)];
>   if (i_16 > 0)
> goto ; [100.00%]
>   else
> goto ; [0.00%]
> ...
>[85.00%]:
>   # j_24 = PHI 
>   # sum_lsm.4_27 = PHI <_6(6), _1(4)>
>   _3 = matrix[i_16][j_24];
>   _4 = matrix[j_24][n_13(D)];
>   _5 = _3 * _4;
>   _6 = sum_lsm.4_27 - _5;
>   j_18 = j_24 + 1;
>   if (i_16 > j_18)
> goto ; [85.00%]
>   else
> goto ; [15.00%]
> ...
>[12.75%]:
>   # _40 = PHI <_6(6)>
>   goto ; [100.00%]
>
>[12.75%]:
>   # sum_lsm.4_19 = PHI <_40(7), _1(4)>
>
>[15.00%]:
>   # i_20 = PHI 
>   # sum_lsm.4_8 = PHI 
>   # sum_lsm.5_22 = PHI <1(9), 0(14)>
>   matrix[i_20][n_13(D)] = sum_lsm.4_8;
>   i_16 = i_20 + 1;
>   if (n_13(D) > i_16)
> goto ; [85.00%]
>   else
> goto ; [15.00%]
>
>[2.25%]:
>   # sum_lsm.4_39 = PHI 
>   # sum_lsm.5_38 = PHI 
>   if (sum_lsm.5_38 != 0)
> goto ; [0.00%]
>   else
> goto ; [100.00%]
>
>[0.00%]:
>   sum = sum_lsm.4_39;
> ...
>
> Although all intermediate loads are eliminated here, store to sum before and 
> after the 'j' loop (see bb 14 and bb 12) still remain which can be eliminated.
It looks like dead store elimination to sum.  Maybe static local
variable can be better handled there?

Thanks,
bin
>
> My solution would reuse a lot of data structures and routines from the sink 
> pass. So my question is, can I add this optimization into the sink pass 
> itself or would it require a new pass and if yes, what would be the ideal 
> place for it.
> Any other pointers are more than welcome.
>
> Thanks,
> Prachi


Re: How to migrate POINTER_TYPE_OVERFLOW_UNDEFINED for GCC v8.x?

2017-08-04 Thread Bin.Cheng
On Fri, Aug 4, 2017 at 8:00 AM, Leslie Zhai  wrote:
> Hi GCC developers,
>
> As ChangeLog mentioned:
>
> 2017-08-01  Bin Cheng 
>
> * tree.h (POINTER_TYPE_OVERFLOW_UNDEFINED): Delete.
>
>
> Then how to migrate it for GCC v8.x? for example:
>
> Constant *Result = POINTER_TYPE_OVERFLOW_UNDEFINED ? A : B;
>
> migrated to
>
> Constant *Result = A;
Yes, basically this is what I did in the patch: simply assuming TRUE
and simplifying the expression.

Thanks,
bin
>
>
> because there was a patch 'Make pointer overflow always undefined and remove
> the macro' https://patchwork.ozlabs.org/patch/792677/
>
> please give me some hint, thanks a lot!
>
> --
> Regards,
> Leslie Zhai - a LLVM developer https://reviews.llvm.org/p/xiangzhai/
>
>
>


Re: Overwhelmed by GCC frustration

2017-08-17 Thread Bin.Cheng
On Thu, Aug 17, 2017 at 6:22 PM,   wrote:
>
>> On Aug 17, 2017, at 11:22 AM, Oleg Endo  wrote:
>>
>> On Wed, 2017-08-16 at 19:04 -0500, Segher Boessenkool wrote:
>>>
>>> LRA is easier to work with than old reload, and that makes it better
>>> maintainable.
>>>
>>> Making LRA handle everything reload did is work, and someone needs to
>>> do it.
>>>
>>> LRA probably needs a few more target hooks (a _few_) to guide its
>>> decisions.
>>
>> Like Georg-Johann mentioned before, LRA has been targeted mainly for
>> mainstream ISAs.  And actually it's a pretty reasonable choice.  Again,
>> I don't think that "one RA to rule them all" is a scalable approach.
>>  But that's just my opinion.
>
> I think G-J said "...  LRA focusses just comfortable, orthogonal targets" 
> which is not quite the same thing.
>
> I'm a bit curious about that, since x86 is hardly "comfortable orthogonal".  
> But if LRA is targeted only at some of the ISA styles that are out in the 
> world, which ones are they, and why the limitation?
>
> One of GCC's great strength is its support for many ISAs.  Not all to the 
> same level of excellence, but there are many, and adding more is easy at 
> least for an initial basic level of support.  When this is needed, GCC is the 
> place to go.
>
> I'd like to work on moving one of the remaining CC0 targets to the new way, 
> but if the reality is that GCC is trying to be "mainstream only" then that 
> may not be a good way for me to spend my time.
HI,
I can't believe GCC ever tries to be "mainstream only".  It's somehow
like that because major part of requirements come from popular
architectures, and large part of patches are developed/tested on
popular architectures.  It's a unfortunate/natural result of lacking
of developers for non-mainstream.  We can make it less "mainstream
only" only if we have enough developers for less popular arch.  Your
work will contribute to this and must be highly appreciated :)

Thanks,
bin
>
> paul
>


Re: RFC: Improving GCC8 default option settings

2017-09-14 Thread Bin.Cheng
On Thu, Sep 14, 2017 at 11:24 AM, Richard Biener
 wrote:
> On Thu, Sep 14, 2017 at 12:18 PM, Martin Liška  wrote:
>> On 09/14/2017 12:07 PM, Markus Trippelsdorf wrote:
>>> On 2017.09.14 at 11:57 +0200, Richard Biener wrote:
 On Wed, Sep 13, 2017 at 6:11 PM, Nikos Chantziaras  
 wrote:
> On 12/09/17 16:57, Wilco Dijkstra wrote:
>>
>> [...] As a result users are
>> required to enable several additional optimizations by hand to get good
>> code.
>> Other compilers enable more optimizations at -O2 (loop unrolling in LLVM
>> was
>> mentioned repeatedly) which GCC could/should do as well.
>> [...]
>>
>> I'd welcome discussion and other proposals for similar improvements.
>
>
> What's the status of graphite? It's been around for years. Isn't it mature
> enough to enable these:
>
> -floop-interchange -ftree-loop-distribution -floop-strip-mine -floop-block
>
> by default for -O2? (And I'm not even sure those are the complete set of
> graphite optimization flags, or just the "useful" ones.)

 It's not on by default at any optimization level.  The main issue is the
 lack of maintainance and a set of known common internal compiler errors
 we hit.  The other issue is that there's no benefit of turning those on for
 SPEC CPU benchmarking as far as I remember but quite a bit of extra
 compile-time cost.
>>>
>>> Not to mention the numerous wrong-code bugs. IMHO graphite should
>>> deprecated as soon as possible.
>>>
>>
>> For wrong-code bugs we've got and I recently went through, I fully agree 
>> with this
>> approach and I would do it for GCC 8. There are PRs where order of simple 2 
>> loops
>> is changed, causing wrong-code as there's a data dependence.
>>
>> Moreover, I know that Bin was thinking about selection whether to use 
>> classical loop
>> optimizations or Graphite (depending on options provided). This would 
>> simplify it ;)
>
> I don't think removing graphite is warranted, I still think it is the
> approach to use when
> handling non-perfect nests.
Hi,
IMHO, we should not be in a hurry to remove graphite, though we are
introducing some traditional transformations.  It's a quite standalone
part in GCC and supports more transformations.  Also as it gets more
attention, never know if somebody will find time to work on it.

Thanks,
bin
>
> Richard.
>
>> Martin


Re: Handling prefetcher tag collisions while allocating registers

2017-10-24 Thread Bin.Cheng
On Tue, Oct 24, 2017 at 12:44 AM, Kugan Vivekanandarajah
 wrote:
> Hi All,
>
> I am wondering if there is anyway we can prefer certain registers in
> register allocations. That is, I want to have some way of recording
> register allocation decisions (for loads in loop that are accessed in
> steps) and use this to influence register allocation of other loads
> (again that are accessed in steps).
>
> This is for architectures (like falkor AArch64) that use hardware
> perefetchers that use signatures of the loads to lock into and tune
> prefetching parameters. Ideally, If the loads are from the same
> stream, they should have same signature and if they are from different
> stream, they should have different signature. Destination, base
> register and offset are used in the signature. Therefore, selecting
> different register can influence this.
I wonder why the destination register is used in signature.  In an extreme case,
load in loop can be unrolled then allocated to different dest registers. Forcing
the same dest register could be too restricted.

Thanks,
bin

>
> In LLVM, this is implemented as a machine specific pass that runs
> after register allocation. It then inserts mov instruction with
> scratch registers to manage this. We can do a machine reorg pass in
> gcc but detecting strided loads at that stage is not easy.
>
> I am trying to implement this in gcc and wondering what is the
> preferred and acceptable way to implement this. Any thoughts ?
>
> Thanks,
> Kugan


Question about generated type for common block in fortran

2017-10-26 Thread Bin.Cheng
Hi,
I am looking into DSE transformation of some fortran codes.  Given
below fortran declarations:

   real*8  a(len) , b(len) ,  c(len) , d(len)
   common /area/  a, b, c, d
   real*8 src1(len), temp1(len), temp2(len), src2(len)
   equivalence(src1, a), (src2, b), (temp1, c), (temp2, d)

with my limited understanding of fortran, the layout should be like:
struct area
{
  union {double src1[len]; double a[len]};
  union {double temp1[len]; double b[len]};
  union {double temp2[len]; double c[len]};
  union {double src2[len]; double d[len]};
};

When debugging in tree-ssa-dse.c, if I have memory reference like
area.src1[idx], the type for area dumped is like:
(gdb) call debug_generic_expr(ref1->exp.operands[0])
area
(gdb) call debug_generic_expr(ref1->exp.operands[0]->typed.type)
union
{
  real(kind=8) src1[100];
  real(kind=8) a[100];
  real(kind=8) src2[100];
  real(kind=8) b[100];
  real(kind=8) temp1[100];
  real(kind=8) c[100];
  real(kind=8) temp2[100];
  real(kind=8) d[100];
}
I do noticed src1/src2 fields do have different offset, although they
are put into a union type here.

I suppose such type are generated by fortran front-end?  Is it just
because it's easy to do so?  And any background comment/document about
this?
Unfortunately middle end misses lots of dead store, redundant load
because of difficulty in alias check of such type information.  I
guess fully support alias check of this issue would be non-trivial, so
any suggestions will be highly appreciated too.

Thanks,
bin


  1   2   >