from:"Bingfeng Mei"

Why does gcc generate const local array on stack?

2016-04-20 Thread Bingfeng Mei

Hi,
I came across the following issue.

int foo (int N)
{
  const int a[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};

  return a[N];
}

Compile with x86 O2

foo:
.LFB0:
.cfi_startproc
movslq %edi, %rdi
movl $0, -56(%rsp)
movl $1, -52(%rsp)
movl $2, -48(%rsp)
movl $3, -44(%rsp)
movl $4, -40(%rsp)
movl $5, -36(%rsp)
movl $6, -32(%rsp)
movl $7, -28(%rsp)
movl $8, -24(%rsp)
movl $9, -20(%rsp)
movl -56(%rsp,%rdi,4), %eax
ret

The array is placed on stack and GCC has to generate a sequence of
instructions to
initialize the array every time the function is called.

On the contrary, LLVM moves the array to global data and doesn't need
initialization
within the function.

If I add static to the array, GCC behaves the same as LLVM, just as expected.

Is there some subtle C standard issue or some switch I didn't turned
on? I understand
if this function is recursive and pointer of the array is involved,
GCC would have to maintain
the array on stack and hence the initialization. But here the code is
very simple. I don't understand the logic of generated code, or maybe
missing optimization opportunity?

Thanks,
Bingfeng Mei

Re: Re: Why does gcc generate const local array on stack?

2016-04-21 Thread Bingfeng Mei

I agree with you on this example.

But in my original code, as Jonathan pointed out, is not recursive and
the address of "a" does not escape the function in any way. I believe
it is valid transformation.

BTW, LLVM compiles your example still by moving const array to rodata,
which I think is wrong and will fail the test.

Cheers,
Bingfeng

On Thu, Apr 21, 2016 at 3:41 AM, lh_mouse  wrote:
> See this example: http://coliru.stacked-crooked.com/a/048b4aa5046da11b
>
> In this example the function is called recursively.
> During each call a pointer to that local areay is appended to a static array 
> of pointers.
> Should a new instance of that local array of const int be created every time, 
> abort() will never be called.
> Since calling a library function is observable behavior, clang's optimization 
> has effectively changed that program's behavior. Hence I think it is wrong.
>
> [code]
> #include 
>
> static const int *ptrs[2];
> static unsigned recur;
>
> void foo(){
>   const int a[] = {0,1,2,3,4,5,6,7,8,9};
>   ptrs[recur] = a;
>   if(recur == 0){
> ++recur;
> foo();
>   }
>   if(ptrs[0] == ptrs[1]){
> abort();
>   }
> }
>
> int main(){
>   foo();
> }
> [/code]
>
> --
> Best regards,
> lh_mouse
> 2016-04-21
>
> -----
> 发件人：Jonathan Wakely 
> 发送日期：2016-04-21 01:51
> 收件人：lh_mouse
> 抄送：Bingfeng Mei,gcc
> 主题：Re: Why does gcc generate const local array on stack?
>
> On 20 April 2016 at 18:31, lh_mouse wrote:
>> I tend to say clang is wrong here.
>
> If you can't detect the difference then it is a valid transformation.
>
>> Your identifier 'a' has no linkage. Your object designated by 'a' does not 
>> have a storage-class specifier.
>> So it has automatic storage duration and 6.2.4/7 applies: 'If the scope is 
>> entered recursively, a new instance of the object is created each time.'
>
> How do you tell the difference between a const array that is recreated
> each time and one that isn't?
>
>> Interesting enough, ISO C doesn't say whether distinct objects should have 
>> distinct addresses.
>> It is worth noting that this is explicitly forbidden in ISO C++ because 
>> distinct complete objects shall have distinct addresses:
>
> If the object's address doesn't escape from the function then I can't
> think of a way to tell the difference.
>
>

Re: Re: Why does gcc generate const local array on stack?

2016-04-21 Thread Bingfeng Mei

Richard, thanks for explanation. I found an option
-fmerge-all-constants, which can help me work around for now.

BIngfeng

On Thu, Apr 21, 2016 at 11:15 AM, Richard Biener
 wrote:
> On Thu, Apr 21, 2016 at 11:39 AM, Jonathan Wakely  
> wrote:
>> On 21 April 2016 at 03:41, lh_mouse wrote:
>>> See this example: http://coliru.stacked-crooked.com/a/048b4aa5046da11b
>>>
>>> In this example the function is called recursively.
>>
>> See the original email you replied to:
>>
>> "I understand if this function is recursive and pointer of the array
>> is involved, GCC would have to maintain the array on stack and hence
>> the initialization."
>>
>> The question is about cases where that doesn't happen.
>
> The decision on whether to localize the array and inline the init is
> done at gimplification time.
> The plan is to delay this until SRA which could then also apply the
> desired optimization
> of removing the local in case it is never written to.
>
> Richard.

Why is this not optimized?

2014-05-14 Thread Bingfeng Mei

Hi, 
I am looking at some code of our target, which is not optimized as expected. 
For the following RTX, I expect source of insn 17 should be propagated into 
insn 20, and insn 17 is eliminated as a result. On our target, it will become a 
predicated xor instruction instead of two. Initially, I thought fwprop pass 
should do this. 

(insn 17 16 18 3 (set (reg/v:HI 102 [ crc ])
(xor:HI (reg/v:HI 108 [ crc ])
(const_int 16386 [0x4002]))) coremark.c:1632 725 {xorhi3}
 (nil))
(insn 18 17 19 3 (set (reg:BI 113)
(ne:BI (reg:QI 101 [ D.4446 ])
(const_int 1 [0x1]))) 1397 {cmp_qimode}
 (nil))
(jump_insn 19 18 55 3 (set (pc)
(if_then_else (ne (reg:BI 113)
(const_int 0 [0]))
(label_ref 23)
(pc))) 1477 {cbranchbi4}
 (expr_list:REG_DEAD (reg:BI 113)
(expr_list:REG_BR_PROB (const_int 7100 [0x1bbc])
(expr_list:REG_PRED_WIDTH (const_int 1 [0x1])
(nil
 -> 23)
(note 55 19 20 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
(insn 20 55 23 4 (set (reg:HI 112 [ crc ])
(reg/v:HI 102 [ crc ])) 502 {fp_movhi}
 (expr_list:REG_DEAD (reg/v:HI 102 [ crc ])
(nil)))
(code_label 23 20 56 5 2 "" [1 uses])


But it can't. First propagate_rtx_1 will return false because PR_CAN_APPEAR is 
false and following code is executed. 

  if (x == old_rtx)
{
  *px = new_rtx;
  return can_appear;
}

Even I forces PR_CAN_APPEAR to be set in flags, fwprop still won't go ahead in 
try_fwprpp_subst because old_cost is 0 (REG only rtx), and set_src_cost 
(SET_SRC (set), speed) is bigger than 0. So the change is deemed as not 
profitable, which is not correct IMO. 

If fwprop is not the place to do this optimization, where should it be done? I 
am working on up-to-date GCC 4.8. 

Thanks,
Bingfeng Mei

RE: Why is this not optimized?

2014-05-15 Thread Bingfeng Mei

Thanks for the reply. I will look at the patch. As far as the cost is 
concerned, I think fwprop doesn't really need to understand pipeline model. As 
long as rtx costs after optimization is less than before optimization, I think 
it is good enough. Of course, it won't be better in every case, but should be 
better in general.

Cheers,
Bingfeng

-Original Message-
From: Bin.Cheng [mailto:amker.ch...@gmail.com] 
Sent: 15 May 2014 06:59
To: Bingfeng Mei
Cc: gcc@gcc.gnu.org
Subject: Re: Why is this not optimized?

On Wed, May 14, 2014 at 9:14 PM, Bingfeng Mei  wrote:
> Hi,
> I am looking at some code of our target, which is not optimized as expected. 
> For the following RTX, I expect source of insn 17 should be propagated into 
> insn 20, and insn 17 is eliminated as a result. On our target, it will become 
> a predicated xor instruction instead of two. Initially, I thought fwprop pass 
> should do this.
>
> (insn 17 16 18 3 (set (reg/v:HI 102 [ crc ])
> (xor:HI (reg/v:HI 108 [ crc ])
> (const_int 16386 [0x4002]))) coremark.c:1632 725 {xorhi3}
>  (nil))
> (insn 18 17 19 3 (set (reg:BI 113)
> (ne:BI (reg:QI 101 [ D.4446 ])
> (const_int 1 [0x1]))) 1397 {cmp_qimode}
>  (nil))
> (jump_insn 19 18 55 3 (set (pc)
> (if_then_else (ne (reg:BI 113)
> (const_int 0 [0]))
> (label_ref 23)
> (pc))) 1477 {cbranchbi4}
>  (expr_list:REG_DEAD (reg:BI 113)
> (expr_list:REG_BR_PROB (const_int 7100 [0x1bbc])
> (expr_list:REG_PRED_WIDTH (const_int 1 [0x1])
> (nil
>  -> 23)
> (note 55 19 20 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
> (insn 20 55 23 4 (set (reg:HI 112 [ crc ])
> (reg/v:HI 102 [ crc ])) 502 {fp_movhi}
>  (expr_list:REG_DEAD (reg/v:HI 102 [ crc ])
> (nil)))
> (code_label 23 20 56 5 2 "" [1 uses])
>
>
> But it can't. First propagate_rtx_1 will return false because PR_CAN_APPEAR 
> is false and
> following code is executed.
>
>   if (x == old_rtx)
> {
>   *px = new_rtx;
>   return can_appear;
> }
>
> Even I forces PR_CAN_APPEAR to be set in flags, fwprop still won't go ahead in
> try_fwprpp_subst because old_cost is 0 (REG only rtx), and set_src_cost 
> (SET_SRC (set),
> speed) is bigger than 0. So the change is deemed as not profitable, which is 
> not correct
> IMO.
Pass fwprop is too conservative with respect to propagation
opportunities outside of memory reference, it just gives up at many
places.  Also as in your case, seems it doesn't take into
consideration that the original insn can be removed after propagation.

We Mi once sent a patch re-implementing fwprop pass at
https://gcc.gnu.org/ml/gcc-patches/2013-03/msg00617.html .
I also did some experiments and worked out a local patch doing similar
work to handle cases exactly like yours.
The problem is even though one instruction can be saved (as in your
case), it's not always good, because it tends to generate more complex
instructions, and such insns are somehow more vulnerable to pipeline
hazard.  Unfortunately, it's kind of impossible for fwprop to
understand the pipeline risk.

Thanks,
bin
>
> If fwprop is not the place to do this optimization, where should it be done? 
> I am working on up-to-date GCC 4.8.
>
> Thanks,
> Bingfeng Mei



-- 
Best Regards.

RE: Register Pressure guided Unroll and Jam in GCC !!

2014-06-17 Thread Bingfeng Mei

That is true. Early estimation of register pressure should be improved. Right 
now I am looking at an example IVOPTS produces too many induction variables and 
causes a lot of register spilling. Though ivopts pass called 
estimate_reg_pressure_cost function, results are not even close to real 
situation.

Bingfeng

-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of 
Vladimir Makarov
Sent: 16 June 2014 19:37
To: Ajit Kumar Agarwal; gcc@gcc.gnu.org
Cc: Michael Eager; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; 
Nagaraju Mekala
Subject: Re: Register Pressure guided Unroll and Jam in GCC !!

On 2014-06-16, 10:14 AM, Ajit Kumar Agarwal wrote:
> Hello All:
>
> I have worked on the Open64 compiler where the Register Pressure Guided 
> Unroll and Jam gave a good amount of performance improvement for the  C and 
> C++ Spec Benchmark and also Fortran benchmarks.
>
> The Unroll and Jam increases the register pressure in the Unrolled Loop 
> leading to increase in the Spill and Fetch degrading the performance of the 
> Unrolled Loop. The Performance of Cache locality achieved through Unroll and 
> Jam is degraded with the presence of Spilling instruction due to increases in 
> register pressure Its better to do the decision  of Unrolled Factor of the 
> Loop based on the Performance model of the register pressure.
>
> Most of the Loop Optimization Like Unroll and Jam is implemented in the High 
> Level IR. The register pressure based Unroll and Jam requires the calculation 
> of register pressure in the High Level IR  which will be similar to register 
> pressure we calculate on Register Allocation. This makes the implementation 
> complex.
>
> To overcome this, the Open64 compiler does the decision of Unrolling to both 
> High Level IR and also at the Code Generation Level. Some of the decisions 
> way at the end of the Code Generation . The advantage of using this approach 
> like Open64 helps in using the register pressure information calculated by 
> the Register Allocator. This helps the implementation much simpler and less 
> complex.
>
> Can we have this approach in GCC of the Decisions of Unroll and Jam in the 
> High Level IR  and also to defer some of the decision at the Code Generation 
> Level like Open64?
>
>   Please let me know what do you think.
>

Most loop optimizations are a good target for register pressure 
sensitive algorithms as loops are usually program hot spots and any 
pressure decrease there would be harmful as any RA can not undo such 
complex transformations.

So I guess your proposal could work.  Right now we have only 
pressure-sensitive modulo scheduling (SMS) and loop-invariant motion (as 
I remember switching from loop-invariant motion based on some very 
inaccurate register-pressure evaluation to one based on RA pressure 
evaluation gave a nice improvement about 1% for SPECFP2000 on some 
targets).

regs_used estimation in IVOPTS seriously flawed

2014-06-17 Thread Bingfeng Mei

Hi,
I am looking at a performance regression in our code. A big loop produces
and uses a lot of temporary variables inside the loop body. The problem
appears that IVOPTS pass creates even more induction variables (from original
2 to 27). It causes a lot of register spilling later and performance
take a severe hit. I looked into tree-ssa-loop-ivopts.c, it does call 
estimate_reg_pressure_cost function to take # of registers into 
consideration. The second parameter passed as data->regs_used is supposed
to represent old register usage before IVOPTS. 

  return size + estimate_reg_pressure_cost (size, data->regs_used, data->speed,
data->body_includes_call);

In this case, it is mere 2 by following calculation. Essentially, it only counts
all loop invariant registers, ignoring all registers produced/used inside the 
loop.

  n = 0;
  for (psi = gsi_start_phis (loop->header); !gsi_end_p (psi); gsi_next (&psi))
{
  phi = gsi_stmt (psi);
  op = PHI_RESULT (phi);

  if (virtual_operand_p (op))
continue;

  if (get_iv (data, op))
continue;

  n++;
}

  EXECUTE_IF_SET_IN_BITMAP (data->relevant, 0, j, bi)
{
  struct version_info *info = ver_info (data, j);

  if (info->inv_id && info->has_nonlin_use)
n++;
}

  data->regs_used = n;

I believe how regs_used is calculated is seriously flawed,
or estimate_reg_pressure_cost is problematic if n_old is
only supposed to be loop invariant registers. Either way,
it affects how IVOPTS makes decision and could result in
worse code. What do you think? Any idea on how to improve
this? 


Thanks,
Bingfeng

RE: regs_used estimation in IVOPTS seriously flawed

2014-06-18 Thread Bingfeng Mei



> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: 18 June 2014 12:36
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: regs_used estimation in IVOPTS seriously flawed
> 
> On Tue, Jun 17, 2014 at 4:59 PM, Bingfeng Mei  wrote:
> > Hi,
> > I am looking at a performance regression in our code. A big loop
> produces
> > and uses a lot of temporary variables inside the loop body. The
> problem
> > appears that IVOPTS pass creates even more induction variables (from
> original
> > 2 to 27). It causes a lot of register spilling later and performance
> > take a severe hit. I looked into tree-ssa-loop-ivopts.c, it does call
> > estimate_reg_pressure_cost function to take # of registers into
> > consideration. The second parameter passed as data->regs_used is
> supposed
> > to represent old register usage before IVOPTS.
> >
> >   return size + estimate_reg_pressure_cost (size, data->regs_used,
> data->speed,
> > data->body_includes_call);
> >
> > In this case, it is mere 2 by following calculation. Essentially, it
> only counts
> > all loop invariant registers, ignoring all registers produced/used
> inside the loop.
> >
> >   n = 0;
> >   for (psi = gsi_start_phis (loop->header); !gsi_end_p (psi); gsi_next
> (&psi))
> > {
> >   phi = gsi_stmt (psi);
> >   op = PHI_RESULT (phi);
> >
> >   if (virtual_operand_p (op))
> > continue;
> >
> >   if (get_iv (data, op))
> > continue;
> >
> >   n++;
> > }
> >
> >   EXECUTE_IF_SET_IN_BITMAP (data->relevant, 0, j, bi)
> > {
> >   struct version_info *info = ver_info (data, j);
> >
> >   if (info->inv_id && info->has_nonlin_use)
> > n++;
> > }
> >
> >   data->regs_used = n;
> >
> > I believe how regs_used is calculated is seriously flawed,
> > or estimate_reg_pressure_cost is problematic if n_old is
> > only supposed to be loop invariant registers. Either way,
> > it affects how IVOPTS makes decision and could result in
> > worse code. What do you think? Any idea on how to improve
> > this?
> 
> Well, it's certainly a lower bound on the number of registers
> live through the whole loop execution (thus over the backedge).
> So they have the same cost as an induction variable as far
> as register pressure is concerned.
> 
> What it doesn't account for is the maximum number of live
> registers anywhere in the loop body - but that is hard to
> estimate at this point in the compilation.  You could compute
> the maximum number of live SSA names which could be
> an upper bound on the register pressure - but that needs
> liveness analysis which is expensive also that upper bound
> is probably way too high.
> 
Yes, I agree it is hard and probably expensive at this stage of
compilation to do accurate analysis. But it could be quite useful
for many tree-level loop optimizations, even just a half-accurate
estimation for register pressure, as also discussed in another 
thread a few days ago.


> So I think the current logic is sensible and simple.  It's just
> not perfect.
> 
> Maybe it's just the cost function of the IV set choosen that
> needs to be adjusted to account for the number of IVs
> in a non-linear way?  That is, adjust ivopts_global_cost_for_size
> which just adds size to sth that pessimizes more IVs even
> more like size * (1 + size / (1 + data->regs_used)) or
> simply size ** (1. + eps) with a suitable eps < 2.
> 
I am going to try a few cost functions as you suggested. Maybe also
just count all SSA together and divide it by a factor.

Thanks,
Bingfeng

RE: regs_used estimation in IVOPTS seriously flawed

2014-06-20 Thread Bingfeng Mei



> -Original Message-
> From: Bin.Cheng [mailto:amker.ch...@gmail.com]
> Sent: 20 June 2014 06:25
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: regs_used estimation in IVOPTS seriously flawed
> 
> On Tue, Jun 17, 2014 at 10:59 PM, Bingfeng Mei  wrote:
> > Hi,
> > I am looking at a performance regression in our code. A big loop
> produces
> > and uses a lot of temporary variables inside the loop body. The
> problem
> > appears that IVOPTS pass creates even more induction variables (from
> original
> > 2 to 27). It causes a lot of register spilling later and performance
> Do you have a simplified case which can be posted here?  I guess it
> affects some other targets too.
> 
> > take a severe hit. I looked into tree-ssa-loop-ivopts.c, it does call
> > estimate_reg_pressure_cost function to take # of registers into
> > consideration. The second parameter passed as data->regs_used is
> supposed
> > to represent old register usage before IVOPTS.
> >
> >   return size + estimate_reg_pressure_cost (size, data->regs_used,
> data->speed,
> > data->body_includes_call);
> >
> > In this case, it is mere 2 by following calculation. Essentially, it
> only counts
> > all loop invariant registers, ignoring all registers produced/used
> inside the loop.
> There are two kinds of registers produced/used inside the loop.  One
> is induction variable irrelevant, it includes non-linear uses as
> mentioned by Richard.  The other kind relates to induction variable
> rewrite, and one issue with this kind is expression generated during
> iv use rewriting is not reflecting the estimated one in ivopt very
> well.
> 

As a short term solution, I tried some simple non-linear functions as Richard 
suggested
to penalize using too many IVs. For example, the following cost in 
ivopts_global_cost_for_size fixed my regression and actually improves 
performance
slightly over a set of benchmarks we usually use. 

  return size * (1 + size * 0.2)
  + estimate_reg_pressure_cost (size, data->regs_used, data->speed,
   
data->body_includes_call); 

The trouble is choice of this non-linear function could be highly target 
dependent
(# of registers?). I don't have setup to prove performance gain for other 
targets.

I also tried counting all SSA names and divide it by a factor. It does seem to 
work
so well.

Long term, if we have infrastructure to analyze maximal live variable in a loop
at tree-level, that would be great for many loop optimizations.

Thanks,
Bingfeng

RE: regs_used estimation in IVOPTS seriously flawed

2014-06-20 Thread Bingfeng Mei

Sorry, typo in previous mail. 

"I also tried counting all SSA names and divide it by a factor. It does
NOT seem to work so well"

> -Original Message-
> From: Bin.Cheng [mailto:amker.ch...@gmail.com]
> Sent: 20 June 2014 10:19
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: regs_used estimation in IVOPTS seriously flawed
> 
> On Fri, Jun 20, 2014 at 5:01 PM, Bingfeng Mei  wrote:
> >
> >
> >> -Original Message-
> >> From: Bin.Cheng [mailto:amker.ch...@gmail.com]
> >> Sent: 20 June 2014 06:25
> >> To: Bingfeng Mei
> >> Cc: gcc@gcc.gnu.org
> >> Subject: Re: regs_used estimation in IVOPTS seriously flawed
> >>
> >> On Tue, Jun 17, 2014 at 10:59 PM, Bingfeng Mei 
> wrote:
> >> > Hi,
> >> > I am looking at a performance regression in our code. A big loop
> >> produces
> >> > and uses a lot of temporary variables inside the loop body. The
> >> problem
> >> > appears that IVOPTS pass creates even more induction variables
> (from
> >> original
> >> > 2 to 27). It causes a lot of register spilling later and
> performance
> >> Do you have a simplified case which can be posted here?  I guess it
> >> affects some other targets too.
> >>
> >> > take a severe hit. I looked into tree-ssa-loop-ivopts.c, it does
> call
> >> > estimate_reg_pressure_cost function to take # of registers into
> >> > consideration. The second parameter passed as data->regs_used is
> >> supposed
> >> > to represent old register usage before IVOPTS.
> >> >
> >> >   return size + estimate_reg_pressure_cost (size, data->regs_used,
> >> data->speed,
> >> > data-
> >body_includes_call);
> >> >
> >> > In this case, it is mere 2 by following calculation. Essentially,
> it
> >> only counts
> >> > all loop invariant registers, ignoring all registers produced/used
> >> inside the loop.
> >> There are two kinds of registers produced/used inside the loop.  One
> >> is induction variable irrelevant, it includes non-linear uses as
> >> mentioned by Richard.  The other kind relates to induction variable
> >> rewrite, and one issue with this kind is expression generated during
> >> iv use rewriting is not reflecting the estimated one in ivopt very
> >> well.
> >>
> >
> > As a short term solution, I tried some simple non-linear functions as
> Richard suggested
> 
> Oh, I misread the non-linear way as non-linear iv uses.
> 
> > to penalize using too many IVs. For example, the following cost in
> > ivopts_global_cost_for_size fixed my regression and actually improves
> performance
> > slightly over a set of benchmarks we usually use.
> 
> Great, I will try to tweak it on ARM.
> 
> >
> >   return size * (1 + size * 0.2)
> >   + estimate_reg_pressure_cost (size, data->regs_used, data-
> >speed,
> >data-
> >body_includes_call);
> >
> > The trouble is choice of this non-linear function could be highly
> target dependent
> > (# of registers?). I don't have setup to prove performance gain for
> other targets.
> >
> > I also tried counting all SSA names and divide it by a factor. It does
> seem to work
> 
> So the number currently computed is the lower bound which is too
> small.  Maybe it's possible to do some analysis with relatively low
> cost increasing the number somehow.  While on the other hand, doesn't
> bring restriction to IVOPT for loops with low register pressure.
> 
> Thanks,
> bin
> 
> > so well.
> >
> > Long term, if we have infrastructure to analyze maximal live variable
> in a loop
> > at tree-level, that would be great for many loop optimizations.
> >
> > Thanks,
> > Bingfeng
> 
> 
> 
> --
> Best Regards.

RE: Comparison of GCC-4.9 and LLVM-3.4 performance on SPECInt2000 for x86-64 and ARM

2014-06-25 Thread Bingfeng Mei

Thanks for nice benchmarks. Vladimir.

Why is GCC code size so much bigger than LLVM? Does -Ofast have more unrolling
on GCC? It doesn't seem increasing code size help performance (164.gzip & 
197.parser)
Is there comparisons for O2? I guess that is more useful for typical
mobile/embedded programmers.

Bingfeng

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
> Vladimir Makarov
> Sent: 24 June 2014 16:07
> To: Ramana Radhakrishnan; gcc.gcc.gnu.org
> Subject: Re: Comparison of GCC-4.9 and LLVM-3.4 performance on
> SPECInt2000 for x86-64 and ARM
> 
> On 06/24/2014 10:57 AM, Ramana Radhakrishnan wrote:
> >
> > The ball-park number you have probably won't change much.
> >
> >>>
> >> Unfortunately, that is the configuration I can use on my system
> because
> >> of lack of libraries for other configurations.
> >
> > Using --with-fpu={neon / neon-vfpv4} shouldn't cause you ABI issues
> > with libraries for any other configurations. neon / neon-vfpv4 enable
> > use of the neon unit in a manner that is ABI compatible with the rest
> > of the system.
> >
> > For more on command line options for AArch32 and how they map to
> > various CPU's you might find this blog interesting.
> >
> > http://community.arm.com/groups/tools/blog/2013/04/15/arm-cortex-a-
> processors-and-gcc-command-lines
> >
> >
> >>
> >> I don't think Neon can improve score for SPECInt2000 significantly
> but
> >> may be I am wrong.
> >
> > It won't probably improve the overall score by a large amount but some
> > individual benchmarks will get some help.
> >
> There are some few benchmarks which benefit from autovectorization (eon
> particularly).
> >>> Did you add any other architecture specific options to your SPEC2k
> >>> runs ?
> >>>
> >>>
> >> No.  The only options I used are -Ofast.
> >>
> >> Could you recommend me what best options you think I should use for
> this
> >> processor.
> >>
> >
> > I would personally use --with-cpu=cortex-a15 --with-fpu=neon-vfpv4
> > --with-float=hard on this processor as that maps with the processor
> > available on that particular piece of Silicon.
> Thanks, Ramana.  Next time, I'll try these options.
> >
> > Also given it's a big LITTLE system with probably kernel switching -
> > it may be better to also make sure that you are always running on the
> > big core.
> >
> The results are pretty stable.  Also this version of Fedora does not
> implement switching from Big to Little processors.

ivdep pragma not used in ddg.c?

2014-07-09 Thread Bingfeng Mei

Hi,
I noticed recent GCC adds ivdep pragma support. We have our own implementation 
for ivdep for a couple of years now. As GCC implementation is much cleaner and 
we want to migrate to it. Ivdep is consumed in two places in our 
implementation, one is tree-vect-data-refs.c used by vectorizer, the other is 
in ddg.c, used by modulo scheduler. In GCC implementation, the former is the 
same, but ddg.c doesn't consume ivdep information at all. I think it is 
important not to draw redundant cross-iteration dependence if ivdep is 
specified in order to improve modulo scheduling performance. 

Looking at the code, I wonder whether loop->safelen still keep the correct 
information or whether loop structure still remain correct after so many 
tree/rtl passes. For example, in sms-schedule of modulo-sched.c

  loop_optimizer_init (LOOPS_HAVE_PREHEADERS
   | LOOPS_HAVE_RECORDED_EXITS);

Does this mean loop structure is reinitialized? I know there is a flag 
(PROP_loops) which is supposed to preserve loop structure. But not sure what 
happens after all loop transformation (unrolling, peeling, etc). Is there a 
stage loop structure is rebuilt and we lost safelen (ivdep) information, or it 
is still safe to use in modulo scheduling pass?

Thanks,
Bingfeng

RE: Vector modes and the corresponding width integer mode

2014-12-12 Thread Bingfeng Mei

I don't think it is required. For example, PowerPC port supports 
V8SImode, but I don't see OImode. Just sometimes it could come handy to
have the equal size scalar mode.

Cheers,
Bingfeng

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
> Matthew Fortune
> Sent: 11 December 2014 13:27
> To: gcc@gcc.gnu.org
> Subject: Vector modes and the corresponding width integer mode
> 
> Hi,
> 
> I'm working on MIPS SIMD support for MSA. Can anyone point me towards
> information about the need for an integer mode of equal size to any
> supported vector mode?
> 
> I.e. if I support V4SImode is there any core GCC requirement that
> TImode is also supported?
> 
> Any guidance is appreciated. The MIPS port already has limited support
> for TImode for 64-bit targets which makes it all the more difficult to
> figure out if there is a relationship between vector modes and integer
> modes.
> 
> Thanks,
> Matthew

LLVM disagrees with GCC on bitfield handling

2017-10-26 Thread Bingfeng Mei

Hi,
Sorry if this question has been raised in past.  I am running GCC testsuite
for our LLVM port. There are several failures related to bitfields handling
(pr32244-1.c, bitfld-3.c bitfld-5.c, etc) that LLVM disagrees with GCC.
Taking pr32444-1.c as example,

struct foo
{
  unsigned long long b:40;
} x;

extern void abort (void);

void test1(unsigned long long res)
{
  /* The shift is carried out in 40 bit precision.  */
  if (x.b<<32 != res)
abort ();
}

int main()
{
  x.b = 0x0100;
  test1(0);
  return 0;
}

The target machine has int  of 32-bit and long long of 64-bit. GCC expects
the arithmetic shift to be performed on 40-bit precision (see above
comment), whereas LLVM first cast the x.b to 64-bit unsigned long long and
do the shift/comparison afterwards.
I checked the standard. It says shift will do integer promotion first,
which doesn't apply because 40-bit > int here,  so it seems to make sense
here with GCC's approach. On the other hand, you can argue when bitfield is
loaded, it is cast to original type first (unsigned long  long here), then
do the arithmetic operation. C standard doesn't define arithmetic on
arbitrary data width. So it needs operate on original data types. I am
confused which approach conforms to standard, or this is just a grey area
not well defined by standard. Any suggestion is greatly appreciated.


Cheers,
Bingfeng Mei

Re: LLVM disagrees with GCC on bitfield handling

2017-10-27 Thread Bingfeng Mei

HI, Joseph,
Thanks for detailed explanation.

Cheers,
Bingfeng

On Thu, Oct 26, 2017 at 5:11 PM, Joseph Myers 
wrote:

> There is a line of C90 DRs and associated textual history (compare the
> relevant text in C90 and C99, or see my comparison of it in WG14 reflector
> message 11100 (18 Apr 2006)) to the effect of bit-fields acting like they
> have a type with the given number of bits; that line is what's followed by
> GCC for C.  The choice of type for a bit-field (possibly separate from
> declared type) was left explicitly implementation-defined after DR#315;
> that is, if an implementation allows implementation-defined declared types
> as permitted by C99 and later, whether the actual type of the bit-field in
> question is the declared type or has the specified number of bits is also
> implementation-defined.  The point in DR#120 regarding assignment to
> bit-fields still applies in C11: nothing other than the semantics of
> conversion to a type with the given number of bits defines how the value
> to be stored in a bit-field is computed if the stored value is not in
> range.
>
> C++ chose a different route from those C90 DRs, of the width explicitly
> not being part of the type of the bit-field.  I don't know what if
> anything in C++ explicitly resolves the C90 DR#120 issue and defines the
> results of storing not-exactly-representable values in a bit-field.
>
> --
> Joseph S. Myers
> jos...@codesourcery.com
>

Vector permutation only deals with # of vector elements same as mask?

2011-02-10 Thread Bingfeng Mei

Hi,
I noticed that vector permutation gets more use in GCC
4.6, which is great. It is used to handle negative step
by reversing vector elements now. 

However, after reading the related code, I understood
that it only works when the # of vector elements is 
the same as that of mask vector in the following code. 

perm_mask_for_reverse (tree-vect-stmts.c)
...
  mask_type = get_vectype_for_scalar_type (mask_element_type);
  nunits = TYPE_VECTOR_SUBPARTS (vectype);
  if (!mask_type
  || TYPE_VECTOR_SUBPARTS (vectype) != TYPE_VECTOR_SUBPARTS (mask_type))
return NULL;
...

For PowerPC altivec, the mask_type is V16QI. It means that
compiler can only permute V16QI type.  But given the capability of
altivec vperm instruction, it can permute any 128-bit type 
(V8HI, V4SI, etc). We just need convert in/out V16QI from
given types and a bit more extra work in producing mask. 

Do I understand correctly or miss something here?

Thanks,
Bingfeng Mei

RE: Vector permutation only deals with # of vector elements same as mask?

2011-02-11 Thread Bingfeng Mei

Thanks. Another question. Is there any plan to vectorize
the loops like the following ones?

for (i=127; i>=0; i--) {
x[i] = y[i] + z[i];
}

I found that GCC trunk still cannot handle negative step
for store. Even it can, it won't be efficient by introducing
redundant permutations on load and store.

Cheers,
Bingfeng
> -Original Message-
> From: Ira Rosen [mailto:i...@il.ibm.com]
> Sent: 10 February 2011 17:22
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Vector permutation only deals with # of vector elements
> same as mask?
> 
> 
> Hi,
> 
> "Bingfeng Mei"  wrote on 10/02/2011 05:35:45 PM:
> >
> > Hi,
> > I noticed that vector permutation gets more use in GCC
> > 4.6, which is great. It is used to handle negative step
> > by reversing vector elements now.
> >
> > However, after reading the related code, I understood
> > that it only works when the # of vector elements is
> > the same as that of mask vector in the following code.
> >
> > perm_mask_for_reverse (tree-vect-stmts.c)
> > ...
> >   mask_type = get_vectype_for_scalar_type (mask_element_type);
> >   nunits = TYPE_VECTOR_SUBPARTS (vectype);
> >   if (!mask_type
> >   || TYPE_VECTOR_SUBPARTS (vectype) != TYPE_VECTOR_SUBPARTS
> (mask_type))
> > return NULL;
> > ...
> >
> > For PowerPC altivec, the mask_type is V16QI. It means that
> > compiler can only permute V16QI type.  But given the capability of
> > altivec vperm instruction, it can permute any 128-bit type
> > (V8HI, V4SI, etc). We just need convert in/out V16QI from
> > given types and a bit more extra work in producing mask.
> >
> > Do I understand correctly or miss something here?
> 
> Yes, you are right. The support of reverse access is somewhat limited.
> Please see vect_transform_slp_perm_load() in tree-vect-slp.c for
> example of
> all type permutation support.
> 
> But, anyway, reverse accesses are not supported for altivec's load
> realignment scheme.
> 
> Ira
> 
> >
> > Thanks,
> > Bingfeng Mei
> >
> >
> >
> >
>

Why does GCC convert short operation to short unsigned?

2011-06-17 Thread Bingfeng Mei

Hi,
I noticed that GCC converts short arithmetic to unsigned short.

short foo2 (short a, short b)
{
  return a - b;
}

In .gimple file:

foo2 (short int a, short int b)
{
  short int D.3347;
  short unsigned int a.0;
  short unsigned int b.1;
  short unsigned int D.3350;

  a.0 = (short unsigned int) a;
  b.1 = (short unsigned int) b;
  D.3350 = a.0 - b.1;
  D.3347 = (short int) D.3350;
  return D.3347;
}

Is this for some C standard conformance, or optimization purpose?
This doesn't happen with int type.

Thanks,
Bingfeng Mei

Is this correct behaviour?

2011-09-06 Thread Bingfeng Mei

Hi, 
I compile the following code with arm gcc 4.6 (x86 is the similar with one of 
4.7 snapshot).
I noticed "a" is written to memory three times instead of being added by 3 and 
written at the
end. Doesn't restrict guarantee "a" won't be aliased to "p" so 3 "a++" can be 
optimized?

Thanks,
Bingfeng Mei

int a;
int P[100];
void foo (int * restrict p)
{
  P[0] = *p;
  a++;
  P[1] = *p;
  a++;
  P[2] = *p;
  a++;
}

~/work/install-arm/bin/arm-elf-gcc tst.c -O2 -S -std=c99

foo:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
ldr r3, .L2
ldr r1, [r3, #0]
ldr ip, [r0, #0]
ldr r2, .L2+4
str r4, [sp, #-4]!
add r4, r1, #1
str r4, [r3, #0]
str ip, [r2, #0]
ldr ip, [r0, #0]
add r4, r1, #2
str r4, [r3, #0]
str ip, [r2, #4]
ldr r0, [r0, #0]
add r1, r1, #3
str r0, [r2, #8]
str r1, [r3, #0]
ldmfd   sp!, {r4}
bx  lr

RE: Is this correct behaviour?

2011-09-06 Thread Bingfeng Mei



> -Original Message-
> From: Richard Guenther [mailto:richard.guent...@gmail.com]
> Sent: 06 September 2011 16:42
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Is this correct behaviour?
> 
> On Tue, Sep 6, 2011 at 5:30 PM, Bingfeng Mei  wrote:
> > Hi,
> > I compile the following code with arm gcc 4.6 (x86 is the similar
> with one of 4.7 snapshot).
> > I noticed "a" is written to memory three times instead of being added
> by 3 and written at the
> > end. Doesn't restrict guarantee "a" won't be aliased to "p" so 3
> "a++" can be optimized?
> 
> No it does not.

Then how do I tell compiler that "a" is not aliased if I have to use global 
variable? 

> 
> > Thanks,
> > Bingfeng Mei
> >
> > int a;
> > int P[100];
> > void foo (int * restrict p)
> > {
> >  P[0] = *p;
> >  a++;
> >  P[1] = *p;
> >  a++;
> >  P[2] = *p;
> >  a++;
> > }
> >
> > ~/work/install-arm/bin/arm-elf-gcc tst.c -O2 -S -std=c99
> >
> > foo:
> >        @ args = 0, pretend = 0, frame = 0
> >        @ frame_needed = 0, uses_anonymous_args = 0
> >        @ link register save eliminated.
> >        ldr     r3, .L2
> >        ldr     r1, [r3, #0]
> >        ldr     ip, [r0, #0]
> >        ldr     r2, .L2+4
> >        str     r4, [sp, #-4]!
> >        add     r4, r1, #1
> >        str     r4, [r3, #0]
> >        str     ip, [r2, #0]
> >        ldr     ip, [r0, #0]
> >        add     r4, r1, #2
> >        str     r4, [r3, #0]
> >        str     ip, [r2, #4]
> >        ldr     r0, [r0, #0]
> >        add     r1, r1, #3
> >        str     r0, [r2, #8]
> >        str     r1, [r3, #0]
> >        ldmfd   sp!, {r4}
> >        bx      lr
> >
> >

Derive more alias information from named address space

2011-09-16 Thread Bingfeng Mei

Hi,
I am trying to implement named address space for our target. 

In alias.c,  I found the following piece of code several times. 

  /* If we have MEMs refering to different address spaces (which can
 potentially overlap), we cannot easily tell from the addresses
 whether the references overlap.  */
  if (MEM_ADDR_SPACE (mem) != MEM_ADDR_SPACE (x))
return 1;

I think we can do better with the existing target hook:

- Target Hook: bool TARGET_ADDR_SPACE_SUBSET_P (addr_space_t superset, 
addr_space_t subset)

If A is not subset of B and B is not subset of A, we can conclude
they are either disjoint or overlapped. According to standard draft 
(section 3.1.3),

"For any two address spaces, either the address spaces must be
disjoint, they must be equivalent, or one must be a subset of
the other. Other forms of overlapping are not permitted."

Therefore, A & B could only be disjoint, i.e., not aliased to each other.
We should be able to write: 

  if (MEM_ADDR_SPACE (mem) != MEM_ADDR_SPACE (x))
  {
if (!targetm.addr_space.subset_p (MEM_ADDR_SPACE (mem), MEM_ADDR_SPACE (x))
   && !targetm.addr_space.subset_p (MEM_ADDR_SPACE (x), MEM_ADDR_SPACE 
(mem)))
  return 0;
else
  return 1;
  }

Is this correct?

Thanks,
Bingfeng Mei

RE: Derive more alias information from named address space

2011-09-19 Thread Bingfeng Mei

Thanks. I will prepare a patch.

Bingfeng

> -Original Message-
> From: Ulrich Weigand [mailto:uweig...@de.ibm.com]
> Sent: 19 September 2011 12:56
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Derive more alias information from named address space
> 
> Bingfeng Mei wrote:
> 
> > Therefore, A & B could only be disjoint, i.e., not aliased to each
> other.
> > We should be able to write:
> >
> >   if (MEM_ADDR_SPACE (mem) != MEM_ADDR_SPACE (x))
> >   {
> > if (!targetm.addr_space.subset_p (MEM_ADDR_SPACE (mem),
> MEM_ADDR_SPACE (x))
> >&& !targetm.addr_space.subset_p (MEM_ADDR_SPACE (x),
> MEM_ADDR_SPACE (mem)))
> >   return 0;
> > else
> >   return 1;
> >   }
> >
> > Is this correct?
> 
> Yes, this looks correct to me ...
> 
> Bye,
> Ulrich
> 
> --
>   Dr. Ulrich Weigand
>   GNU Toolchain for Linux on System z and Cell BE
>   ulrich.weig...@de.ibm.com

Wrong documentation of TARGET_ADDR_SPACE_SUBSET_P

2011-09-23 Thread Bingfeng Mei

Hi, 
I notice the following description is different from how spu & m32c use it. 

In internal manual:

bool TARGET_ADDR_SPACE_SUBSET_P (addr space t superset, [Target Hook]
addr space t subset)
Define this to return whether the subset named address space is contained 
within the
superset named address space. Pointers to a named address space that is a subset
of another named address space will be converted automatically without a cast if
used together in arithmetic operations. Pointers to a superset address space 
can be
converted to pointers to a subset address space via explicit casts.

In spu & m32c ports:
m32c_addr_space_subset_p (addr_space_t subset, addr_space_t superset)
spu_addr_space_subset_p (addr_space_t subset, addr_space_t superset)

I believe the document is wrong. The first argument is subset and the second
one is superset.  Should I submit a patch?

Cheers,
Bingfeng Mei

Not conform to c90?

2011-10-04 Thread Bingfeng Mei

Hello,
According to 
http://gcc.gnu.org/onlinedocs/gcc-4.6.1/gcc/Zero-Length.html#Zero-Length
A zero-length array should have a length of 1 in c90.

But I tried 

struct 
{
  char a[0];
} ZERO;

void main()
{
  int a[0];
  printf ("size = %d\n", sizeof(ZERO));
}

Compiled with gcc 4.7
~/work/install-x86/bin/gcc test.c -O2 -std=c90

size = 0

I noticed the following statement in GCC document.
"As a quirk of the original implementation of zero-length arrays, 
sizeof evaluates to zero." 

Does it mean GCC just does not conform to c90 in this respect?

Thanks,
Bingfeng Mei

RE: Not conform to c90?

2011-10-04 Thread Bingfeng Mei

Thank you very much. I misunderstood the document. 

Bingfeng

> -Original Message-
> From: Jonathan Wakely [mailto:jwakely@gmail.com]
> Sent: 04 October 2011 12:48
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Not conform to c90?
> 
> On 4 October 2011 12:09, Bingfeng Mei wrote:
> > Hello,
> > According to http://gcc.gnu.org/onlinedocs/gcc-4.6.1/gcc/Zero-
> Length.html#Zero-Length
> > A zero-length array should have a length of 1 in c90.
> 
> I think you've misunderstood that page.  You cannot have a zero-length
> array in C90, what that page says is that in strict C90 you would have
> to create an array of length 1 as a workaround.  It's not saying
> sizeof(char[0]) is 1.
> 
> GNU C an C99 allow you to have a zero-length array.
> 
> > But I tried
> >
> > struct
> > {
> >  char a[0];
> > } ZERO;
> >
> > void main()
> > {
> >  int a[0];
> >  printf ("size = %d\n", sizeof(ZERO));
> > }
> >
> > Compiled with gcc 4.7
> > ~/work/install-x86/bin/gcc test.c -O2 -std=c90
> >
> > size = 0
> 
> If you add -pedantic you'll discover that program isn't valid in C90.
> 
> > I noticed the following statement in GCC document.
> > "As a quirk of the original implementation of zero-length arrays,
> > sizeof evaluates to zero."
> >
> > Does it mean GCC just does not conform to c90 in this respect?
> 
> C90 doesn't allow zero length arrays, so you're trying to evaluate a
> GNU extension in terms of a standard.  I'm not sure what you expect to
> happen.

RE: Porting 64-bit target on 32-bit host

2011-10-10 Thread Bingfeng Mei

I believe that 64-bit target on 32-bit host is not supported by GCC.
You need a lot of hackings to do so.

Check this thread.
http://gcc.gnu.org/ml/gcc-patches/2009-03/msg00908.html

Bingfeng Mei

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
> Huang Ping
> Sent: 10 October 2011 11:29
> To: gcc@gcc.gnu.org
> Subject: Porting 64-bit target on 32-bit host
> 
> Hi, all
> 
> I'm porting a 64-bit target gcc on a 32-bit i386 host. I have set
> need_64bit_hwint to yes in config.gcc. But it fails when building
> libgcc.
> Then I did a simple test. test case like this:
> int test ()
> {
>    return 0;
> }
> 
> I use cc1 compile it with -fdump-tree-all. The 003t.orioginal dump file
> shows:
> {
>    return 1900544;
> }
> 
> I guess the compiler may take constant 0 as TImode, and read the
> adjacent word in memory. But I'm not sure. Could someone give some
> advice?
> Thanks.
> 
> Ping

RE: Porting 64-bit target on 32-bit host

2011-10-10 Thread Bingfeng Mei

Well, I just switched to 64-bit host and everything is fine.

Bingfeng

> -Original Message-
> From: harder...@gmail.com [mailto:harder...@gmail.com] On Behalf Of
> Huang Ping
> Sent: 10 October 2011 16:55
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Porting 64-bit target on 32-bit host
> 
> 2011/10/10 Bingfeng Mei :
> > I believe that 64-bit target on 32-bit host is not supported by GCC.
> > You need a lot of hackings to do so.
> >
> > Check this thread.
> > http://gcc.gnu.org/ml/gcc-patches/2009-03/msg00908.html
> 
> Then how did you solve your problem in this thread?
> do many hackings on 32-bit host or change to 64-bit host?

Why doesn't GCC generate conditional move for COND_EXPR?

2011-10-24 Thread Bingfeng Mei

Hello,
I noticed that COND_EXPR is not expanded to conditional move 
as MIN_EXPR/MAX_EXPR are (assuming movmodecc is available). 
I wonder why not?

I have some loop that fails tree vectorization, but still contains
COND_EXPR from tree ifcvt pass. In the end, the generated code
is worse than if I don't turned -ftree-vectorize on.  This
is on our private port.

Thanks,
Bingfeng Mei

RE: Why doesn't GCC generate conditional move for COND_EXPR?

2011-10-25 Thread Bingfeng Mei

Thanks, Andrew. I also implemented a quick patch on our port (based on GCC 
4.5). 
I noticed it produced better code now for our applications. Maybe eliminating
control flow in earlier stage helps other optimizing passes. Currently, tree 
if-conversion pass is not turned on by default (only with tree vectorization
or some other passes). Maybe it is worth to make it default at -O2 (for those
processors support conditional move)? 

Cheers,
Bingfeng

> -Original Message-
> From: Andrew Pinski [mailto:pins...@gmail.com]
> Sent: 24 October 2011 17:20
> To: Richard Guenther
> Cc: Bingfeng Mei; gcc@gcc.gnu.org
> Subject: Re: Why doesn't GCC generate conditional move for COND_EXPR?
> 
> On Mon, Oct 24, 2011 at 7:00 AM, Richard Guenther
>  wrote:
> > On Mon, Oct 24, 2011 at 2:55 PM, Bingfeng Mei 
> wrote:
> >> Hello,
> >> I noticed that COND_EXPR is not expanded to conditional move
> >> as MIN_EXPR/MAX_EXPR are (assuming movmodecc is available).
> >> I wonder why not?
> >>
> >> I have some loop that fails tree vectorization, but still contains
> >> COND_EXPR from tree ifcvt pass. In the end, the generated code
> >> is worse than if I don't turned -ftree-vectorize on.  This
> >> is on our private port.
> >
> > Because nobody touched COND_EXPR expansion since ages.
> 
> I have a patch which I will be submitting next week or so that does
> this expansion correctly.  In fact I have a few patches which improves
> the generation of COND_EXPR in simple cases (in PHI-OPT).
> 
> Thanks,
> Andrew Pinski

SLP vectorizer on non-loop?

2011-11-01 Thread Bingfeng Mei

Hello,
I have one example with two very similar loops. cunrolli pass unrolls one loop 
completely
but not the other based on slightly different cost estimations. The 
not-unrolled loop 
get SLP-vectorized, then unrolled by "cunroll" pass, whereas the other unrolled 
loop cannot
be vectorized since it is not a loop any more.  In the end, there is big 
difference of
performance between two loops. 

My question is why SLP vectorization has to be performed on loop (it is a 
sub-pass under
pass_tree_loop). Conceptually, cannot it be done on any basic block? Our port 
are still
stuck at 4.5. But I checked 4.7, it seems still the same. I also checked 
functions in 
tree-vect-slp.c. They use a lot of loop_vinfo structures. But in some places it 
checks
whether loop_vinfo exists to use it or other alternative. I tried to add an 
extra SLP 
pass after pass_tree_loop, but it didn't work. I wonder how easy to make SLP 
works for 
non-loop.

Thanks,
Bingfeng Mei

Broadcom UK

void foo (int *__restrict__ temp_hist_buffer, 
  int * __restrict__ p_hist_buff, 
  int *__restrict__ p_input)
{
  int i;
  for(i=0;i<4;i++)
 temp_hist_buffer[i]=p_hist_buff[i];

  for(i=0;i<4;i++)
 temp_hist_buffer[i+4]=p_input[i];

}

RE: SLP vectorizer on non-loop?

2011-11-01 Thread Bingfeng Mei

Ira,
Thank you very much for quick answer. I will check 4.7 x86-64 
to see difference from our port. Is there significant change
between 4.5 & 4.7 regarding SLP? 

Cheers,
Bingfeng

> -Original Message-
> From: Ira Rosen [mailto:i...@il.ibm.com]
> Sent: 01 November 2011 11:13
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: SLP vectorizer on non-loop?
> 
> 
> 
> gcc-ow...@gcc.gnu.org wrote on 01/11/2011 12:41:32 PM:
> 
> > Hello,
> > I have one example with two very similar loops. cunrolli pass
> > unrolls one loop completely
> > but not the other based on slightly different cost estimations. The
> > not-unrolled loop
> > get SLP-vectorized, then unrolled by "cunroll" pass, whereas the
> > other unrolled loop cannot
> > be vectorized since it is not a loop any more.  In the end, there is
> > big difference of
> > performance between two loops.
> >
> 
> Here what I see with the current trunk on x86_64 with -O3 (with the two
> loops split into different functions):
> 
> The first loop, the one that doesn't get unrolled by cunrolli, gets
> loop
> vectorized with -fno-vect-cost-model. With the cost model the
> vectorization
> fails because the number of iterations is not sufficient (the
> vectorizer
> tries to apply loop peeling in order to align the accesses), the loop
> gets
> later unrolled by cunroll and the basic block gets vectorized by SLP.
> 
> The second loop, unrolled by cunrolli, also gets vectorized by SLP.
> 
> The *.optimized dumps look similar:
> 
> 
> :
>   vect_var_.14_48 = MEM[(int *)p_hist_buff_9(D)];
>   MEM[(int *)temp_hist_buffer_5(D)] = vect_var_.14_48;
>   return;
> 
> 
> :
>   vect_var_.7_57 = MEM[(int *)p_input_10(D)];
>   MEM[(int *)temp_hist_buffer_6(D) + 16B] = vect_var_.7_57;
>   return;
> 
> 
> > My question is why SLP vectorization has to be performed on loop (it
> > is a sub-pass under
> > pass_tree_loop). Conceptually, cannot it be done on any basic block?
> > Our port are still
> > stuck at 4.5. But I checked 4.7, it seems still the same. I also
> > checked functions in
> > tree-vect-slp.c. They use a lot of loop_vinfo structures. But in
> > some places it checks
> > whether loop_vinfo exists to use it or other alternative. I tried to
> > add an extra SLP
> > pass after pass_tree_loop, but it didn't work. I wonder how easy to
> > make SLP works for
> > non-loop.
> 
> SLP vectorization works both on loops (in vectorize pass) and on basic
> blocks (in slp-vectorize pass).
> 
> Ira
> 
> >
> > Thanks,
> > Bingfeng Mei
> >
> > Broadcom UK
> >
> > void foo (int *__restrict__ temp_hist_buffer,
> >   int * __restrict__ p_hist_buff,
> >   int *__restrict__ p_input)
> > {
> >   int i;
> >   for(i=0;i<4;i++)
> >  temp_hist_buffer[i]=p_hist_buff[i];
> >
> >   for(i=0;i<4;i++)
> >  temp_hist_buffer[i+4]=p_input[i];
> >
> > }
> >
> >
>

Bug in Tree to RTL expansion?

2011-12-08 Thread Bingfeng Mei

Hi,
I experienced a code generation bug with 4.5 (yes, our
port is still stuck at 4.5.4). Since the concerned code
is full of our target-specific code, it is not easy
to demonstrate the error with x86 or ARM. 

Here is what happens in expanding process. The following is a 
piece of optimized tree code to be expanded to RTL.

  # ptr_h2_493 = PHI 
  ...
  D.13598_218 = MEM[base: ptr_h2_493, offset: 8];
  D.13599_219 = (long int) D.13598_218;
  ...
  ptr_h2_310 = ptr_h2_493 + 16;
  ...
  D.13634_331 = D.13599_219 * D.13538_179;
  cor3_332 = D.13635_339 + D.13634_331;
  ...

When expanding to RTL, the coalescing algorithm will coalesce 
ptr_h2_310 & ptr_h2_493 to one register:

;; ptr_h2_310 = ptr_h2_493 + 16;
(insn 364 363 0 (set (reg/v/f:SI 282 [ ptr_h2 ])
(plus:SI (reg/v/f:SI 282 [ ptr_h2 ])
(const_int 16 [0x10]))) -1 (nil))

GCC 4.5 (fp_gcc 2.3.x) doesn't expand statements one-by-one 
as GCC 4.4 (fp_gcc 2.2.x) does. So when GCC expands the
following statement,

cor3_332 = D.13635_339 + D.13634_331;

it then in turn expands each operand by going back to 
expand previous relevant statements. 

 D.13598_218 = MEM[base: ptr_h2_493, offset: 8];
 D.13599_219 = (long int) D.13598_218;
 ...
 D.13634_331 = D.13599_219 * D.13538_179;

The problem is that compiler doesn't take account into fact that 
ptr_h2_493|ptr_h2_310 has been modified. Still expand the above 
statement as it is.

(insn 380 379 381 (set (reg:HI 558)
(mem:HI (plus:SI (reg/v/f:SI 282 [ ptr_h2 ])
(const_int 8 [0x8])) [0 S2 A8])) -1 (nil))
...
(insn 382 381 383 (set (reg:SI 557)
(mult:SI (sign_extend:SI (reg:HI 558))
(sign_extend:SI (reg:HI 559 -1 (nil))

This seems to me quite a basic issue. I cannot believe testsuites
and other applications do not expose more errors. 

What I am not sure is whether the coalescing algorithm or the expanding
procedure is wrong here. If ptr_h2_493 and ptr_h2_310 are not coalesced
to use the same register, it should be correctly compiled. Or expanding
procedure checks data flow, it should be also OK. Which one should I
I look at? Or is this a known issue and fixed in 4.6/4.7?

Thanks,
Bingfeng Mei

RE: Bug in Tree to RTL expansion?

2011-12-08 Thread Bingfeng Mei

Richard,
Thanks. -fno-tree-ter does work around the problem. I did look
at the info about coalescing, which doesn't give much info. I think
I have to take a closer look at TER and coalescing algorithm.

Regards,
Bingfeng

> -Original Message-
> From: Richard Guenther [mailto:richard.guent...@gmail.com]
> Sent: 08 December 2011 12:10
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org; Michael Matz
> Subject: Re: Bug in Tree to RTL expansion?
> 
> On Thu, Dec 8, 2011 at 12:34 PM, Bingfeng Mei  wrote:
> > Hi,
> > I experienced a code generation bug with 4.5 (yes, our
> > port is still stuck at 4.5.4). Since the concerned code
> > is full of our target-specific code, it is not easy
> > to demonstrate the error with x86 or ARM.
> >
> > Here is what happens in expanding process. The following is a
> > piece of optimized tree code to be expanded to RTL.
> >
> >  # ptr_h2_493 = PHI 
> >  ...
> >  D.13598_218 = MEM[base: ptr_h2_493, offset: 8];
> >  D.13599_219 = (long int) D.13598_218;
> >  ...
> >  ptr_h2_310 = ptr_h2_493 + 16;
> >  ...
> >  D.13634_331 = D.13599_219 * D.13538_179;
> >  cor3_332 = D.13635_339 + D.13634_331;
> >  ...
> >
> > When expanding to RTL, the coalescing algorithm will coalesce
> > ptr_h2_310 & ptr_h2_493 to one register:
> >
> > ;; ptr_h2_310 = ptr_h2_493 + 16;
> > (insn 364 363 0 (set (reg/v/f:SI 282 [ ptr_h2 ])
> >        (plus:SI (reg/v/f:SI 282 [ ptr_h2 ])
> >            (const_int 16 [0x10]))) -1 (nil))
> >
> > GCC 4.5 (fp_gcc 2.3.x) doesn't expand statements one-by-one
> > as GCC 4.4 (fp_gcc 2.2.x) does. So when GCC expands the
> > following statement,
> >
> > cor3_332 = D.13635_339 + D.13634_331;
> >
> > it then in turn expands each operand by going back to
> > expand previous relevant statements.
> >
> >  D.13598_218 = MEM[base: ptr_h2_493, offset: 8];
> >  D.13599_219 = (long int) D.13598_218;
> >  ...
> >  D.13634_331 = D.13599_219 * D.13538_179;
> >
> > The problem is that compiler doesn't take account into fact that
> > ptr_h2_493|ptr_h2_310 has been modified. Still expand the above
> > statement as it is.
> >
> > (insn 380 379 381 (set (reg:HI 558)
> >        (mem:HI (plus:SI (reg/v/f:SI 282 [ ptr_h2 ])
> >                (const_int 8 [0x8])) [0 S2 A8])) -1 (nil))
> > ...
> > (insn 382 381 383 (set (reg:SI 557)
> >        (mult:SI (sign_extend:SI (reg:HI 558))
> >            (sign_extend:SI (reg:HI 559 -1 (nil))
> >
> > This seems to me quite a basic issue. I cannot believe testsuites
> > and other applications do not expose more errors.
> >
> > What I am not sure is whether the coalescing algorithm or the
> expanding
> > procedure is wrong here. If ptr_h2_493 and ptr_h2_310 are not
> coalesced
> > to use the same register, it should be correctly compiled. Or
> expanding
> > procedure checks data flow, it should be also OK. Which one should I
> > I look at? Or is this a known issue and fixed in 4.6/4.7?
> 
> TER should not happen for D.13598_218 = MEM[base: ptr_h2_493, offset:
> 8]; because it conflicts with the coalesce.  Thus, -fno-tree-ter
> should
> fix your issue.  You may look at the -fdump-rtl-expand-details dump
> to learn about the coalescing decisions.
> 
> I'm not sure we fixed a bug that looks like the above.  With 4.5
> the 'MEM' is a TARGET_MEM_REF tree.
> 
> Micha should be most familiar with evolutions in this code.
> 
> Richard.
> 
> > Thanks,
> > Bingfeng Mei
> >

RE: Bug in Tree to RTL expansion?

2011-12-09 Thread Bingfeng Mei

Michael,
Thanks for your help. I struggled to understand tree-ssa-ter.c.
Please see questions below. 

I also tried the tree-ssa-ter.c from the trunk. Same results.

Bingfeng

> -Original Message-
> From: Michael Matz [mailto:m...@suse.de]
> Sent: 08 December 2011 13:50
> To: Richard Guenther
> Cc: Bingfeng Mei; gcc@gcc.gnu.org
> Subject: Re: Bug in Tree to RTL expansion?
> 
> Hi,
> 
> On Thu, 8 Dec 2011, Richard Guenther wrote:
> 
> > > What I am not sure is whether the coalescing algorithm or the
> > > expanding procedure is wrong here.
> 
> The forwarding of _218 is wrong.  TER shouldn't have marked it as being
> able to forward (check the expand-detailed dump for the "Replacing
> Expressions" section).  Obviously it does think it can forward it, so
> something is wrong in tree-ssa-ter.c.
> 
> If you can't come up with a testcase that fails with some available
> cross
> compiler (though I wonder why, the tree-ssa parts of the compiler are
> not
> that target dependend, so maybe you can show similar forwarding with an
> adjusted testcase for something publically available) you have to debug
> it
> yourself (I'm right now not aware of any known bug in 4.5 regarding
> this
> particular situation).
> 
> There should be a call to kill_expr on the statement
>   ptr_h2_310 = ptr_h2_493 + 16;

I tracked into how TER is executed. 
kill_expr is called but the kill_list are already all empty because

mark_replaceable -> finished_with_expr  clear all the kill_list.

In addition, once replaceable_expressions is set by mark_replaceable. It 
doesn't seem
it is ever cleared due to kill_expr or any other function. 
replaceable_expression
is the only data structure passed to expand pass.  


> which should have killed the expression MEM[ptr_h2_493] (and hence _218)
> from the available expressions.
> 
> 
> Ciao,
> Michael.

RE: Bug in Tree to RTL expansion?

2011-12-09 Thread Bingfeng Mei

OK, don't bother. I think I understand TER and my issue now.
It is from a misfix of widening multiplication, which I found
there is a new pass doing this from 4.6. I am going to back
port that to my target. 

Thanks,
Bingfeng

> -Original Message-
> From: Bingfeng Mei
> Sent: 09 December 2011 14:34
> To: 'Michael Matz'; Richard Guenther
> Cc: gcc@gcc.gnu.org
> Subject: RE: Bug in Tree to RTL expansion?
> 
> Michael,
> Thanks for your help. I struggled to understand tree-ssa-ter.c.
> Please see questions below.
> 
> I also tried the tree-ssa-ter.c from the trunk. Same results.
> 
> Bingfeng
> 
> > -Original Message-
> > From: Michael Matz [mailto:m...@suse.de]
> > Sent: 08 December 2011 13:50
> > To: Richard Guenther
> > Cc: Bingfeng Mei; gcc@gcc.gnu.org
> > Subject: Re: Bug in Tree to RTL expansion?
> >
> > Hi,
> >
> > On Thu, 8 Dec 2011, Richard Guenther wrote:
> >
> > > > What I am not sure is whether the coalescing algorithm or the
> > > > expanding procedure is wrong here.
> >
> > The forwarding of _218 is wrong.  TER shouldn't have marked it as
> being
> > able to forward (check the expand-detailed dump for the "Replacing
> > Expressions" section).  Obviously it does think it can forward it, so
> > something is wrong in tree-ssa-ter.c.
> >
> > If you can't come up with a testcase that fails with some available
> > cross
> > compiler (though I wonder why, the tree-ssa parts of the compiler are
> > not
> > that target dependend, so maybe you can show similar forwarding with
> an
> > adjusted testcase for something publically available) you have to
> debug
> > it
> > yourself (I'm right now not aware of any known bug in 4.5 regarding
> > this
> > particular situation).
> >
> > There should be a call to kill_expr on the statement
> >   ptr_h2_310 = ptr_h2_493 + 16;
> 
> I tracked into how TER is executed.
> kill_expr is called but the kill_list are already all empty because
> 
> mark_replaceable -> finished_with_expr  clear all the kill_list.
> 
> In addition, once replaceable_expressions is set by mark_replaceable.
> It doesn't seem
> it is ever cleared due to kill_expr or any other function.
> replaceable_expression
> is the only data structure passed to expand pass.
> 
> 
> > which should have killed the expression MEM[ptr_h2_493] (and hence
> _218)
> > from the available expressions.
> >
> >
> > Ciao,
> > Michael.

libtool error in building GCC

2009-07-21 Thread Bingfeng Mei

Hello,
I am experiencing the following error when building TRUNK version of our port.
I am not familar with libtool. In 4.4, GCC produces its own libtools under 
libstdc++v3 directory and other similar directories. But I cannot track 
how the libtool is generated. Even I remove libtool under libstdc++-v3 directory
and rerun make and it cannot regenerate libtool again. Examining config.log, 
config.status and Makefile doesn't help me either. So I really get lost what
is going wrong in 4.5 trunk. Any help is greatly appreciated.
 
Thanks,
Bingfeng Mei
 
 
/bin/sh ../libtool --tag CXX --tag disable-shared --mode=compile 
/projects/firepath/tools/work/bmei/gcc-head/build/./gcc/xgcc -shared-libgcc 
-B/projects/firepath/tools/work/bmei/gcc-head/build/./gcc -nostdinc++ 
-L/projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/libstdc++-v3/src
 
-L/projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/libstdc++-v3/src/.libs
 -nostdinc 
-B/projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/newlib/ 
-isystem 
/projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/newlib/targ-include
 -isystem /projects/firepath/tools/work/bmei/gcc-head/src/newlib/libc/include 
-B/projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/libgloss/firepath
 
-L/projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/libgloss/libnosys
 -L/projects/firepath/tools/work/bmei/gcc-head/src/libgloss/firepath 
-B/home/bmei/work/gcc-head/install/firepath-elf/bin/ 
-B/home/bmei/work/gcc-head/install/firepath-elf/lib/ -isystem 
/home/bmei/work/gcc-head/install/firepath-elf/include -isystem 
/home/bmei/work/gcc-head/install/firepath-elf/sys-include
-I/projects/firepath/tools/work/bmei/gcc-head/src/libstdc++-v3/../gcc 
-I/projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/libstdc++-v3/include/firepath-elf
 
-I/projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/libstdc++-v3/include
 -I/projects/firepath/tools/work/bmei/gcc-head/src/libstdc++-v3/libsupc++  
-fno-implicit-templates  -Wall -Wextra -Wwrite-strings -Wcast-qual  
-fdiagnostics-show-location=once  -ffunction-sections -fdata-sections  -g -O2  
-c -o array_type_info.lo 
../../../../src/libstdc++-v3/libsupc++/array_type_info.cc
/bin/sh: ../libtool: No such file or directory
make[4]: *** [array_type_info.lo] Error 127
make[4]: Leaving directory 
`/projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/libstdc++-v3/libsupc++

 
The following is my configuration command:
 
CC="gcc -m32" CFLAGS="-g" ../src/configure 
--prefix=/home/bmei/work/gcc-head/install --enable-languages=c,c++ 
--disable-nls --target=firepath-elf --with-newlib 
--with-mpfr=/projects/firepath/tools/work/bmei/packages/mpfr/2.4.1 
--with-gmp=/projects/firepath/tools/work/bmei/packages/gmp/4.3.0  
--disable-libssp --with-headers --enable-checking --enable-multilib

RE: libtool error in building GCC

2009-07-21 Thread Bingfeng Mei

Just ignore my previous mail. I find the error is because we failed to
import the new 4.5 directory libstdc++v3/python to our repository.


> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On 
> Behalf Of Bingfeng Mei
> Sent: 21 July 2009 12:41
> To: gcc@gcc.gnu.org
> Subject: libtool error in building GCC
> 
> Hello,
> I am experiencing the following error when building TRUNK 
> version of our port.
> I am not familar with libtool. In 4.4, GCC produces its own 
> libtools under 
> libstdc++v3 directory and other similar directories. But I 
> cannot track 
> how the libtool is generated. Even I remove libtool under 
> libstdc++-v3 directory
> and rerun make and it cannot regenerate libtool again. 
> Examining config.log, 
> config.status and Makefile doesn't help me either. So I 
> really get lost what
> is going wrong in 4.5 trunk. Any help is greatly appreciated.
>  
> Thanks,
> Bingfeng Mei
>  
>  
> /bin/sh ../libtool --tag CXX --tag disable-shared 
> --mode=compile 
> /projects/firepath/tools/work/bmei/gcc-head/build/./gcc/xgcc 
> -shared-libgcc 
> -B/projects/firepath/tools/work/bmei/gcc-head/build/./gcc 
> -nostdinc++ 
> -L/projects/firepath/tools/work/bmei/gcc-head/build/firepath-e
> lf/libstdc++-v3/src 
> -L/projects/firepath/tools/work/bmei/gcc-head/build/firepath-e
> lf/libstdc++-v3/src/.libs -nostdinc 
> -B/projects/firepath/tools/work/bmei/gcc-head/build/firepath-e
> lf/newlib/ -isystem 
> /projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf
> /newlib/targ-include -isystem 
> /projects/firepath/tools/work/bmei/gcc-head/src/newlib/libc/in
> clude 
> -B/projects/firepath/tools/work/bmei/gcc-head/build/firepath-e
> lf/libgloss/firepath 
> -L/projects/firepath/tools/work/bmei/gcc-head/build/firepath-e
> lf/libgloss/libnosys 
> -L/projects/firepath/tools/work/bmei/gcc-head/src/libgloss/fir
> epath -B/home/bmei/work/gcc-head/install/firepath-elf/bin/ 
> -B/home/bmei/work/gcc-head/install/firepath-elf/lib/ -isystem 
> /home/bmei/work/gcc-head/install/firepath-elf/include 
> -isystem 
> /home/bmei/work/gcc-head/install/firepath-elf/sys-include
> -I/projects/firepath/tools/work/bmei/gcc-head/src/libstdc++-v3
> /../gcc 
> -I/projects/firepath/tools/work/bmei/gcc-head/build/firepath-e
> lf/libstdc++-v3/include/firepath-elf 
> -I/projects/firepath/tools/work/bmei/gcc-head/build/firepath-e
> lf/libstdc++-v3/include 
> -I/projects/firepath/tools/work/bmei/gcc-head/src/libstdc++-v3
> /libsupc++  -fno-implicit-templates  -Wall -Wextra 
> -Wwrite-strings -Wcast-qual  -fdiagnostics-show-location=once 
>  -ffunction-sections -fdata-sections  -g -O2  -c -o 
> array_type_info.lo 
> ../../../../src/libstdc++-v3/libsupc++/array_type_info.cc
> /bin/sh: ../libtool: No such file or directory
> make[4]: *** [array_type_info.lo] Error 127
> make[4]: Leaving directory 
> `/projects/firepath/tools/work/bmei/gcc-head/build/firepath-el
> f/libstdc++-v3/libsupc++
> 
>  
> The following is my configuration command:
>  
> CC="gcc -m32" CFLAGS="-g" ../src/configure 
> --prefix=/home/bmei/work/gcc-head/install 
> --enable-languages=c,c++ --disable-nls --target=firepath-elf 
> --with-newlib 
> --with-mpfr=/projects/firepath/tools/work/bmei/packages/mpfr/2
> .4.1 
> --with-gmp=/projects/firepath/tools/work/bmei/packages/gmp/4.3
> .0  --disable-libssp --with-headers --enable-checking 
> --enable-multilib
> 
> 
>

RE: PRE_DEC, POST_INC

2009-08-07 Thread Bingfeng Mei

I asked similar question regarding PRE_INC/POST_INC quite a while ago,
and there were quite some discussions. Haven't check whether the situation
changed. 
 http://gcc.gnu.org/ml/gcc/2007-11/msg00032.html

Bingfeng Mei

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On 
> Behalf Of Ramana Radhakrishnan
> Sent: 07 August 2009 14:11
> To: Florent Defay
> Cc: gcc@gcc.gnu.org
> Subject: Re: PRE_DEC, POST_INC
> 
> On Fri, Aug 7, 2009 at 1:33 PM, Florent 
> Defay wrote:
> > Hi,
> >
> > I am working on a new port.
> >
> > The target machine supports post-increment and pre-decrement
> > addressing modes. These modes are twice faster than indexed mode.
> > It is important for us that GCC consider them well.
> 
> 
> GCC does support generation of pre{post}-inc {dec}.  GCC's auto-inc
> detector works at a basic block level and attempts to generate
> auto-inc operations within a basic block . Look at auto-inc-dec.c for
> more information.It is an area which could do with some improvement
> and work , however no one's found the time to do it yet.
> 
> >
> > I wrote emails to gcc-help and I was told that GCC was not 
> so good at
> > pre/post-dec/increment since few targets support these modes.
> >
> > I would like to know if there is a good way to make pre-dec and
> > post-inc modes have more priority than indexed mode.
> > Is there current work for dealing with this issue ?
> 
> I am assuming you already have basic generation of auto-incs and you
> have your definitions for legitimate{legitimize}_address all set up
> correctly.
> 
> In this case you could start by tweaking your address costs macros.
> Getting that right can help you get going in the right direction with
> the current state of the art. In a previous life while maintaining a
> private port of GCC I've dabbled with a few patches posted by Joern
> Reneccke with regards to auto-inc-dec that worked well for me in
> improving code generation on some of the benchmarks. You should be
> able to get them out using Google.
> 
> There are a number of bugzilla entries in the database that cover a
> number of cases for auto-inc generation and some ideas on what can be
> done to improve this. You might be better off searching in that as
> well. One of the problems upto 4.3 was that the ivopts and the loop
> optimizers didn't care too much about auto-increment addressing and
> thereby pessimizing this in favour of using index addressing.  There
> have been a few patches that were being discussed in in the recent
> past by Bernd Schmidt and Zdenek attempting to address auto-inc
> generation for loop ivopts but I'm not sure if these have gone into
> trunk yet.
> 
> Hope this helps.
> 
> 
> cheers
> Ramana
> 
>

RE: IRA undoing scheduling decisions

2009-08-25 Thread Bingfeng Mei

I can comfirm too in our private port, though in slightly different form.

r2 = 7
[r0] = r2
r0 = 4
[r1] = r0

Bingfeng 

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On 
> Behalf Of Charles J. Tabony
> Sent: 25 August 2009 00:56
> To: gcc@gcc.gnu.org
> Subject: IRA undoing scheduling decisions
> 
> Fellow GCC developers,
> 
> I am seeing a performance regression on the port I maintain, 
> and I would appreciate some pointers.
> 
> When I compile the following code
> 
> void f(int *x, int *y){
>   *x = 7;
>   *y = 4;
> }
> 
> with GCC 4.3.2, I get the desired sequence of instructions.  
> I'll call it sequence A:
> 
> r0 = 7
> r1 = 4
> [x] = r0
> [y] = r1
> 
> When I compile the same code with GCC 4.4.0, I get a sequence 
> that is lower performance for my target machine.  I'll call 
> it sequence B:
> 
> r0 = 7
> [x] = r0
> r0 = 4
> [y] = r0
> 
> I see the same difference between GCC 4.3.2 and 4.4.0 when 
> compiling for PowerPC, MIPS, ARM, and FR-V.
> 
> When I look at the RTL dumps, I see that the first scheduling 
> pass always produces sequence A, across all targets and GCC 
> versions I tried.  In GCC 4.3.2, sequence A persists 
> throughout the remainder of compilation.  In GCC 4.4.0, for 
> every target, the .ira dump shows that the sequence of 
> instructions has reverted back to sequence B.
> 
> Are there any machine-dependent parameters that I can tune to 
> prevent IRA from transforming sequence A into sequence B?  If 
> not, where can I add a hook to allow this decision to be 
> tuned per machine?
> 
> Is there any more information you would like me to provide?
> 
> Thank you,
> Charles J. Tabony
> 
> 
>

Restrict qualifier doesn't work in TRUNK.

2009-09-11 Thread Bingfeng Mei

Hello,
I notice the restrict qualifier doesn't work properly in trunk any more. 
In the following example, the memory accesses of a, b, c don't have different
alias set attached any more. Instead, the generic alias set of 2 is used
for all accesses. 

I remember alias analysis part had some big changes since 4.5 branched out.
Is it still in unstable phase, or is there some new hooks I should use for
my port? 

Cheers,
Bingfeng Mei


void foo (int * __restrict__ a, int * __restrict__ b, int * __restrict__ c)
{
   int i;
   for(i = 0; i < 100; i+=4)
 {
   a[i] = b[i] * c[i];
   a[i+1] = b[i+1] * c[i+1];
   a[i+2] = b[i+2] * c[i+2];
   a[i+3] = b[i+3] * c[i+3];
 }
}   

Before expand:

foo (int * restrict a, int * restrict b, int * restrict c)
{
  long unsigned int D.3213;
  long unsigned int ivtmp.32;
  long unsigned int ivtmp.31;
  long unsigned int ivtmp.29;
  int D.3168;
  int D.3167;
  int D.3165;
  int D.3160;
  int D.3159;
  int D.3157;
  int D.3152;
  int D.3151;
  int D.3149;
  int D.3143;
  int D.3142;
  int D.3140;

  # BLOCK 2 freq:385
  # PRED: ENTRY [100.0%]  (fallthru,exec)
  ivtmp.29_77 = (long unsigned int) b_9(D);
  ivtmp.31_74 = (long unsigned int) c_14(D);
  ivtmp.32_83 = (long unsigned int) a_5(D);
  D.3213_85 = ivtmp.32_83 + 400;
  # SUCC: 3 [100.0%]  (fallthru,exec)

  # BLOCK 3 freq:9615
  # PRED: 3 [96.0%]  (true,exec) 2 [100.0%]  (fallthru,exec)
  # ivtmp.29_79 = PHI 
  # ivtmp.31_76 = PHI 
  # ivtmp.32_73 = PHI 
  D.3140_11 = MEM[index: ivtmp.29_79];
  D.3142_16 = MEM[index: ivtmp.31_76];
  D.3143_17 = D.3142_16 * D.3140_11;
  MEM[index: ivtmp.32_73] = D.3143_17;
  D.3149_26 = MEM[index: ivtmp.29_79, offset: 4];
  D.3151_31 = MEM[index: ivtmp.31_76, offset: 4];
  D.3152_32 = D.3151_31 * D.3149_26;
  MEM[index: ivtmp.32_73, offset: 4] = D.3152_32;
  D.3157_41 = MEM[index: ivtmp.29_79, offset: 8];
  D.3159_46 = MEM[index: ivtmp.31_76, offset: 8];
  D.3160_47 = D.3159_46 * D.3157_41;
  MEM[index: ivtmp.32_73, offset: 8] = D.3160_47;
  D.3165_56 = MEM[index: ivtmp.29_79, offset: 12];
  D.3167_61 = MEM[index: ivtmp.31_76, offset: 12];
  D.3168_62 = D.3167_61 * D.3165_56;
  MEM[index: ivtmp.32_73, offset: 12] = D.3168_62;
  ivtmp.29_78 = ivtmp.29_79 + 16;
  ivtmp.31_75 = ivtmp.31_76 + 16;
  ivtmp.32_82 = ivtmp.32_73 + 16;
  if (ivtmp.32_82 != D.3213_85)
goto ;
  else
goto ;
  # SUCC: 3 [96.0%]  (true,exec) 4 [4.0%]  (false,exec)

  # BLOCK 4 freq:385
  # PRED: 3 [4.0%]  (false,exec)
  return;
  # SUCC: EXIT [100.0%] 

}

Part of RTL

...
insn 40 39 41 4 sms-6.c:11 (set (reg:SI 157)
(mem:SI (reg:SI 151 [ ivtmp.31 ]) [2 S4 A32])) -1 (nil))

(insn 41 40 42 4 sms-6.c:11 (set (reg:SI 158)
(mem:SI (reg:SI 152 [ ivtmp.29 ]) [2 S4 A32])) -1 (nil))

(insn 42 41 43 4 sms-6.c:11 (set (reg:SI 159)
(mult:SI (reg:SI 157)
(reg:SI 158))) -1 (nil))

(insn 43 42 44 4 sms-6.c:11 (set (mem:SI (reg:SI 150 [ ivtmp.32 ]) [2 S4 A32])
(reg:SI 159)) -1 (nil))

(insn 44 43 45 4 sms-6.c:12 (set (reg:SI 160)
(mem:SI (plus:SI (reg:SI 151 [ ivtmp.31 ])
(const_int 4 [0x4])) [2 S4 A32])) -1 (nil))

(insn 45 44 46 4 sms-6.c:12 (set (reg:SI 161)
(mem:SI (plus:SI (reg:SI 152 [ ivtmp.29 ])
(const_int 4 [0x4])) [2 S4 A32])) -1 (nil))

(insn 46 45 47 4 sms-6.c:12 (set (reg:SI 162)
(mult:SI (reg:SI 160)
(reg:SI 161))) -1 (nil))

(insn 47 46 48 4 sms-6.c:12 (set (mem:SI (plus:SI (reg:SI 150 [ ivtmp.32 ])
(const_int 4 [0x4])) [2 S4 A32])
(reg:SI 162)) -1 (nil))

(insn 48 47 49 4 sms-6.c:13 (set (reg:SI 163)
(mem:SI (plus:SI (reg:SI 151 [ ivtmp.31 ])
(const_int 8 [0x8])) [2 S4 A32])) -1 (nil))

(insn 49 48 50 4 sms-6.c:13 (set (reg:SI 164)
(mem:SI (plus:SI (reg:SI 152 [ ivtmp.29 ])
(const_int 8 [0x8])) [2 S4 A32])) -1 (nil))
...

RE: help on - how to specify architecture information to gcc

2009-09-21 Thread Bingfeng Mei

You should check how to construct DFA for your target architecture.
Look at "Specifying processor pipeline description" in GCC internal
manual and checked out how other architectures do it.


-Bingfeng 

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On 
> Behalf Of ddmetro
> Sent: 21 September 2009 12:52
> To: gcc@gcc.gnu.org
> Subject: help on - how to specify architecture information to gcc
> 
> 
> Hi All,
>  Our project is to optimize instruction scheduling in gcc. It
> requires us to specify architecture information
> (basically number of cycles per instruction, stall and branch delays)
> to gcc, to optimize structural hazard detection.
> 
> Problem: Is there any specific format in which we can specify this
> information to gcc? Is it possible to embed this additional
> architecture specific detail, in .md files?
> 
> Target language for which optimization is being done: C
> Target machine architecture: i686
> GCC version: 4.4.1
> 
>  If none of the above options work, we were planning to put
> the information manually in a file and make gcc read it each time it
> loads. Any suggestions/comments on this approach?
> 
>  Couldn't find a related thread. Hence a new one.
> 
> Thanking All,
> - Dhiraj.
> -- 
> View this message in context: 
> http://www.nabble.com/help-on---how-to-specify-architecture-in
> formation-to-gcc-tp25530300p25530300.html
> Sent from the gcc - Dev mailing list archive at Nabble.com.
> 
> 
>

Issues of the latest trunk with LTO merges

2009-10-12 Thread Bingfeng Mei

Hello,
I ran into an issue with the LTO merges when updating to current trunk. 
The problem is that my target calls a few functions/uses some data structures
in the gcc directory: c_language, paragma_lex, c_register_pragma, etc.

gcc -m32  -g -DIN_GCC -DCROSS_DIRECTORY_STRUCTURE  -W -Wall -Wwrite-strings 
-Wcast-qual -Wstrict-prototypes -Wmissing-prototypes -Wmissing-format-attribute 
-pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings 
-Wold-style-definition -Wc++-compat -fno-common  -DHAVE_CONFIG_H  -o lto1 \
lto/lto-lang.o lto/lto.o lto/lto-elf.o attribs.o main.o 
tree-browser.o libbackend.a ../libcpp/libcpp.a ../libdecnumber/libdecnumber.a   
-L/projects/firepath/tools/work/bmei/packages/gmp/4.3.0/lib 
-L/projects/firepath/tools/work/bmei/packages/mpfr/2.4.1/lib -lmpfr -lgmp 
-rdynamic -ldl  -L../zlib -lz 
-L/projects/firepath/tools/work/bmei/packages/libelf/lib -lelf 
../libcpp/libcpp.a   ../libiberty/libiberty.a ../libdecnumber/libdecnumber.a  
-lelf

When compiling for lto1 in above step, I have many linking errors consequently.
I tried to add some extra object files like c-common.o, c-pragma.o, etc into
lto/Make-lang.in. More linking errors are produced. One problem is that lto
code redefines some data exist in the main code: flag_no_builtin, flag_isoc99
lang_hooks, etc, which prevent it from linking with object files in main 
directory. 

What is the clean solution for this? Thanks in advance.

Cheers,
Bingfeng Mei

RE: Issues of the latest trunk with LTO merges

2009-10-12 Thread Bingfeng Mei

Richard,
Doesn't REGISTER_TARGET_PRAGMAS need to call c_register_pragma, etc, if we
want to specify target-specific pragma? It becomes part of libbackend.a, 
which is linked to lto1. One solution I see is to put them into a separate
file so the linker won't produce undefined references when they are not
actually used by lto1. 

Thanks,
Bingfeng

> -Original Message-
> From: Richard Guenther [mailto:richard.guent...@gmail.com] 
> Sent: 12 October 2009 15:34
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Issues of the latest trunk with LTO merges
> 
> On Mon, Oct 12, 2009 at 4:31 PM, Bingfeng Mei 
>  wrote:
> > Hello,
> > I ran into an issue with the LTO merges when updating to 
> current trunk.
> > The problem is that my target calls a few functions/uses 
> some data structures
> > in the gcc directory: c_language, paragma_lex, 
> c_register_pragma, etc.
> >
> > gcc -m32  -g -DIN_GCC -DCROSS_DIRECTORY_STRUCTURE  -W -Wall 
> -Wwrite-strings -Wcast-qual -Wstrict-prototypes 
> -Wmissing-prototypes -Wmissing-format-attribute -pedantic 
> -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings 
> -Wold-style-definition -Wc++-compat -fno-common  
> -DHAVE_CONFIG_H  -o lto1 \
> >                lto/lto-lang.o lto/lto.o lto/lto-elf.o 
> attribs.o main.o tree-browser.o libbackend.a 
> ../libcpp/libcpp.a ../libdecnumber/libdecnumber.a   
> -L/projects/firepath/tools/work/bmei/packages/gmp/4.3.0/lib 
> -L/projects/firepath/tools/work/bmei/packages/mpfr/2.4.1/lib 
> -lmpfr -lgmp -rdynamic -ldl  -L../zlib -lz 
> -L/projects/firepath/tools/work/bmei/packages/libelf/lib 
> -lelf ../libcpp/libcpp.a   ../libiberty/libiberty.a 
> ../libdecnumber/libdecnumber.a  -lelf
> >
> > When compiling for lto1 in above step, I have many linking 
> errors consequently.
> > I tried to add some extra object files like c-common.o, 
> c-pragma.o, etc into
> > lto/Make-lang.in. More linking errors are produced. One 
> problem is that lto
> > code redefines some data exist in the main code: 
> flag_no_builtin, flag_isoc99
> > lang_hooks, etc, which prevent it from linking with object 
> files in main directory.
> >
> > What is the clean solution for this? Thanks in advance.
> 
> You should not use C frontend specific stuff when not building
> the C frontend.
> 
> Richard.
> 
> > Cheers,
> > Bingfeng Mei
> >
> >
> >
> 
>

LTO question

2009-10-13 Thread Bingfeng Mei

Hello,
I just had the first taste with the latest LTO merge on our port.  
Compiler is configured with LTO enabled and built correctly. 
I tried the following example: 
 
a.c
extern void foo(int);
int main()
{  foo(20);
  return 1;
}  

b.c
#include 
void foo(int c)
{
  printf("Hello world: %d\n", c);
}

compiled with:
firepath-elf-gcc -flto a.c b.c -save-temps -O2
 
I expected that foo should be inlined into main, but not.  Both functions
 of main and foo are present in a.s, while b.s contains the LTO code. 
 
Did I miss something here? Are there new hooks I should specify or tune for 
LTO? I checked the up-to-date internal manual and found nothing.
 
Thanks,
Bingfeng Mei

RE: LTO question

2009-10-13 Thread Bingfeng Mei

Thanks. It works. I thought -fwhole-program was used with --combine and they 
are replaced
by -flto. Now it seems that -flto is equivalent of --combine, and 
-fwhole-program is still
important. 

Bingfeng  

> -Original Message-
> From: Diego Novillo [mailto:dnovi...@google.com] 
> Sent: 13 October 2009 14:30
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: LTO question
> 
> On Tue, Oct 13, 2009 at 08:47, Bingfeng Mei  wrote:
> 
> > a.c
> > extern void foo(int);
> > int main()
> > {  foo(20);
> >  return 1;
> > }
> >
> > b.c
> > #include 
> > void foo(int c)
> > {
> >  printf("Hello world: %d\n", c);
> > }
> >
> > compiled with:
> > firepath-elf-gcc -flto a.c b.c -save-temps -O2
> >
> > I expected that foo should be inlined into main, but not.  
> Both functions
> >  of main and foo are present in a.s, while b.s contains the 
> LTO code.
> 
> Try adding -fwhole-program.
> 
> 
> Diego.
> 
>

RE: LTO question

2009-10-13 Thread Bingfeng Mei

 

> -Original Message-
> From: Richard Guenther [mailto:richard.guent...@gmail.com] 
> Sent: 13 October 2009 16:15
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: LTO question
> 
> On Tue, Oct 13, 2009 at 2:47 PM, Bingfeng Mei 
>  wrote:
> > Hello,
> > I just had the first taste with the latest LTO merge on our port.
> > Compiler is configured with LTO enabled and built correctly.
> > I tried the following example:
> >
> > a.c
> > extern void foo(int);
> > int main()
> > {  foo(20);
> >  return 1;
> > }
> >
> > b.c
> > #include 
> > void foo(int c)
> > {
> >  printf("Hello world: %d\n", c);
> > }
> >
> > compiled with:
> > firepath-elf-gcc -flto a.c b.c -save-temps -O2
> >
> > I expected that foo should be inlined into main, but not.  
> Both functions
> >  of main and foo are present in a.s, while b.s contains the 
> LTO code.
> >
> > Did I miss something here? Are there new hooks I should 
> specify or tune for
> > LTO? I checked the up-to-date internal manual and found nothing.
> 
> non-inline declared functions are inlined at -O2 only if doing so
> does not increase program size.  Use -O3 or -finline-functions.

But the function is only called once here. It should always decrease size unless
my cost function is terribly wrong.  I will check how other targets such as 
x86 do here. 


> 
> Richard.
> 
> > Thanks,
> > Bingfeng Mei
> >
> >
> >
> 
>

RE: Turning off unrolling to certain loops

2009-10-15 Thread Bingfeng Mei

Hello,
I faced a similar issue a while ago. I filed a bug report 
(http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36712) In the end, 
I implemented a simple tree-level unrolling pass in our port
which uses all the existing infrastructure. It works quite well for 
our purpose, but I hesitated to submit the patch because it contains
our not-very-elegannt #prgama unroll implementation. 

The following two functions do the unrolling. I insert the pass 
just after the loop_prefetch pass (4.4)

Cheers,
Bingfeng Mei


/* Perfect unrolling of a loop */
static void tree_unroll_perfect_loop (struct loop *loop, unsigned factor,
  edge exit)
{
  sbitmap wont_exit;
  edge e;
  bool ok;
  unsigned i;
  VEC (edge, heap) *to_remove = NULL;
  
  /* Unroll the loop and remove the exits in all iterations except for the
 last one.  */
  wont_exit = sbitmap_alloc (factor);
  sbitmap_ones (wont_exit);
  RESET_BIT (wont_exit, factor - 1);

  ok = gimple_duplicate_loop_to_header_edge
  (loop, loop_latch_edge (loop), factor - 1,
   wont_exit, exit, &to_remove, DLTHE_FLAG_UPDATE_FREQ);
  free (wont_exit);
  gcc_assert (ok);

  for (i = 0; VEC_iterate (edge, to_remove, i, e); i++)
{
  ok = remove_path (e);
  gcc_assert (ok);
}
  VEC_free (edge, heap, to_remove);
  update_ssa (TODO_update_ssa);
  
#ifdef ENABLE_CHECKING
  verify_flow_info ();
  verify_dominators (CDI_DOMINATORS);
  verify_loop_structure ();
  verify_loop_closed_ssa ();
#endif
}


  
/* Go through all the loops: 
 1. Determine unrolling factor
 2. Unroll loops in different conditions
-- perfect loop: no extra copy of original loop
-- other loops: the original version of loops to execute the remaining 
iterations
*/
static unsigned int rest_of_tree_unroll (void)
{
  loop_iterator li;
  struct loop *loop;
  unsigned ninsns, unroll_factor;
  HOST_WIDE_INT est_niter;
  struct tree_niter_desc desc;
  bool unrolled = false;

  initialize_original_copy_tables ();
  
  /* Scan the loops, inner ones first.  */
  FOR_EACH_LOOP (li, loop, LI_FROM_INNERMOST)
  {
 
 est_niter = estimated_loop_iterations_int (loop, false);
 ninsns = tree_num_loop_insns (loop, &eni_size_weights);
 unroll_factor = determine_unroll_factor (loop, ninsns, &desc, est_niter);
 if (unroll_factor != 1)
 {
   tree niters = number_of_exit_cond_executions(loop);
   
   bool perfect_unrolling = false;
   if(niters != NULL_TREE && niters!= chrec_dont_know && TREE_CODE(niters) 
== INTEGER_CST){
 int num_iters = tree_low_cst(niters, 1);
 if((num_iters % unroll_factor) == 0)
   perfect_unrolling = true;
   }
   
   /* If no. of iterations can be divided by unrolling factor, we have 
perfect unrolling */
   if(perfect_unrolling){
 tree_unroll_perfect_loop(loop, unroll_factor, single_dom_exit(loop));
   }
   else{
 tree_unroll_loop (loop, unroll_factor, single_dom_exit (loop), &desc);
   }  
   unrolled = true;
 }
  }  
  
  free_original_copy_tables ();
  
  /* Need to rebuild call graph due if new function calls are created due to
loop unrolling 
FIXME: rebuild cgraph will lose some information like reason of not 
inlining*/
  if(unrolled)  
rebuild_cgraph_edges();
  /*debug_cgraph();*/
  
  return 0;
}

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On 
> Behalf Of Jean Christophe Beyler
> Sent: 14 October 2009 19:29
> To: Zdenek Dvorak
> Cc: gcc@gcc.gnu.org
> Subject: Re: Turning off unrolling to certain loops
> 
> Ok, I've actually gone a different route. Instead of waiting for the
> middle end to perform this, I've directly modified the parser stage to
> unroll the loop directly there.
> 
> Basically, I take the parser of the for and modify how it adds the
> various statements. Telling it to, instead of doing in the
> c_finish_loop :
> 
>   if (body)
> add_stmt (body);
>   if (clab)
> add_stmt (build1 (LABEL_EXPR, void_type_node, clab));
>   if (incr)
> add_stmt (incr);
> ...
> 
> I tell it to add multiple copies of body and incr and the at the end
> add in the loop the rest of it. I've also added support to remove
> further unrolling to these modified loops and will be handling the
> "No-unroll" pragma. I then let the rest of the optimization passes,
> fuse the incrementations together if possible,  etc.
> 
> The initial results are quite good and seem to work and 
> produce good code.
> 
> Currently, there are two possibilities :
> 
> - If the loop is not in the form we want, for example:
> 
> for (;i {
> ...
> }
> 
> Do we still unroll even though we have to trust the user that the
> number of unrolling will not break the semantics ?
> 
&g

RE: Turning off unrolling to certain loops

2009-10-15 Thread Bingfeng Mei

Jc,
How did you implement #pragma unroll?  I checked other compilers. The
pragma should govern the next immediate loop. It took me a while to 
find a not-so-elegant way to do that. I also implemented #pragma ivdep.
These information are supposed to be passed through both tree and RTL
levels and suvive all GCC optimization. I still have problem in handling
combination of unroll and ivdep.

Bingfeng

> -Original Message-
> From: fearyours...@gmail.com [mailto:fearyours...@gmail.com] 
> On Behalf Of Jean Christophe Beyler
> Sent: 15 October 2009 16:34
> To: Zdenek Dvorak
> Cc: Bingfeng Mei; gcc@gcc.gnu.org
> Subject: Re: Turning off unrolling to certain loops
> 
> You are entirely correct, I hadn't thought that through enough.
> 
> So I backtracked and have just merged what Bingfeng Mei has done with
> your code and have now a corrected version of the loop unrolling.
> 
> What I did was directly modified tree_unroll_loop to handle the case
> of a perfect unroll or not internally and then used something similar
> to what you had done around that. I added what I think is needed to
> stop unrolling of those loops in later passes.
> 
> I'll be starting my tests but I can port it to 4.5 if you are
> interested to see what I did.
> Jc
> 
> On Thu, Oct 15, 2009 at 6:00 AM, Zdenek Dvorak 
>  wrote:
> > Hi,
> >
> >> I faced a similar issue a while ago. I filed a bug report
> >> (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36712) In the end,
> >> I implemented a simple tree-level unrolling pass in our port
> >> which uses all the existing infrastructure. It works quite well for
> >> our purpose, but I hesitated to submit the patch because 
> it contains
> >> our not-very-elegannt #prgama unroll implementation.
> >
> > could you please do so anyway?  Even if there are some 
> issues with the
> > #prgama unroll implementation, it could serve as a basis of a usable
> > patch.
> >
> >> /* Perfect unrolling of a loop */
> >> static void tree_unroll_perfect_loop (struct loop *loop, 
> unsigned factor,
> >>                 edge exit)
> >> {
> >> ...
> >> }
> >>
> >>
> >>
> >> /* Go through all the loops:
> >>      1. Determine unrolling factor
> >>      2. Unroll loops in different conditions
> >>         -- perfect loop: no extra copy of original loop
> >>         -- other loops: the original version of loops to 
> execute the remaining iterations
> >> */
> >> static unsigned int rest_of_tree_unroll (void)
> >> {
> > ...
> >>        tree niters = number_of_exit_cond_executions(loop);
> >>
> >>        bool perfect_unrolling = false;
> >>        if(niters != NULL_TREE && niters!= chrec_dont_know 
> && TREE_CODE(niters) == INTEGER_CST){
> >>          int num_iters = tree_low_cst(niters, 1);
> >>          if((num_iters % unroll_factor) == 0)
> >>            perfect_unrolling = true;
> >>        }
> >>
> >>        /* If no. of iterations can be divided by unrolling 
> factor, we have perfect unrolling */
> >>        if(perfect_unrolling){
> >>          tree_unroll_perfect_loop(loop, unroll_factor, 
> single_dom_exit(loop));
> >>        }
> >>        else{
> >>          tree_unroll_loop (loop, unroll_factor, 
> single_dom_exit (loop), &desc);
> >>        }
> >
> > It would be better to move this test to tree_unroll_loop, and not
> > duplicate its code in tree_unroll_perfect_loop.
> >
> > Zdenek
> >
> 
>

RE: Turning off unrolling to certain loops

2009-10-16 Thread Bingfeng Mei

The basic idea is the same as what is described in this thread.
 http://gcc.gnu.org/ml/gcc/2008-05/msg00436.html
But I made many changes on Alex's method. 

Pragmas are encoded into names of the helper functions because
they are not optimized out by tree-level optimizer. These
pseudo functions are either be consumed by target-builtins.c
if it is only used at tree-level (unroll) or be replaced in
target-builtins.c with special rtl NOTE(ivdep). 

To ensure the right scope of these loop pragmas, I also modified
c_parser_for_statement, c_parser_while_statemennt, etc, to check
loop level. I define that these pragmas only control the next 
innnermost loop. Once the right scope of the pragma is determined,
I actually generate two helper functions for each pragma. The second 
is to close the scope of the pragma.

When the pragma is used, I just search backward for preceding 
helper function (tree-level) or special note (rtl-level)


Bingfeng

> -Original Message-
> From: fearyours...@gmail.com [mailto:fearyours...@gmail.com] 
> On Behalf Of Jean Christophe Beyler
> Sent: 15 October 2009 17:27
> To: Bingfeng Mei
> Cc: Zdenek Dvorak; gcc@gcc.gnu.org
> Subject: Re: Turning off unrolling to certain loops
> 
> I implemented it like this:
> 
> - I modified c_parser_for_statement to include a pragma tree node in
> the loop with the unrolling request as an argument
> 
> - Then during my pass to handle unrolling, I parse the loop 
> to find the pragma.
>  - I retrieve the unrolling factor and use a merge of Zdenek's
> code and yours to perform either a perfect unrolling or  and register
> it in the loop structure
> 
>   - During the following passes that handle loop unrolling, I
> look at that variable in the loop structure to see if yes or no, we
> should allow the normal execution of the unrolling
> 
>   - During the expand, I transform the pragma into a note that
> will allow me to remove any unrolling at that point.
> 
> That is what I did and it seems to work quite well.
> 
> Of course, I have a few things I am currently considering:
> - Is there really a position that is better for the pragma node in
> the tree representation ?
> - Can other passes actually move that node out of a given loop
> before I register it in the loop ?
> - Should I, instead of keeping that node through the tree
> optimizations, actually remove it and later on add it just before
> expansion ?
> - Can an RTL pass remove notes or move them out of a loop ?
> - Can the tree node or note change whether or not an optimization
> takes place?
> - It is important to note that after the registration, the pragma
> node or note are actually just there to say "don't do anything",
> therefore, the number of nodes or notes in the loop is not important
> as long as they are not moved out.
> 
> Those are my current concerns :-), I can give more 
> information if requested,
> Jc
> 
> PS: What exactly was your solution to this issue?
> 
> 
> On Thu, Oct 15, 2009 at 12:11 PM, Bingfeng Mei 
>  wrote:
> > Jc,
> > How did you implement #pragma unroll?  I checked other 
> compilers. The
> > pragma should govern the next immediate loop. It took me a while to
> > find a not-so-elegant way to do that. I also implemented 
> #pragma ivdep.
> > These information are supposed to be passed through both 
> tree and RTL
> > levels and suvive all GCC optimization. I still have 
> problem in handling
> > combination of unroll and ivdep.
> >
> > Bingfeng
> >
> >> -Original Message-
> >> From: fearyours...@gmail.com [mailto:fearyours...@gmail.com]
> >> On Behalf Of Jean Christophe Beyler
> >> Sent: 15 October 2009 16:34
> >> To: Zdenek Dvorak
> >> Cc: Bingfeng Mei; gcc@gcc.gnu.org
> >> Subject: Re: Turning off unrolling to certain loops
> >>
> >> You are entirely correct, I hadn't thought that through enough.
> >>
> >> So I backtracked and have just merged what Bingfeng Mei 
> has done with
> >> your code and have now a corrected version of the loop unrolling.
> >>
> >> What I did was directly modified tree_unroll_loop to 
> handle the case
> >> of a perfect unroll or not internally and then used 
> something similar
> >> to what you had done around that. I added what I think is needed to
> >> stop unrolling of those loops in later passes.
> >>
> >> I'll be starting my tests but I can port it to 4.5 if you are
> >> interested to see what I did.
> >> Jc
> >>
> >> On Thu, Oct 15, 2009 at 6:00 AM, Zdenek Dvorak
> >>  wrote:
> >

How to avoid a tree node being garbage collected after C frontend?

2009-11-09 Thread Bingfeng Mei

Hello,
I need to pass a tree node (section name from processing pragmas)
from C frontend to main GCC body (used in TARGET_INSERT_ATTRIBUTES). 
I store the node in a global pointer array delcared in target.c.
But the tree node is garbage collected in the end of c-parser
pass, and causes an ICE later on. I am not familiar with GC part 
at all. How to prevent this from hanppening?

I checked other targets. It seems v850 almost uses the same approach
to implement section name pragma. Not sure if it has the same problem.
Also the issue is very sensitive to certain condition. For example, with
-save-temps option the bug disappear. 

Thanks,
Bingfeng Mei

RE: How to avoid a tree node being garbage collected after C frontend?

2009-11-10 Thread Bingfeng Mei

Ian, 
Thanks. I tried to follow the examples, but it still doesn't work. 
Here is the related code:

in target-c.c:
extern GTY(()) tree pragma_ghs_sections[GHS_SECTION_COUNT];

...

pragma_ghs_sections[sec_num] = copy_node (sec_name);


in target.c:

...
  section_name = pragma_ghs_sections[sec_num];

  if (section_name == NULL_TREE)
return;

  DECL_SECTION_NAME(decl) = section_name;
...


When I watch the memory pragma_ghs_sections[sec_num] points to,
it is motified by 

#0  0x006cb3e7 in memset () from /lib/tls/libc.so.6
#1  0xc4f0 in ?? ()
#2  0x08120da4 in poison_pages () at ../../src/gcc/ggc-page.c:1854
#3  0x08120ee6 in ggc_collect () at ../../src/gcc/ggc-page.c:1945
#4  0x080f3692 in c_parser_translation_unit (parser=0xf7f92834) at 
../../src/gcc/c-parser.c:978
#5  0x08103bd7 in c_parse_file () at ../../src/gcc/c-parser.c:8290

So target.c won't get correct section_name. 

What is wrong here? I understood GTY marker tells GC that
this global pointer contains access to GC allocated memory. 


Thanks for any input,
Bingfeng 

> -Original Message-
> From: Ian Lance Taylor [mailto:i...@google.com] 
> Sent: 09 November 2009 19:00
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: How to avoid a tree node being garbage collected 
> after C frontend?
> 
> "Bingfeng Mei"  writes:
> 
> > I need to pass a tree node (section name from processing pragmas)
> > from C frontend to main GCC body (used in 
> TARGET_INSERT_ATTRIBUTES). 
> > I store the node in a global pointer array delcared in target.c.
> > But the tree node is garbage collected in the end of c-parser
> > pass, and causes an ICE later on. I am not familiar with GC part 
> > at all. How to prevent this from hanppening?
> 
> Mark the global variable with GTY(()).  See many many existing
> examples.
> 
> Ian
> 
>

RE: How to avoid a tree node being garbage collected after C frontend?

2009-11-10 Thread Bingfeng Mei

Thanks, it works. I should have read the internal manual 
more carefully :-)

Cheers,
Bingfeng 

> -Original Message-
> From: Basile STARYNKEVITCH [mailto:bas...@starynkevitch.net] 
> Sent: 10 November 2009 12:20
> To: Bingfeng Mei
> Cc: Ian Lance Taylor; gcc@gcc.gnu.org
> Subject: Re: How to avoid a tree node being garbage collected 
> after C frontend?
> 
> Bingfeng Mei wrote:
> > Ian, 
> > Thanks. I tried to follow the examples, but it still doesn't work. 
> > Here is the related code:
> > 
> > in target-c.c:
> > extern GTY(()) tree pragma_ghs_sections[GHS_SECTION_COUNT];
> > 
> 
> Perhaps you need to make sure that target-c.c is processed by 
> gengtype, and that it does include the generated 
> gt-target-c.h file.
> 
> Regards.
> 
> 
> -- 
> Basile STARYNKEVITCH http://starynkevitch.net/Basile/
> email: basilestarynkevitchnet mobile: +33 6 8501 2359
> 8, rue de la Faiencerie, 92340 Bourg La Reine, France
> *** opinions {are only mines, sont seulement les miennes} ***
> 
>

Is this patch of vector shift in 4.5?

2009-11-10 Thread Bingfeng Mei

Hello, Andrew,

I am wondering whether this patch you mentioned has 
made into 4.5?
http://gcc.gnu.org/ml/gcc/2009-02/msg00381.html

We would like to support it in our port if the frontend
has be adapted to support it. 

Thanks,
Bingfeng

RE: help on - adding a new pass to gcc

2009-11-10 Thread Bingfeng Mei

Did you add your new object file to OBJS-common list in Makefile.in? 

Bingfeng



> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On 
> Behalf Of ddmetro
> Sent: 10 November 2009 16:25
> To: gcc@gcc.gnu.org
> Subject: help on - adding a new pass to gcc
> 
> 
> Hi All,
>  We are adding a new pass for - structural hazard 
> optimization - in
> gcc.
>  We have added a rtl_opt_pass variable(pass_sched3) 
> declaration in
> tree-pass.h and defined the same in a new file - sched-by-category.c
>  We then added a target in gcc/Makefile.in, as follows:
> sched-by-category.o : sched-by-category.c $(CONFIG_H) $(SYSTEM_H)
> coretypes.h $(TM_H) \
>$(RTL_H) $(SCHED_INT_H) $(REGS_H) hard-reg-set.h 
> $(FLAGS_H) insn-config.h
> \
>$(FUNCTION_H) $(INSN_ATTR_H) $(TOPLEV_H) $(RECOG_H) 
> except.h $(PARAMS_H)
> \
>$(TM_P_H) $(TARGET_H) $(CFGLAYOUT_H) $(TIMEVAR_H) tree-pass.h  \
>$(DBGCNT_H)
> 
>  We are getting an error in passes.c - undefined reference to
> 'pass_sched3'
>  Kindly guide us as to what is wrong in our approach 
> of adding a new
> file to gcc build.
> 
> Thanking You,
>   Dhiraj.
> -- 
> View this message in context: 
> http://old.nabble.com/help-on---adding-a-new-pass-to-gcc-tp262
> 86452p26286452.html
> Sent from the gcc - Dev mailing list archive at Nabble.com.
> 
> 
>

Loop pragmas dilemma

2009-11-18 Thread Bingfeng Mei

Hi,
Due to pressing requirements of our target processor/application, I 
am implementing several popular loop pragmas in our private porting.
I've already implemented "unroll" and "ivdep", and am now working 
on "loop_count" to give GCC hints about number of iterations. 

The problem I am now facing is that GCC has many loop optimizations
in both tree and rtl levels that change loop property. For example,
loop versioning by unrolling called by predom pass and loop fissions
by graphite passes. This makes loop_count simply wrong for transformed
loop(s). What is best strategy? Updating loop count pragma to track
changed loops, or disable loop optimizations altogether in presence
of loop pragma? 

To less extent, loop optimizations also affect other loop pragmas. 
For example, I have to disable cunroll pass in presence of #pragma
unroll because it is confusing for user. 

Does anyone know how other compilers, e.g., icc, handle
such issues? 

Thanks for any input,
Bingfeng Mei

Broadcom UK

RE: Worth balancing the tree before scheduling?

2009-11-25 Thread Bingfeng Mei

Hello,
It seems to me that tree balancing risk of producing wrong result due
to overflow of subexpression. 

Say a = INT_MIN, b = 10, c = 10, d = INT_MAX. 

If
((a + b) + c) + d))

becomes 
((a + b) + (c + d))

c + d will overflow and the original won't. So the behaviour of
two are different. Though the architecture may manage to produce
correct result, it is undefined I think. 


Cheers,
Bingfeng



> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On 
> Behalf Of Ian Bolton
> Sent: 20 November 2009 15:05
> To: gcc@gcc.gnu.org
> Subject: Worth balancing the tree before scheduling?
> 
> From some simple experiments (see below), it appears as 
> though GCC aims
> to
> create a lop-sided tree when there are constants involved 
> (func1 below),
> but a balanced tree when there aren't (func2 below).
> 
> Our assumption is that GCC likes having constants all near to 
> each other
> to
> aid with tree-based optimisations, but I'm fairly sure that, when it
> comes
> to scheduling, it would be better to have a balanced tree, so 
> sched has
> more
> choices about what to schedule next?
> 
> The impact of limiting sched's options can be seen if we look at the
> pseudo-assembly produced by GCC for our architecture:
> 
> func1:
> LSHIFT  $c3,$c1,3 # tmp137, a,
> ADD $c2,$c2,$c3   # tmp138, b, tmp137
> ADD $c1,$c2,$c1   #, tmp138, a
> 
> We think it would be better to avoid using the temporary:
> 
> func1:
> ADD $c2,$c1,$c2 # tmp137, a, b
> LSHIFT  $c1,$c1,3   # tmp138, a,
> ADD $c1,$c2,$c1 # , tmp137, tmp138
> 
> As it currently stands, sched doesn't have the option to do 
> this because
> its input (shown in func.c.172r.asmcons below) is arranged 
> such that the
> first add depends on the shift and the second add depends on the first
> add.
> 
> If the tree were balanced, sched would have the option to do the add
> first.
> And, providing the logic was there in sched, we could make it 
> choose to
> schedule such that we limit the number of temporaries used.
> 
> Maybe one of the RTL passes prior to scheduling has the potential to
> balance the tree/RTL, but just isn't on our architecture?
> 
> ==
> func.c:
> --
> int func1 (int a, int b)
> {
>   /* the original expression */
>   return a + b + (a << 3);
> }
>  
> 
> int func2 (int a, int b, int c)
> {
>   /* the original expression */
>   return a + b + (a << c);
> }
>  
> 
> ==
> 
> ==
> func.c.129t.supress_extend:
> --
> ;; Function func1 (func1)
>  
> func1 (int a, int b)
> {
> :
>   return (b + (a << 3)) + a;
> }
> 
> func2 (int a, int b, int c)
> {
> :
>   return (b + a) + (a << c);
> }
> 
>  
> ==
> 
> ==
> func.c.172r.asmcons:
> --
> 
> ;; Function func1 (func1)
> 
> ;; Pred edge  ENTRY [100.0%]  (fallthru)
> (note 5 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
>  
> (insn 2 5 3 2 func.c:2 (set (reg/v:SI 134 [ a ])
> (reg:SI 1 $c1 [ a ])) 45 {*movsi} (expr_list:REG_DEAD 
> (reg:SI 1
> $c1 [ a ])
> (nil)))
>  
> (note 3 2 4 2 NOTE_INSN_DELETED)
>  
> (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)
>  
> (insn 7 4 8 2 func.c:2 (set (reg:SI 137)
> (ashift:SI (reg/v:SI 134 [ a ])
> (const_int 3 [0x3]))) 80 {ashlsi3} (nil))
>  
> (insn 8 7 9 2 func.c:2 (set (reg:SI 138)
> (plus:SI (reg:SI 2 $c2 [ b ])
> (reg:SI 137))) 65 {*addsi3} (expr_list:REG_DEAD 
> (reg:SI 137)
> (expr_list:REG_DEAD (reg:SI 2 $c2 [ b ])
> (nil
>  
> 
> (note 9 8 14 2 NOTE_INSN_DELETED)
>  
> 
> (insn 14 9 20 2 func.c:5 (set (reg/i:SI 1 $c1)
> (plus:SI (reg:SI 138)
> (reg/v:SI 134 [ a ]))) 65 {*addsi3} (expr_list:REG_DEAD
> (reg:SI 138)
> (expr_list:REG_DEAD (reg/v:SI 134 [ a ])
> (nil
>  
> 
> (insn 20 14 0 2 func.c:5 (use (reg/i:SI 1 $c1)) -1 (nil))
> 
> ;; Function func2 (func2)
> 
> ;; Pred edge  ENTRY [100.0%]  (fallthru)
> (note 6 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
>  
> 
> (insn 2 6 3 2 func.c:8 (set (reg/v:SI 134 [ a ])
> (reg:SI 1 $c1 [ a ])) 45 {*movsi} (expr_list:REG_DEAD 
> (reg:SI 1
> $c1 [ a ])
> (nil)))
>  
> 
> (note 3 2 4 2 NOTE_INSN_DELETED)
>  
> 
> (note 4 3 5 2 NOTE_INSN_DELETED)
>  
> 
> (note 5 4 8 2 NOTE_INSN_FUNCTION_BEG)
>  
> 
> (insn 8 5 9 2 func.c:8 (set (reg:SI 138)
> (plus:SI (reg:SI 2 $c2 [ b ])
> (reg/v:SI 134 [ a ]))) 65 {*addsi3} (expr_list:REG_DEAD
> (reg:SI 2 $c2 [ b ])
> (nil)))
>  
> 
> (insn 9 8 10 2 func.c:8 (set (reg:SI 139)
> (ashift:SI (reg/v:SI 134 [ a ])
> (reg:SI 3 $c3 [ c ]))) 80 {ashlsi3} (expr_list:REG_DEAD
> (reg/v:SI 134 [ a ])
> (expr_list:REG_DEAD (reg:SI 3 $c3 [

RE: HELP: data dependence

2009-12-03 Thread Bingfeng Mei

Data dependence analysis is done in sched-deps.c.  You can have a look
at build_intra_loop_deps function in ddg.c (which constructs data
dependency graph for modulo scheduler) to see how it is used.

Bingfeng

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On 
> Behalf Of Jianzhang Peng
> Sent: 03 December 2009 09:56
> To: gcc@gcc.gnu.org
> Subject: HELP: data dependence
> 
> I want to get  data dependence information about an basic block, which
> contains RTLs.
> What functions or data structure should I use ?
> 
> thanks
> 
> -- 
> Jianzhang Peng
> 
>

Unnecessary PRE optimization

2009-12-23 Thread Bingfeng Mei

Hello,
I encounter an issue with PRE optimization, which created worse
code than no optimization.

This the test function: 

void foo(int *data, int *m_v4w, int num)
{
  int i;
  int m0;
  for( i=0; i

RE: Unnecessary PRE optimization

2009-12-23 Thread Bingfeng Mei

-O2 

> -Original Message-
> From: Steven Bosscher [mailto:stevenb@gmail.com] 
> Sent: 23 December 2009 12:01
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org; dber...@dberlin.org
> Subject: Re: Unnecessary PRE optimization
> 
> On Wed, Dec 23, 2009 at 12:49 PM, Bingfeng Mei 
>  wrote:
> > Hello,
> > I encounter an issue with PRE optimization, which created worse
> 
> Is this at -O2 or -O3?
> 
> Ciao!
> Steven
> 
>

RE: Unnecessary PRE optimization

2009-12-23 Thread Bingfeng Mei

Do you mean if TARGET_ADDRES_COST (non-x86) is defined properly, 
this should be fixed?  Or it requires extra patch?

Bingfeng

> -Original Message-
> From: Paolo Bonzini [mailto:paolo.bonz...@gmail.com] On 
> Behalf Of Paolo Bonzini
> Sent: 23 December 2009 13:28
> To: Steven Bosscher
> Cc: Bingfeng Mei; gcc@gcc.gnu.org; dber...@dberlin.org
> Subject: Re: Unnecessary PRE optimization
> 
> On 12/23/2009 01:01 PM, Steven Bosscher wrote:
> > On Wed, Dec 23, 2009 at 12:49 PM, Bingfeng 
> Mei  wrote:
> >> Hello,
> >> I encounter an issue with PRE optimization, which created worse
> >
> > Is this at -O2 or -O3?
> 
> I think this could be fixed if fwprop propagated addresses 
> into loops; 
> it doesn't because it made performance worse on x86.  The 
> real reason is 
> "address_cost on x86 sucks and nobody knows how to fix it 
> exactly", but 
> the performance hit was bad enough that we (Steven Bosscher and I) 
> decided to put that hack into fwprop.
> 
> Paolo
> 
>

RE: Unnecessary PRE optimization

2009-12-23 Thread Bingfeng Mei

It seems that just commenting out this check in fwprop.c should work.
 
 /* Do not propagate loop invariant definitions inside the loop.  */
/*  if (DF_REF_BB (def)->loop_father != DF_REF_BB (use)->loop_father)
return;*/

Bingfeng

> -Original Message-
> From: Paolo Bonzini [mailto:paolo.bonz...@gmail.com] On 
> Behalf Of Paolo Bonzini
> Sent: 23 December 2009 15:01
> To: Bingfeng Mei
> Cc: Steven Bosscher; gcc@gcc.gnu.org; dber...@dberlin.org
> Subject: Re: Unnecessary PRE optimization
> 
> On 12/23/2009 03:27 PM, Bingfeng Mei wrote:
> > Do you mean if TARGET_ADDRES_COST (non-x86) is defined properly,
> > this should be fixed?  Or it requires extra patch?
> 
> No, if TARGET_ADDRESS_COST was fixed for x86 (and of course defined 
> properly for your target), we could fix this very easily.
> 
> Paolo
> 
>

RE: PowerPC : GCC2 optimises better than GCC4???

2010-01-04 Thread Bingfeng Mei

I can confirm that our target also generate GOOD code for this case. 
Maybe this is a EABI or target-specific thing, where Struct/union is
forced to memory. 

Bingfeng
Broadcom Uk

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On 
> Behalf Of Andrew Haley
> Sent: 04 January 2010 16:08
> To: gcc@gcc.gnu.org
> Subject: Re: PowerPC : GCC2 optimises better than GCC4???
> 
> On 01/04/2010 12:07 PM, Jakub Jelinek wrote:
> > On Mon, Jan 04, 2010 at 12:18:50PM +0100, Steven Bosscher wrote:
> >>On Mon, Jan 4, 2010 at 12:02 PM, Andrew Haley 
>  wrote:
> >>> This optimization is done by the first RTL cse pass.  I 
> can't understand
> >>> why it's not being done for your target.  I guess this will need a
> >>> powerpc expert.
> >>
> >> Known bug, see http://gcc.gnu.org/PR22141
> > 
> > That's unrelated.  PR22141 is about (lack of) merging of 
> adjacent stores of
> > constant values into memory, but there are no memory stores 
> involved here,
> > everything is in registers, so PR22141 patch will make zero 
> difference here.
> > 
> > IMHO we really should have some late tree pass that 
> converts adjacent
> > bitfield operations into integral operations on 
> non-bitfields (likely with
> > alias set of the whole containing aggregate), as at the RTL 
> level many cases
> > are simply too many instructions for combine etc. to 
> optimize them properly,
> > while at the tree level it could be simpler.
> 
> Yabbut, how come RTL cse can handle it in x86_64, but PPC not?
> 
> Andrew.
> 
>

RE: GCC-How does the coding style affect the insv pattern recognization?

2010-01-13 Thread Bingfeng Mei

Your instruction is likely too specific to be picked up by GCC. 
You may use an intrinisc for it. 

Bingfeng 

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On 
> Behalf Of fanqifei
> Sent: 12 January 2010 12:50
> To: gcc@gcc.gnu.org
> Subject: GCC-How does the coding style affect the insv 
> pattern recognization?
> 
> Hi,
> I am working on a micro controller and trying to port 
> gcc(4.3.2) for it.
> There is insv instruction in our micro controller and I have add
> define_insn to machine description file.
> However, the insv instruction can only be generated when the code
> is written like below.  If the code is written using logical shift and
> or operators, the insv instruction will not be generated.
> For the statement: x= (x&0xFF00) | ((i<<16)&0x00FF);
> 6 RTL instructions are generated after combine pass and 8
> instructions are generated in the assembly file.
> Paolo Bonzini said that insv instruction might be synthesized
> later by combine. But combine only works on at most 3 instructions and
> insv is not generated in such case.
> So exactly when will the insv pattern be recognized and how does
> the coding style affect it?
> Is there any open bug report about this?
> 
> struct test_foo {
> unsigned int a:18;
> unsigned int b:2;
> unsigned int c:12;
> };
> 
> struct test_foo x;
> 
> unsigned int foo()
> {
> unsigned int a=x.b;
> x.b=2;
> return a;
> }
> 
> Thanks!
> fanqifei
> 
>

RE: GCC-How does the coding style affect the insv pattern recognization?

2010-01-13 Thread Bingfeng Mei

OOPs, I don't know that. Anyway, I won't count on GCC to 
reliably pick up these complex patterns.  In our port, we 
implemented clz/ffs/etc as intrinsics though they are present as 
standard patterns. 

Bingfeng

> -Original Message-
> From: fanqifei [mailto:fanqi...@gmail.com] 
> Sent: 13 January 2010 10:26
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: GCC-How does the coding style affect the insv 
> pattern recognization?
> 
> 2010/1/13 Bingfeng Mei :
> > Your instruction is likely too specific to be picked up by GCC.
> > You may use an intrinisc for it.
> >
> > Bingfeng
> 
> but insv is a standard pattern name.
> the semantics of expression x= (x&0xFF00) | ((i<<16)&0x00FF);
> is exactly what insv can do.
> I all tried mips gcc cross compiler, and ins is also not generated.
> Intrinsic is a way to resolve this though. Maybe there is no 
> other better way.
> 
> BTW,
> There is a special case(the bit position is 0):
> 235: f0 97 fc mvi a9 -0x4;  #move immediate to reg
> 238: ff e9 94 and a9 a14 a9;
> 23b: f0 95 02 or a9 0x2;
> The above three instructions can be replaced by mvi and insv. But the
> fact is not in the combine pass.
> 
> Qifei Fan
> 
>

A bug on 32-bit host?

2010-01-22 Thread Bingfeng Mei

Hello,
I am tracking a bug and find the lshift_value function in 
expmed.c questionable (both head and gcc 4.4).

Suppose HOST_BITS_PER_WIDE_INT = 32,  bitpos = 0
and bitsize = 33, the following expression is wrong

high =  (v >> (HOST_BITS_PER_WIDE_INT - bitpos)) 
& ((1 << (bitpos + bitsize - HOST_BITS_PER_WIDE_INT)) - 1);

v >> 32 bits on a 32-bit machine is undefined. On i386, 
v >> 32 results in v, which is not intention of the function.

Cheers,
Bingfeng Mei



static rtx
lshift_value (enum machine_mode mode, rtx value, int bitpos, int bitsize)
{
  unsigned HOST_WIDE_INT v = INTVAL (value);
 
  HOST_WIDE_INT low, high;
 
  if (bitsize < HOST_BITS_PER_WIDE_INT)
v &= ~((HOST_WIDE_INT) -1 << bitsize);
 
  if (bitpos < HOST_BITS_PER_WIDE_INT)
{
  low = v << bitpos;
  /* Obtain value by shifting and set zeros for remaining part*/
  if((bitpos + bitsize) > HOST_BITS_PER_WIDE_INT)
high =  (v >> (HOST_BITS_PER_WIDE_INT - bitpos)) 
& ((1 << (bitpos + bitsize - HOST_BITS_PER_WIDE_INT)) - 1);
  else
high = 0;  
}
  else
{
  low = 0;
  high = v << (bitpos - HOST_BITS_PER_WIDE_INT);
}
 
  return immed_double_const (low, high, mode);
}

RE: A bug on 32-bit host?

2010-01-22 Thread Bingfeng Mei

Oops, that is embarassing. Usually any local change are marked with
#ifdef in our port.  I shoud double check next time when I report an issue. 
Thanks.

Bingfeng

> -Original Message-
> From: Ian Lance Taylor [mailto:i...@google.com] 
> Sent: 22 January 2010 15:04
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: A bug on 32-bit host?
> 
> "Bingfeng Mei"  writes:
> 
> >   /* Obtain value by shifting and set zeros for remaining part*/
> >   if((bitpos + bitsize) > HOST_BITS_PER_WIDE_INT)
> > high =  (v >> (HOST_BITS_PER_WIDE_INT - bitpos)) 
> > & ((1 << (bitpos + bitsize - 
> HOST_BITS_PER_WIDE_INT)) - 1);
> 
> That is not what expmed.c looks like on mainline or on gcc 4.4 branch.
> You must have a local change.
> 
> Ian
> 
>

RE: GCC calling GNU assembler

2010-02-03 Thread Bingfeng Mei

GCC just literally emits the string in your asm expression together with other
assembly code generated by compiler. Only in next step assembler is invoked by 
GCC driver. 

Typically, hard register number is not used so that GCC can do register 
allocation
for inline assembly. 

Bingfeng 

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On 
> Behalf Of Nikola Ikonic
> Sent: 03 February 2010 09:27
> To: gcc@gcc.gnu.org
> Subject: GCC calling GNU assembler
> 
> Hello all,
> 
> Could anybody please answer me on following question:
> 
> where is GCC callin assembler where it recognizes assembler code in C
> function? For example, let's say that there is this line in C code:
> 
> asm("mov r1,r0");
> 
> So, the parser parses this as an assembler string. But where, in GCC
> code, is assembler called to process this string?
> Or maybe the question is where this "mov r1, r0" string is passed to
> assembler. Anyway, I think you got my question.
> 
> Thanks in advance!
> 
> Best regards,
>Nikola
> 
>

RE: Function versioning tests?

2010-02-22 Thread Bingfeng Mei

Hi,
GCC 4.5 already contains such patch.  
http://gcc.gnu.org/ml/gcc-patches/2009-03/msg01186.html
If you are working on 4.4 branch, you can just apply the patch without problem.

Cheers,
Bingfeng

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On 
> Behalf Of Ian Bolton
> Sent: 19 February 2010 17:09
> To: gcc@gcc.gnu.org
> Subject: Function versioning tests?
> 
> Hi there,
> 
> I've changed our private port of GCC to give versioned functions
> better names (rather than T.0, T.1), and was wondering if there
> are any existing tests that push function-versioning to the limit,
> so I can test whether my naming scheme is sound.
> 
> Failing that, I'd appreciate some pointers on how I might make
> such a test.  I know I need to be passing a constant in as a
> parameter, but I don't know what other criteria are required to
> make it trigger.
> 
> Cheers,
> Ian
> 
>

Issue in combine pass.

2010-03-25 Thread Bingfeng Mei

Hello, 
I experienced an ICE for no-scevccp-outer-16.c in our port. It seems not in 
other ports so I couldn't file a bug report. 

Baiscally, the problem appears after the following transformations in 
expand_compound_operation (combine.c). 

Enter expand_compound_operation 
x:
(zero_extend:SI (subreg:HI (plus:V4HI (reg:V4HI 143 [ vect_var_.65 ])
(reg:V4HI 142 [ vect_var_.65 ])) 0))

tem = gen_lowpart (mode, XEXP (x, 0));
tem = (subreg:SI (plus:V4HI (reg:V4HI 143 [ vect_var_.65 ])
(reg:V4HI 142 [ vect_var_.65 ])) 0)

tem = simplify_shift_const (NULL_RTX, ASHIFT, mode,
  tem, modewidth - pos - len);
tem = (subreg:SI (ashift:V4HI (plus:V4HI (reg:V4HI 143 [ vect_var_.65 ])
(reg:V4HI 142 [ vect_var_.65 ]))
(const_int 16 [0x10])) 0)   

tem = simplify_shift_const (NULL_RTX, unsignedp ? LSHIFTRT : ASHIFTRT,
  mode, tem, modewidth - len);

/projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-16.c:
 In function 'main':
/projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-16.c:59:1:
 internal compiler error: in trunc_int_for_mode, at explow.c:56
Please submit a full bug report,

#0  internal_error (gmsgid=0xe9aa77 "in %s, at %s:%d") at 
../../src/gcc/diagnostic.c:707
#1  0x005acf23 in fancy_abort (file=0xea8453 "../../src/gcc/explow.c", 
line=56, function=0xea8440 "trunc_int_for_mode") at 
../../src/gcc/diagnostic.c:763
#2  0x0060528b in trunc_int_for_mode (c=65535, mode=V4HImode) at 
../../src/gcc/explow.c:56
#3  0x005edf24 in gen_int_mode (c=65535, mode=V4HImode) at 
../../src/gcc/emit-rtl.c:459
#4  0x00cf22d9 in simplify_and_const_int (x=0x0, mode=V4HImode, 
varop=0x2a957a8420, constop=65535) at ../../src/gcc/combine.c:9038
#5  0x00cf462f in simplify_shift_const_1 (code=LSHIFTRT, 
result_mode=SImode, varop=0x2a957a0600, orig_count=16) at 
../../src/gcc/combine.c:10073
#6  0x00cf47cf in simplify_shift_const (x=0x0, code=LSHIFTRT, 
result_mode=SImode, varop=0x2a957a8408, count=16) at 
../../src/gcc/combine.c:10122
#7  0x00cebbf9 in expand_compound_operation (x=0x2a95789c20) at 
../../src/gcc/combine.c:6517
#8  0x00ce8afe in combine_simplify_rtx (x=0x2a95789c20, 
op0_mode=HImode, in_dest=0) at ../../src/gcc/combine.c:5535
#9  0x00ce6da5 in subst (x=0x2a95789c20, from=0x2a95781ba0, 
to=0x2a957a83a8, in_dest=0, unique_copy=0) at ../../src/gcc/combine.c:4884
#10 0x00ce6b53 in subst (x=0x2a957a0660, from=0x2a95781ba0, 
to=0x2a957a83a8, in_dest=0, unique_copy=0) at ../../src/gcc/combine.c:4812
#11 0x00ce13ed in try_combine (i3=0x2a957a1678, i2=0x2a957a1630, 
i1=0x0, new_direct_jump_p=0x7fbfffeafc) at ../../src/gcc/combine.c:2963
...


It seems to me that both the gen_lowpart and simplify_shift_const do the wrong 
things in handling vector type. (zero_extend:SI (subreg:HI (V4HI)) is not equal 
to (subreg:SI (V4HI)), is it?  simplify_shift_const produces (ashift:V4HI 
(V4HI..) (16), which is not right either. Does shifting of a vector with a 
const value mean shifting every element of vector or treat the vector as an 
entity? Internal manual is not very clear about that. 


Thanks,
Bingfeng Mei

Release novops attribute for external use?

2010-04-12 Thread Bingfeng Mei

Hello,
One of our engineers requested a feature so that
compiler can avoid to re-load variables after a function
call if it is known not to write to memory. It should 
slash considerable code size in our applications. I found
the existing "pure" and "const" cannot meet his requirements
because the function is optimized out if it doesn't return
a value. I almost started to implement a new attribute 
in our own port, only to find out "novops" attribute is 
exact what we want. Why "novops" is only limited to 
internal use? Does it has any other implication? Could
we release this attribute for external use as well? 

Thanks,
Bingfeng Mei

RE: Release novops attribute for external use?

2010-04-13 Thread Bingfeng Mei

Something like printf (Though I read somewhere glibc extension of printf 
make it non-pure). 

Bingfeng  

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On 
> Behalf Of Andrew Haley
> Sent: 12 April 2010 17:34
> To: gcc@gcc.gnu.org
> Subject: Re: Release novops attribute for external use?
> 
> On 04/12/2010 05:27 PM, Bingfeng Mei wrote:
> > Hello,
> > One of our engineers requested a feature so that
> > compiler can avoid to re-load variables after a function
> > call if it is known not to write to memory. It should 
> > slash considerable code size in our applications. I found
> > the existing "pure" and "const" cannot meet his requirements
> > because the function is optimized out if it doesn't return
> > a value.
> 
> If a function doesn't write to memory and it doesn't return a
> value, what is the point of calling it?
> 
> Andrew.
> 
>

RE: Release novops attribute for external use?

2010-04-13 Thread Bingfeng Mei

> 
> Surely printf writes to global memory (it clobbers the stdout FILE*)
> 
OK, the point is not about whether printf is pure or not. Instead, if
programmer knows the callee function such as printf contains no 
memory access that affects operations inside caller function, and he
would like to have a way to optimize the code. Our engineer gave following
example: 

void myfunc(MyStruct *myStruct)
{
  int a,b;
  a = myStruct->a;
  printf("a=%d\n",a);
  b = 2*mystruct->a;  // I would like to have the compiler acting as if 
I had written b = 2*a;
 ...
}
Providing such attribute may be potentially dangerous. But it is just
like "restrict" qualifier and some other attributes, putting responsibilty
of correctness on the programmer. "novops" seems to achieve that effect, 
though its semantics doesn't match exactly what I described. 


> As for the original question - novops is internal only because its
> semantics is purely internal and changes with internal aliasing
> changes.
> 
> Now, we still lack a compelling example to see what exact semantics
> you are requesting?  I suppose it might be close to a pure but
> volatile function?  Which you could simulate by
> 
> dummy = pure_fn ();
> asm ("" : "g" (dummy));
> 
> or even
> 
> volatile int dummy = pure_fn ();

These two methods still generate extra code to reload variables

Bingfeng


> 
> Richard.
> 
> > Bingfeng
> >
> >> -Original Message-
> >> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On
> >> Behalf Of Andrew Haley
> >> Sent: 12 April 2010 17:34
> >> To: gcc@gcc.gnu.org
> >> Subject: Re: Release novops attribute for external use?
> >>
> >> On 04/12/2010 05:27 PM, Bingfeng Mei wrote:
> >> > Hello,
> >> > One of our engineers requested a feature so that
> >> > compiler can avoid to re-load variables after a function
> >> > call if it is known not to write to memory. It should
> >> > slash considerable code size in our applications. I found
> >> > the existing "pure" and "const" cannot meet his requirements
> >> > because the function is optimized out if it doesn't return
> >> > a value.
> >>
> >> If a function doesn't write to memory and it doesn't return a
> >> value, what is the point of calling it?
> >>
> >> Andrew.
> >>
> >>
> >
> 
>

RE: Release novops attribute for external use?

2010-04-13 Thread Bingfeng Mei

Thanks! I forgot to declare the function as pure. The empty asm
seems to be a clever trick to avoid function being optimized out.
I shall tell our engineers to use this instead of implementing a new 
attribute. 

Bingfeng

> -Original Message-
> From: Richard Guenther [mailto:richard.guent...@gmail.com] 
> Sent: 13 April 2010 11:25
> To: Bingfeng Mei
> Cc: Andrew Haley; gcc@gcc.gnu.org
> Subject: Re: Release novops attribute for external use?
> 
> On Tue, Apr 13, 2010 at 12:23 PM, Richard Guenther
>  wrote:
> > On Tue, Apr 13, 2010 at 12:15 PM, Bingfeng Mei 
>  wrote:
> >>>
> >>> Surely printf writes to global memory (it clobbers the 
> stdout FILE*)
> >>>
> >> OK, the point is not about whether printf is pure or not. 
> Instead, if
> >> programmer knows the callee function such as printf contains no
> >> memory access that affects operations inside caller 
> function, and he
> >> would like to have a way to optimize the code. Our 
> engineer gave following
> >> example:
> >>
> >>    void myfunc(MyStruct *myStruct)
> >>    {
> >>      int a,b;
> >>      a = myStruct->a;
> >>      printf("a=%d\n",a);
> >>      b = 2*mystruct->a;      // I would like to have the 
> compiler acting as if I had written b = 2*a;
> >>     ...
> >>    }
> >> Providing such attribute may be potentially dangerous. But 
> it is just
> >> like "restrict" qualifier and some other attributes, 
> putting responsibilty
> >> of correctness on the programmer. "novops" seems to 
> achieve that effect,
> >> though its semantics doesn't match exactly what I described.
> >
> > Indeed.  IPA pointer analysis will probably figure it out
> > automagically - that *myStruct didn't escape the unit.
> > Being able to annotate incoming pointers this way would
> > maybe be useful.
> >
> >>> As for the original question - novops is internal only because its
> >>> semantics is purely internal and changes with internal aliasing
> >>> changes.
> >>>
> >>> Now, we still lack a compelling example to see what exact 
> semantics
> >>> you are requesting?  I suppose it might be close to a pure but
> >>> volatile function?  Which you could simulate by
> >>>
> >>> dummy = pure_fn ();
> >>> asm ("" : "g" (dummy));
> >>>
> >>> or even
> >>>
> >>> volatile int dummy = pure_fn ();
> >>
> >> These two methods still generate extra code to reload variables
> >
> > The latter works for me (ok, the store to dummy is retained):
> >
> > extern int myprintf(int) __attribute__((pure));
> > int myfunc (int *p)
> > {
> >  int a;
> >  a = *p;
> >  volatile int dummy = myprintf(a);
> >  return a + *p;
> > }
> >
> > myfunc:
> > .LFB0:
> >        pushq   %rbx
> > .LCFI0:
> >        subq    $16, %rsp
> > .LCFI1:
> >        movl    (%rdi), %ebx
> >        movl    %ebx, %edi
> >        call    myprintf
> >        movl    %eax, 12(%rsp)
> >        leal    (%rbx,%rbx), %eax
> >        addq    $16, %rsp
> > .LCFI2:
> >        popq    %rbx
> > .LCFI3:
> >        ret
> >
> > so we load from %rdi only once.
> 
> And
> 
> extern int myprintf(int) __attribute__((pure));
> int myfunc (int *p)
> {
>   int a;
>   a = *p;
>   int dummy = myprintf(a);
>   asm ("" : : "g" (dummy));
>   return a + *p;
> }
> 
> produces
> 
> myfunc:
> .LFB0:
> pushq   %rbx
> .LCFI0:
> movl(%rdi), %ebx
> movl%ebx, %edi
> callmyprintf
> leal(%rbx,%rbx), %eax
> popq%rbx
> .LCFI1:
> ret
> 
> even better.
> 
> Richard.
> 
>

Which target has working modulo scheduling?

2008-10-17 Thread Bingfeng Mei

Hello,
I tried to enable modulo scheduling for our target VLIW. It fails even for the 
simplest loop. I would like to have a look at how GCC produces schedule for 
other targets. I know that modulo scheduling relies on doloop_end pattern to 
identify a pipelineable loop. There are only a handful of targets supporting 
doloop_end. Which among them are known to work well with modulo scheduling?  
Thanks in advance.

Cheers,
Bingfeng Mei

Broadcom UK

Is there any plan for "data propagation from Tree SSA to RTL" to be in GCC mainline?

2008-11-03 Thread Bingfeng Mei

Hello,
I found current modulo pipelining very inefficient for many loops. One reason 
is primitive cross-iteration memory dependency analysis. The 
add_inter_loop_mem_dep function in ddg.c just draws true dependency between 
every write and read pair. This is quite inadequate since many loops read from 
memory at the beginning of the loop and wrte to the memory at the end. In the 
end, we obtain schedule no better than list scheduling.


I am aware of this work of propagating Tree-level dependency info to RTL 
(http://sysrun.haifa.il.ibm.com/hrl/greps2007/papers/melnik-propagation-greps2007.pdf).
 It should help a lot in improving memory dependency analysis. Is there any 
plan for this work to make into GCC mainline? Thanks in advance.

Kind Regards,
Bingfeng Mei

Broadcom UK

RE: Is there any plan for "data propagation from Tree SSA to RTL" to be in GCC mainline?

2008-11-11 Thread Bingfeng Mei

I found the the GsoC project and patch here (only 2007)
http://code.google.com/soc/2007/gcc/appinfo.html?csaid=E0FEBB869A5F65A8

Is this patch only for propagating data dependency or does it include 
propagating alias info as well?

Bingfeng


> -Original Message-
> From: Andrey Belevantsev 
> [mailto:[EMAIL PROTECTED] On Behalf Of Andrey Belevantsev
> Sent: 09 November 2008 20:31
> To: Diego Novillo
> Cc: Steven Bosscher; Bingfeng Mei; gcc@gcc.gnu.org; 
> [EMAIL PROTECTED]; Daniel Berlin
> Subject: Re: Is there any plan for "data propagation from 
> Tree SSA to RTL" to be in GCC mainline?
> 
> Diego Novillo wrote:
> > On Sun, Nov 9, 2008 at 06:38, Steven Bosscher 
> <[EMAIL PROTECTED]> wrote:
> > 
> >> Wasn't there a GSoC project for this last year?  And this year?
> >>
> >> It'd be interesting to hear if anything came out of that...
> > 
> > Nothing came of that, unfortunately.
> There are two patches, actually.  The patch of propagating data 
> dependences to RTL is ready and working, it wasn't (at that time) 
> committed just because it was initially completed during stage3.  The 
> patch for propagating alias info wasn't finished within the scope of 
> this year's GSoC, unfortunately, and I take it more as my 
> fault than a 
> student's fault, as I failed to help him locally with 
> organizing his work.
> 
> We are nevertheless trying to put some work into finishing 
> this patch. 
> As it is not completed yet, I don't have a subject to 
> discuss.  I hope 
> that before the next stage1 we'll manage to finish the patches and to 
> unify them before submitting, as the mechanism they use for 
> mapping MEMs 
> to trees is the same.  If we'd not finish the second patch, 
> we'll submit 
> the first one anyways.
> 
> Sorry for not writing this earlier -- I've had a few busy 
> months (mostly 
> finishing and defending ph.d. thesis :)
> 
> Andrey
> 
>

RE: Is there any plan for "data propagation from Tree SSA to RTL" to be in GCC mainline?

2008-11-13 Thread Bingfeng Mei

I found it quite hard to merge the patch into current trunk HEAD since many 
things has changed.  Do you know which revision you use?  I would like to have 
a test to see whether it is effective in solving memory dependency issue in 
SMS. Thanks. 

Bingfeng

> -Original Message-
> From: Andrey Belevantsev [mailto:[EMAIL PROTECTED] 
> Sent: 11 November 2008 13:53
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Is there any plan for "data propagation from 
> Tree SSA to RTL" to be in GCC mainline?
> 
> Bingfeng Mei wrote:
> > I found the the GsoC project and patch here (only 2007)
> > 
> http://code.google.com/soc/2007/gcc/appinfo.html?csaid=E0FEBB8
> 69A5F65A8
> > 
> > Is this patch only for propagating data dependency or does 
> it include propagating alias info as well?
> The patch at http://gcc.gnu.org/ml/gcc/2007-12/msg00240.html 
> (I presume 
> this is the same patch, I'm just giving you the link to its 
> submission 
> to the GCC ML) only does propagating data dependency info.
> 
> Andrey
> 
>

RE: generate assembly mnemonic depending the resource allocation

2008-12-03 Thread Bingfeng Mei

You can use C statements to return a modified template string such like
(define_insn "addsi3"
  [(set (match_operand:SI 0 "general_register_operand" "=d")
(plus:SI (match_operand:SI 1 "general_register_operand" "d")
 (match_operand:SI 2 "general_register_operand" "d")))]
  ""
  {
  switch (slot-used){
case 0:
  return "add-slot0, %0, %1, %2";
case 1:
  return "add-slot1, %0, %1, %2";
case 2:
  return "add-slot1, %0, %1, %2";
  }
 } 
 [(set_attr "type" "alu")
   (set_attr "mode" "SI")
   (set_attr "length"   "1")])

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On 
> Behalf Of Alex Turjan
> Sent: 03 December 2008 10:34
> To: gcc@gcc.gnu.org
> Subject: generate assembly mnemonic depending the resource allocation
> 
> Hi all,
> Im building a gcc target for a vliw machine that can execute 
> the same instruction on different resources (slots) and 
> depending on which resources are allocate the instruction 
> must have a different mnemonic. Is it possible in gcc to have 
> for the same define_insn constraints (depending on the 
> allocated architecture resources) different assembly instructions?
> 
> Here is an example:
> Consider the following addSI RTL pattern:
> (define_insn "addsi3"
>   [(set (match_operand:SI 0 "general_register_operand" "=d")
> (plus:SI (match_operand:SI 1 "general_register_operand" "d")
>  (match_operand:SI 2 
> "general_register_operand" "d")))]
>   ""
>   "add %0,%1,%2%"
>  [(set_attr "type" "alu")
>(set_attr "mode" "SI")
>(set_attr "length"   "1")])
> 
> On my target machine "alu" is a reservation that occupies one 
> of the following 3 slots: "slot1|slot2|slot3" and, I need to 
> generate assembly code with different mnemonic depending on 
> which slot the instruction was scheduled:
> 
> add-slot1 %0,%1,%2% // if scheduled on slot 1
> add-slot2 %0,%1,%2% // if scheduled on slot 2
> add-slot3 %0,%1,%2% // if scheduled on slot 3
> 
> Alex
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>   
> 
>

Bug in optimize_bitfield_assignment_op()?

2008-12-18 Thread Bingfeng Mei

Hello,
My GCC porting for our own VLIW processor tracks mainline weekly.  Test 
991118-1.c has failed since two weeks ago.  Following is a simplified version 
of 99118-1.c. After some investigation, I found the following statement is 
expanded to RTL wrongly.

;; tmp2.field = () () ((long long int) 
tmp2.field ^ 0x8765412345678);

(insn 9 8 10 
/projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.c-torture/execute/991118-1.c:23
 (set (reg/f:SI 88)
(symbol_ref:SI ("tmp2") [flags 0x2] )) -1 
(nil))

(insn 10 9 11 
/projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.c-torture/execute/991118-1.c:23
 (set (reg:DI 89)
(mem/s/j/c:DI (reg/f:SI 88) [0+0 S8 A64])) -1 (nil))

(insn 11 10 12 
/projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.c-torture/execute/991118-1.c:23
 (set:DI (reg:DI 90)
(const_int 284280 [0x45678])) -1 (nil)) < wrong constant

(insn 12 11 13 
/projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.c-torture/execute/991118-1.c:23
 (set (reg:DI 91)
(xor:DI (reg:DI 89)
(reg:DI 90))) -1 (nil))

(insn 13 12 0 
/projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.c-torture/execute/991118-1.c:23
 (set (mem/s/j/c:DI (reg/f:SI 88) [0+0 S8 A64])
(reg:DI 91)) -1 (nil))

Insn 11 only preserves the lower 20-bit of the 52-bit long constant.  Further 
investigation shows the problem arises in  
optimize_bitfield_assignment_op function (expr.c).

...
case BIT_XOR_EXPR:
  if (TREE_CODE (op1) != INTEGER_CST)
break;
  value = expand_expr (op1, NULL_RTX, GET_MODE (str_rtx), EXPAND_NORMAL);
  value = convert_modes (GET_MODE (str_rtx),
 TYPE_MODE (TREE_TYPE (op1)), value,
 TYPE_UNSIGNED (TREE_TYPE (op1)));

  /* We may be accessing data outside the field, which means
 we can alias adjacent data.  */
  if (MEM_P (str_rtx))
{
  str_rtx = shallow_copy_rtx (str_rtx);
  set_mem_alias_set (str_rtx, 0);
  set_mem_expr (str_rtx, 0);
}

  binop = TREE_CODE (src) == BIT_IOR_EXPR ? ior_optab : xor_optab;
  if (bitpos + bitsize != GET_MODE_BITSIZE (GET_MODE (str_rtx)))
{
  rtx mask = GEN_INT (((unsigned HOST_WIDE_INT) 1 << bitsize)   
  < Suspected bug
  - 1);
  value = expand_and (GET_MODE (str_rtx), value, mask,
  NULL_RTX);
}
  value = expand_shift (LSHIFT_EXPR, GET_MODE (str_rtx), value,
build_int_cst (NULL_TREE, bitpos),
NULL_RTX, 1);
  result = expand_binop (GET_MODE (str_rtx), binop, str_rtx,
 value, str_rtx, 1, OPTAB_WIDEN);


Here the bitpos = 0, bitsize = 52.  HOST_WIDE_INT for our processor is 32, 
though 64-bit long long type is supported.  The marked statement produces a 
mask of 0xf, thus causes the upper 32-bit removed later.  Is this a 
potential bug, or did I miss something?  

I also tried the older version (> 2 weeks ago). This function is not called at 
all, so can produce correct code. 


Cheers,
Bingfeng 

Broadcom UK

[IRA] New register allocator question

2009-01-02 Thread Bingfeng Mei

Hello,
I recently ported our GCC to new IRA by following mainline development.  The 
only interface I added is IRA_COVER_CLASSES. Our architecture has predicate 
register file. When predicate register has to be spilled, the new IRA produces 
inferior code to the old register allocator.  The old allocator first tries to 
spill to general register file, which is far cheaper on our architecture than 
spilling to memory. The IRA always spills the predicate register to memory 
directly.
 
#define IRA_COVER_CLASSES   \
{  \
  GR_REGS, PR_REGS, M_REGS, BXBC_REGS, LIM_REG_CLASSES  \
}
 
Apart from above macro, what other interfaces/parameters I can tune to change 
this behaviour in new IRA?  Thanks in advance.
 
Happy New Year,
Bingfeng Mei
 
Broadcom UK.

RE: [IRA] New register allocator question

2009-01-02 Thread Bingfeng Mei

I found if I define a new register class that covers both GR_REGS and PR_REGS, 
the issue can be solved. New IRA spill the predicate register to general 
regsister first instead of memory.  Is this right approach? 

 #define IRA_COVER_CLASSES   \
 {  \
   GRPR_REGS, M_REGS, BXBC_REGS, LIM_REG_CLASSES  \
 }

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On 
> Behalf Of Bingfeng Mei
> Sent: 02 January 2009 11:50
> To: gcc@gcc.gnu.org
> Cc: Vladimir Makarov
> Subject: [IRA] New register allocator question
> 
> Hello,
> I recently ported our GCC to new IRA by following mainline 
> development.  The only interface I added is 
> IRA_COVER_CLASSES. Our architecture has predicate register 
> file. When predicate register has to be spilled, the new IRA 
> produces inferior code to the old register allocator.  The 
> old allocator first tries to spill to general register file, 
> which is far cheaper on our architecture than spilling to 
> memory. The IRA always spills the predicate register to 
> memory directly.
>  
> #define IRA_COVER_CLASSES   \
> {  \
>   GR_REGS, PR_REGS, M_REGS, BXBC_REGS, LIM_REG_CLASSES  \
> }
>  
> Apart from above macro, what other interfaces/parameters I 
> can tune to change this behaviour in new IRA?  Thanks in advance.
>  
> Happy New Year,
> Bingfeng Mei
>  
> Broadcom UK.
> 
> 
>

Document error on TARGET_ASM_NAMED_SECTION ?

2009-01-22 Thread Bingfeng Mei

Hello,
According to current GCC internal manual. 
http://gcc.gnu.org/onlinedocs/gccint/File-Framework.html#index-TARGET_005fASM_005fNAMED_005fSECTION-4335

- Target Hook: void TARGET_ASM_NAMED_SECTION (const char *name, unsigned int 
flags, unsigned int align)

Output assembly directives to switch to section name. The section should 
have attributes as specified by flags, which is a bit mask of the SECTION_* 
flags defined in output.h. If align is nonzero, it contains an alignment in 
bytes to be used for the section, otherwise some target default should be used. 
Only targets that must specify an alignment within the section directive need 
pay attention to align - we will still use ASM_OUTPUT_ALIGN. 


But the actually the third argument should be "tree decl" instead of "unsigned 
int align". The following is the default hook. 

default_elf_asm_named_section (const char *name, unsigned int flags,
   tree decl ATTRIBUTE_UNUSED)

Is it an error or do I miss something?

Cheers,
Bingfeng Mei

Solve transitive closure issue in modulo scheduling

2009-01-30 Thread Bingfeng Mei

Hello,
I try to make modulo scheduling work more efficiently for our VLIW target. I 
found one serious issue that prevents current SMS algorithm from achieving high 
IPC is so-called "transitive closure" problem, where scheduling window is only 
calculated using direct predecessors and successors. Because SMS is not an 
iterative algorithm, this may cause failures in finding a valid schedule. 
Without splitting rows, some simple loops just cannot be scheduled not matter 
how big the II is. With splitting rows, schedule can be found, but only at 
bigger II. GCC wiki (http://gcc.gnu.org/wiki/SwingModuloScheduling) lists this 
as a TODO. Is there any work going on about this issue (the last wiki update 
was one year ago)? If no one is working on it, I plan to do it. My idea is to 
use the MinDist algorithm described in B. Rau's classic paper "iterative modulo 
scheduling" (http://www.hpl.hp.com/techreports/94/HPL-94-115.html). The same 
algorithm can also be used to compute better RecMII. The biggest concern is 
complexity of computing MinDist matrix, which is O(N^3). N is number of nodes 
in the loop. I remember somewhere GCC coding guide says "never write quadratic 
algorithm" :-) Is this an absolute requirement? If yes, I will keep it as our 
target-specific code (we are less concerned about compilation time). Otherwise, 
I will try to make it more generic to see if it can make into mainline in 4.5. 
Any comments? 

Cheers,
Bingfeng Mei

Broadcom UK

Difference between vec_shl_ and ashl3

2009-02-10 Thread Bingfeng Mei

Hello,
Could anyone explain to me what is difference between vec_shl_ and 
ashl3 patterns? It seems to me that both shift a vector operand 1 
with scalar operand 2.  I tried to understand some targets' implemenation, 
e.g., ia64 as follows, and cannot grasp their difference. Does the "whole 
vector shift" of vec_shl means treating a vector as a long scalar?  Thanks in 
advance. 

(define_insn "lshr3"
  [(set (match_operand:VECINT24 0 "gr_register_operand" "=r")
(lshiftrt:VECINT24
  (match_operand:VECINT24 1 "gr_register_operand" "r")
  (match_operand:DI 2 "gr_reg_or_5bit_operand" "rn")))]
  ""
  "pshr.u %0 = %1, %2"
  [(set_attr "itanium_class" "mmshf")])

(define_expand "vec_shr_"
  [(set (match_operand:VECINT 0 "gr_register_operand" "")
(lshiftrt:DI (match_operand:VECINT 1 "gr_register_operand" "")
     (match_operand:DI 2 "gr_reg_or_6bit_operand" "")))]
  ""
{
  operands[0] = gen_lowpart (DImode, operands[0]);
  operands[1] = gen_lowpart (DImode, operands[1]);
})

Cheers,
Bingfeng Mei
Broadcom UK

RE: Difference between vec_shl_ and ashl3

2009-02-10 Thread Bingfeng Mei

Ian,
Thanks for prompt reply.  Just out of curiosity. Isn't this naming convention 
for shift instructions inconsistent with other patterns? For example, we can 
define add3 and GCC will automatically use it by vectorization or 
in plus expression of two vector types. Why does shift need special names? 

Bingfeng

> -Original Message-
> From: Ian Lance Taylor [mailto:i...@google.com] 
> Sent: 10 February 2009 14:31
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Difference between vec_shl_ and 
> ashl3
> 
> "Bingfeng Mei"  writes:
> 
> > Could anyone explain to me what is difference between
> > vec_shl_ and ashl3 patterns? It 
> seems to me
> > that both shift a vector operand 1 with scalar operand 2.
> 
> The difference is that with a vector mode gcc will look for 
> the standard
> name vec_shl_MODE, and with a non-vector mode gcc will look for the
> standard name lshlMODE or ashlMODE.
> 
> > I tried to understand some targets' implemenation, e.g., ia64 as
> > follows, and cannot grasp their difference.
> 
> The name which matters is vec_shr_.  The fact that the 
> ia64 names
> the real insn mode3 does not imply that that insn name 
> is actually
> used by anything.  vec_shr_ is a define_expand which is expands
> into a pattern which is recognized by the mode3 insn.  
> The name of
> the mode3 insn could change or be removed and everything would
> work.
> 
> Ian
> 
>

Native support for vector shift

2009-02-24 Thread Bingfeng Mei

Hello,
For the targets that support vectors, we can write the following code:

typedef short  V4H  __attribute__ ((vector_size (8)));

V4H tst(V4H a, V4H b){
  return a + b;
}

Other operators such as -, *, |, &, ^ etc are also supported.  However, vector 
shift
is not supported by frontend, including both scalar and vector second operands. 

V4H tst(V4H a, V4H b){
  return a << 3;
}

V4H tst(V4H a, V4H b){
  return a << b;
}

Currently, we have to use intrinsics to support such shift. Isn't syntax of 
vector
shift intuitive enough to be supported natively? Someone may argue it breaks the
C language. But vector is a GCC extension anyway. Support for vector add/sub/etc
already break C syntax. Any thought? Sorry if this issue had been raised in 
past.

Greetings,
Bingfeng Mei

Broadcom UK

RE: Instrument gcc

2009-02-24 Thread Bingfeng Mei

Did you compile with -O0?  A function may be inlined and a symbol may be 
optimized away with -O1 and above.

Bingfeng 

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On 
> Behalf Of Vincent R.
> Sent: 24 February 2009 15:38
> To: gcc@gcc.gnu.org
> Subject: Instrument gcc
> 
> Hi,
> 
> even if I am simple mortal I would like to understand or at 
> least follow
> what is going on with gcc.
> Generally when I run gdb and try to breakpoint inside a 
> function I get a
> undefined symbol or something like that.
> I suppose this is because gcc is not a simple static exe but 
> depends on
> other binaries (g++, cpp, ...).
> So my question is how can I debug step by step gcc ?
> Let's say for instance I want to breakpoint the function
> init_exception_processing located in gcc/gcc/cp
> and related to c++ exceptions
> 
> This GDB was configured as "i486-linux-gnu"...
> (gdb) b init_exception_processing
> Function "init_exception_processing" not defined.
> Make breakpoint pending on future shared library load? (y or [n])
> 
> What is the magical trick to be able to follow what is going on.
> 
> Thanks
> 
> 
>

RE: Native support for vector shift

2009-02-24 Thread Bingfeng Mei

Yes, at least the first case (scalar operand 2) is supported by valarray.
http://www.reading.ac.uk/SerDepts/su/Topic/Pgram/PgSWC+FP01/Workshop/stdlib/stdref/val_6244.htm#Non-member%20Binary%20Operators

Additionally, if we follow valarray guideline, GCC should also support code 
like:

V4H a, c;
short b;

c = a + b; 

Instead of using
c = a + (V4H){b, b, b, b};

This can be useful.

> -Original Message-
> From: Joseph Myers [mailto:jos...@codesourcery.com] 
> Sent: 24 February 2009 18:52
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Native support for vector shift
> 
> On Tue, 24 Feb 2009, Bingfeng Mei wrote:
> 
> > Currently, we have to use intrinsics to support such shift. 
> Isn't syntax 
> > of vector shift intuitive enough to be supported natively? 
> Someone may 
> > argue it breaks the C language. But vector is a GCC 
> extension anyway. 
> > Support for vector add/sub/etc already break C syntax. Any thought? 
> 
> The general guideline we've followed for C vector extensions 
> is "like C++ 
> valarray".  Does it support this?  (This isn't an absolute 
> rule in either 
> direction, but a useful guide and a set of semantics that have been 
> well-tested in practice.)
> 
> -- 
> Joseph S. Myers
> jos...@codesourcery.com
> 
>

RE: Native support for vector shift

2009-02-24 Thread Bingfeng Mei

Yes, I am aware of both types of vector shift. Our target VLIW
actually supports both and I have implemented all related patterns
in our porting. But it would be still nice to allow programmer 
explicitly use vector shift, preferably both types.

Bingfeng

> -Original Message-
> From: Michael Meissner [mailto:meiss...@linux.vnet.ibm.com] 
> Sent: 24 February 2009 21:07
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Native support for vector shift
> 
> On Tue, Feb 24, 2009 at 06:15:37AM -0800, Bingfeng Mei wrote:
> > Hello,
> > For the targets that support vectors, we can write the 
> following code:
> > 
> > typedef short  V4H  __attribute__ ((vector_size (8)));
> > 
> > V4H tst(V4H a, V4H b){
> >   return a + b;
> > }
> > 
> > Other operators such as -, *, |, &, ^ etc are also 
> supported.  However, vector shift
> > is not supported by frontend, including both scalar and 
> vector second operands. 
> > 
> > V4H tst(V4H a, V4H b){
> >   return a << 3;
> > }
> > 
> > V4H tst(V4H a, V4H b){
> >   return a << b;
> > }
> > 
> > Currently, we have to use intrinsics to support such shift. 
> Isn't syntax of vector
> > shift intuitive enough to be supported natively? Someone 
> may argue it breaks the
> > C language. But vector is a GCC extension anyway. Support 
> for vector add/sub/etc
> > already break C syntax. Any thought? Sorry if this issue 
> had been raised in past.
> 
> Note, internally there are two different types of vector 
> shift.  Some machines
> support a vector shift by a scalar, some machines support a 
> vector shift by a
> vector.  One future machine (x86_64 with -msse5) can support 
> both types of
> vector shifts.
> 
> The auto vectorizer now can deal with both types:
> 
>   for (i = 0; i < n; i++)
> a[i] = b[i] << c
> 
> will generate a vector shift by a scalar on machines with 
> that support, and
> splat the scalar into a vector for the second set of machines.
> 
> If the machine only has vector shift by a scalar, the auto 
> vectorizer will not
> generate a vector shift for:
> 
>   for (i = 0; i < n; i++)
> a[i] = b[i] << c[i]
> 
> Internally, the compiler uses the standard shift names for 
> vector shift by a
> scalar (i.e. ashl, ashr, lshl), and a v 
> prefix for the vector
> by vector shifts (i.e. vashl, vashr, vlshl).
> 
> The rotate patterns are also similar.
> 
> -- 
> Michael Meissner, IBM
> 4 Technology Place Drive, MS 2203A, Westford, MA, 01886, USA
> meiss...@linux.vnet.ibm.com
> 
>

Why are these two functions compiled differently?

2009-03-03 Thread Bingfeng Mei

Hello,
I came across the following example and their .final_cleanup files. To me, both 
functions should produce the same code. But tst1 function actually requires two 
extra sign_extend instructions compared with tst2. Is this a C semantics thing, 
or GCC mis-compile (over-conservatively) in the first case.

Cheers,
Bingfeng Mei
Broadcom UK

 
#define A  255

int tst1(short a, short b){
  if(a > (b - A))
return 0;
  else
return 1;  

}


int tst2(short a, short b){
  short c = b - A;
  if(a > c)
return 0;
  else
return 1;  

}


.final_cleanup
;; Function tst1 (tst1)

tst1 (short int a, short int b)
{
:
  return (int) b + -254 > (int) a;

}



;; Function tst2 (tst2)

tst2 (short int a, short int b)
{
:
  return (short int) ((short unsigned int) b + 65281) >= a;

}

RE: Why are these two functions compiled differently?

2009-03-03 Thread Bingfeng Mei

Should I file a bug report? If it is not a C semantics thing, GCC certainly 
produces unnecessarily big code. 

.file   "tst.c"
.text
.p2align 4,,15
.globl tst1
.type   tst1, @function
tst1:
.LFB0:
.cfi_startproc
movswl  %si,%esi
movswl  %di,%edi
xorl%eax, %eax
subl$254, %esi
cmpl%edi, %esi
setg%al
ret
.cfi_endproc
.LFE0:
.size   tst1, .-tst1
.p2align 4,,15
.globl tst2
.type   tst2, @function
tst2:
.LFB1:
.cfi_startproc
subw$255, %si
xorl%eax, %eax
cmpw%di, %si
setge   %al
ret
.cfi_endproc
.LFE1:
.size   tst2, .-tst2
.ident  "GCC: (GNU) 4.4.0 20090218 (experimental) [trunk revision 
143368]"
.section.note.GNU-stack,"",@progbits

> -Original Message-
> From: Richard Guenther [mailto:richard.guent...@gmail.com] 
> Sent: 03 March 2009 15:16
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org; John Redford
> Subject: Re: Why are these two functions compiled differently?
> 
> On Tue, Mar 3, 2009 at 4:06 PM, Bingfeng Mei 
>  wrote:
> > Hello,
> > I came across the following example and their 
> .final_cleanup files. To me, both functions should produce 
> the same code. But tst1 function actually requires two extra 
> sign_extend instructions compared with tst2. Is this a C 
> semantics thing, or GCC mis-compile (over-conservatively) in 
> the first case.
> 
> Both transformations are already done by the fronted (or fold), likely
> shorten_compare is quilty for tst1 and fold_unary for tst2 (which
> folds (short)((int)b - (int)A).
> 
> Richard.
> 
> > Cheers,
> > Bingfeng Mei
> > Broadcom UK
> >
> >
> > #define A  255
> >
> > int tst1(short a, short b){
> >  if(a > (b - A))
> >    return 0;
> >  else
> >    return 1;
> >
> > }
> >
> >
> > int tst2(short a, short b){
> >  short c = b - A;
> >  if(a > c)
> >    return 0;
> >  else
> >    return 1;
> >
> > }
> >
> >
> > .final_cleanup
> > ;; Function tst1 (tst1)
> >
> > tst1 (short int a, short int b)
> > {
> > :
> >  return (int) b + -254 > (int) a;
> >
> > }
> >
> >
> >
> > ;; Function tst2 (tst2)
> >
> > tst2 (short int a, short int b)
> > {
> > :
> >  return (short int) ((short unsigned int) b + 65281) >= a;
> >
> > }
> >
> >
> >
> >
> 
>

Is const_int zero extended or sign-extended?

2009-03-12 Thread Bingfeng Mei

Hello,
I am confused by one very basic concept :).  In the following rtx expression, 
if const_int is 32-bit and DImode is 64-bit, will the const_int sign-extended 
or zero-extended. In other word, is the content of reg:DI 95 
0x9 or 0x9 after this instruction? 

(set:DI (reg:DI 95)
(const_int -7 [0xfff9]))

Thanks,
Bingfeng Mei

Understand BLKmode and returning structure in register.

2009-03-13 Thread Bingfeng Mei

Hello,
I came across an issue regarding BLKmode and returning structure in register.  
For following code,  I try to return the structure in register instead of 
memory. 
 
extern void abort();
typedef struct {
  short x;
  short y;
} COMPLEX;
 
COMPLEX foo (void) __attribute__ ((noinline));
COMPLEX foo (void)
{
  COMPLEX  x;  
 
  x.x = -7;
  x.y = -7;
 
  return x;
}
 

int main(){
  COMPLEX x = foo();
  if(x.y != -7)
abort();
}

 
In foo function, compute_record_mode function will set the mode for struct 
COMPLEX as BLKmode partly because STRICT_ALIGNMENT is 1 on my target. In 
TARGET_RETURN_IN_MEMORY hook, I return 1 for BLKmode type and 0 otherwise for 
small size (<8) (like MIPS). Thus, this structure is still returned through 
memory, which is not very efficient. More importantly, ABI is NOT FIXED under 
such situation. If an assembly code programmer writes a function returning a 
structure. How does he know the structure will be treated as BLKmode or 
otherwise? So he doesn't know whether to pass result through memory or 
register. Do I understand correctly?

On the other hand, if I return 0 only according to struct type's size 
regardless BLKmode or not, GCC will produces very inefficient code. For 
example, stack setup code in foo is still generated even it is totally 
unnecessary.

Only when I set STRICT_ALIGNMENT to 0, the structure can be passed through 
register in an efficient way. Unfortunately, our machine is strictly aligned 
and I cannot really do that. 

Any suggestion? 

Thanks,
Bingfeng Mei
Broadcom UK

RE: Understand BLKmode and returning structure in register.

2009-03-13 Thread Bingfeng Mei

I found that compiling for mips with -mabi=n32 produces such inefficient code. 
When -mabi=n32, mips_return_in_memory returns 0 if size is small regardless 
BLKmode or not. 

.type   foo, @function
foo:
.frame  $sp,16,$31  # vars= 16, regs= 0/0, args= 0, gp= 0

addiu   $sp,$sp,-16
li  $2,-7   # 0xfff9
sh  $2,0($sp)
sh  $2,2($sp)
ld  $3,0($sp)
addiu   $sp,$sp,16
dsrl$4,$3,32
andi$4,$4,0x
dsrl$3,$3,48
dsll$4,$4,32
dsll$2,$3,48
j   $31
or  $2,$2,$4

.entmain
.type   main, @function
main:

addiu   $sp,$sp,-48
sd  $31,40($sp)
jal foo
nop

dsra$3,$2,32
dsrl$2,$2,48
sh  $3,18($sp)
sh  $2,16($sp)
lw  $2,16($sp)
sll $3,$2,16
sw  $2,0($sp)
sra $3,$3,16
li  $2,-7   # 0xfff9
bne $3,$2,$L8
ld  $31,40($sp)

j   $31
addiu   $sp,$sp,48

$L8:
jal abort
nop


With old ABI, produced code is much simpler but the structure is returned 
through memory. mips_retrun_in_memory returns 1 because the structure type is 
BLKmode.

foo:
li  $3,-7   # 0xfff9
move$2,$4
sh  $3,0($4)
j   $31
sh  $3,2($4)

.entmain
.type   main, @function
main:
.frame  $sp,32,$31  # vars= 8, regs= 1/0, args= 16, gp= 0

addiu   $sp,$sp,-32
sw  $31,28($sp)
jal foo
addiu   $4,$sp,16

lh  $3,18($sp)
li  $2,-7   # 0xfff9
bne $3,$2,$L8
nop

lw  $31,28($sp)
nop
j   $31
addiu   $sp,$sp,32

$L8:
jal abort
nop

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On 
> Behalf Of Bingfeng Mei
> Sent: 13 March 2009 16:35
> To: gcc@gcc.gnu.org
> Cc: Adrian Ashley
> Subject: Understand BLKmode and returning structure in register.
> 
> Hello,
> I came across an issue regarding BLKmode and returning 
> structure in register.  For following code,  I try to return 
> the structure in register instead of memory. 
>  
> extern void abort();
> typedef struct {
>   short x;
>   short y;
> } COMPLEX;
>  
> COMPLEX foo (void) __attribute__ ((noinline));
> COMPLEX foo (void)
> {
>   COMPLEX  x;  
>  
>   x.x = -7;
>   x.y = -7;
>  
>   return x;
> }
>  
> 
> int main(){
>   COMPLEX x = foo();
>   if(x.y != -7)
> abort();
> }
> 
>  
> In foo function, compute_record_mode function will set the 
> mode for struct COMPLEX as BLKmode partly because 
> STRICT_ALIGNMENT is 1 on my target. In 
> TARGET_RETURN_IN_MEMORY hook, I return 1 for BLKmode type and 
> 0 otherwise for small size (<8) (like MIPS). Thus, this 
> structure is still returned through memory, which is not very 
> efficient. More importantly, ABI is NOT FIXED under such 
> situation. If an assembly code programmer writes a function 
> returning a structure. How does he know the structure will be 
> treated as BLKmode or otherwise? So he doesn't know whether 
> to pass result through memory or register. Do I understand correctly?
> 
> On the other hand, if I return 0 only according to struct 
> type's size regardless BLKmode or not, GCC will produces very 
> inefficient code. For example, stack setup code in foo is 
> still generated even it is totally unnecessary.
> 
> Only when I set STRICT_ALIGNMENT to 0, the structure can be 
> passed through register in an efficient way. Unfortunately, 
> our machine is strictly aligned and I cannot really do that. 
> 
> Any suggestion? 
> 
> Thanks,
> Bingfeng Mei
> Broadcom UK 
> 
> 
>

Typo or intended?

2009-03-16 Thread Bingfeng Mei

Hello,
I just updated our porting to include last 2-3 weeks of GCC developments. I 
noticed a large number of test failures at -O1 that use a user-defined data 
type (based on a special register file of our processor). All variables of such 
type are now spilled to memory which we don't allow at -O1 because it is too 
expensive. After investigation, I found that it is the following new code 
causes the trouble. I don't quite understand the function of the new code, but 
I don't see what's special for -O1 in terms of register allocation in 
comparison with higher optimizing levels. If I change it to (optimize < 1), 
everthing is fine as before. I start to wonder whether (optimize <= 1) is a 
typo or intended. Thanks in advance.

Cheers,
Bingfeng Mei
Broadcom UK

  if ((! flag_caller_saves && ALLOCNO_CALLS_CROSSED_NUM (a) != 0)
  /* For debugging purposes don't put user defined variables in
 callee-clobbered registers.  */
  || (optimize <= 1   <-  why 
include -O1? 
  && (attrs = REG_ATTRS (regno_reg_rtx [ALLOCNO_REGNO (a)])) != NULL
  && (decl = attrs->decl) != NULL
  && VAR_OR_FUNCTION_DECL_P (decl)
  && ! DECL_ARTIFICIAL (decl)))
{
  IOR_HARD_REG_SET (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a),
call_used_reg_set);
  IOR_HARD_REG_SET (ALLOCNO_CONFLICT_HARD_REGS (a),
call_used_reg_set);
}
  else if (ALLOCNO_CALLS_CROSSED_NUM (a) != 0)
{
  IOR_HARD_REG_SET (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a),
no_caller_save_reg_set);
  IOR_HARD_REG_SET (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a),
temp_hard_reg_set);
  IOR_HARD_REG_SET (ALLOCNO_CONFLICT_HARD_REGS (a),
no_caller_save_reg_set);
  IOR_HARD_REG_SET (ALLOCNO_CONFLICT_HARD_REGS (a),
temp_hard_reg_set);
}

RE: Understand BLKmode and returning structure in register.

2009-03-17 Thread Bingfeng Mei

Thanks for the reply. There should be more opportunties for strictly aligned
machines. In my example, the structure is a local variable allocated on stack.
I don't see why it is marked as BLKmode. Compiler has full freedom to make it
aligned and use DImode instead.

Bingfeng

> -Original Message-
> From: Richard Sandiford [mailto:rdsandif...@googlemail.com] 
> Sent: 16 March 2009 22:14
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org; Adrian Ashley
> Subject: Re: Understand BLKmode and returning structure in register.
> 
> "Bingfeng Mei"  writes:
> > In foo function, compute_record_mode function will set the mode for
> > struct COMPLEX as BLKmode partly because STRICT_ALIGNMENT is 1 on my
> > target. In TARGET_RETURN_IN_MEMORY hook, I return 1 for BLKmode type
> > and 0 otherwise for small size (<8) (like MIPS).  Thus, 
> this structure
> > is still returned through memory, which is not very efficient. More
> > importantly, ABI is NOT FIXED under such situation. If an assembly
> > code programmer writes a function returning a structure. How does he
> > know the structure will be treated as BLKmode or otherwise? So he
> > doesn't know whether to pass result through memory or register. Do I
> > understand correctly?
> 
> Yes.  I think having TARGET_RETURN_IN_MEMORY depend on 
> internal details
> like the RTL mode is often seen as an historical mistake.  As you say,
> the ABI should be defined directly by the type instead.
> 
> Unfortunately, once you start using a mode, it's difficult to stop
> using a mode without breaking compatibility.  So one of the 
> main reasons
> the MIPS port still uses the mode is because no-one dares touch it.
> 
> Likewise, it's now difficult to change the mode attached to a 
> structure
> (which could potentially make structure accesses more 
> efficient) without
> accidentally breaking someone's ABI.
> 
> > On the other hand, if I return 0 only according to struct 
> type's size
> > regardless BLKmode or not, GCC will produces very inefficient
> > code. For example, stack setup code in foo is still 
> generated even it
> > is totally unnecessary.
> 
> Yeah, there's definitely room for improvement here.  And as you say,
> it's already a problem for MIPS.  I think it's just one of 
> those things
> that doesn't occur often enough in critical code for anyone to have
> spent time optimising it.
> 
> Richard
> 
>

RE: Is const_int zero extended or sign-extended?

2009-03-17 Thread Bingfeng Mei

I am tracking a bug, not sure whether it is a generic GCC bug or my porting 
goes wrong. 

To access the below structure, 

typedef struct {
  long int p_x, p_y;
} Point;
...
p1.p_x = -1;
...

It is expanded to follwing RTL
 ;; p1.p_x = -1;

(insn 19 18 20 
/projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.c-torture/execute/2808-1.c:14
 (set (reg:DI 98)
(ior:DI (reg/v:DI 87 [ p1 ])
(const_int -1 [0x]))) -1 (nil))

(insn 20 19 0 
/projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.c-torture/execute/2808-1.c:14
 (set (reg/v:DI 87 [ p1 ])
(reg:DI 98)) -1 (nil))

According to your explaination, (reg:DI 98) will get -1 (0x) 
after insn 19, and is wrong. Am I right? 

Thanks,
Bingfeng

> -Original Message-
> From: Dave Korn [mailto:dave.korn.cyg...@googlemail.com] 
> Sent: 12 March 2009 17:53
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Is const_int zero extended or sign-extended?
> 
> Bingfeng Mei wrote:
> > Hello, I am confused by one very basic concept :).  In the 
> following rtx
> > expression, if const_int is 32-bit and DImode is 64-bit, 
> will the const_int
> > sign-extended or zero-extended. In other word, is the 
> content of reg:DI 95
> > 0x9 or 0x9 after this instruction?
> > 
> > (set:DI (reg:DI 95) (const_int -7 [0xfff9]))
> > 
> > Thanks, Bingfeng Mei
> > 
> 
>   IIUC in the absence of any explicit extension operation, a 
> const_int is
> taken to be whatever size the object it is assigned to, with 
> the value given
> by the signed decimal interpretation.  That RTL sets reg 95 
> to a DImode -7.
> 
>   Is this part of a larger problem?
> 
> cheers,
>   DaveK
> 
>

RE: Typo or intended?

2009-03-24 Thread Bingfeng Mei

That's fine. It seems that other targets don't have such issue. Our target is 
too special and it is
still a private port. I can just use optimize < 1 here. Thanks,

Bingfeng 

> -Original Message-
> From: Vladimir Makarov [mailto:vmaka...@redhat.com] 
> Sent: 23 March 2009 19:40
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Typo or intended?
> 
> Bingfeng Mei wrote:
> > Hello,
> > I just updated our porting to include last 2-3 weeks of GCC 
> developments. I noticed a large number of test failures at 
> -O1 that use a user-defined data type (based on a special 
> register file of our processor). All variables of such type 
> are now spilled to memory which we don't allow at -O1 because 
> it is too expensive. After investigation, I found that it is 
> the following new code causes the trouble. I don't quite 
> understand the function of the new code, but I don't see 
> what's special for -O1 in terms of register allocation in 
> comparison with higher optimizing levels. If I change it to 
> (optimize < 1), everthing is fine as before. I start to 
> wonder whether (optimize <= 1) is a typo or intended. Thanks 
> in advance.
> >
> >   
> Sorry for the delay with the answer.  I was on vacation last week.
> 
> As Andrew Haley guess, it was intended.  I thought that improving 
> debugging for -O1 is also important (more important than 
> optimization).  
> Although GCC manual says
> 
>  With `-O', the compiler tries to reduce code size and execution
>  time, without performing any optimizations that take a great deal
>  of compilation time.
> 
> it also says
> 
> `-O' also turns on `-fomit-frame-pointer' on machines where doing
>  so does not interfere with debugging.
> 
> Therefore I've decided to do analogous thing for the patch.  
> May be I am 
> wrong.  We could do this only for -O0 if people really want 
> this which I 
> am not sure about.
> > Cheers,
> > Bingfeng Mei
> > Broadcom UK
> >
> >   if ((! flag_caller_saves && ALLOCNO_CALLS_CROSSED_NUM 
> (a) != 0)
> >   /* For debugging purposes don't put user defined variables in
> >  callee-clobbered registers.  */
> >   || (optimize <= 1   
> <-  why include -O1? 
> >   && (attrs = REG_ATTRS (regno_reg_rtx 
> [ALLOCNO_REGNO (a)])) != NULL
> >   && (decl = attrs->decl) != NULL
> >   && VAR_OR_FUNCTION_DECL_P (decl)
> >   && ! DECL_ARTIFICIAL (decl)))
> > {
> >   IOR_HARD_REG_SET (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a),
> > call_used_reg_set);
> >   IOR_HARD_REG_SET (ALLOCNO_CONFLICT_HARD_REGS (a),
> > call_used_reg_set);
> > }
> >   else if (ALLOCNO_CALLS_CROSSED_NUM (a) != 0)
> > {
> >   IOR_HARD_REG_SET (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a),
> > no_caller_save_reg_set);
> >   IOR_HARD_REG_SET (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a),
> > temp_hard_reg_set);
> >   IOR_HARD_REG_SET (ALLOCNO_CONFLICT_HARD_REGS (a),
> > no_caller_save_reg_set);
> >   IOR_HARD_REG_SET (ALLOCNO_CONFLICT_HARD_REGS (a),
> > temp_hard_reg_set);
> > }
> >   
> 
> 
>

gcc99 inlining rules

2009-03-31 Thread Bingfeng Mei

Hello, 
I found the following code doesn't compile with gcc4.4. and -std=c99. Does this 
behaviour conform to standard? 
 
inline int foo(){
  return 10;
}
 
int main(int argc, char **argv){
  return foo();
}

I goolged the c99 inlining rule as follows. They does't seem to say such code 
cannot be compiled. 
 
C99 inline rules

The specification for "inline" is section 6.7.4 of the C99 standard (ISO/IEC 
9899:1999). This isn't freely available, but you can buy a PDF of it from ISO 
relatively cheaply.

*

  A function where all the declarations (including the definition) mention 
"inline" and never "extern". There must be a definition in the same translation 
unit. No stand-alone object code is emitted. You can (must?) have a separate 
(not inline) definition in another translation unit, and the compiler might 
choose either that or the inline definition.

  Such functions may not contain modifiable static variables, and may not 
refer to static variables or functions elsewhere in the source file where they 
are declared.
*

  A function where at least one declaration mentions "inline", but where 
some declaration doesn't mention "inline" or does mention "extern". There must 
be a definition in the same translation unit. Stand-alone object code is 
emitted (just like a normal function) and can be called from other translation 
units in your program.

  The same constraint about statics above applies here, too.
* A function defined "static inline". A local definition may be emitted if 
required. You can have multiple definitions in your program, in different 
translation units, and it will still work. This is the same as the GNU C rules.


Cheers,
Bingfeng Mei

RE: gcc99 inlining rules

2009-03-31 Thread Bingfeng Mei

Link error. 

/tmp/ccqpP1D1.o: In function `main':
tst.c:(.text+0x15): undefined reference to `foo'
collect2: ld returned 1 exit status

As Joseph said, I found the original text in c99 standard in section 6.7.4.

"
EXAMPLE The declaration of an inline function with external linkage can result 
in either an external
definition, or a definition available for use only within the translation unit. 
A file scope declaration with
extern creates an external definition. The following example shows an entire 
translation unit.
inline double fahr(double t)
{
return (9.0 * t) / 5.0 + 32.0;
}
inline double cels(double t)
{
return (5.0 * (t - 32.0)) / 9.0;
}
extern double fahr(double); // creates an external definition
double convert(int is_fahr, double temp)
{
/* A translator may perform inline substitutions */
return is_fahr ? cels(temp) : fahr(temp);
}
8 Note that the definition of fahr is an external definition because fahr is 
also declared with extern, but
the definition of cels is an inline definition. Because cels has external 
linkage and is referenced, an
external definition has to appear in another translation unit (see 6.9); the 
inline definition and the external
definition are distinct and either may be used for the call. "

I understand now the GCC implementation conforms to c99, but don't see 
rationale behind it :-). Anyway,
this is not gcc dev question any more.

Cheers,
Bingfeng
> -Original Message-
> From: Richard Guenther [mailto:richard.guent...@gmail.com] 
> Sent: 31 March 2009 15:32
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: gcc99 inlining rules
> 
> On Tue, Mar 31, 2009 at 4:24 PM, Bingfeng Mei 
>  wrote:
> > Hello,
> > I found the following code doesn't compile with gcc4.4. and 
> -std=c99. Does this behaviour conform to standard?
> >
> > inline int foo(){
> >  return 10;
> > }
> >
> > int main(int argc, char **argv){
> >  return foo();
> > }
> 
> It works for me.  What is your error?
> 
> Richard.
> 
>

1 2 3 >

1 - 100 of 236 matches

Mail list logo