Re: [PATCH v2, rtl-optimization]: Fix PR54457, [x32] Fail to combine 64bit index + constant

2012-09-30 Thread Richard Sandiford
Uros Bizjak  writes:
> On Thu, Sep 27, 2012 at 8:20 PM, Jakub Jelinek  wrote:
>> On Thu, Sep 27, 2012 at 08:04:58PM +0200, Uros Bizjak wrote:
>>> After some off-line discussion with Richard, attached is v2 of the patch.
>>>
>>> 2012-09-27  Uros Bizjak  
>>>
>>> PR rtl-optimization/54457
>>> * simplify-rtx.c (simplify_subreg):
>>>   Simplify (subreg:SI (op:DI ((x:DI) (y:DI)), 0)
>>>   to (op:SI (subreg:SI (x:DI) 0) (subreg:SI (x:DI) 0)).
>>
>> Is that a good idea even for WORD_REGISTER_OPERATIONS targets?
>
> I have bootstrapped and regtested [1] the patch on
> alphaev68-pc-linux-gnu, a WORD_REGISTER_OPERATIONS target, and there
> were no additional failures.

Thanks.  Given Jakub's question/concern, I'd really prefer a third
opinion rather than approving it myself, but... OK if no-one objects
within 24hrs.

Richard


Re: RFC: LRA for x86/x86-64 [1/9]

2012-09-30 Thread Richard Sandiford
Hi Vlad,

Vladimir Makarov  writes:
> @@ -2973,11 +2973,11 @@ cleanup_subreg_operands (rtx insn)
>  df_insn_rescan (insn);
>  }
>  
> -/* If X is a SUBREG, replace it with a REG or a MEM,
> -   based on the thing it is a subreg of.  */
> +/* If X is a SUBREG, try to replace it with a REG or a MEM, based on
> +   the thing it is a subreg of.  Do it anyway if FINAL_P.  */
>  
>  rtx
> -alter_subreg (rtx *xp)
> +alter_subreg (rtx *xp, bool final_p)
>  {
>rtx x = *xp;
>rtx y = SUBREG_REG (x);
> @@ -3001,16 +3001,19 @@ alter_subreg (rtx *xp)
>  offset += difference % UNITS_PER_WORD;
>  }
>  
> -  *xp = adjust_address (y, GET_MODE (x), offset);
> +  if (final_p)
> + *xp = adjust_address (y, GET_MODE (x), offset);
> +  else
> + *xp = adjust_address_nv (y, GET_MODE (x), offset);
>  }
>else
>  {
>rtx new_rtx = simplify_subreg (GET_MODE (x), y, GET_MODE (y),
> -  SUBREG_BYTE (x));
> +  SUBREG_BYTE (x));
>  
>if (new_rtx != 0)
>   *xp = new_rtx;
> -  else if (REG_P (y))
> +  else if (final_p && REG_P (y))
>   {
> /* Simplify_subreg can't handle some REG cases, but we have to.  */
> unsigned int regno;

Could you add a bit more commentary to explain the MEM case?
The REG handling obviously matches the comment at the head of the function:
if FINAL_P, we replace a (subreg (reg)) with a (reg) even in cases where
simplify_subreg wouldn't.  If !FINAL_P we leave things be.

But in the MEM case it's more a verify vs. don't verify thing.  I assume
the idea is that LRA wants to see the rtl for invalid addresses and have
an opportunity to make them valid (because LRA works on rtl rather an
internal representation, like you said elsewhere).  It would be nice to
make that more explicit here.

Thanks,
Richard


[RFC] Make vectorizer to skip loops with small iteration estimate

2012-09-30 Thread Jan Hubicka
Hi,
the point of the following patch is to make vectorizer to not vectorize the
following testcase with profile feedback:

int a[1];
int i=5;
int k=2;
int val;
__attribute__ ((noinline,noclone))
test()
{
  int j;
  for(j=0;j VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC(A)
 where
 SIC = scalar iteration cost, VIC = vector iteration cost,
 VOC = vector outside cost, VF = vectorization factor,
 PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations
 SOC = scalar outside cost for run time cost model check.  */

This value is used for both
1) decision if number of iterations is too low (max iterations is known)
2) decision on runtime whether we want to take the vectorized path
or the scalar path.

The vectoried loop looks like:
  k.1_10 = k;
  if (k.1_10 > 0)
{
  pretmp_2 = val;
  niters.8_4 = (unsigned int) k.1_10;
  bnd.9_13 = niters.8_4 >> 2;
  ratio_mult_vf.10_1 = bnd.9_13 << 2;
  _18 = niters.8_4 <= 3;
  _19 = ratio_mult_vf.10_1 == 0;
  _20 = _19 | _18;
  if (_20 != 0)
scalar loop
  else
vector prologue
}

 So the unvectorized cost is
 SIC * niters

 The vectorized path is
 SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
 The scalar path of vectorizer loop is
 SIC * niters + SOC

   It makes sense to vectorize if
   SIC * niters > SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC   (B)
   That is in the optimal cse where we actually vectorize the overall
   speed of vectorized loop including the runtime check is better.

   It makes sense to take the vector loop if
   SIC * niters > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC (C)
   Because the scalar loop is taken.

   The attached patch implements the formula (C) and uses it to deterine the
   decision based on number of iterations estimate (that is usually provided by
   the feedback)

   As a reality check, I tried my testcase.

   9: Cost model analysis:
 Vector inside of loop cost: 1
 Vector prologue cost: 7
 Vector epilogue cost: 2
 Scalar iteration cost: 1
 Scalar outside cost: 6
 Vector outside cost: 9
 prologue iterations: 0
 epilogue iterations: 2
 Calculated minimum iters for profitability: 4

   9:   Profitability threshold = 3

   9:   Profitability estimated iterations threshold = 20

   This is overrated. The loop starts to be benefical at about 4 iterations in
   reality.  I guess the values are kind of wrong.

   Vector inside of loop cost and Scalar iteration cost seems to ignore the
   fact that the loops do contain some control flow that should account at least
   one extra cycle.

   Vector prologue cost seems bit overrated for one pack operation.

   Of course this is very simple benchmark, in reality the vectorizatoin can be
   a lot more harmful by complicating more complex control flows.

   So I guess we have two options
1) go with the new formula and try to make cost model a bit more realistic.
2) stay with original formula that is quite close to reality, but I think
   more by an accident.

2) Even when loop iterates 2 times, it is estimated to 4 iterations by
   estimated_stmt_executions_int with the profile feedback.
   The reason is loop_ch pass.  Given a rolled loop with exit probability
   30%, proceeds by duplicating the header with original probabilities.
   This makes the loop to be executed with 60% probability.  Because the
   loop body counts remain the same (and they should), the expected number
   of iterations increase by the decrease of entry edge to the header.

   I wonder what to do about this.  Obviously without path profiling
   loop_ch can not really do a good job.  We can artifically make
   header to suceed more likely, that is the reality, but that requires
   non-trivial loop profile updating.

   We can also simply record the iteration bound into loop structure 
   and ignore that the profile is not realistic

   Finally we can duplicate loop headers before profilng.  I implemented
   that via early_ch pass executed only with profile generation or feedback.
   I guess it makes sense to do, even if it breaks the assumption that
   we should do strictly -Os generation on paths where


Index: tree-vect-loop.c
===
--- tree-vect-loop.c(revision 191852)
+++ tree-vect-loop.c(working copy)
@@ -1243,6 +1243,8 @@ vect_analyze_loop_operations (loop_vec_i
   unsigned int th;
   bool only_slp_in_loop = true, ok;
   HOST_WIDE_INT max_niter;
+  HOST_WIDE_INT estimated_niter;
+  int min_profitable_estimate;
 
   if (vect_print_dump_info (REPORT_DETAILS))
 fprintf (vect_dump, "=== vect_analyze_loop_operations ===");
@@ -1436,7 +1438,8 @@ vect_analyze_loop_operations (loop_vec_i
  vector stmts depends on VF.  */
   vect_update_slp_costs_according_to_vf (loop_vinfo);
 
-  min_profita

[v3] update docs w.r.t PR 54577

2012-09-30 Thread Jonathan Wakely
PR libstdc++/54577
* doc/xml/manual/status_cxx2011.xml: N2350 changes are missing from
sequence containers.
* doc/html/*: Regenerate.

Committed to trunk.
commit 56e855e46beb016fcf4f9f293abbb774a9285a46
Author: Jonathan Wakely 
Date:   Sun Sep 30 12:35:19 2012 +0100

PR libstdc++/54577
* doc/xml/manual/status_cxx2011.xml: N2350 changes are missing from
sequence containers.
* doc/html/*: Regenerate.

diff --git a/libstdc++-v3/doc/xml/manual/status_cxx2011.xml 
b/libstdc++-v3/doc/xml/manual/status_cxx2011.xml
index 324d1e2..1e149f0 100644
--- a/libstdc++-v3/doc/xml/manual/status_cxx2011.xml
+++ b/libstdc++-v3/doc/xml/manual/status_cxx2011.xml
@@ -1418,10 +1418,12 @@ particular release.
   
 
 
+  
   23.3.3
   Class template deque
-  Y
-  
+  Partial
+  insert and erase members do not
+ take const_iterator arguments (N2350).
 
 
   23.3.4
@@ -1430,22 +1432,28 @@ particular release.
   
 
 
+  
   23.3.5
   Class template list
-  Y
-  
+  Partial
+  insert and erase members do not
+ take const_iterator arguments (N2350).
 
 
+  
   23.3.6
   Class template vector
-  Y
-  
+  Partial
+  insert and erase members do not
+ take const_iterator arguments (N2350).
 
 
+  
   23.3.7
   Class vector
-  Y
-  
+  Partial
+  insert and erase members do not
+ take const_iterator arguments (N2350).
 
 
   23.4


Re: [PATCH] Fix instability of -fschedule-insn for x86

2012-09-30 Thread Uros Bizjak
On Tue, Sep 18, 2012 at 1:31 PM, Uros Bizjak  wrote:

>> This patch aims to fix all stability issues related to using the first
>> scheduler in gcc
>> for x86 target (there several reported issues related to this problem).
>>
>> Main idea of this activity is mostly to provide user a possibility to
>> safely turn on first scheduler for his codes. In some cases this could
>> positively affect performance, especially for in-order Atom.
>>
>> Below is short description of proposed changes.
>
>> 2012-09-18  Yuri Rumyantsev  
>>
>> * config/i386/i386.c (ix86_dep_by_shift_count_body) : Add
>> check on reload_completed since it can be invoked before
>> register allocation phase in 1st scheduler.
>> (ia32_multipass_dfa_lookahead) : Do not use dfa_lookahead for 1st
>> Scheduler to save compile time.
>> (ix86_sched_reorder) : Do not perform ready list reordering for 1st
>> Scheduler to save compile time.
>> (insn_is_function_arg) : New function. Returns true if lhs of insn is
>> HW function argument register.
>> (add_parameter_dependencies) : New function. Add output dependencies
>> for chain of function adjacent arguments if only there is a move to
>> likely spilled HW registers. Return first argument if at least one
>> dependence was added or NULL otherwise.
>> (avoid_func_arg_motion) : New function. Add output or anti dependency
>> from insn to first_arg to restrict code motion.
>> (add_dependee_for_func_arg) : New function. Avoid cross block motion 
>> of
>> function argument through adding dependency from the first non-jump
>> insn in bb.
>> (ix86_dependencies_evaluation_hook) : New function. Hook for 
>> schedule1:
>> avoid motion of function arguments passed in passed in likely spilled
>> HW registers.
>> (ix86_adjust_priority) : New function. Hook for schedule1: set 
>> priority
>> of moves from likely spilled HW registers to maximum to schedule them
>> as soon as possible.
>> (ix86_sched_init_global): Do not perform multipass scheduling for 1st
>> Scheduler to save compile time.
>
> I would kindly ask scheduler expert to review the patch from the
> scheduler functionality POV.

I have received opinion from Vladimir from off-line discussion, quoted below:

--quote--
I think, it is ok.

  Switching off first cycle multipass scheduling is ok.  It is mostly
useful when the order of insns issued on the same cycle is important
(mostly VLIW or quasy-VLIW processors).

  Other solutions are necessary to decrease spills and avoid reload
crash (can not find a register in a class) when the 1st insn
scheduling is on.  I don't think it fully avoids possibility of the
reload crashes but it takes into account most of cases resulting in
the crashes and makes the crash possibility really negligible.
Register pressure sensitive insn scheduling decreased the possibility.
 This patch will make it negligible.  And LRA will solve all the rest
cases of the crashes.

  I don't like a bit absence in freedom of moving argument insns with
likely spilled hard-regs between each other as they are chained in the
original order but it is debatable because it still decreases the
possibility of spills.

  In overall, the patch is ok for me.
--/quote--

Based on this opinion, the patch is OK for mainline, if there are no
objections from other x86 maintainers in the next couple of days
(48h). However, please watch for possible fallout from the patch,
compile-time ICEs and performance problems. x86 and scheduler didn't
play well together in the past, but your patch and (in the near
future) LRA seems to fix all these problems.

Thanks,
Uros.


Re: [PATCH] Changes in mode switching

2012-09-30 Thread Uros Bizjak
On Thu, Sep 20, 2012 at 8:35 AM, Uros Bizjak  wrote:
> On Thu, Sep 20, 2012 at 8:06 AM, Vladimir Yakovlev  
> wrote:
>> The compiler with the patch and without post_reload.patch is built and works
>> successfully. It has the only failure with avx-vzeroupper-3 test because of
>> post reload problem.
>
> Ok, can you please elaborate a bit on this filure? Perhaps someone has
> an idea why reload moves unspec_volatile around?

LRA will eventually replace reload in the nearby future [1], does LRA
also move unspec_volatile vzeroupper around?

[1] http://gcc.gnu.org/ml/gcc-patches/2012-09/msg01862.html

Uros.


Re: [Patch,avr]: Ad PR rtl-optimization/52543: Undo the MEM->UNSPEC hack

2012-09-30 Thread Georg-Johann Lay

Denis Chertykov schrieb:

Georg-Johann Lay:

PR52543 required to represent a load from non-generic address spaces as UNSPEC
instead of as MEM to avoid a gross code bloat.

http://gcc.gnu.org/PR52543

lower-subreg's cost model is still broken: It assumes that any loads from MEM
are from the generic address space and does not care for address spaces in its
cost model.

This patch undoes the changes from SVN r185605

http://gcc.gnu.org/viewcvs?view=revision&revision=185605

and installs a different but less intrusive hack around PR52543:

targetm.mode_dependent_address_p has an address space parameter so that the
backend can pretend all non-generic addresses are mode-dependent.

This keeps lower-subreg.c from splitting the loads, and it is possible to
represent the loads as MEM and there is no more the need to represent them as
UNSPECs.

This patch is still not an optimal solution but the code is much closer to a
clean solution now.

Ok for trunk?


You can apply it.

Denis.


I also applied the following change:

http://gcc.gnu.org/viewcvs?view=revision&revision=191825

* config/avr/avr.md (adjust_len): Add lpm.
(reload_in): Use avr_out_lpm for output.  Use "lpm" for
adjust_len.
* config/avr/avr-protos.h (avr_out_lpm): New prototype.
* config/avr/avr.c (avr_out_lpm): Make global.
(adjust_insn_length): Handle ADJUST_LEN_LPM.

The reload_in insns used the wrong output functions.

Notice that this change is just a cosmetic change because the secondary 
reload for the non-generic spaces are ignored.  That is:  despite 
avr_secondary_reload, REG <- MEM input reloads are not mapped to their 
secondary reload insn and the mov insn for that load is used.


This leads to a situation where the insn output function is not supplied 
with the needed clobber register, thus the avr_find_unused_d_reg 
function is needed to work around that.


Sigh. Even more FIXMEs in the avr backend...

Denis, do you know why the secondary reloads requested by 
avr_secondary_reload are ignored?  I see calls to this hook and 
sri->icode is set to the right insn code but ignored afterwards.


The only calls to that hook with the right operands are from ira cost 
computation.


From the internals I don't see why it is skipped and the responsiveness 
in the gcc-help@ list on such topics is zero :-(


A test case is

$ avr-gcc -mmcu=atmega128 -S -std=gnu99 ...

int read (const __flash1 int *p)
{
return *p;
}


Johann


Re: [Patch, Fortran, OOP] PR 54667: gimplification failure with c_f_pointer

2012-09-30 Thread Thomas Koenig

Hi Janus,



Regtested on x86_64-unknown-linux-gnu. Ok for trunk?


This looks all right to me (although I'm not really an expert :-)

OK, and thanks for the patch!

Thomas






[Patch, Fortran, committed] Re-enable some class array checks in the testsuite

2012-09-30 Thread Janus Weil
Hi all,

I have just committed as obvious a patch which re-enables a few class
array checks in the testsuite (removing some FIXMEs):

http://gcc.gnu.org/viewcvs?view=revision&revision=191867

Those tests had been disabled for the 4.6 release, when class arrays
were not supported yet, cf.
http://gcc.gnu.org/ml/fortran/2011-02/msg00096.html.

Cheers,
Janus


Profile housekeeping 4/n (scale_loop_profile cleanup)

2012-09-30 Thread Jan Hubicka
Hi,
when writting scale_loop_profile I forgot about scale_loop_frequency that
already sits in the tree for few years. The functions are slightly different.
While scale_loop_frequency only cales frequency of each BB in the loop,
scale_loop_profile takes care of reducing iteration count to known bound.

For this reason I kept them both, and I am not really able to think of better
names for them.  I moved them to same place.  I also simplified
cale_loop_profile to call scale_loop_frequency to take care of the final
update.

Honza

* cfgloop.c (scale_loop_profile): Move to...
* cfgloopmanip.c (scale_loop_profile): .. here; use
scale_loop_frequencies.
(loopify): Use RDIV.
Index: cfgloop.c
===
--- cfgloop.c   (revision 191850)
+++ cfgloop.c   (working copy)
@@ -1666,121 +1666,3 @@ loop_exits_from_bb_p (struct loop *loop,
 
   return false;
 }
-
-/* Scale the profile estiamte within loop by SCALE.
-   If ITERATION_BOUND is non-zero, scale even further if loop is predicted
-   to iterate too many times.  */
-void
-scale_loop_profile (struct loop *loop, int scale, int iteration_bound)
-{
-  gcov_type iterations = expected_loop_iterations_unbounded (loop);
-  basic_block *bbs;
-  unsigned int i;
-  edge e;
-  edge_iterator ei;
-
-  if (dump_file && (dump_flags & TDF_DETAILS))
-fprintf (dump_file, ";; Scaling loop %i with scale %f, "
-"bounding iterations to %i from guessed %i\n",
-loop->num, (double)scale / REG_BR_PROB_BASE,
-iteration_bound, (int)iterations);
-
-  /* See if loop is predicted to iterate too many times.  */
-  if (iteration_bound && iterations > 0
-  && RDIV (iterations * scale, REG_BR_PROB_BASE) > iteration_bound)
-{
-  /* Fixing loop profile for different trip count is not trivial; the exit
-probabilities has to be updated to match and frequencies propagated 
down
-to the loop body.
-
-We fully update only the simple case of loop with single exit that is
-either from the latch or BB just before latch and leads from BB with
-simple conditional jump.   This is OK for use in vectorizer.  */
-  e = single_exit (loop);
-  if (e)
-   {
- edge other_e;
- int freq_delta;
- gcov_type count_delta;
-
-  FOR_EACH_EDGE (other_e, ei, e->src->succs)
-   if (!(other_e->flags & (EDGE_ABNORMAL | EDGE_FAKE))
-   && e != other_e)
- break;
-
- /* Probability of exit must be 1/iterations.  */
- freq_delta = EDGE_FREQUENCY (e);
- e->probability = REG_BR_PROB_BASE / iteration_bound;
- other_e->probability = inverse_probability (e->probability);
- freq_delta -= EDGE_FREQUENCY (e);
-
- /* Adjust counts accordingly.  */
- count_delta = e->count;
- e->count = apply_probability (e->src->count, e->probability);
- other_e->count = apply_probability (e->src->count, 
other_e->probability);
- count_delta -= e->count;
-
- /* If latch exists, change its frequency and count, since we changed
-probability of exit.  Theoretically we should update everything 
from
-source of exit edge to latch, but for vectorizer this is enough.  
*/
- if (loop->latch
- && loop->latch != e->src)
-   {
- loop->latch->frequency += freq_delta;
- if (loop->latch->frequency < 0)
-   loop->latch->frequency = 0;
- loop->latch->count += count_delta;
- if (loop->latch->count < 0)
-   loop->latch->count = 0;
-   }
-   }
-
-  /* Roughly speaking we want to reduce the loop body profile by the
-the difference of loop iterations.  We however can do better if
-we look at the actual profile, if it is available.  */
-  scale = RDIV (iteration_bound * scale, iterations);
-  if (loop->header->count)
-   {
- gcov_type count_in = 0;
-
- FOR_EACH_EDGE (e, ei, loop->header->preds)
-   if (e->src != loop->latch)
- count_in += e->count;
-
- if (count_in != 0)
-   scale = RDIV (count_in * iteration_bound * REG_BR_PROB_BASE, 
loop->header->count);
-   }
-  else if (loop->header->frequency)
-   {
- int freq_in = 0;
-
- FOR_EACH_EDGE (e, ei, loop->header->preds)
-   if (e->src != loop->latch)
- freq_in += EDGE_FREQUENCY (e);
-
- if (freq_in != 0)
-   scale = RDIV (freq_in * iteration_bound * REG_BR_PROB_BASE, 
loop->header->frequency);
-   }
-  if (!scale)
-   scale = 1;
-}
-
-  if (scale == REG_BR_PROB_BASE)
-return;
-
-  /* Scale the actual probabilities.  */
-  bbs = get_loop_body (loop);
-  for (i = 0; i < loop->num_nodes; i++)
-{
-  basic_block bb = bbs[i];
-
-  bb->count = RDIV (bb->count * scale, REG_BR_PROB_BASE);
-

Re: __gnu_cxx::rope: __uninitialized_fill_n_a error

2012-09-30 Thread Jonathan Wakely
This fixes a lookup failure when using ropes with an allocator
declared outside namespace std, introduced by
http://gcc.gnu.org/ml/libstdc++/2004-07/msg00157.html

* include/ext/ropeimpl.h (__uninitialized_fill_n_a): Fix using
declaration.
* testsuite/ext/rope/5.cc: New.

Tested x86_64-linux, committed to trunk.
commit 293275915de97fc9a627d27a4b8d3143f398486e
Author: Jonathan Wakely 
Date:   Sun Sep 30 15:53:11 2012 +0100

* include/ext/ropeimpl.h (__uninitialized_fill_n_a): Fix using
declaration.
* testsuite/ext/rope/5.cc: New.

diff --git a/libstdc++-v3/include/ext/ropeimpl.h 
b/libstdc++-v3/include/ext/ropeimpl.h
index 3ee0610..5a68c18 100644
--- a/libstdc++-v3/include/ext/ropeimpl.h
+++ b/libstdc++-v3/include/ext/ropeimpl.h
@@ -58,7 +58,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   using std::basic_ostream;
   using std::__throw_length_error;
   using std::_Destroy;
-  using std::uninitialized_fill_n;
+  using std::__uninitialized_fill_n_a;
 
   // Set buf_start, buf_end, and buf_ptr appropriately, filling tmp_buf
   // if necessary.  Assumes _M_path_end[leaf_index] and leaf_pos are correct.
diff --git a/libstdc++-v3/testsuite/ext/rope/5.cc 
b/libstdc++-v3/testsuite/ext/rope/5.cc
new file mode 100644
index 000..73e8294
--- /dev/null
+++ b/libstdc++-v3/testsuite/ext/rope/5.cc
@@ -0,0 +1,26 @@
+// Copyright (C) 2012 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// You should have received a copy of the GNU General Public License along
+// with this library; see the file COPYING3.  If not see
+// .
+
+// rope (SGI extension)
+// http://gcc.gnu.org/ml/libstdc++/2012-09/msg00204.html
+
+// { dg-do compile }
+
+#include 
+#include 
+
+__gnu_cxx::rope > r(10, 'a');


Re: RFC: LRA for x86/x86-64 [0/9]

2012-09-30 Thread Richard Guenther
On Sat, Sep 29, 2012 at 10:26 PM, Steven Bosscher  wrote:
> Hi Vlad,
>
> Thanks for the testing and the logs. You must have good hardware, your
> timings are all ~3 times faster than mine :-)
>
> On Sat, Sep 29, 2012 at 3:01 AM, Vladimir Makarov  wrote:
>> --32-bit
>> Reload:
>> 581.85user 29.91system 27:15.18elapsed 37%CPU (0avgtext+0avgdata
>> LRA:
>> 629.67user 24.16system 24:31.08elapsed 44%CPU (0avgtext+0avgdata
>
> This is a ~8% slowdown.
>
>
>> --64-bit:---
>> Reload:
>> 503.26user 36.54system 30:16.62elapsed 29%CPU (0avgtext+0avgdata
>> LRA:
>> 598.70user 30.90system 27:26.92elapsed 38%CPU (0avgtext+0avgdata
>
> This is a ~19% slowdown

I think both measurements run into swap (low CPU utilization), from the LRA
numbers I'd say that LRA uses less memory but the timings are somewhat
useless with the swapping.

>> Here is the numbers for PR54146 on the same machine with -O1 only for
>> 64-bit (compiler reports error for -m32).
>
> Right, the test case is for 64-bits only, I think it's preprocessed
> code for AMD64.
>
>> Reload:
>> 350.40user 21.59system 17:09.75elapsed 36%CPU (0avgtext+0avgdata
>> LRA:
>> 468.29user 21.35system 15:47.76elapsed 51%CPU (0avgtext+0avgdata
>
> This is a ~34% slowdown.
>
> To put it in another perspective, here are my timings of trunk vs lra
> (both checkouts done today):
>
> trunk:
>  integrated RA   : 181.68 (24%) usr   1.68 (11%) sys 183.43
> (24%) wall  643564 kB (20%) ggc
>  reload  :  11.00 ( 1%) usr   0.18 ( 1%) sys  11.17 (
> 1%) wall   32394 kB ( 1%) ggc
>  TOTAL : 741.6414.76   756.41
>   3216164 kB
>
> lra branch:
>  integrated RA   : 174.65 (16%) usr   1.33 ( 8%) sys 176.33
> (16%) wall  643560 kB (20%) ggc
>  reload  : 399.69 (36%) usr   2.48 (15%) sys 402.69
> (36%) wall   41852 kB ( 1%) ggc
>  TOTAL :1102.0616.05  1120.83
>   3231738 kB
>
> That's a 49% slowdown. The difference is completely accounted for by
> the timing difference between reload and LRA.
> (Timings done on gcc17, which is AMD Opteron(tm) Processor 8354 with
> 15GB ram, so swapping is no issue.)
>
> It looks like the reload timevar is used for LRA. Why not have
> multiple timevars, one per phase of LRA? Sth like the patch below
> would be nice. This gives me the following timings:
>
>  integrated RA   : 189.34 (16%) usr   1.84 (11%) sys 191.18
> (16%) wall  643560 kB (20%) ggc
>  LRA non-specific:  59.82 ( 5%) usr   0.22 ( 1%) sys  60.12 (
> 5%) wall   18202 kB ( 1%) ggc
>  LRA virtuals eliminatenon:  56.79 ( 5%) usr   0.03 ( 0%) sys  56.80 (
> 5%) wall   19223 kB ( 1%) ggc
>  LRA reload inheritance  :   6.41 ( 1%) usr   0.01 ( 0%) sys   6.42 (
> 1%) wall1665 kB ( 0%) ggc
>  LRA create live ranges  : 175.30 (15%) usr   2.14 (13%) sys 177.44
> (15%) wall2761 kB ( 0%) ggc
>  LRA hard reg assignment : 130.85 (11%) usr   0.20 ( 1%) sys 131.17
> (11%) wall   0 kB ( 0%) ggc
>  LRA coalesce pseudo regs:   2.54 ( 0%) usr   0.00 ( 0%) sys   2.55 (
> 0%) wall   0 kB ( 0%) ggc
>  reload  :   6.73 ( 1%) usr   0.20 ( 1%) sys   6.92 (
> 1%) wall   0 kB ( 0%) ggc
>
> so the LRA "slowness" (for lack of a better word) appears to be due to
> scalability problems in all sub-passes.

It would be nice to see if LRA just has a larger constant cost factor
compared to reload or if it has bigger complexity.

> The code size changes are impressive, but I think that this kind of
> slowdown should be addressed before making LRA the default for any
> target.

Certainly if it shows bigger complexity, not sure for the constant factor
(but for sure improvements are welcome).

I suppose there is the option to revert back to reload by default for
x86_64 as well for 4.8, right?  That is, do both reload and LRA
co-exist for each target or is it a definite decision target by target?

Thanks,
Richard.

> Ciao!
> Steven
>
>
>
>
> Index: lra-assigns.c
> ===
> --- lra-assigns.c   (revision 191858)
> +++ lra-assigns.c   (working copy)
> @@ -1261,6 +1261,8 @@ lra_assign (void)
>bitmap_head insns_to_process;
>bool no_spills_p;
>
> +  timevar_push (TV_LRA_ASSIGN);
> +
>init_lives ();
>sorted_pseudos = (int *) xmalloc (sizeof (int) * max_reg_num ());
>sorted_reload_pseudos = (int *) xmalloc (sizeof (int) * max_reg_num ());
> @@ -1312,5 +1314,6 @@ lra_assign (void)
>free (sorted_pseudos);
>free (sorted_reload_pseudos);
>finish_lives ();
> +  timevar_pop (TV_LRA_ASSIGN);
>return no_spills_p;
>  }
> Index: lra.c
> ===
> --- lra.c   (revision 191858)
> +++ lra.c   (working copy)
> @@ -2193,6 +2193,7 @@ lra (FILE *f)
>
>lra_dump_file = f;
>
> +  timevar_p

Re: [Patch,avr]: Ad PR rtl-optimization/52543: Undo the MEM->UNSPEC hack

2012-09-30 Thread Denis Chertykov
2012/9/30 Georg-Johann Lay :
> Denis Chertykov schrieb:
>>
>> Georg-Johann Lay:
>>
>>> PR52543 required to represent a load from non-generic address spaces as
>>> UNSPEC
>>> instead of as MEM to avoid a gross code bloat.
>>>
>>> http://gcc.gnu.org/PR52543
>>>
>>> lower-subreg's cost model is still broken: It assumes that any loads from
>>> MEM
>>> are from the generic address space and does not care for address spaces
>>> in its
>>> cost model.
>>>
>>> This patch undoes the changes from SVN r185605
>>>
>>> http://gcc.gnu.org/viewcvs?view=revision&revision=185605
>>>
>>> and installs a different but less intrusive hack around PR52543:
>>>
>>> targetm.mode_dependent_address_p has an address space parameter so that
>>> the
>>> backend can pretend all non-generic addresses are mode-dependent.
>>>
>>> This keeps lower-subreg.c from splitting the loads, and it is possible to
>>> represent the loads as MEM and there is no more the need to represent
>>> them as
>>> UNSPECs.
>>>
>>> This patch is still not an optimal solution but the code is much closer
>>> to a
>>> clean solution now.
>>>
>>> Ok for trunk?
>>
>>
>> You can apply it.
>>
>> Denis.
>
>
> I also applied the following change:
>
> http://gcc.gnu.org/viewcvs?view=revision&revision=191825
>
> * config/avr/avr.md (adjust_len): Add lpm.
> (reload_in): Use avr_out_lpm for output.  Use "lpm" for
> adjust_len.
> * config/avr/avr-protos.h (avr_out_lpm): New prototype.
> * config/avr/avr.c (avr_out_lpm): Make global.
> (adjust_insn_length): Handle ADJUST_LEN_LPM.
>
> The reload_in insns used the wrong output functions.
>
> Notice that this change is just a cosmetic change because the secondary
> reload for the non-generic spaces are ignored.  That is:  despite
> avr_secondary_reload, REG <- MEM input reloads are not mapped to their
> secondary reload insn and the mov insn for that load is used.
>
> This leads to a situation where the insn output function is not supplied
> with the needed clobber register, thus the avr_find_unused_d_reg function is
> needed to work around that.

What would happen if  no unused_d_reg ?

> Denis, do you know why the secondary reloads requested by
> avr_secondary_reload are ignored?  I see calls to this hook and sri->icode
> is set to the right insn code but ignored afterwards.
>
> The only calls to that hook with the right operands are from ira cost
> computation.

I have tried to use secondary a few years ago (may be 5 or 7).
I have definitely remember only one thing: secondary reload should be
avoided as long as possible.
The better way to got a knowledge about it is a GDB ;-)

>
> From the internals I don't see why it is skipped and the responsiveness in
> the gcc-help@ list on such topics is zero :-(

IMHO  it's a question to gcc@ not to gcc-help@


Denis.


Re: [rtl] combine a vec_concat of 2 vec_selects from the same vector

2012-09-30 Thread Marc Glisse

On Sat, 29 Sep 2012, Eric Botcazou wrote:


this patch lets the compiler try to rewrite:

(vec_concat (vec_select x [a]) (vec_select x [b]))

as:

vec_select x [a b]

or even just "x" if appropriate.

[...]

OK, but:

1) Always add a comment describing the simplification when you add one,

2) The condition:


+   if (GET_MODE (XEXP (trueop0, 0)) == mode
+   && INTVAL (XVECEXP (XEXP (trueop1, 1), 0, 0))
+  - INTVAL (XVECEXP (XEXP (trueop0, 1), 0, 0)) == 1)
+ return XEXP (trueop0, 0);


can be simplified: if GET_MODE (XEXP (trueop0, 0)) == mode, then XEXP
(trueop0, 0) is a 2-element vector so the only possible case is (0,1).
That would probably even be more correct since you don't test CONST_INT_P for
the indices, while the test is done in the VEC_SELECT case.


It looks like I was trying to be clever by replacing 2 understandable 
tests with a single more obscure one, bad idea.



Why not generalizing to all kinds of VEC_SELECTs instead of just scalar ones?


Ok, I changed the patch a bit to handle arbitrary VEC_SELECTs, and moved 
the identity recognition to VEC_SELECT handling (where it belonged). 
Testing with non-scalar VEC_SELECTs was limited though, because they are 
not that easy to generate. Also, the identity case is the only one where 
it actually optimized. To handle more cases, I'd have to look through 
several layers of VEC_SELECTs, which gets a bit complicated (for instance, 
the permutation 0,1,3,2 will appear as a vec_concat of a 
vec_select(v,[0,1]) and a vec_select(vec_select(v,[2,3]),[1,0]), or worse 
with a vec_concat in the middle). It also didn't optimize 3,2,3,2, 
possibly because that meant substituting the same rtx twice (I didn't go 
that far in gdb). Then there is also the vec_duplicate case (I should try 
to replace vec_duplicate with vec_concat in simplify-rtx to see what 
happens...). Still, the identity case is nice to have.


Thanks for your comments.

bootstrap+testsuite on x86_64-linux-gnu with default languages.

2012-09-09  Marc Glisse  

gcc/
* simplify-rtx.c (simplify_binary_operation_1) :
Detect the identity.
: Handle VEC_SELECTs from the same vector.

gcc/testsuite/
* gcc.target/i386/vect-rebuild.c: New testcase.


--
Marc GlisseIndex: gcc/testsuite/gcc.target/i386/vect-rebuild.c
===
--- gcc/testsuite/gcc.target/i386/vect-rebuild.c(revision 0)
+++ gcc/testsuite/gcc.target/i386/vect-rebuild.c(revision 0)
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-options "-O -mavx -fno-tree-forwprop" } */
+
+typedef double v2df __attribute__ ((__vector_size__ (16)));
+typedef double v4df __attribute__ ((__vector_size__ (32)));
+
+v2df f1 (v2df x)
+{
+  v2df xx = { x[0], x[1] };
+  return xx;
+}
+
+v4df f2 (v4df x)
+{
+  v4df xx = { x[0], x[1], x[2], x[3] };
+  return xx;
+}
+
+v2df g (v2df x)
+{
+  v2df xx = { x[1], x[0] };
+  return xx;
+}
+
+v2df h (v4df x)
+{
+  v2df xx = { x[2], x[3] };
+  return xx;
+}
+
+/* { dg-final { scan-assembler-not "unpck" } } */
+/* { dg-final { scan-assembler-times "\tv?permilpd\[ \t\]" 1 } } */
+/* { dg-final { scan-assembler-times "\tv?extractf128\[ \t\]" 1 } } */

Property changes on: gcc/testsuite/gcc.target/i386/vect-rebuild.c
___
Added: svn:keywords
   + Author Date Id Revision URL
Added: svn:eol-style
   + native

Index: gcc/simplify-rtx.c
===
--- gcc/simplify-rtx.c  (revision 191865)
+++ gcc/simplify-rtx.c  (working copy)
@@ -3239,20 +3239,37 @@ simplify_binary_operation_1 (enum rtx_co
  rtx x = XVECEXP (trueop1, 0, i);
 
  gcc_assert (CONST_INT_P (x));
  RTVEC_ELT (v, i) = CONST_VECTOR_ELT (trueop0,
   INTVAL (x));
}
 
  return gen_rtx_CONST_VECTOR (mode, v);
}
 
+ /* Recognize the identity.  */
+ if (GET_MODE (trueop0) == mode)
+   {
+ bool maybe_ident = true;
+ for (int i = 0; i < XVECLEN (trueop1, 0); i++)
+   {
+ rtx j = XVECEXP (trueop1, 0, i);
+ if (!CONST_INT_P (j) || INTVAL (j) != i)
+   {
+ maybe_ident = false;
+ break;
+   }
+   }
+ if (maybe_ident)
+   return trueop0;
+   }
+
  /* If we build {a,b} then permute it, build the result directly.  */
  if (XVECLEN (trueop1, 0) == 2
  && CONST_INT_P (XVECEXP (trueop1, 0, 0))
  && CONST_INT_P (XVECEXP (trueop1, 0, 1))
  && GET_CODE (trueop0) == VEC_CONCAT
  && GET_CODE (XEXP (trueop0, 0)) == VEC_CONCAT
  && GET_MODE (XEXP (trueop0, 0)) == mode
  && GET_CODE (XEXP (trueop0,

Re: [Patch, Fortran, OOP] PR 54667: gimplification failure with c_f_pointer

2012-09-30 Thread Janus Weil
Hi Thomas,

>> Regtested on x86_64-unknown-linux-gnu. Ok for trunk?
>
>
> This looks all right to me (although I'm not really an expert :-)
>
> OK, and thanks for the patch!

thanks for the review. Committed as r191870.

Cheers,
Janus


[SH] PR 50457 - Add additional atomic models

2012-09-30 Thread Oleg Endo
Hello,

This implements the changes as proposed PR, albeit with some small
differences:

* I decided to go for a more verbose option name '-matomic-model'
instead of just '-matomic'.  

* In addition to the soft-tcb model I've also added a soft-imask model.
Interrupt-flipping atomics might not be the best choice but it is easy
to setup and get started with.

* There is a new atomic model parameter 'strict' to prohibit mixing of
atomic model sequences on SH4A.

There are no functional changes to the already existing soft and hard
atomics, except that '-msoft-atomic' is now mapped to
'-matomic-model=soft-gusa' and '-mhard-atomic' has been removed.

Tested on rev 191865 with 'make all' and 'make info dvi pdf' and by
compiling a couple of functions that use atomics and eyeballing the asm
output.

OK?

Cheers,
Oleg

ChangeLog:

PR target/50457
* config/sh/sh.opt (matomic-model): New option.
(msoft-atomic): Mark as deprecated and alias to 
matomic-model=soft-gusa.
(mhard-atomic): Delete.
* config/sh/predicates.md (gbr_displacement): New predicate.
* config/sh/sh-protos.h (sh_atomic_model): New struct.
(selected_atomic_model): New declaration.
(TARGET_ATOMIC_ANY, TARGET_ATOMIC_STRICT, 
TARGET_ATOMIC_SOFT_GUSA, TARGET_ATOMIC_HARD_LLCS, 
TARGET_ATOMIC_SOFT_TCB, TARGET_ATOMIC_SOFT_TCB_GBR_OFFSET_RTX, 
TARGET_ATOMIC_SOFT_IMASK): New macros.
* config/sh/linux.h (SUBTARGET_OVERRIDE_OPTIONS): Adapt setting 
to default atomic model.
* config/sh/sh.c (selected_atomic_model_): New global variable.
(selected_atomic_model, parse_validate_atomic_model_option): New
functions.
(sh_option_override): Replace atomic selection checks with call 
to parse_validate_atomic_model_option.
* config/sh/sh.h (TARGET_ANY_ATOMIC, UNSUPPORTED_ATOMIC_OPTIONS,
UNSUPPORTED_HARD_ATOMIC_CPU): Delete.
(DRIVER_SELF_SPECS): Remove atomic checks.
config/sh/sync.md: Update documentation comments.
(atomic_compare_and_swap, atomic_exchange, 
atomic_fetch_, atomic_fetch_nand, 
atomic__fetch, atomic_nand_fetch): Use
TARGET_ATOMIC_ANY as condition.  Add TARGET_ATOMIC_STRICT check
for SH4A case.  Handle new TARGET_ATOMIC_SOFT_TCB and
TARGET_ATOMIC_SOFT_IMASK cases.
(atomic_test_and_set): Handle new TARGET_ATOMIC_SOFT_TCB and 
TARGET_ATOMIC_SOFT_IMASK cases.
(atomic_compare_and_swapsi_hard, atomic_exchangesi_hard, 
atomic_fetch_si_hard, atomic_fetch_nandsi_hard, 
atomic__fetchsi_hard, atomic_nand_fetchsi_hard): 
Add TARGET_ATOMIC_STRICT check.
(atomic_compare_and_swap_hard, atomic_exchange_hard,
atomic_fetch__hard, 
atomic_fetch_nand_hard, 
atomic__fetch_hard, 
atomic_nand_fetch_hard, atomic_test_and_set_hard): Use
TARGET_ATOMIC_HARD_LLCS condition.
(atomic_compare_and_swap_soft, atomic_exchange_soft,
atomic_fetch__soft,
atomic_fetch_nand_soft, 
atomic__fetch_soft,
atomic_nand_fetch_soft, atomic_test_and_set_soft): Append 
_gusa to the insn names and use TARGET_ATOMIC_SOFT_GUSA as 
condition.
(atomic_compare_and_swap_soft_tcb, 
atomic_exchange_soft_tcb, 
atomic_fetch__soft_tcb, 
atomic_fetch_nand_soft_tcb, 
atomic__fetch_soft_tcb, 
atomic_nand_fetch_soft_tcb, atomic_test_and_set_soft_tcb):
New insns.
(atomic_compare_and_swap_soft_imask, 
atomic_exchange_soft_imask, 
atomic_fetch__soft_imask, 
atomic_fetch_nand_soft_imask, 
atomic__fetch_soft_imask, 
atomic_nand_fetch_soft_imask, 
atomic_test_and_set_soft_imask): New insns.
* doc/invoke.texi (SH Options): Document new matomic-model 
option.  Remove msoft-atomic and mhard-atomic options.
Index: gcc/config/sh/sh.opt
===
--- gcc/config/sh/sh.opt	(revision 191865)
+++ gcc/config/sh/sh.opt	(working copy)
@@ -320,12 +320,12 @@
 Follow Renesas (formerly Hitachi) / SuperH calling conventions
 
 msoft-atomic
-Target Report Var(TARGET_SOFT_ATOMIC)
-Use gUSA software atomic sequences
+Target Undocumented Alias(matomic-model=, soft-gusa, none)
+Deprecated.  Use -matomic= instead to select the atomic model
 
-mhard-atomic
-Target Report Var(TARGET_HARD_ATOMIC)
-Use hardware atomic sequences
+matomic-model=
+Target Report RejectNegative Joined Var(sh_atomic_model_str)
+Specify the model for atomic operations
 
 mtas
 Target Report RejectNegative Var(TARGET_ENABLE_TAS)
Index: gcc/config/sh/predicates.md
===
--- gcc/config/sh/predicates.md	(revision 191865)
+++ gcc/config/sh/predicates.md	(working copy)
@@ -1071,3 +1071,19 @@
 
   return false;
 })
+
+;; A predicate that determines whether a given c

Re: RFC: LRA for x86/x86-64 [0/9]

2012-09-30 Thread Steven Bosscher
On Sun, Sep 30, 2012 at 6:01 PM, Richard Guenther
 wrote:
>>> --64-bit:---
>>> Reload:
>>> 503.26user 36.54system 30:16.62elapsed 29%CPU (0avgtext+0avgdata
>>> LRA:
>>> 598.70user 30.90system 27:26.92elapsed 38%CPU (0avgtext+0avgdata
>>
>> This is a ~19% slowdown
>
> I think both measurements run into swap (low CPU utilization), from the LRA
> numbers I'd say that LRA uses less memory but the timings are somewhat
> useless with the swapping.

Not on gcc17. It has almost no swap to begin with, but the max.
resident size is less than half of the machine's RAM (~7GB max.
resident vs 16GB machine RAM). It obviously has to do with memory
behavior, but it's probably more a matter of size (>200,000 basic
blocks, >600,000 pseudos, etc., basic blocks with livein/liveout sets
with a cardinality in the 10,000s, etc.), not swapping.


> It would be nice to see if LRA just has a larger constant cost factor
> compared to reload or if it has bigger complexity.

It is complexity in all typical measures of size (basic blocks, number
of insns, number of basic blocks), that's easily verified with
artificial test cases.


>> The code size changes are impressive, but I think that this kind of
>> slowdown should be addressed before making LRA the default for any
>> target.
>
> Certainly if it shows bigger complexity, not sure for the constant factor
> (but for sure improvements are welcome).
>
> I suppose there is the option to revert back to reload by default for
> x86_64 as well for 4.8, right?  That is, do both reload and LRA
> co-exist for each target or is it a definite decision target by target?

Do you really want to have two such bug-sensitive paths through of the compiler?

Ciao!
Steven


Re: RFC: LRA for x86/x86-64 [0/9]

2012-09-30 Thread Andi Kleen
Richard Guenther  writes:
>
> I think both measurements run into swap (low CPU utilization), from the LRA
> numbers I'd say that LRA uses less memory but the timings are somewhat
> useless with the swapping.

On Linux I would normally recommend to use

/usr/bin/time -f 'real=%e user=%U system=%S share=%P%% maxrss=%M ins=%I
outs=%O mfaults=%R waits=%w'

instead of plain time. It gives you much more information
(especially maxrss and waits), so it's more reliable to tell if you 
have a memory problem or not.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: RFC: LRA for x86/x86-64 [0/9]

2012-09-30 Thread Steven Bosscher
Hi,


To look at it in yet another way:

>  integrated RA   : 189.34 (16%) usr
>  LRA non-specific:  59.82 ( 5%) usr
>  LRA virtuals eliminatenon:  56.79 ( 5%) usr
>  LRA create live ranges  : 175.30 (15%) usr
>  LRA hard reg assignment : 130.85 (11%) usr

The IRA pass is slower than the next-slowest pass (tree PRA) by almost
a factor 2.5.  Each of the individually-measured *phases* of LRA is
slower than the complete IRA *pass*. These 5 timevars together make up
for 52% of all compile time.

IRA already has scalability problems, let's not add more of that with LRA.

Ciao!
Steven


Re: RFC: LRA for x86/x86-64 [0/9]

2012-09-30 Thread Richard Guenther
On Sun, Sep 30, 2012 at 6:52 PM, Steven Bosscher  wrote:
> Hi,
>
>
> To look at it in yet another way:
>
>>  integrated RA   : 189.34 (16%) usr
>>  LRA non-specific:  59.82 ( 5%) usr
>>  LRA virtuals eliminatenon:  56.79 ( 5%) usr
>>  LRA create live ranges  : 175.30 (15%) usr
>>  LRA hard reg assignment : 130.85 (11%) usr
>
> The IRA pass is slower than the next-slowest pass (tree PRA) by almost
> a factor 2.5.  Each of the individually-measured *phases* of LRA is
> slower than the complete IRA *pass*. These 5 timevars together make up
> for 52% of all compile time.

That figure indeed makes IRA + LRA look bad.  Did you by chance identify
anything obvious that can be done to improve the situation?

Thanks,
Richard.

> IRA already has scalability problems, let's not add more of that with LRA.
>
> Ciao!
> Steven


[PATCH] Update line numbers in testsuite

2012-09-30 Thread Andreas Schwab
Committed.

Andreas.

* gcc.dg/ucnid-8.c: Update line number.
* gcc.dg/torture/pr51106-2.c: Likewise.

diff --git a/gcc/testsuite/gcc.dg/torture/pr51106-2.c 
b/gcc/testsuite/gcc.dg/torture/pr51106-2.c
index 49dcdd0..80328a9 100644
--- a/gcc/testsuite/gcc.dg/torture/pr51106-2.c
+++ b/gcc/testsuite/gcc.dg/torture/pr51106-2.c
@@ -12,4 +12,4 @@ lab:
   return 0;
 }
 
-/* { dg-warning "probably doesn.t match constraints" "" { target *-*-* } 8 } */
+/* { dg-warning "probably doesn.t match constraints" "" { target *-*-* } 9 } */
diff --git a/gcc/testsuite/gcc.dg/ucnid-8.c b/gcc/testsuite/gcc.dg/ucnid-8.c
index 66cdbc5..0e48a7f 100644
--- a/gcc/testsuite/gcc.dg/ucnid-8.c
+++ b/gcc/testsuite/gcc.dg/ucnid-8.c
@@ -13,4 +13,4 @@ void f (int b) { int \u00e9[b]; } /* { dg-warning "variable 
length array 'U0
 void g (static int \u00e9); /* { dg-error "storage class specified for 
parameter 'U00e9'" } */
 
 struct s2 { int \u00e1; } \u00e9 = { { 0 } }; /* { dg-warning "braces around 
scalar initializer" } */
-/* { dg-warning "near initialization for 'U00e9\\.U00e1'" "UCN 
diag" { target *-*-* } 14 } */
+/* { dg-warning "near initialization for 'U00e9\\.U00e1'" "UCN 
diag" { target *-*-* } 15 } */
-- 
1.7.12.2


-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


[m68k] Cleanup bitfield insns

2012-09-30 Thread Andreas Schwab
This cleanup removes the handling for MEM expressions that can never
happen.  This fixes an ICE that happens because register_operand also
matches (SUBREG (REG)), which isn't caught when only checking for REG.
Tested on m68k-linux and committed.

Andreas.

* config/m68k/m68k.md: Add names to bitfield insert and extract
insns.
(*insv_8_16_reg): Remove constraints and conditions that assume
that operand 0 could be a MEM.
(*extzv_8_16_reg, *extv_8_16_reg): Likewise, for operand 1.

Index: config/m68k/m68k.md
===
--- config/m68k/m68k.md (revision 191870)
+++ config/m68k/m68k.md (working copy)
@@ -5603,7 +5603,7 @@
 ; The move is allowed to be odd byte aligned, because that's still faster
 ; than an odd byte aligned bit-field instruction.
 ;
-(define_insn ""
+(define_insn "*insv_32_mem"
   [(set (zero_extract:SI (match_operand:QI 0 "memory_operand" "+o")
 (const_int 32)
 (match_operand:SI 1 "const_int_operand" "n"))
@@ -5619,32 +5619,17 @@
   return "move%.l %2,%0";
 })
 
-(define_insn ""
-  [(set (zero_extract:SI (match_operand:SI 0 "register_operand" "+do")
+(define_insn "*insv_8_16_reg"
+  [(set (zero_extract:SI (match_operand:SI 0 "register_operand" "+d")
 (match_operand:SI 1 "const_int_operand" "n")
 (match_operand:SI 2 "const_int_operand" "n"))
(match_operand:SI 3 "register_operand" "d"))]
   "TARGET_68020 && TARGET_BITFIELD
&& (INTVAL (operands[1]) == 8 || INTVAL (operands[1]) == 16)
-   && INTVAL (operands[2]) % INTVAL (operands[1]) == 0
-   && (GET_CODE (operands[0]) == REG
-   || ! mode_dependent_address_p (XEXP (operands[0], 0),
-  MEM_ADDR_SPACE (operands[0])))"
+   && INTVAL (operands[2]) % INTVAL (operands[1]) == 0"
 {
-  if (REG_P (operands[0]))
-{
-  if (INTVAL (operands[1]) + INTVAL (operands[2]) != 32)
-return "bfins %3,%0{%b2:%b1}";
-}
-  else
-operands[0] = adjust_address (operands[0],
- INTVAL (operands[1]) == 8 ? QImode : HImode,
- INTVAL (operands[2]) / 8);
-
-  if (GET_CODE (operands[3]) == MEM)
-operands[3] = adjust_address (operands[3],
- INTVAL (operands[1]) == 8 ? QImode : HImode,
- (32 - INTVAL (operands[1])) / 8);
+  if (INTVAL (operands[1]) + INTVAL (operands[2]) != 32)
+return "bfins %3,%0{%b2:%b1}";
 
   if (INTVAL (operands[1]) == 8)
 return "move%.b %3,%0";
@@ -5659,7 +5644,7 @@
 ; The move is allowed to be odd byte aligned, because that's still faster
 ; than an odd byte aligned bit-field instruction.
 ;
-(define_insn ""
+(define_insn "*extzv_32_mem"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=rm")
(zero_extract:SI (match_operand:QI 1 "memory_src_operand" "oS")
 (const_int 32)
@@ -5675,34 +5660,20 @@
   return "move%.l %1,%0";
 })
 
-(define_insn ""
+(define_insn "*extzv_8_16_reg"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=&d")
-   (zero_extract:SI (match_operand:SI 1 "register_operand" "do")
+   (zero_extract:SI (match_operand:SI 1 "register_operand" "d")
 (match_operand:SI 2 "const_int_operand" "n")
 (match_operand:SI 3 "const_int_operand" "n")))]
   "TARGET_68020 && TARGET_BITFIELD
&& (INTVAL (operands[2]) == 8 || INTVAL (operands[2]) == 16)
-   && INTVAL (operands[3]) % INTVAL (operands[2]) == 0
-   && (GET_CODE (operands[1]) == REG
-   || ! mode_dependent_address_p (XEXP (operands[1], 0),
-  MEM_ADDR_SPACE (operands[1])))"
+   && INTVAL (operands[3]) % INTVAL (operands[2]) == 0"
 {
   cc_status.flags |= CC_NOT_NEGATIVE;
-  if (REG_P (operands[1]))
-{
-  if (INTVAL (operands[2]) + INTVAL (operands[3]) != 32)
-   return "bfextu %1{%b3:%b2},%0";
-}
-  else
-operands[1]
-  = adjust_address (operands[1], SImode, INTVAL (operands[3]) / 8);
+  if (INTVAL (operands[2]) + INTVAL (operands[3]) != 32)
+return "bfextu %1{%b3:%b2},%0";
 
   output_asm_insn ("clr%.l %0", operands);
-  if (GET_CODE (operands[0]) == MEM)
-operands[0] = adjust_address (operands[0],
- INTVAL (operands[2]) == 8 ? QImode : HImode,
- (32 - INTVAL (operands[1])) / 8);
-
   if (INTVAL (operands[2]) == 8)
 return "move%.b %1,%0";
   return "move%.w %1,%0";
@@ -5715,7 +5686,7 @@
 ; The move is allowed to be odd byte aligned, because that's still faster
 ; than an odd byte aligned bit-field instruction.
 ;
-(define_insn ""
+(define_insn "*extv_32_mem"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=rm")
(sign_extract:SI (match_operand:QI 1 "memory_src_operand" "oS")
 (const_int 32)
@@ -5731,28 +5702,17 @@
   return "mov

[committed] Skip gcc.dg/torture/pr53922.c on 32-bit hppa-*-hpux*

2012-09-30 Thread John David Anglin
The test gcc.dg/torture/pr53922.c fails on 32-bit HP-UX because there
isn't support for undefined weak symbols.  We skip rather than xfail
to avoid the warning that would otherwise arise.

Tested on hppa2.0w-hp-hpux11.11 and hppa64-hp-hpux11.11.  Committed
to trunk and 4.7 branch.

Dave
-- 
J. David Anglin  dave.ang...@nrc-cnrc.gc.ca
National Research Council of Canada  (613) 990-0752 (FAX: 952-6602)

2012-09-30  John David Anglin  

PR target/54083
* gcc.dg/torture/pr53922.c: Skip on 32-bit hppa-*-hpux*.

Index: gcc.dg/torture/pr53922.c
===
--- gcc.dg/torture/pr53922.c(revision 191314)
+++ gcc.dg/torture/pr53922.c(working copy)
@@ -1,5 +1,6 @@
 /* { dg-do run } */
 /* { dg-require-weak "" } */
+/* { dg-skip-if "No undefined weak" { hppa*-*-hpux* && { ! lp64 } } { "*" } { 
"" } } */
 
 int x(int a)
 {


[SPARC] Fix recent and older thinkos

2012-09-30 Thread Eric Botcazou
This fixes a recent thinko introduced by the double-int rewrite and 
responsible for the following failures:

FAIL: gcc.target/sparc/pdist-2.c scan-tree-dump optimized "return 475"
FAIL: gcc.target/sparc/pdist-3.c execution test

as well as an older one spotted by Bernd, whereby the compiler emits non-
canonical RTL for the stack adjustments issued for the epilogue.  The patch 
also contains a patchlet for reorg.c that performs a bit of manual CSE:
 
XVECEXP (PATTERN (insn), 0, 0)  --> delay_insn

and which makes it more obvious why you can take JUMP_LABEL (delay_insn).

Tested on SPARC/Linux, applied on the mainline.


2012-09-30  Eric Botcazou  

* reorg.c (relax_delay_slots): Use delay_insn consistently.

* config/sparc/sparc.c (gen_stack_pointer_dec): Delete.
(sparc_expand_epilogue): Use gen_stack_pointer_inc and adjust.
(sparc_flat_expand_epilogue): Likewise.
(emit_and_preserve): Likewise.
(sparc_fold_builtin): Fix thinko in latest change.


-- 
Eric BotcazouIndex: reorg.c
===
--- reorg.c	(revision 191796)
+++ reorg.c	(working copy)
@@ -3432,9 +3432,8 @@ relax_delay_slots (rtx first)
 	reorg_redirect_jump (insn, other_target);
 	}
 
-  /* Now look only at cases where we have filled a delay slot.  */
-  if (!NONJUMP_INSN_P (insn)
-	  || GET_CODE (PATTERN (insn)) != SEQUENCE)
+  /* Now look only at cases where we have a filled delay slot.  */
+  if (!NONJUMP_INSN_P (insn) || GET_CODE (PATTERN (insn)) != SEQUENCE)
 	continue;
 
   pat = PATTERN (insn);
@@ -3494,9 +3493,8 @@ relax_delay_slots (rtx first)
 	}
 
   /* Now look only at the cases where we have a filled JUMP_INSN.  */
-  if (!JUMP_P (XVECEXP (PATTERN (insn), 0, 0))
-	  || ! (condjump_p (XVECEXP (PATTERN (insn), 0, 0))
-		|| condjump_in_parallel_p (XVECEXP (PATTERN (insn), 0, 0
+  if (!JUMP_P (delay_insn)
+	  || !(condjump_p (delay_insn) || condjump_in_parallel_p (delay_insn)))
 	continue;
 
   target_label = JUMP_LABEL (delay_insn);
Index: config/sparc/sparc.c
===
--- config/sparc/sparc.c	(revision 191796)
+++ config/sparc/sparc.c	(working copy)
@@ -4976,18 +4976,6 @@ gen_stack_pointer_inc (rtx increment)
 increment));
 }
 
-/* Generate a decrement for the stack pointer.  */
-
-static rtx
-gen_stack_pointer_dec (rtx decrement)
-{
-  return gen_rtx_SET (VOIDmode,
-		  stack_pointer_rtx,
-		  gen_rtx_MINUS (Pmode,
- stack_pointer_rtx,
- decrement));
-}
-
 /* Expand the function prologue.  The prologue is responsible for reserving
storage for the frame, saving the call-saved registers and loading the
GOT register if needed.  */
@@ -5258,17 +5246,17 @@ sparc_expand_epilogue (bool for_eh)
   else if (sparc_leaf_function_p)
 {
   if (size <= 4096)
-	emit_insn (gen_stack_pointer_dec (GEN_INT (-size)));
+	emit_insn (gen_stack_pointer_inc (GEN_INT (size)));
   else if (size <= 8192)
 	{
-	  emit_insn (gen_stack_pointer_dec (GEN_INT (-4096)));
-	  emit_insn (gen_stack_pointer_dec (GEN_INT (4096 - size)));
+	  emit_insn (gen_stack_pointer_inc (GEN_INT (4096)));
+	  emit_insn (gen_stack_pointer_inc (GEN_INT (size - 4096)));
 	}
   else
 	{
 	  rtx reg = gen_rtx_REG (Pmode, 1);
-	  emit_move_insn (reg, GEN_INT (-size));
-	  emit_insn (gen_stack_pointer_dec (reg));
+	  emit_move_insn (reg, GEN_INT (size));
+	  emit_insn (gen_stack_pointer_inc (reg));
 	}
 }
 }
@@ -5318,17 +5306,17 @@ sparc_flat_expand_epilogue (bool for_eh)
   emit_insn (gen_blockage ());
 
   if (size <= 4096)
-	emit_insn (gen_stack_pointer_dec (GEN_INT (-size)));
+	emit_insn (gen_stack_pointer_inc (GEN_INT (size)));
   else if (size <= 8192)
 	{
-	  emit_insn (gen_stack_pointer_dec (GEN_INT (-4096)));
-	  emit_insn (gen_stack_pointer_dec (GEN_INT (4096 - size)));
+	  emit_insn (gen_stack_pointer_inc (GEN_INT (4096)));
+	  emit_insn (gen_stack_pointer_inc (GEN_INT (size - 4096)));
 	}
   else
 	{
 	  rtx reg = gen_rtx_REG (Pmode, 1);
-	  emit_move_insn (reg, GEN_INT (-size));
-	  emit_insn (gen_stack_pointer_dec (reg));
+	  emit_move_insn (reg, GEN_INT (size));
+	  emit_insn (gen_stack_pointer_inc (reg));
 	}
 }
 }
@@ -10131,7 +10119,7 @@ sparc_fold_builtin (tree fndecl, int n_a
 	  && TREE_CODE (arg2) == INTEGER_CST)
 	{
 	  bool overflow = false;
-	  double_int di_arg2 = TREE_INT_CST (arg2);
+	  double_int result = TREE_INT_CST (arg2);
 	  double_int tmp;
 	  unsigned i;
 
@@ -10147,13 +10135,13 @@ sparc_fold_builtin (tree fndecl, int n_a
 	  if (tmp.is_negative ())
 		tmp = tmp.neg_with_overflow (&neg2_ovf);
 
-	  tmp = di_arg2.add_with_sign (tmp, false, &add2_ovf);
+	  result = result.add_with_sign (tmp, false, &add2_ovf);
 	  overflow |= neg1_ovf | neg2_ovf | add1_ovf | add2_ovf;
 	}
 
 	  gcc_assert (!overflow);
 
-	  return build_int_cst_wide (rtype, tmp.low, tmp.high)

Tweak IRA checks for singleton register classes

2012-09-30 Thread Richard Sandiford
IRA has code to check whether there is only a single acceptable register
for a given operand.  This code uses conditions like:

  ira_class_hard_regs_num[cl] != 0
  && (ira_class_hard_regs_num[cl] <= ira_reg_class_max_nregs[cl][mode])

i.e. the number of registers needed to store the mode is >=
the number of alloctable registers in the class.  Then:

  ira_class_hard_regs[cl][0]

gives the register in question.

MIPS has a slightly strange situation in which HI can only be allocated
alongside LO; it can't be allocated independently.  At the moment,
HI and LO have their own register classes (MD0_REG and MD1_REG,
with the mapping depending on endianness) and MD_REGS is used when both
HI and LO are required.  There is also ACC_REGS, which is equivalent to
MD_REGS when the DSP ASE is not being used.  MD_REGS and ACC_REGS are
already mapped to constraints.

Having MD0_REG and MD1_REG leads to some confusing costs and makes
HI and LO irregular WRT the DSP ASE accumulator registers.  I've been
experimenting with patches to remove these classes and just have MD_REGS.
I wanted to get to a situtation where this change has no effect on cc1 .ii
files for -mno-dsp; the patch below is one of those needed to get to that
stage.

MD_REGS has only one SImode register.  As describe above, the same goes
for ACC_REGS unless the DSP ASE is being used.  However, both classes
fail the check above because HI (which doesn't accept SImode) is also
allocatable.  That is, the classes have two allocatable registers,
but only one of them can be used for SImode.

The patch below adds a new array for tracking which class/mode
combinations specify a single register, and for recording which
register that is.  The net effect will be the same on almost all
targets.

I deliberately didn't change:

  for (p2 = ®_class_subclasses[cl2][0];
   *p2 != LIM_REG_CLASSES; p2++)
if (ira_class_hard_regs_num[*p2] > 0
&& (ira_reg_class_max_nregs[*p2][mode]
<= ira_class_hard_regs_num[*p2]))
  cost = MAX (cost, ira_register_move_cost[mode][cl1][*p2]);

  for (p1 = ®_class_subclasses[cl1][0];
   *p1 != LIM_REG_CLASSES; p1++)
if (ira_class_hard_regs_num[*p1] > 0
&& (ira_reg_class_max_nregs[*p1][mode]
<= ira_class_hard_regs_num[*p1]))
  cost = MAX (cost, ira_register_move_cost[mode][*p1][cl2]);

from ira_init_register_move_cost because that had more effect
than I was expecting and wasn't needed for the MIPS patch.
It could be done as a follow-up if I ever find time...

I checked that this produced no difference in assembly output for
a set of x86_64 gcc .ii files (tested with -O2 -march=native on gcc20).
Also tested on x86_64-linux-gnu (including -m32) and mipsisa64-elf.
OK to install?

Richard


gcc/
* ira.h (target_ira): Add x_ira_class_singleton.
(ira_class_singleton): New macro.
* ira.c (setup_prohibited_class_mode_regs): Set up ira_class_singleton.
* ira-build.c (update_conflict_hard_reg_costs): Use
ira_class_singleton to check for classes with a single
allocatable register.
* ira-lives.c (ira_implicitly_set_insn_hard_regs): Likewise.
(single_reg_class): Likewise.  When more than one class is specified,
check whether they have the same singleton register.
(process_single_reg_class_operands): Require single_reg_class
to return NO_REGS or a class with a single allocatable register.
Obtain that register from ira_class_singleton.

Index: gcc/ira.h
===
--- gcc/ira.h   2012-09-30 12:56:14.344185269 +0100
+++ gcc/ira.h   2012-09-30 17:45:14.964463976 +0100
@@ -79,6 +79,10 @@ struct target_ira {
  class.  */
   int x_ira_class_hard_regs_num[N_REG_CLASSES];
 
+  /* If class CL has a single allocatable register of mode M,
+ index [CL][M] gives the number of that register, otherwise it is -1.  */
+  short x_ira_class_singleton[N_REG_CLASSES][MAX_MACHINE_MODE];
+
   /* Function specific hard registers can not be used for the register
  allocation.  */
   HARD_REG_SET x_ira_no_alloc_regs;
@@ -117,6 +121,8 @@ #define ira_class_hard_regs \
   (this_target_ira->x_ira_class_hard_regs)
 #define ira_class_hard_regs_num \
   (this_target_ira->x_ira_class_hard_regs_num)
+#define ira_class_singleton \
+  (this_target_ira->x_ira_class_singleton)
 #define ira_no_alloc_regs \
   (this_target_ira->x_ira_no_alloc_regs)
 
Index: gcc/ira.c
===
--- gcc/ira.c   2012-09-30 12:56:14.344185269 +0100
+++ gcc/ira.c   2012-09-30 19:20:32.555409864 +0100
@@ -1451,16 +1451,21 @@ setup_reg_class_nregs (void)
 
 
 
-/* Set up IRA_PROHIBITED_CLASS_MODE_REGS.  */
+/* Set up IRA_PROHIBITED_CLASS_MODE_REGS and IRA_CLASS_SINGLETON.
+   This function is called once 

PR 53889: Add __gthread_recursive_mutex_destroy

2012-09-30 Thread Jonathan Wakely
There is no __gthread_recursive_mutex_destroy function in the gthreads API.

Trying to use __gthread_mutex_destroy fails to compile on platforms
where the mutex
types are different. To avoid resource leaks libstdc++ needs to hack
around the missing function with overloaded functions and SFINAE
tricks to detect how a recursive mutex can be destroyed.

This patch extends the gthreads API to include
__gthread_recursive_mutex_destroy, defining it for each gthread model,
and removing the hacks from libstdc++.

libgcc:

PR other/53889
* gthr.h (__gthread_recursive_mutex_destroy): Document new required
function.
* gthr-posix.h (__gthread_recursive_mutex_destroy): Define.
* gthr-single.h (__gthread_recursive_mutex_destroy): Likewise.
* config/gthr-rtems.h (__gthread_recursive_mutex_destroy): Likewise.
* config/gthr-vxworks.h (__gthread_recursive_mutex_destroy): Likewise.
* config/i386/gthr-win32.h (__gthread_recursive_mutex_destroy):
Likewise.
* config/mips/gthr-mipssde.h (__gthread_recursive_mutex_destroy):
Likewise.
* config/pa/gthr-dce.h (__gthread_recursive_mutex_destroy): Likewise.
* config/s390/gthr-tpf.h (__gthread_recursive_mutex_destroy): Likewise.

libstdc++-v3:

PR other/53889
* include/std/mutex (__recursive_mutex_base::~__recursive_mutex_base):
Use __gthread_recursive_mutex_destroy.
(__recursive_mutex_base::_S_destroy): Remove.
(__recursive_mutex_base::_S_destroy_win32): Likewise.
* include/ext/concurrence.h (__recursive_mutex::~__recursive_mutex):
Use __gthread_recursive_mutex_destroy.
(__recursive_mutex::_S_destroy): Remove.
(__recursive_mutex::_S_destroy_win32): Likewise.


Tested x86_64-linux.

Are the libgcc parts OK for trunk?
commit 37d75fef68222e92c4b58870dcfeeb3679e3c718
Author: Jonathan Wakely 
Date:   Sun Sep 30 19:00:51 2012 +0100

libgcc:

PR other/53889
* gthr.h (__gthread_recursive_mutex_destroy): Document new required
function.
* gthr-posix.h (__gthread_recursive_mutex_destroy): Define.
* gthr-single.h (__gthread_recursive_mutex_destroy): Likewise.
* config/gthr-rtems.h (__gthread_recursive_mutex_destroy): Likewise.
* config/gthr-vxworks.h (__gthread_recursive_mutex_destroy): Likewise.
* config/i386/gthr-win32.h (__gthread_recursive_mutex_destroy):
Likewise.
* config/mips/gthr-mipssde.h (__gthread_recursive_mutex_destroy):
Likewise.
* config/pa/gthr-dce.h (__gthread_recursive_mutex_destroy): Likewise.
* config/s390/gthr-tpf.h (__gthread_recursive_mutex_destroy): Likewise.

libstdc++-v3:

PR other/53889
* include/std/mutex (__recursive_mutex_base::~__recursive_mutex_base):
Use __gthread_recursive_mutex_destroy.
(__recursive_mutex_base::_S_destroy): Remove.
(__recursive_mutex_base::_S_destroy_win32): Likewise.
* include/ext/concurrence.h (__recursive_mutex::~__recursive_mutex):
Use __gthread_recursive_mutex_destroy.
(__recursive_mutex::_S_destroy): Remove.
(__recursive_mutex::_S_destroy_win32): Likewise.

diff --git a/libgcc/config/gthr-rtems.h b/libgcc/config/gthr-rtems.h
index c5bd522..50bdd9f 100644
--- a/libgcc/config/gthr-rtems.h
+++ b/libgcc/config/gthr-rtems.h
@@ -1,8 +1,7 @@
 /* RTEMS threads compatibility routines for libgcc2 and libobjc.
by: Rosimildo da Silva( rdasi...@connecttel.com ) */
 /* Compile this one with gcc.  */
-/* Copyright (C) 1997, 1999, 2000, 2002, 2003, 2005, 2008, 2009
-   Free Software Foundation, Inc.
+/* Copyright (C) 1997-2012 Free Software Foundation, Inc.
 
 This file is part of GCC.
 
@@ -150,6 +149,12 @@ __gthread_recursive_mutex_unlock 
(__gthread_recursive_mutex_t *__mutex)
 return rtems_gxx_recursive_mutex_unlock( __mutex );
 }
 
+static inline int
+__gthread_recursive_mutex_destroy (__gthread_recursive_mutex_t *__mutex)
+{
+return rtems_gxx_mutex_destroy( __mutex );
+}
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/libgcc/config/gthr-vxworks.h b/libgcc/config/gthr-vxworks.h
index 63116c4..b48c5ac 100644
--- a/libgcc/config/gthr-vxworks.h
+++ b/libgcc/config/gthr-vxworks.h
@@ -1,7 +1,6 @@
 /* Threads compatibility routines for libgcc2 and libobjc for VxWorks.  */
 /* Compile this one with gcc.  */
-/* Copyright (C) 1997, 1999, 2000, 2008, 2009, 2011
-   Free Software Foundation, Inc.
+/* Copyright (C) 1997-2012 Free Software Foundation, Inc.
Contributed by Mike Stump .
 
 This file is part of GCC.
@@ -111,6 +110,12 @@ __gthread_recursive_mutex_unlock 
(__gthread_recursive_mutex_t *mutex)
   return __gthread_mutex_unlock (mutex);
 }
 
+static inline int
+__gthread_recursive_mutex_destroy (__gthread_recursive_mutex_t *__mutex)
+{
+  return __gthread_mutex_destroy (__mutex);
+}
+
 /* pthread_once is complicated enough that it's implemented
out-of-line.  See config/vxl

profitable_hard_regs vs. PR 48435

2012-09-30 Thread Richard Sandiford
This is another patch needed for the MIPS MD_REGS change described here:

http://gcc.gnu.org/ml/gcc-patches/2012-09/msg01992.html

The profitable_hard_regs set used during IRA colouring used to be restricted
to registers that are valid for the allocno's mode.  That caused problems
for multi-register modes that can only start with an even register (say),
because profitable_hard_regs would only include the even start registers,
not the pairing odd registers.  Vlad fixed it with:

2011-04-08  Vladimir Makarov  

PR inline-asm/48435
* ira-color.c (setup_profitable_hard_regs): Add comments.
Don't take prohibited hard regs into account.
(setup_conflict_profitable_regs): Rename to
get_conflict_profitable_regs.
(check_hard_reg_p): Check prohibited hard regs.

However, one effect of that change is that if register R belongs to class CL
but can never be used anywhere in a register of mode M, it will still be
included in profitable_hard_regs.  That's the case with MD_REGS and
register HI on MIPS.

The patch below is a half-way house between the original behaviour
and the post-48435 one.  It restricts profitable_hard_regs to registers
that can be used for the allocno's mode, but doesn't restrict it to
starting registers.

Most of the ira.c change is reindentation, so I've included a -b
diff as well.

As with the patch linked above, I checked that this produced no
difference in assembly output for a set of x86_64 gcc .ii files
(tested with -O2 -march=native on gcc20).  Also tested on
x86_64-linux-gnu (including -m32) and mipsisa64-elf.  OK to install?

Richard


gcc/
* ira-int.h (target_ira_int): Add x_ira_useful_class_mode_regs.
(ira_useful_class_mode_regs): New macro.
* ira.c (clarify_prohibited_class_mode_regs): Set up
ira_useful_class_mode_regs.
* ira-color.c (setup_profitable_hard_regs): Use it to initialise
profitable_hard_regs.

Index: gcc/ira-int.h
===
--- gcc/ira-int.h   2012-09-30 18:59:09.463447170 +0100
+++ gcc/ira-int.h   2012-09-30 19:21:59.395407339 +0100
@@ -816,6 +816,20 @@ struct target_ira_int {
  values for given mode are zero.  */
   HARD_REG_SET 
x_ira_prohibited_class_mode_regs[N_REG_CLASSES][NUM_MACHINE_MODES];
 
+  /* Index [CL][M] contains R if R appears somewhere in a register of the form:
+
+ (reg:M R'), R' not in x_ira_prohibited_class_mode_regs[CL][M]
+
+ For example, if:
+
+ - (reg:M 2) is valid and occupies two registers;
+ - register 2 belongs to CL; and
+ - register 3 belongs to the same pressure class as CL
+
+ then (reg:M 2) contributes to [CL][M] and registers 2 and 3 will be
+ in the set.  */
+  HARD_REG_SET x_ira_useful_class_mode_regs[N_REG_CLASSES][NUM_MACHINE_MODES];
+
   /* The value is number of elements in the subsequent array.  */
   int x_ira_important_classes_num;
 
@@ -902,6 +916,8 @@ #define ira_class_hard_reg_index \
   (this_target_ira_int->x_ira_class_hard_reg_index)
 #define ira_prohibited_class_mode_regs \
   (this_target_ira_int->x_ira_prohibited_class_mode_regs)
+#define ira_useful_class_mode_regs \
+  (this_target_ira_int->x_ira_useful_class_mode_regs)
 #define ira_important_classes_num \
   (this_target_ira_int->x_ira_important_classes_num)
 #define ira_important_classes \
Index: gcc/ira.c
===
--- gcc/ira.c   2012-09-30 19:20:32.555409864 +0100
+++ gcc/ira.c   2012-09-30 19:21:59.396407339 +0100
@@ -1495,29 +1495,36 @@ clarify_prohibited_class_mode_regs (void
 
   for (cl = (int) N_REG_CLASSES - 1; cl >= 0; cl--)
 for (j = 0; j < NUM_MACHINE_MODES; j++)
-  for (k = ira_class_hard_regs_num[cl] - 1; k >= 0; k--)
-   {
- hard_regno = ira_class_hard_regs[cl][k];
- if (TEST_HARD_REG_BIT (ira_prohibited_class_mode_regs[cl][j], 
hard_regno))
-   continue;
- nregs = hard_regno_nregs[hard_regno][j];
-  if (hard_regno + nregs > FIRST_PSEUDO_REGISTER)
-{
-  SET_HARD_REG_BIT (ira_prohibited_class_mode_regs[cl][j],
-hard_regno);
-   continue;
-}
- pclass = ira_pressure_class_translate[REGNO_REG_CLASS (hard_regno)];
- for (nregs-- ;nregs >= 0; nregs--)
-   if (((enum reg_class) pclass
-!= ira_pressure_class_translate[REGNO_REG_CLASS
-(hard_regno + nregs)]))
+  {
+   CLEAR_HARD_REG_SET (ira_useful_class_mode_regs[cl][j]);
+   for (k = ira_class_hard_regs_num[cl] - 1; k >= 0; k--)
+ {
+   hard_regno = ira_class_hard_regs[cl][k];
+   if (TEST_HARD_REG_BIT (ira_prohibited_class_mode_regs[cl][j], 
hard_regno))
+ continue;
+   nregs = hard_regno_nregs[hard_regno][j];
+   if (hard_regno + nregs > FIRST_PSEUDO_REGISTER)
  {
 

Re: [Patch,avr]: Ad PR rtl-optimization/52543: Undo the MEM->UNSPEC hack

2012-09-30 Thread Georg-Johann Lay

Denis Chertykov wrote:

Georg-Johann wrote:

Denis Chertykov wrote:

Georg-Johann Lay wrote:


PR52543 required to represent a load from non-generic address spaces as
UNSPEC
instead of as MEM to avoid a gross code bloat.

http://gcc.gnu.org/PR52543

lower-subreg's cost model is still broken: It assumes that any loads from
MEM
are from the generic address space and does not care for address spaces
in its
cost model.

This patch undoes the changes from SVN r185605

http://gcc.gnu.org/viewcvs?view=revision&revision=185605

and installs a different but less intrusive hack around PR52543:

targetm.mode_dependent_address_p has an address space parameter so that
the
backend can pretend all non-generic addresses are mode-dependent.

This keeps lower-subreg.c from splitting the loads, and it is possible to
represent the loads as MEM and there is no more the need to represent
them as
UNSPECs.

This patch is still not an optimal solution but the code is much closer
to a
clean solution now.

Ok for trunk?


You can apply it.

Denis.


I also applied the following change:

http://gcc.gnu.org/viewcvs?view=revision&revision=191825

* config/avr/avr.md (adjust_len): Add lpm.
(reload_in): Use avr_out_lpm for output.  Use "lpm" for
adjust_len.
* config/avr/avr-protos.h (avr_out_lpm): New prototype.
* config/avr/avr.c (avr_out_lpm): Make global.
(adjust_insn_length): Handle ADJUST_LEN_LPM.

The reload_in insns used the wrong output functions.

Notice that this change is just a cosmetic change because the secondary
reload for the non-generic spaces are ignored.  That is:  despite
avr_secondary_reload, REG <- MEM input reloads are not mapped to their
secondary reload insn and the mov insn for that load is used.

This leads to a situation where the insn output function is not supplied
with the needed clobber register, thus the avr_find_unused_d_reg function is
needed to work around that.


What would happen if no unused_d_reg ?


In that case  the output function backup a d-reg in tmp_reg to get one 
d-register free, similar to the situation in other output functions that 
cook up a scratch, avr.c:output_reload_in_const is one example.


It is the else case in avr.c:avr_out_lpm(), %5 stands for tmp_reg:

  xop[4] = GEN_INT (segment);
  xop[3] = avr_find_unused_d_reg (insn, lpm_addr_reg_rtx);

  if (xop[3] != NULL_RTX)
{
  avr_asm_len ("ldi %3,%4" CR_TAB
   "out %i6,%3", xop, plen, 2);
}
  else if (segment == 1)
{
  avr_asm_len ("clr %5" CR_TAB
   "inc %5" CR_TAB
   "out %i6,%5", xop, plen, 3);
}
  else
{
  avr_asm_len ("mov %5,%2"   CR_TAB
   "ldi %2,%4"   CR_TAB
   "out %i6,%2"  CR_TAB
   "mov %2,%5", xop, plen, 4);
}


Denis, do you know why the secondary reloads requested by
avr_secondary_reload are ignored?  I see calls to this hook and sri->icode
is set to the right insn code but ignored afterwards.

The only calls to that hook with the right operands are from ira cost
computation.


I have tried to use secondary a few years ago (may be 5 or 7).
I have definitely remember only one thing: secondary reload should be
avoided as long as possible.


Currently each mov has to be decorated with moving the segment to RAMPZ 
and (depending on target) restoring RAMPZ afterwards.


GCC has no concept of a segmented layout and there is no way to describe 
that.


One way is to hack with UNSPEC and bypass ira/reload altogether but IMO 
that is no good solution.  Besides that is only works because the mov 
insns have special constraints (there will be writes to flash, flash 
does not change after load time, etc.)



The better way to got a knowledge about it is a GDB ;-)


I think reload.c:push_secondary_reload() should be the right place but 
it does not call targetm.secondary_reload so that no secondary is generated.


It's hard to tell where the place is that is responsible for the 
bypassing of calling the hook.



From the internals I don't see why it is skipped and the responsiveness in
the gcc-help@ list on such topics is zero :-(


IMHO  it's a question to gcc@ not to gcc-help@


Ok, I will try my luck again.

Do you have an idea for a better approach, i.e. not set RAMPZ over and 
over again?


One idea is mode_switching pass but I did not try if it is worth the effort.


Re: [patch, mips] Patch for new mips triplet - mips-mti-elf

2012-09-30 Thread Richard Sandiford
"Steve Ellcey "  writes:
> diff --git a/gcc/testsuite/gcc.target/mips/pr37362.c 
> b/gcc/testsuite/gcc.target/mips/pr37362.c
> index a378366..da34b9d 100644
> --- a/gcc/testsuite/gcc.target/mips/pr37362.c
> +++ b/gcc/testsuite/gcc.target/mips/pr37362.c
> @@ -1,5 +1,5 @@
>  /* mips*-sde-elf doesn't have 128-bit long doubles.  */
> -/* { dg-do compile { target { ! mips*-sde-elf } } } */
> +/* { dg-do compile { target { ! mips*-sde-elf mips*-mti-elf } } } */
>  /* { dg-options "-march=mips64r2 -mabi=n32" } */
>  
>  typedef float TFtype __attribute__((mode(TF)));

Sorry for only noticing now, but this produced:

ERROR: gcc.target/mips/pr37362.c  -O0 : syntax error in target selector "target 
! mips*-sde-elf mips*-mti-elf" for " dg-do 2 compile { target { ! mips*-sde-elf 
mips*-mti-elf } } "
...

We need another set of braces.  Tested on mipsisa64-elf and applied.

Richard


gcc/testsuite/
* gcc.target/mips/pr37362.c: Fix target selector.

Index: gcc/testsuite/gcc.target/mips/pr37362.c
===
--- gcc/testsuite/gcc.target/mips/pr37362.c 2012-09-30 19:46:09.0 
+0100
+++ gcc/testsuite/gcc.target/mips/pr37362.c 2012-09-30 19:49:05.146360070 
+0100
@@ -1,5 +1,5 @@
 /* mips*-sde-elf doesn't have 128-bit long doubles.  */
-/* { dg-do compile { target { ! mips*-sde-elf mips*-mti-elf } } } */
+/* { dg-do compile { target { ! { mips*-sde-elf mips*-mti-elf } } } } */
 /* { dg-options "-march=mips64r2 -mabi=n32" } */
 
 typedef float TFtype __attribute__((mode(TF)));


[PATCH] Fix PR middle-end/54759

2012-09-30 Thread Dehao Chen
Hi,

This patch fixes the bug when comparing location to UNKNOWN_LOC.

Bootstrapped and passed gcc regression test.

Okay for trunk?

Thanks,
Dehao

2012-09-30  Dehao Chen  

PR middle-end/54759
* gcc/tree-vect-loop-manip.c (slpeel_make_loop_iterate_ntimes): Use
LOCATION_LOCUS to compare with UNKNOWN_LOCATION.
(slpeel_tree_peel_loop_to_edge): Likewise.
* gcc/tree-vectorizer.c (vectorize_loops): Likewise.
Index: gcc/tree-vect-loop-manip.c
===
--- gcc/tree-vect-loop-manip.c  (revision 191876)
+++ gcc/tree-vect-loop-manip.c  (working copy)
@@ -793,7 +793,7 @@ slpeel_make_loop_iterate_ntimes (struct loop *loop
   loop_loc = find_loop_location (loop);
   if (dump_file && (dump_flags & TDF_DETAILS))
 {
-  if (loop_loc != UNKNOWN_LOC)
+  if (LOCATION_LOCUS (loop_loc) != UNKNOWN_LOC)
 fprintf (dump_file, "\nloop at %s:%d: ",
  LOC_FILE (loop_loc), LOC_LINE (loop_loc));
   print_gimple_stmt (dump_file, cond_stmt, 0, TDF_SLIM);
@@ -1248,7 +1248,7 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop,
   loop_loc = find_loop_location (loop);
   if (dump_file && (dump_flags & TDF_DETAILS))
 {
-  if (loop_loc != UNKNOWN_LOC)
+  if (LOCATION_LOCUS (loop_loc) != UNKNOWN_LOC)
 fprintf (dump_file, "\n%s:%d: note: ",
  LOC_FILE (loop_loc), LOC_LINE (loop_loc));
   fprintf (dump_file, "tree_duplicate_loop_to_edge_cfg failed.\n");
Index: gcc/tree-vectorizer.c
===
--- gcc/tree-vectorizer.c   (revision 191876)
+++ gcc/tree-vectorizer.c   (working copy)
@@ -192,7 +192,7 @@ vectorize_loops (void)
loop_vec_info loop_vinfo;
 
vect_location = find_loop_location (loop);
-if (vect_location != UNKNOWN_LOC
+if (LOCATION_LOCUS (vect_location) != UNKNOWN_LOC
 && vect_verbosity_level > REPORT_NONE)
  fprintf (vect_dump, "\nAnalyzing loop at %s:%d\n",
 LOC_FILE (vect_location), LOC_LINE (vect_location));
@@ -203,7 +203,7 @@ vectorize_loops (void)
if (!loop_vinfo || !LOOP_VINFO_VECTORIZABLE_P (loop_vinfo))
  continue;
 
-if (vect_location != UNKNOWN_LOC
+if (LOCATION_LOCUS (vect_location) != UNKNOWN_LOC
 && vect_verbosity_level > REPORT_NONE)
   fprintf (vect_dump, "\n\nVectorizing loop at %s:%d\n",
 LOC_FILE (vect_location), LOC_LINE (vect_location));


[PATCH, i386]: Fix spurious testsuite failure in gcc.target/i386/pad-10.c

2012-09-30 Thread Uros Bizjak
Hello!

Recently, gcc become smart enough to merge:

leal(%rdi), %eax
addl%esi, %eax

into

leal(%rsi,%rdi), %eax

The generated sequence (without ret) becomes shorter than 4 instructions:

cmpl$1, %esi
leal(%rsi,%rdi), %eax
je  .L8
nop
nop
ret

and nop insertion for ATOM was triggered.

Following patch changes arithmetic to again generate correct number of
instructions, without compromising the intention of the test.

leal(%rsi), %eax
subl%edi, %eax
cmpl$1, %esi
je  .L8
ret

2012-09-30  Uros Bizjak  

* gcc.target/i386/pad-10.c (foo2): Return x - z.

Tested on x86_64-pc-linux-gnu, committed to mainline SVN.

Uros.

Index: gcc.target/i386/pad-10.c
===
--- gcc.target/i386/pad-10.c(revision 191866)
+++ gcc.target/i386/pad-10.c(working copy)
@@ -15,5 +15,5 @@
   return z;
 }
   else
-return x + z;
+return x - z;
 }


Re: RFC: LRA for x86/x86-64 [0/9]

2012-09-30 Thread Steven Bosscher
On Sun, Sep 30, 2012 at 7:03 PM, Richard Guenther
 wrote:
> On Sun, Sep 30, 2012 at 6:52 PM, Steven Bosscher  
> wrote:
>> Hi,
>>
>>
>> To look at it in yet another way:
>>
>>>  integrated RA   : 189.34 (16%) usr
>>>  LRA non-specific:  59.82 ( 5%) usr
>>>  LRA virtuals eliminatenon:  56.79 ( 5%) usr
>>>  LRA create live ranges  : 175.30 (15%) usr
>>>  LRA hard reg assignment : 130.85 (11%) usr
>>
>> The IRA pass is slower than the next-slowest pass (tree PRA) by almost
>> a factor 2.5.  Each of the individually-measured *phases* of LRA is
>> slower than the complete IRA *pass*. These 5 timevars together make up
>> for 52% of all compile time.
>
> That figure indeed makes IRA + LRA look bad.  Did you by chance identify
> anything obvious that can be done to improve the situation?

Not really. It was what I was looking/hoping for with the multiple
timevars, but no cheese.

Ciao!
Steven


Re: RFC: LRA for x86/x86-64 [0/9]

2012-09-30 Thread Vladimir Makarov

On 12-09-30 1:03 PM, Richard Guenther wrote:

On Sun, Sep 30, 2012 at 6:52 PM, Steven Bosscher  wrote:

Hi,


To look at it in yet another way:


  integrated RA   : 189.34 (16%) usr
  LRA non-specific:  59.82 ( 5%) usr
  LRA virtuals eliminatenon:  56.79 ( 5%) usr
  LRA create live ranges  : 175.30 (15%) usr
  LRA hard reg assignment : 130.85 (11%) usr

The IRA pass is slower than the next-slowest pass (tree PRA) by almost
a factor 2.5.  Each of the individually-measured *phases* of LRA is
slower than the complete IRA *pass*. These 5 timevars together make up
for 52% of all compile time.

That figure indeed makes IRA + LRA look bad.  Did you by chance identify
anything obvious that can be done to improve the situation?


  As I wrote, I don't see that LRA has a problem right now because even 
on 8GB machine, GCC with LRA is 10% faster than GCC with reload with 
real time point of view (not saying that LRA generates 15% smaller 
code).  And real time is what really matters for users.


  But I think that LRA cpu time problem for this test can be fixed. But 
I don't think I can fix it for 2 weeks.  So if people believe that 
current LRA behaviour on this PR is a stopper to include it into gcc4.8 
than we should postpone its inclusion until gcc4.9 when I hope to fix it.


  As for IRA, IRA uses Chaitin-Briggs algorithm which scales worse than 
most other optimizations.  So the bigger test, the bigger percent of IRA 
in compilation time.  I don't believe that somebdoy can achieve a better 
code using other faster RA algorithms.  LLVM has no such problem because 
even their new RA (a big improvement for llvm3.0) is not based on CB 
algorithm.  It is still based on modification of linear-scan RA.  It 
would be interesting to check how other compilers behave on this test.  
Particually Intel one is most interesting (but I have doubts that it 
will be doing well because I saw programs when Intel compiler with 
optimizations struggled more than 40 mins on a file compilation).


  Still we can improve IRA behaviour from simple solutions like using a 
fast algorithm (currently used for -O0) for huge functions or by 
implementing division of function on smaller regions (but it is hard to 
implement and it will not work well for tests when most pseudos have 
very long live range).  I will work on it when I am less busy.


  About 14 years ago in Cygnus time, I worked on some problem from a 
customer.  GCC was not able to compile a big program by that time. 
Fixing GCC would have required a lot of efforts.  Finally, the customer 
modifies its code to generate smaller functions and problem was gone.  I 
mean that we could spend a lot of efforts to fix corner cases, ignoring 
improvements for majority of users.  But it seems to me that I'll have 
to work on these PRs.




Re: RFC: LRA for x86/x86-64 [0/9]

2012-09-30 Thread Vladimir Makarov

On 12-09-30 4:42 PM, Steven Bosscher wrote:

On Sun, Sep 30, 2012 at 7:03 PM, Richard Guenther
 wrote:

On Sun, Sep 30, 2012 at 6:52 PM, Steven Bosscher  wrote:

Hi,


To look at it in yet another way:


  integrated RA   : 189.34 (16%) usr
  LRA non-specific:  59.82 ( 5%) usr
  LRA virtuals eliminatenon:  56.79 ( 5%) usr
  LRA create live ranges  : 175.30 (15%) usr
  LRA hard reg assignment : 130.85 (11%) usr

The IRA pass is slower than the next-slowest pass (tree PRA) by almost
a factor 2.5.  Each of the individually-measured *phases* of LRA is
slower than the complete IRA *pass*. These 5 timevars together make up
for 52% of all compile time.

That figure indeed makes IRA + LRA look bad.  Did you by chance identify
anything obvious that can be done to improve the situation?

Not really. It was what I was looking/hoping for with the multiple
timevars, but no cheese.

I spent a lot of time to speed up LRA code.  So I don't think there is a 
simple solution.  The problem can be solved by using by simpler 
algorithms which results in generation of worse code.  It is not one 
week work even for me.




Re: RFC: LRA for x86/x86-64 [0/9]

2012-09-30 Thread Steven Bosscher
On Mon, Oct 1, 2012 at 12:44 AM, Vladimir Makarov  wrote:
>   Actually, I don't see there is a problem with LRA right now.  I think we
> should first to solve a whole compiler memory footprint problem for this
> test because cpu utilization is very small for this test.  On my machine
> with 8GB, the maximal resident space achieves almost 8GB.

Sure. But note that up to IRA, the max. resident memory size of the
test case is "only" 3.6 GB. IRA/reload allocate more than 4GB,
doubling the foot print. If you want to solve that first, that'd be
great of course...

Ciao!
Steven


Re: RFC: LRA for x86/x86-64 [0/9]

2012-09-30 Thread Steven Bosscher
On Mon, Oct 1, 2012 at 12:50 AM, Vladimir Makarov  wrote:
>   As I wrote, I don't see that LRA has a problem right now because even on
> 8GB machine, GCC with LRA is 10% faster than GCC with reload with real time
> point of view (not saying that LRA generates 15% smaller code).  And real
> time is what really matters for users.

For me, those compile times I reported *are* real times.

But you are right that the test case is a bit extreme. Before GCC 4.8
other parts of the compiler also choked on it. Still, the test case
comes from real user's code (combination of Eigen library with MPFR),
and it shows scalability problems in LRA (and IRA) that one can't just
"explain away" with an "RA is just expensive" claim. The test case for
PR26854 is Brad Lucier's Scheme interpreter, that is also real user's
code.

FWIW, I had actually expected IRA to extremely well on this test case
because IRA is supposed to be a regional allocator and I had expected
that would help for scalability. But most of the region data
structures in IRA are designed to hold whole functions (e.g. several
per region arrays of size max_reg_num / max_insn_uid / ...) and that
appears to be a problem for IRA's memory foot print. Perhaps something
similar is going on with LRA?

Ciao!
Steven


Re: [SH] PR 50457 - Add additional atomic models

2012-09-30 Thread Kaz Kojima
Oleg Endo  wrote:
> This implements the changes as proposed PR, albeit with some small
> differences:

> --- gcc/config/sh/sh.c(revision 191865)
> +++ gcc/config/sh/sh.c(working copy)
[snip]
> +  std::vector tokens;
> +  for (std::stringstream ss (str); ss.good (); )
> +  {
> +tokens.push_back (std::string ());
> +std::getline (ss, tokens.back (), ',');
> +  }

Can we use C++ in .c files already?  I couldn't find other examples
in the current gcc.

Regards,
kaz


Re: [SH] PR 50457 - Add additional atomic models

2012-09-30 Thread Oleg Endo
On Mon, 2012-10-01 at 08:38 +0900, Kaz Kojima wrote:
> Oleg Endo  wrote:
> > This implements the changes as proposed PR, albeit with some small
> > differences:
> 
> > --- gcc/config/sh/sh.c  (revision 191865)
> > +++ gcc/config/sh/sh.c  (working copy)
> [snip]
> > +  std::vector tokens;
> > +  for (std::stringstream ss (str); ss.good (); )
> > +  {
> > +tokens.push_back (std::string ());
> > +std::getline (ss, tokens.back (), ',');
> > +  }
> 
> Can we use C++ in .c files already?  I couldn't find other examples
> in the current gcc.
> 

The existing .c files are compiled as C++ already.  There was a
discussion not long go whether the .c files should be renamed to .cc or
not.  If I remember correctly, the conclusion was that existing .c files
remain .c, while files newly added should be .cc.  gcc/double-int.c
would probably one of the recent examples.

Cheers,
Oleg




Re: RFC: LRA for x86/x86-64 [0/9]

2012-09-30 Thread Vladimir Makarov

On 12-09-30 7:15 PM, Steven Bosscher wrote:

On Mon, Oct 1, 2012 at 12:50 AM, Vladimir Makarov  wrote:

   As I wrote, I don't see that LRA has a problem right now because even on
8GB machine, GCC with LRA is 10% faster than GCC with reload with real time
point of view (not saying that LRA generates 15% smaller code).  And real
time is what really matters for users.

For me, those compile times I reported *are* real times.
Sorry, I missed your data (it was buried in calculations of percents 
from my data).  I saw that on my machine maxrss was 8GB with a lot of 
page faults and small cpu utillizations (about 30%).  I guess you used 
16GB machine and 16GB is enough for this test.  Ok, I'll work on this 
problem although I think it will take some time to solve it or make it 
more tolerable.  Although I think it is not right to pay attention only 
to compilation time.  See my reasons below.

But you are right that the test case is a bit extreme. Before GCC 4.8
other parts of the compiler also choked on it. Still, the test case
comes from real user's code (combination of Eigen library with MPFR),
and it shows scalability problems in LRA (and IRA) that one can't just
"explain away" with an "RA is just expensive" claim. The test case for
PR26854 is Brad Lucier's Scheme interpreter, that is also real user's
code.


   I myself wrote a few interpreters, so I looked at the code of Scheme 
interpreter.


   It seems to me it is a computer generated code.  So the first 
solution would be generate a few functions instead of one. Generating a 
huge function is not wise for performance critical applications because 
compilers for this corner cases use simpler faster algorithms for 
optimization generating worse code.  By the way, I can solve the 
compilation time problem by using simpler algorithms harming 
performance.  The author will be happy with compilation speed but will 
be disappointed by saying 10% slower interpreter.  I don't think it is a 
solution the problem, it is creating a bigger problem.  It seems to me I 
have to do this :)  Or if I tell him that waiting 40% more time he can 
get 15% smaller code, I guess he would prefer this.  Of course it is 
interesting problem to speed up the compiler but we don't look at whole 
picture when we solve compilation time by hurting the performance.


  Scalability problem is a problem of computer generated programs and 
usually there is simpler and better solution for this by generating 
smaller functions.


  By the way, I also found that the author uses label values.  It is 
not the best solution although there are still a lot of articles 
recommending it.  One switch is faster for modern computers.  Anton Ertl 
proposed to use several switches (one switch after each interpreter 
insn) for better branch predictions but I found this work worse than one 
switch solution at least for my interpreters.





Re: [SH] PR 50457 - Add additional atomic models

2012-09-30 Thread Kaz Kojima
Oleg Endo  wrote:
> The existing .c files are compiled as C++ already.  There was a
> discussion not long go whether the .c files should be renamed to .cc or
> not.  If I remember correctly, the conclusion was that existing .c files
> remain .c, while files newly added should be .cc.  gcc/double-int.c
> would probably one of the recent examples.

Ah, I've missed that argument.  Then can we use STL classes in .c now?

Regards,
kaz


Re: [SH] PR 50457 - Add additional atomic models

2012-09-30 Thread Gabriel Dos Reis
On Sun, Sep 30, 2012 at 7:23 PM, Kaz Kojima  wrote:
> Oleg Endo  wrote:
>> The existing .c files are compiled as C++ already.  There was a
>> discussion not long go whether the .c files should be renamed to .cc or
>> not.  If I remember correctly, the conclusion was that existing .c files
>> remain .c, while files newly added should be .cc.  gcc/double-int.c
>> would probably one of the recent examples.
>
> Ah, I've missed that argument.  Then can we use STL classes in .c now?

Yes.

The existing file extensions ".c" do not mean "C" only.

-- Gaby


Re: RFC: LRA for x86/x86-64 [0/9]

2012-09-30 Thread Vladimir Makarov

On 12-09-28 1:48 PM, Andi Kleen wrote:

Steven Bosscher  writes:


On Fri, Sep 28, 2012 at 12:56 AM, Vladimir Makarov  wrote:

   Any comments and proposals are appreciated.  Even if GCC community
decides that it is too late to submit it to gcc4.8, the earlier reviews
are always useful.

I would like to see some benchmark numbers, both for code quality and
compile time impact for the most notorious compile time hog PRs for
large routines where IRA performs poorly (e.g. PR54146, PR26854).

I would be interested in some numbers how much the new XMM spilling
helps on x86 and how it affects code size.

I have some results which I got after implementation of spilling into 
SSE regs:


Average code size change: Corei7   Bulldozer
SPECInt 32-bit-0.15%   -0.14%
SPECFP  32-bit-0.36%   -0.24%
SPECInt 64-bit-0.03%   -0.07%
SPECFP  64-bit-0.11%   -0.09%

Rate change:   Corei7   Bulldozer
SPECInt 32-bit  +0.6%   -1.2%
SPECFP  32-bit  +0.3%  0%
SPECInt 64-bit 0%  0%
SPECFP  64-bit 0%  0%

  I used -O3 -mtune=corei7 -march=corei7 for Corei7 and -O3
-mtune=bdver1 -march=bdver1 for Bulldozer processor. Additionally I
enabled inter unit moves for Bulldozer when the optimization works
because without this spilling general regs into SSE regs is not
possible.


Re: [SH] PR 50457 - Add additional atomic models

2012-09-30 Thread Kaz Kojima
Gabriel Dos Reis  wrote:
>> Ah, I've missed that argument.  Then can we use STL classes in .c now?
> 
> Yes.
> 
> The existing file extensions ".c" do not mean "C" only.

Thanks for clarification.

Oleg, the patch is OK.

Regards,
kaz


Re: RFC: LRA for x86/x86-64 [0/9]

2012-09-30 Thread Jakub Jelinek
On Sun, Sep 30, 2012 at 06:50:50PM -0400, Vladimir Makarov wrote:
>   But I think that LRA cpu time problem for this test can be fixed.
> But I don't think I can fix it for 2 weeks.  So if people believe
> that current LRA behaviour on this PR is a stopper to include it
> into gcc4.8 than we should postpone its inclusion until gcc4.9 when
> I hope to fix it.

I think this testcase shouldn't be a show stopper for LRA inclusion into
4.8, but something to look at for stage3.

I think a lot of GCC passes have scalability issues on that testcase,
that is why it must be compiled with -O1 and not higher optimization
options, so perhaps it would be enough to choose a faster algorithm
generating worse code for the huge functions and -O1.

And I agree it is primarily a bug in the generator that it creates such huge
functions, that can't perform very well.

Jakub


Re: [PATCH] Add option for dumping to stderr (issue6190057)

2012-09-30 Thread Sharad Singhai
Resend to gcc-patches

I have addressed the comments by fixing all the minor issues,
bootstrapped and tested on x86_64. I did the recommended reshuffling
by moving non-tree code from tree-dump.c into a new file dumpfile.c.

I committed two successive revisions
r191883 Main patch with the dump infrastructure changes. However, I
accidentally left out a new file, dumpfile.c.
r191884 Added dumpfile.c, and did the renaming of dump_* functions
from gimple_pretty_print.[ch].

As things stand right now, r191883 is broken because of the missing
file 'dumpfile.c', which the very next commit fixes. Anyone who got
broken revision r191883, please svn update. I am really very sorry
about that.

I have a couple more minor patches which deal with renaming; I plan to
address those later.

Thanks,
Sharad

> On Thu, Sep 27, 2012 at 9:10 AM, Xinliang David Li 
> wrote:
>>
>> On Thu, Sep 27, 2012 at 4:35 AM, Sharad Singhai 
>> wrote:
>> > Thanks for the review. A couple of comments inline:
>> >
>> >> Some minor issues:
>> >>
>> >> * c/c-decl.c (c_write_global_declarations): Use different
>> >> method to
>> >> determine if the dump has ben initialized.
>> >> * cp/decl2.c (cp_write_global_declarations): Ditto.
>> >> * testsuite/gcc.target/i386/vect-double-1.c: Fix test.
>> >>
>> >> these subdirs all have their separate ChangeLog entry from where the
>> >> directory name is omitted.
>> >>
>> >> Index: tree-dump.c
>> >> ===
>> >> --- tree-dump.c (revision 191490)
>> >> +++ tree-dump.c (working copy)
>> >> @@ -24,9 +24,11 @@ along with GCC; see the file COPYING3.  If not see
>> >>  #include "coretypes.h"
>> >>  #include "tm.h"
>> >>  #include "tree.h"
>> >> +#include "gimple-pretty-print.h"
>> >>  #include "splay-tree.h"
>> >>  #include "filenames.h"
>> >>  #include "diagnostic-core.h"
>> >> +#include "rtl.h"
>> >>
>> >> what do you need gimple-pretty-print.h and rtl.h for?
>> >>
>> >> +
>> >> +extern void dump_bb (FILE *, basic_block, int, int);
>> >> +
>> >>
>> >> that should be declared in some header
>> >>
>> >> +/* Dump gimple statement GS with SPC indentation spaces and
>> >> +   EXTRA_DUMP_FLAGS on the dump streams if DUMP_KIND is enabled.  */
>> >> +
>> >> +void
>> >> +dump_gimple_stmt (int dump_kind, int extra_dump_flags, gimple gs, int
>> >> spc)
>> >> +{
>> >>
>> >> the gimple stuff really belongs in to gimple-pretty-print.c
>> >
>> > This dump_gimple_stmt () is just a dispatcher, which uses internal
>> > data structure such as dump streams/flags. If I move it into
>> > gimple-pretty-print.c, then I would have to export those
>> > streams/flags. I was hoping to avoid it by keeping all dump_* ()
>> > methods together in dumpfile.c (earlier in tree-dump.c). Thus, later
>> > one could just make dump_file/dump_flags static when all the passes
>> > have converted to this scheme.
>> >
>>
>> You can make the flags/streams global but only expose them via inline
>> accessors in the header file.
>>
>> David
>>
>> >>
>> >> (parts of tree-dump.c should be moved to a new file dumpfile.c)
>> >>
>> >> +/* Dump tree T using EXTRA_DUMP_FLAGS on dump streams if DUMP_KIND is
>> >> +   enabled.  */
>> >> +
>> >> +void
>> >> +dump_generic_expr (int dump_kind, int extra_dump_flags, tree t)
>> >> +{
>> >>
>> >> belongs to tree-pretty-print.c (to where the routines are it calls)
>> >
>> > This is again a dispatcher for dump_generic_expr () which writes to
>> > the appropriate stream depending upon dump_kind.
>> >
>> >>
>> >> +int
>> >> +dump_start (int phase, int *flag_ptr)
>> >> +{
>> >>
>> >> perfect candidate for dumpfile.c
>> >>
>> >> You can do this re-shuffling as followup, but please try to not include
>> >> rtl.h
>> >> or gimple-pretty-print.h from tree-dump.c.  Thus re-shuffling required
>> >> by that
>> >> do now.  tree-dump.c should only know about dumping 'tree'.
>> >
>> > Okay, I have moved relevant methods into dumpfile.c.
>> >
>> >>
>> >> Index: tree-dump.h
>> >> ===
>> >> --- tree-dump.h (revision 191490)
>> >> +++ tree-dump.h (working copy)
>> >> @@ -22,6 +22,7 @@ along with GCC; see the file COPYING3.  If not see
>> >>  #ifndef GCC_TREE_DUMP_H
>> >>  #define GCC_TREE_DUMP_H
>> >>
>> >> +#include "input.h"
>> >>
>> >> probably no longer required.
>> >>
>> >> Index: dumpfile.h
>> >> ===
>> >> --- dumpfile.h  (revision 191490)
>> >> +++ dumpfile.h  (working copy)
>> >> @@ -22,6 +22,9 @@ along with GCC; see the file COPYING3.  If not see
>> >>  #ifndef GCC_DUMPFILE_H
>> >>  #define GCC_DUMPFILE_H 1
>> >>
>> >> +#include "coretypes.h"
>> >> +#include "input.h"
>> >>
>> >> likewise for input.h.
>> >>
>> >> Index: testsuite/gcc.target/i386/vect-double-1.c
>> >> ===
>> >> --- testsuite/gcc.target/i386/vect-double-1.c   (revision 191490)
>> >> +++ tests