Re: [PATCH v8] PR middle-end/60281

2014-04-18 Thread Jakub Jelinek
On Fri, Apr 18, 2014 at 12:21:50PM +0800, lin zuojian wrote:
> Here is the patch after the Jakub's review, and Jakub helps with the
> coding style.

> --- a/gcc/ChangeLog
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,12 @@
> +2014-04-18  Lin Zuojian  
> +   PR middle-end/60281

Extra line missing before the PR line.
Otherwise this is ok for trunk and for 4.9.1 after a while on the trunk (a
week or two).

Thanks.

Jakub


Re: [C PATCH] Warn if switch has boolean value (PR c/60439)

2014-04-18 Thread Jakub Jelinek
On Fri, Apr 18, 2014 at 07:49:22AM +0200, Marc Glisse wrote:
> On Fri, 18 Apr 2014, Marek Polacek wrote:
> 
> >This patch implements a new warning that warns when controlling
> >expression of a switch has boolean value.  (Intentionally I don't
> >warn if the controlling expression is (un)signed:1 bit-field.)
> >I guess the question is if this should be enabled by default or
> >deserves some new warning option.  Since clang does the former,
> >I did it too and currently this warning is enabled by default.
> 
> It can be enabled by -Wsome-name which is itself enabled by default but
> at least gives the possibility to use -Wno-some-name, -Werror=some-name,
> etc. No? I believe Manuel insists regularly that no new warning should
> use 0 (and old ones should progressively lose it).

Yeah, completely agreed.  It can be enabled by default, but it should still be
constorlled by a warning switch.

Jakub


Re: Add testcase for PR lto/60820

2014-04-18 Thread Dominique Dhumieres
> this is stand alone testcase for that PR.
> Comitted to mainline.
> ...

The test fails on darwin with

/opt/gcc/work/gcc/testsuite/gcc.dg/lto/pr60820_0.c:13:1: warning: alias 
definitions not supported in Mach-O; ignored

TIA

Dominique


Re: GCC's -fsplit-stack disturbing Mach's vm_allocate

2014-04-18 Thread Samuel Thibault
Samuel Thibault, le Thu 17 Apr 2014 00:03:45 +0200, a écrit :
> Thomas Schwinge, le Wed 09 Apr 2014 09:36:42 +0200, a écrit :
> > Well, the first step is to verify that TARGET_THREAD_SPLIT_STACK_OFFSET
> > and similar configury is correct for the Hurd,
> 
> I have added the corresponding field, so we can just use the same offset
> as on Linux.

I have uploaded packages on http://people.debian.org/~sthibault/tmp/ so
Svante can try setting TARGET_THREAD_SPLIT_STACK_OFFSET to 0x30 with
them.

Samuel


Re: [build] PR 43538: Don't overwrite CXXFLAGS_FOR_TARGET in config/mt-gnu

2014-04-18 Thread Marc Glisse

Ping
http://gcc.gnu.org/ml/gcc-patches/2014-01/msg01480.html

On Thu, 23 Jan 2014, Marc Glisse wrote:


Hello,

although setting CFLAGS_FOR_TARGET before compiling gcc works fine, 
CXXFLAGS_FOR_TARGET is ignored. I don't see any good reason for that.


I tested the patch by doing a regular bootstrap+testsuite on 
x86_64-unknown-linux-gnu. I also did a non-bootstrap build where I set 
CXXFLAGS_FOR_TARGET and checked that it now propagates to libstdc++ and 
others.


config/ChangeLog:

2014-01-23  Marc Glisse  

PR target/43538
* mt-gnu: Don't reset CXXFLAGS_FOR_TARGET.


--
Marc GlisseIndex: config/mt-gnu
===
--- config/mt-gnu   (revision 209514)
+++ config/mt-gnu   (working copy)
@@ -1 +1 @@
-CXXFLAGS_FOR_TARGET = $(CXXFLAGS) -D_GNU_SOURCE
+CXXFLAGS_FOR_TARGET += -D_GNU_SOURCE


Re: [Ping] [Patch C++] RFC Fix PR57958

2014-04-18 Thread Dinar Temirbulatov
Hello,
Here is another version of fix. This time, I added
complete_type_or_else call just before aggregate_value_p. Bootstraped
and tested on x86_64-pc-linux-gnu with no new regressions.  Ok to
apply
to trunk?
 Thanks, Dinar.


On Mon, Apr 7, 2014 at 11:47 PM, Jason Merrill  wrote:
> On 04/07/2014 03:46 PM, Jason Merrill wrote:
>>
>> I guess we need to call complete_type before aggregate_value_p.
>
>
> complete_type_or_else, actually.
>
> Jason
>
>
2014-04-18  Dinar Temirbulatov  

PR c++/57958
* semantics.c (apply_deduced_return_type): Complete non-void type
before estimating whether the type is aggregate.


fix.patch
Description: Binary data


Re: [Patch, avr] Propagate -mrelax gcc driver flag to assembler

2014-04-18 Thread Senthil Kumar Selvaraj

On Sat, Apr 12, 2014 at 06:36:01PM +0200, Georg-Johann Lay wrote:
> Senthil Kumar Selvaraj schrieb:
> >This patch modifies AVR target's ASM spec to pass -mlink-relax to the
> >assembler if -mrelax is passed to the compiler driver. This was already
> >being passed on to the linker, this patch merely makes the assembler
> >also aware of it.
> >
> >The corresponding patch in binutils to handle the -mlink-relax patch is
> >already committed in the binutils repo. I'm not sure how to manage a
> >running a newer gcc with an older version of binutils though - how is this
> >generally handled?
> 
> The right place is gcc/configure.ac and have a macro defined depending on
> whether gas supports -mlink-relax.
> 
> 
> Same should be done for -mrmw, IMO, for similar reasons, e.g. something like
> 
> case "$target" in
>   ...
>   avr-*-*)
>   ...
> gcc_GAS_CHECK_FEATURE([-mrmw option], gcc_cv_as_avr_mrmw,,
>   [-mrmw], [.text],,
>   [AC_DEFINE(HAVE_AS_AVR_MRMW_OPTION, 1,
>   [Define if your assembler supports -mrmw option.])])
> 
> or
> 
> gcc_GAS_CHECK_FEATURE([-mrmw option], gcc_cv_as_avr_mrmw,,
>   [-mrmw], [.text],,,)
> if test x$gcc_cv_as_avr_mrmw = xyes; then
>   AC_DEFINE(HAVE_AS_AVR_MRMW_OPTION, 1,
> [Define if your assembler supports the -mrmw option.])
> 

Thanks Johann. The below patch adds the configury check for -mlink-relax,
along with the change to ASM_SPEC to propagate the -mrelax flag. I
modified the original patch a bit to make it easier to handle
conditional additions to SPECs (inspired by the sparc target).

> 
> However, the gcc-4_9-branch has already been created...

Yes, but I figured it would be useful anyway - if this eventually gets
backported to 4_9, for example.

If the below patch looks ok, could someone commit please? I don't have
commit access.

Regards
Senthil

gcc/ChangeLog

2014-04-18  Senthil Kumar Selvaraj  

* config/avr/avr.h: Pass on mlink-relax to assembler.
* configure.ac: Test for mlink-relax support in assembler.
* configure: Regenerate.

diff --git gcc/config/avr/avr.h gcc/config/avr/avr.h
index 78434ec..b4e3eb1 100644
--- gcc/config/avr/avr.h
+++ gcc/config/avr/avr.h
@@ -512,7 +512,28 @@ extern const char *avr_device_to_sp8 (int argc, const char 
**argv);
 %{!fenforce-eh-specs:-fno-enforce-eh-specs} \
 %{!fexceptions:-fno-exceptions}"
 
-#define ASM_SPEC "%:device_to_as(%{mmcu=*:%*}) "
+#ifdef HAVE_AS_RELAX_OPTION
+#define ASM_RELAX_SPEC "%{mrelax:-mlink-relax}"
+#else
+#define ASM_RELAX_SPEC ""
+#endif
+
+#define ASM_SPEC "%:device_to_as(%{mmcu=*:%*})\
+%(asm_relax)"
+
+/* This macro defines names of additional specifications to put in the specs
+   that can be used in various specifications like CC1_SPEC.  Its definition
+   is an initializer with a subgrouping for each command option.
+
+   Each subgrouping contains a string constant, that defines the
+   specification name, and a string constant that used by the GCC driver
+   program.
+
+   Do not define this macro if it does not need to do anything.  */
+
+#define EXTRA_SPECS \
+  { "asm_relax",   ASM_RELAX_SPEC }
+
   
 #define LINK_SPEC "\
 %{mrelax:--relax\
diff --git gcc/configure gcc/configure
index bfb1525..7815038 100755
--- gcc/configure
+++ gcc/configure
@@ -24142,6 +24142,39 @@ $as_echo "#define HAVE_AS_JSRDIRECT_RELOCS 1" 
>>confdefs.h
 fi
 ;;
 
+  avr-*-*)
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking assembler for 
-mlink-relax option" >&5
+$as_echo_n "checking assembler for -mlink-relax option... " >&6; }
+if test "${gcc_cv_as_avr_relax+set}" = set; then :
+  $as_echo_n "(cached) " >&6
+else
+  gcc_cv_as_avr_relax=no
+  if test x$gcc_cv_as != x; then
+$as_echo '.text' > conftest.s
+if { ac_try='$gcc_cv_as $gcc_cv_as_flags -mlink-relax -o conftest.o 
conftest.s >&5'
+  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
+  (eval $ac_try) 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; }
+then
+   gcc_cv_as_avr_relax=yes
+else
+  echo "configure: failed program was" >&5
+  cat conftest.s >&5
+fi
+rm -f conftest.o conftest.s
+  fi
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $gcc_cv_as_avr_relax" >&5
+$as_echo "$gcc_cv_as_avr_relax" >&6; }
+if test $gcc_cv_as_avr_relax = yes; then
+
+$as_echo "#define HAVE_AS_RELAX_OPTION 1" >>confdefs.h
+
+fi
+  ;;
+
   cris-*-*)
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking assembler for 
-no-mul-bug-abort option" >&5
 $as_echo_n "checking assembler for -no-mul-bug-abort option... " >&6; }
diff --git gcc/configure.ac gcc/configure.ac
index d7cae6c..cfa862d 100644
--- gcc/configure.ac
+++ gcc/configure.ac
@@ -3579,6 +3579,13 @@ case "$target" in
   [Define if your assembler supports the lituse_jsrdirect relocation.])])
 ;;
 
+  avr-*-*)
+gcc_GAS_CHECK_FEATURE([-mlink-relax option], gcc_cv_as_avr_relax,,
+  

Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.

2014-04-18 Thread Evgeny Stupachenko
Hi,

Merged with current master the patch passes bootstrap and is giving
expected gains.
Patch and new tests are attached.

ChangeLog:

2014-04-18  Evgeny Stupachenko  

* tree-vect-data-refs.c (vect_grouped_store_supported): New
check for stores group of length 3.
(vect_permute_store_chain): New permutations for stores group of
length 3.
(vect_grouped_load_supported): New check for loads group of length 3.
(vect_permute_load_chain): New permutations for loads group of length 3.
* tree-vect-stmts.c (vect_model_store_cost): Change cost
of vec_perm_shuffle for the new permutations.
(vect_model_load_cost): Ditto.

ChangeLog for testsuite:

2014-04-18  Evgeny Stupachenko  

   PR tree-optimization/52252
   * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3.
   * gcc.dg/vect/pr52252-st.c: Test on stores group of size 3.

Evgeny

On Thu, Mar 6, 2014 at 6:44 PM, Evgeny Stupachenko  wrote:
> Missed attachment.
>
> On Thu, Mar 6, 2014 at 6:42 PM, Evgeny Stupachenko  wrote:
>> I've separated the patch into 2: cost model tuning and load/store
>> groups parallelism.
>> SLM tuning was partially introduced in the patch:
>> http://gcc.gnu.org/ml/gcc-patches/2014-03/msg00226.html
>> The patch introducing vectorization for load/store groups of size 3 attached.
>>
>> Is it ok for stage1?
>>
>> ChangeLog:
>>
>> 2014-03-06  Evgeny Stupachenko  
>>
>>* tree-vect-data-refs.c (vect_grouped_store_supported): New
>>check for stores group of length 3.
>>(vect_permute_store_chain): New permutations for stores group of
>>length 3.
>>(vect_grouped_load_supported): New check for loads group of length 3.
>>(vect_permute_load_chain): New permutations for loads group of length 
>> 3.
>>* tree-vect-stmts.c (vect_model_store_cost): Change cost
>>of vec_perm_shuffle for the new permutations.
>>(vect_model_load_cost): Ditto.
>>
>>
>>
>> On Tue, Feb 11, 2014 at 7:19 PM, Richard Biener  wrote:
>>> On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
>>>
 Missed patch attached in plain-text.

 I have copyright assignment on file with the FSF covering work on GCC.

 Load/stores groups of length 3 is the most frequent non-power-of-2
 case. It is used in RGB image processing (like test case in PR52252).
 For sure we can extend the patch to length 5 and more. However, this
 potentially affect performance on some other architectures and
 requires larger testing. So length 3 it is just first step.The
 algorithm in the patch could be modified for a general case in several
 steps.

 I understand that the patch should wait for the stage 1, however since
 its ready we can discuss it right now and make some changes (like
 general size of group).
>>>
>>> Other than that I'd like to see a vectorizer hook querying the cost of a
>>> vec_perm_const expansion instead of adding vec_perm_shuffle
>>> (thus requires the constant shuffle mask to be passed as well
>>> as the vector type).  That's more useful for other uses that
>>> would require (arbitrary) shuffles.
>>>
>>> Didn't look at the rest of the patch yet - queued in my review
>>> pipeline.
>>>
>>> Thanks,
>>> Richard.
>>>
 Thanks,
 Evgeny

 On Tue, Feb 11, 2014 at 5:00 PM, Richard Biener  wrote:
 >
 > On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
 >
 > > Hi,
 > >
 > > The patch gives an expected 3 times gain for the test case in the 
 > > PR52252
 > > (and even 6 times for AVX2).
 > > It passes make check and bootstrap on x86.
 > > spec2000/spec2006 got no regressions/gains on x86.
 > >
 > > Is this patch ok?
 >
 > I've worked on generalizing the permutation support in the light
 > of the availability of the generic shuffle support in the IL
 > but hit some road-blocks in the way code-generation works for
 > group loads with permutations (I don't remember if I posted all patches).
 >
 > This patch seems to be to a slightly different place but it again
 > special-cases a specific permutation.  Why's that?  Why can't we
 > support groups of size 7 for example?  So - can this be generalized
 > to support arbitrary non-power-of-two load/store groups?
 >
 > Other than that the patch has to wait for stage1 to open again,
 > of course.  And it misses a testcase.
 >
 > Btw, do you have a copyright assignment on file with the FSF covering
 > work on GCC?
 >
 > Thanks,
 > Richard.
 >
 > > ChangeLog:
 > >
 > > 2014-02-11  Evgeny Stupachenko  
 > >
 > > * target.h (vect_cost_for_stmt): Defining new cost 
 > > vec_perm_shuffle.
 > > * tree-vect-data-refs.c (vect_grouped_store_supported): New
 > > check for stores group of length 3.
 > > (vect_permute_store_chain): New permutations for stores group 

Re: [PATCH v8] PR middle-end/60281

2014-04-18 Thread lin zuojian
Hi Jakub,

On Fri, Apr 18, 2014 at 09:09:17AM +0200, Jakub Jelinek wrote:
> Extra line missing before the PR line.
Should I post PATCH v9 or someone helps adding one when committing
the patch?

--
Regards
lin zuojian


Re: Remove obsolete Solaris 9 support

2014-04-18 Thread Eric Botcazou
> But for the Solaris 9 stuff, it crystal clear that this cannot occur on
> Solaris 10 and up (no single-threaded case anymore since libthread.so.1
> has been folded into libc.so.1).  Ok to remove this part?

OK for the "Solaris 9 - single-threaded" part.

-- 
Eric Botcazou


[PATCH] RTEMS thread model configuration

2014-04-18 Thread Sebastian Huber
From: Sebastian Huber 

The command line to build a GCC for RTEMS contained virtually always a
'--enable-threads'.  This patch helps to avoid this extra configuration
command line parameter and makes the GCC build a bit more user friendly
for RTEMS.

This patch should be applied to GCC 4.9 branch and master.

2014-04-18  Sebastian Huber  

* config.gcc (*-*-rtems*): Default to 'rtems' thread model.
Enable selection of 'posix' or no thread model.
---
 gcc/config.gcc | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 3c55c88..93d5994 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -791,7 +791,13 @@ case ${target} in
   ;;
 *-*-rtems*)
   case ${enable_threads} in
-yes) thread_file='rtems' ;;
+"" | yes | rtems) thread_file='rtems' ;;
+posix) thread_file='posix' ;;
+no) ;;
+*)
+  echo 'Unknown thread configuration for RTEMS'
+  exit 1
+  ;;
   esac
   tmake_file="${tmake_file} t-rtems"
   extra_options="${extra_options} rtems.opt"
-- 
1.8.1.4



Re: [PATCH] Do not run IPA transform phases multiple times

2014-04-18 Thread Jakub Jelinek
On Wed, Apr 16, 2014 at 03:28:59PM +, Zamyatin, Igor wrote:
> Likely after this was checked in appeared following on x86
> 
> FAIL: gcc.dg/vect/vect-simd-clone-11.c -flto -ffat-lto-objects (internal 
> compiler error)
> FAIL: gcc.dg/vect/vect-simd-clone-11.c -flto -ffat-lto-objects (test for 
> excess errors)
> FAIL: gcc.dg/vect/vect-simd-clone-12.c -flto -ffat-lto-objects (internal 
> compiler error)
> FAIL: gcc.dg/vect/vect-simd-clone-12.c -flto -ffat-lto-objects (test for 
> excess errors)
> FAIL: gcc.dg/vect/vect-simd-clone-1.c -flto -ffat-lto-objects (internal 
> compiler error)
> FAIL: gcc.dg/vect/vect-simd-clone-1.c -flto -ffat-lto-objects (test for 
> excess errors)
> FAIL: gcc.dg/vect/vect-simd-clone-2.c -flto -ffat-lto-objects (internal 
> compiler error)
> FAIL: gcc.dg/vect/vect-simd-clone-2.c -flto -ffat-lto-objects (test for 
> excess errors)
> FAIL: gcc.dg/vect/vect-simd-clone-3.c -flto -ffat-lto-objects (internal 
> compiler error)
> FAIL: gcc.dg/vect/vect-simd-clone-3.c -flto -ffat-lto-objects (test for 
> excess errors)
> FAIL: gcc.dg/vect/vect-simd-clone-4.c -flto -ffat-lto-objects (internal 
> compiler error)
> FAIL: gcc.dg/vect/vect-simd-clone-4.c -flto -ffat-lto-objects (test for 
> excess errors)
> FAIL: gcc.dg/vect/vect-simd-clone-5.c -flto -ffat-lto-objects (internal 
> compiler error)
> FAIL: gcc.dg/vect/vect-simd-clone-5.c -flto -ffat-lto-objects (test for 
> excess errors)
> FAIL: gcc.dg/vect/vect-simd-clone-6.c -flto -ffat-lto-objects (internal 
> compiler error)
> FAIL: gcc.dg/vect/vect-simd-clone-6.c -flto -ffat-lto-objects (test for 
> excess errors)
> FAIL: gcc.dg/vect/vect-simd-clone-7.c -flto -ffat-lto-objects (internal 
> compiler error)
> FAIL: gcc.dg/vect/vect-simd-clone-7.c -flto -ffat-lto-objects (test for 
> excess errors)
> FAIL: gcc.dg/vect/vect-simd-clone-8.c -flto -ffat-lto-objects (internal 
> compiler error)
> FAIL: gcc.dg/vect/vect-simd-clone-8.c -flto -ffat-lto-objects (test for 
> excess errors)
> FAIL: gcc.dg/vect/vect-simd-clone-9.c -flto -ffat-lto-objects (internal 
> compiler error)
> FAIL: gcc.dg/vect/vect-simd-clone-9.c -flto -ffat-lto-objects (test for 
> excess errors)

Yeah, it is in the assert added in this patch:
977   gcc_assert (!old_version_node->ipa_transforms_to_apply.exists ());
in cgraph_function_versioning.
pass_omp_simd_clone is a late IPA pass which needs to perform
cgraph_function_versioning, and the ICE is in lto1 when the old_version_node
has been read from the LTO IL from the object file, and
ipa_transforms_to_apply contains tons of various transforms, but I suppose
that during late IPA passes they are no longer performed.

Martin, can you please fix this up?

Jakub


Re: [PATCH v8] PR middle-end/60281

2014-04-18 Thread Jakub Jelinek
On Fri, Apr 18, 2014 at 06:05:35PM +0800, lin zuojian wrote:
> Hi Jakub,
> 
> On Fri, Apr 18, 2014 at 09:09:17AM +0200, Jakub Jelinek wrote:
> > Extra line missing before the PR line.
> Should I post PATCH v9 or someone helps adding one when committing
> the patch?

No need for that, that was meant for next time.  Anyway, I'll try to do an
arm scratch bootstrap/regtest of your patch and if all goes well, will check
it in (but as that may take a while, might do that only on Monday).

Jakub


Re: [PATCH] Try to coalesce for unary and binary ops

2014-04-18 Thread Eric Botcazou
> Now the question is what does this tell us?  Not re-using
> the same pseudo as op and target is always better?

I think that "generally better" is the most appropriate wording, there is even 
a specific pass (web.c) to that effect.

-- 
Eric Botcazou


Re: debug container patch

2014-04-18 Thread Jonathan Wakely
On 17 April 2014 21:43, François Dumont wrote:
> Hi
>
> Here is a patch to globally enhance debug containers implementation.
>
> I have isolated all code of special functions in a base class so that in
> C++11 we can use default implementations for the debug containers. This way
> implementation is simpler and inherit from the noexcept qualifications.

Excellent.

> I had to put a _IsCpp11AllocatorAware template parameter to this new
> type for types that are not yet C++11 allocator aware. We will be able to
> simplify it later.

Minor: we switch from using "cpp" to "cxx", meaning "cpp" can
unambiguously refer to the preprocessor, so I would use _IsCxx11...
for that.

> I noticed also that in std/c++11/debug.cc we have some methods qualified
> with noexcept while in a C++03 user code those methods will have a throw()
> qualification. Is that fine ?

Yes, an empty throw() is compatible with noexcept(true).

I'll review the rest of the patch over the weekend, thanks!


Re: [PATCH] Fix up rotate expansion (take 2)

2014-04-18 Thread Jakub Jelinek
On Wed, Apr 16, 2014 at 10:22:52PM -0400, DJ Delorie wrote:
> > +   {
> > + other_amount
> > +   = simplify_gen_unary (NEG, GET_MODE (op1),
> > + op1, GET_MODE (op1));
> > + other_amount
> > +   = simplify_gen_binary (AND, GET_MODE (op1),
> > +  other_amount,
> > +  GEN_INT (GET_MODE_PRECISION (mode)
> > +   - 1));
> > +   }
> >  
> >   shifted = force_reg (mode, shifted);
> >  
> 
> causes an ICE in gcc.c-torture/execute/20020226-1.c, which we tried to
> avoid by adding an "addneghi" pattern (which itself has a bug, hence me
> investigating).

I don't see why you would need such an pattern.  Here is what happens
on x86_64 if I forcefully ignore the rotate patterns to excersize this code:

#8  0x00a20bd7 in expand_simple_unop (mode=SImode, code=NEG, 
op0=0x719f7600, target=0x719fc2c0, unsignedp=0)
at ../../gcc/optabs.c:2511
#9  0x007f180c in force_operand (value=0x719fd040, 
target=0x719fc2c0) at ../../gcc/expr.c:7212
#10 0x007f1457 in force_operand (value=0x719ddd68, target=0x0) at 
../../gcc/expr.c:7154
#11 0x007c8509 in force_reg (mode=SImode, x=0x719ddd68) at 
../../gcc/explow.c:683
#12 0x007dd502 in convert_move (to=0x719fc2a0, from=0x719ddd68, 
unsignedp=1) at ../../gcc/expr.c:607
#13 0x007ddde0 in convert_modes (mode=QImode, oldmode=SImode, 
x=0x719ddd68, unsignedp=1) at ../../gcc/expr.c:798
#14 0x00a1ddba in expand_binop_directly (mode=SImode, 
binoptab=ashl_optab, op0=0x719f78c0, op1=0x719ddd68, target=0x0, 
unsignedp=1, methods=OPTAB_LIB_WIDEN, last=0x719f4f78) at 
../../gcc/optabs.c:1437
#15 0x00a1e18e in expand_binop (mode=SImode, binoptab=ashl_optab, 
op0=0x719f78c0, op1=0x719ddd68, target=0x0, unsignedp=1, 
methods=OPTAB_LIB_WIDEN) at ../../gcc/optabs.c:1546
#16 0x007d12c9 in expand_shift_1 (code=LSHIFT_EXPR, mode=SImode, 
shifted=0x719f78c0, amount=0x719ddd68, target=0x0, unsignedp=1)
at ../../gcc/expmed.c:2287
#17 0x007d1209 in expand_shift_1 (code=RROTATE_EXPR, mode=SImode, 
shifted=0x719f78c0, amount=0x719f7600, target=0x0, unsignedp=1)
at ../../gcc/expmed.c:2275

i.e. the other_amount expression (and (neg ()) (const_int)) is, unless
it is a general_operand (shouldn't be usually the case) is first forced
to register with whatever mode it has (GET_MODE (op1)), and forcing it
into a register forces even the operands of the AND using force_operand,
thus separate NEG and separate AND insn, and then finally is mode converted
(zero extended or truncated) into whatever mode the shift pattern requires.

Jakub


Re: [C PATCH] Warn if switch has boolean value (PR c/60439)

2014-04-18 Thread Steven Bosscher
On Fri, Apr 18, 2014 at 7:30 AM, Marek Polacek wrote:
> + if (TREE_CODE (type) == BOOLEAN_TYPE
> + || exp_code == TRUTH_ANDIF_EXPR
> + || exp_code == TRUTH_AND_EXPR
> + || exp_code == TRUTH_ORIF_EXPR
> + || exp_code == TRUTH_OR_EXPR
> + || exp_code == TRUTH_XOR_EXPR
> + || exp_code == TRUTH_NOT_EXPR
> + || exp_code == EQ_EXPR
> + || exp_code == NE_EXPR
> + || exp_code == LE_EXPR
> + || exp_code == GE_EXPR
> + || exp_code == LT_EXPR
> + || exp_code == GT_EXPR)

Is there a TREE_CODE_CLASS or a #define for this?

Ciao!
Steven


Re: calloc = malloc + memset

2014-04-18 Thread Jakub Jelinek
On Sun, Mar 23, 2014 at 05:13:46PM +0100, Marc Glisse wrote:
> 2014-03-23  Marc Glisse  
> 
> PR tree-optimization/57742
> gcc/
> * tree-ssa-strlen.c (get_string_length): Ignore malloc.
> (handle_builtin_malloc, handle_builtin_memset): New functions.
>   (strlen_optimize_stmt): Call them.
>   * passes.def: Move strlen after loop+dom.
> gcc/testsuite/
> * g++.dg/tree-ssa/calloc.C: New testcase.
> * gcc.dg/tree-ssa/calloc-1.c: Likewise.
> * gcc.dg/tree-ssa/calloc-2.c: Likewise.
>   * gcc.dg/strlenopt-9.c: Adapt.

Some CL lines are indented by 8 spaces instead of tabs.
The passes.def change makes me a little bit nervous, but if it works,
perhaps.

> --- gcc/testsuite/g++.dg/tree-ssa/calloc.C(revision 0)
> +++ gcc/testsuite/g++.dg/tree-ssa/calloc.C(working copy)
> @@ -0,0 +1,35 @@
> +/* { dg-do compile { target c++11 } } */
> +/* { dg-options "-O3 -fdump-tree-optimized" } */
> +
> +#include 
> +#include 
> +#include 
> +
> +void g(void*);
> +inline void* operator new(std::size_t sz)
> +{
> +  void *p;
> +
> +  if (sz == 0)
> +sz = 1;
> +
> +  // Slightly modified from the libsupc++ version, that one has 2 calls
> +  // to malloc which makes it too hard to optimize.
> +  while ((p = std::malloc (sz)) == 0)
> +{
> +  std::new_handler handler = std::get_new_handler ();
> +  if (! handler)
> +throw std::bad_alloc();
> +  handler ();
> +}
> +  return p;
> +}
> +
> +void f(void*p,int n){
> +  new(p)std::vector(n);
> +}
> +
> +/* { dg-final { scan-tree-dump-times "calloc" 1 "optimized" } } */
> +/* { dg-final { scan-tree-dump-not "malloc" "optimized" } } */
> +/* { dg-final { scan-tree-dump-not "memset" "optimized" } } */
> +/* { dg-final { cleanup-tree-dump "optimized" } } */

This looks to me way too much fragile, any time the libstdc++
or glibc headers change a little bit, you might need to adjust the
dg-final directives.  Much better would be if you just provided
the prototypes yourself and subset of the std::vector you really need for
the testcase.  You can throw some class or int, it doesn't have to be
std::bad_alloc, etc.

> --- gcc/testsuite/gcc.dg/strlenopt-9.c(revision 208772)
> +++ gcc/testsuite/gcc.dg/strlenopt-9.c(working copy)
> @@ -11,21 +11,21 @@ fn1 (int r)
>   optimized away.  */
>return strchr (p, '\0');
>  }
>  
>  __attribute__((noinline, noclone)) size_t
>  fn2 (int r)
>  {
>char *p, q[10];
>strcpy (q, "abc");
>p = r ? "a" : q;
> -  /* String length for p varies, therefore strlen below isn't
> +  /* String length is constant for both alternatives, and strlen is
>   optimized away.  */
>return strlen (p);

Is this because of jump threading?

> --- gcc/testsuite/gcc.dg/tree-ssa/calloc-1.c  (revision 0)
> +++ gcc/testsuite/gcc.dg/tree-ssa/calloc-1.c  (working copy)
> @@ -0,0 +1,29 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +
> +#include 
> +#include 

Even this I find unsafe.  The strlenopt*.c tests use it's custom
strlenopt.h header for a reason, you might just add a calloc
prototype in there and use that header.

> --- gcc/tree-ssa-strlen.c (revision 208772)
> +++ gcc/tree-ssa-strlen.c (working copy)
> @@ -496,20 +496,23 @@ get_string_length (strinfo si)
>   case BUILT_IN_STPCPY_CHK:
> gcc_assert (lhs != NULL_TREE);
> loc = gimple_location (stmt);
> si->endptr = lhs;
> si->stmt = NULL;
> lhs = fold_convert_loc (loc, size_type_node, lhs);
> si->length = fold_convert_loc (loc, size_type_node, si->ptr);
> si->length = fold_build2_loc (loc, MINUS_EXPR, size_type_node,
>   lhs, si->length);
> break;
> + case BUILT_IN_MALLOC:
> +   break;
> +   /* calloc always has si->length set.  */

I guess it would be better to talk about BUILT_IN_CALLOC here, and align
the comment with case and default column, to make it clear the comment
is not about BUILT_IN_MALLOC.  Or is it?
But, see below.

>   default:

>   if (!si->dont_invalidate)
> {
>   ao_ref r;
> + // Do not use si->length.

As the whole file uses /* ... */ comments, can you use them here too?

>   ao_ref_init_from_ptr_and_size (&r, si->ptr, NULL_TREE);
>   if (stmt_may_clobber_ref_p_1 (stmt, &r))
> {
>   set_strinfo (i, NULL);
>   free_strinfo (si);
>   continue;
> }
> }
>   si->dont_invalidate = false;
>   nonempty = true;
> @@ -1608,20 +1612,90 @@ handle_builtin_strcat (enum built_in_fun
>   {
> laststmt.stmt = stmt;
> laststmt.len = srclen;
> laststmt.stridx = dsi->idx;
>   }
>  }
>else if (dump_file && (dump_flags & TDF_DETAILS) != 0)
>  fprintf (dump_file, "not possible.\n");
>  }
>  
> +/* Handle a call to malloc or calloc.  */
> +
> +static void
> +handle_builtin_malloc (enum 

Re: [PATCH] Try to coalesce for unary and binary ops

2014-04-18 Thread Steven Bosscher
On Thu, Apr 17, 2014 at 4:00 PM, Michael Matz wrote:
>> And to have sth that TER not immediately un-does we have
>> to disable TER which conveniently happens for coalesced
>> SSA names.
>
> So, instead TER should be improved to not disturb the incoming instruction
> order (except where secondary effects of expanding larger trees can be
> had).  Changing the coalescing set to disable some bad parts in a later
> pass doesn't sound very convincing :)

IMHO TER should be improved to *do* disturb the order of the incoming
instructions, to reduce register pressure. There are test cases I've
looked at (pathological cases, I'll admit) where TER forwarded loads
to stores and blew up register pressure.

Alternatively: Do what has to be done to enable sched1 for ix86/x86_64...

Ciao!
Steven


Re: [PATCH] Do not run IPA transform phases multiple times

2014-04-18 Thread Martin Jambor
On Fri, Apr 18, 2014 at 12:11:45PM +0200, Jakub Jelinek wrote:
> On Wed, Apr 16, 2014 at 03:28:59PM +, Zamyatin, Igor wrote:
> > Likely after this was checked in appeared following on x86
> > 
> > FAIL: gcc.dg/vect/vect-simd-clone-11.c -flto -ffat-lto-objects (internal 
> > compiler error)
> > FAIL: gcc.dg/vect/vect-simd-clone-11.c -flto -ffat-lto-objects (test for 
> > excess errors)
> > FAIL: gcc.dg/vect/vect-simd-clone-12.c -flto -ffat-lto-objects (internal 
> > compiler error)
> > FAIL: gcc.dg/vect/vect-simd-clone-12.c -flto -ffat-lto-objects (test for 
> > excess errors)
> > FAIL: gcc.dg/vect/vect-simd-clone-1.c -flto -ffat-lto-objects (internal 
> > compiler error)
> > FAIL: gcc.dg/vect/vect-simd-clone-1.c -flto -ffat-lto-objects (test for 
> > excess errors)
> > FAIL: gcc.dg/vect/vect-simd-clone-2.c -flto -ffat-lto-objects (internal 
> > compiler error)
> > FAIL: gcc.dg/vect/vect-simd-clone-2.c -flto -ffat-lto-objects (test for 
> > excess errors)
> > FAIL: gcc.dg/vect/vect-simd-clone-3.c -flto -ffat-lto-objects (internal 
> > compiler error)
> > FAIL: gcc.dg/vect/vect-simd-clone-3.c -flto -ffat-lto-objects (test for 
> > excess errors)
> > FAIL: gcc.dg/vect/vect-simd-clone-4.c -flto -ffat-lto-objects (internal 
> > compiler error)
> > FAIL: gcc.dg/vect/vect-simd-clone-4.c -flto -ffat-lto-objects (test for 
> > excess errors)
> > FAIL: gcc.dg/vect/vect-simd-clone-5.c -flto -ffat-lto-objects (internal 
> > compiler error)
> > FAIL: gcc.dg/vect/vect-simd-clone-5.c -flto -ffat-lto-objects (test for 
> > excess errors)
> > FAIL: gcc.dg/vect/vect-simd-clone-6.c -flto -ffat-lto-objects (internal 
> > compiler error)
> > FAIL: gcc.dg/vect/vect-simd-clone-6.c -flto -ffat-lto-objects (test for 
> > excess errors)
> > FAIL: gcc.dg/vect/vect-simd-clone-7.c -flto -ffat-lto-objects (internal 
> > compiler error)
> > FAIL: gcc.dg/vect/vect-simd-clone-7.c -flto -ffat-lto-objects (test for 
> > excess errors)
> > FAIL: gcc.dg/vect/vect-simd-clone-8.c -flto -ffat-lto-objects (internal 
> > compiler error)
> > FAIL: gcc.dg/vect/vect-simd-clone-8.c -flto -ffat-lto-objects (test for 
> > excess errors)
> > FAIL: gcc.dg/vect/vect-simd-clone-9.c -flto -ffat-lto-objects (internal 
> > compiler error)
> > FAIL: gcc.dg/vect/vect-simd-clone-9.c -flto -ffat-lto-objects (test for 
> > excess errors)
> 
> Yeah, it is in the assert added in this patch:
> 977 gcc_assert (!old_version_node->ipa_transforms_to_apply.exists ());
> in cgraph_function_versioning.
> pass_omp_simd_clone is a late IPA pass which needs to perform
> cgraph_function_versioning, and the ICE is in lto1 when the old_version_node
> has been read from the LTO IL from the object file, and
> ipa_transforms_to_apply contains tons of various transforms, but I suppose
> that during late IPA passes they are no longer performed.
> 
> Martin, can you please fix this up?
> 
>   Jakub

I am aware this problem has been reported but my problem is that I
cannot reproduce it anywhere.  The tests pass on x86_64 host and the
only i686 host I use is gcc45 on gcc farm where these tests are
skipped.  It is very likely that the patch below is the proper fix but
it's cumbersome to propose patches without having the testcase.  And
of course I cannot test it because the vecotr is empty for me in all
cases.

I'll try to look for some new i686 machine somewhere, although they
are very scarce recently.

Martin


2014-04-18  Martin Jambor  

* cgraphclones.c (cgraph_function_versioning): Copy
ipa_transforms_to_apply instead of asserting it is empty.

Index: src/gcc/cgraphclones.c
===
--- src.orig/gcc/cgraphclones.c
+++ src/gcc/cgraphclones.c
@@ -974,7 +974,9 @@ cgraph_function_versioning (struct cgrap
 cgraph_copy_node_for_versioning (old_version_node, new_decl,
 redirect_callers, bbs_to_copy);
 
-  gcc_assert (!old_version_node->ipa_transforms_to_apply.exists ());
+  if (old_version_node->ipa_transforms_to_apply.exists ())
+new_version_node->ipa_transforms_to_apply
+  = old_version_node->ipa_transforms_to_apply.copy ();
   /* Copy the OLD_VERSION_NODE function tree to the new version.  */
   tree_function_versioning (old_decl, new_decl, tree_map, false, args_to_skip,
skip_return, bbs_to_copy, new_entry_block);


Re: [PATCH] Do not run IPA transform phases multiple times

2014-04-18 Thread Jakub Jelinek
On Fri, Apr 18, 2014 at 01:33:54PM +0200, Martin Jambor wrote:
> I am aware this problem has been reported but my problem is that I
> cannot reproduce it anywhere.  The tests pass on x86_64 host and the
> only i686 host I use is gcc45 on gcc farm where these tests are
> skipped.  It is very likely that the patch below is the proper fix but

It reproduces on x86_64 too, I guess the reason why you aren't seeing this
is that you might have too old assembler that doesn't support
avx2 instructions (you actually don't need avx2 hw to reproduce, any
x86_64 or i686 just with gas that supports avx2 should be enough).

Jakub


Re: [C PATCH] Warn if switch has boolean value (PR c/60439)

2014-04-18 Thread Marc Glisse

On Fri, 18 Apr 2014, Steven Bosscher wrote:


On Fri, Apr 18, 2014 at 7:30 AM, Marek Polacek wrote:

+ if (TREE_CODE (type) == BOOLEAN_TYPE
+ || exp_code == TRUTH_ANDIF_EXPR
+ || exp_code == TRUTH_AND_EXPR
+ || exp_code == TRUTH_ORIF_EXPR
+ || exp_code == TRUTH_OR_EXPR
+ || exp_code == TRUTH_XOR_EXPR
+ || exp_code == TRUTH_NOT_EXPR
+ || exp_code == EQ_EXPR
+ || exp_code == NE_EXPR
+ || exp_code == LE_EXPR
+ || exp_code == GE_EXPR
+ || exp_code == LT_EXPR
+ || exp_code == GT_EXPR)


Is there a TREE_CODE_CLASS or a #define for this?


truth_value_p

--
Marc Glisse


Re: [PATCH] Try to coalesce for unary and binary ops

2014-04-18 Thread Richard Biener
On April 18, 2014 1:30:59 PM CEST, Steven Bosscher  
wrote:
>On Thu, Apr 17, 2014 at 4:00 PM, Michael Matz wrote:
>>> And to have sth that TER not immediately un-does we have
>>> to disable TER which conveniently happens for coalesced
>>> SSA names.
>>
>> So, instead TER should be improved to not disturb the incoming
>instruction
>> order (except where secondary effects of expanding larger trees can
>be
>> had).  Changing the coalescing set to disable some bad parts in a
>later
>> pass doesn't sound very convincing :)
>
>IMHO TER should be improved to *do* disturb the order of the incoming
>instructions, to reduce register pressure. There are test cases I've
>looked at (pathological cases, I'll admit) where TER forwarded loads
>to stores and blew up register pressure.

I am looking at doing sth like that by scheduling gimple stmts to that effect 
before RTL expansion. But TER undos most of the effort unfortunately. TER 
itself won't be able to perform scheduling due to what it is designed to do 
(simulate RTL expansion from GENERIC).

Richard.

>Alternatively: Do what has to be done to enable sched1 for
>ix86/x86_64...
>
>Ciao!
>Steven




Re: [C PATCH] Warn if switch has boolean value (PR c/60439)

2014-04-18 Thread Marek Polacek
On Fri, Apr 18, 2014 at 01:20:59PM +0200, Steven Bosscher wrote:
> On Fri, Apr 18, 2014 at 7:30 AM, Marek Polacek wrote:
> > + if (TREE_CODE (type) == BOOLEAN_TYPE
> > + || exp_code == TRUTH_ANDIF_EXPR
> > + || exp_code == TRUTH_AND_EXPR
> > + || exp_code == TRUTH_ORIF_EXPR
> > + || exp_code == TRUTH_OR_EXPR
> > + || exp_code == TRUTH_XOR_EXPR
> > + || exp_code == TRUTH_NOT_EXPR
> > + || exp_code == EQ_EXPR
> > + || exp_code == NE_EXPR
> > + || exp_code == LE_EXPR
> > + || exp_code == GE_EXPR
> > + || exp_code == LT_EXPR
> > + || exp_code == GT_EXPR)
> 
> Is there a TREE_CODE_CLASS or a #define for this?

Good question.  I was looking for something nicer and found nothing,
no T_C_C or #define.
But now when I'm looking again, I found truth_value_p...  Thanks.

2014-04-18  Marek Polacek  

PR c/60439
* doc/invoke.texi: Document -Wswitch-bool.
c/
* c-typeck.c (c_start_case): Warn if switch condition has boolean
value.
c-family/
* c.opt (Wswitch-bool): New option.
testsuite/
* gcc.dg/pr60439.c: New test.

diff --git gcc/c-family/c.opt gcc/c-family/c.opt
index 390c056..9089496 100644
--- gcc/c-family/c.opt
+++ gcc/c-family/c.opt
@@ -529,6 +529,10 @@ Wswitch-enum
 C ObjC C++ ObjC++ Var(warn_switch_enum) Warning
 Warn about all enumerated switches missing a specific case
 
+Wswitch-bool
+C ObjC Warning Init(1)
+Warn about switches with boolean controlling expression
+
 Wmissing-format-attribute
 C ObjC C++ ObjC++ Alias(Wsuggest-attribute=format)
 ;
diff --git gcc/c/c-typeck.c gcc/c/c-typeck.c
index 65aad45..1b37f83 100644
--- gcc/c/c-typeck.c
+++ gcc/c/c-typeck.c
@@ -9344,6 +9344,15 @@ c_start_case (location_t switch_loc,
   else
{
  tree type = TYPE_MAIN_VARIANT (orig_type);
+ tree e = exp;
+
+ while (TREE_CODE (e) == COMPOUND_EXPR)
+   e = TREE_OPERAND (e, 1);
+
+ if (TREE_CODE (type) == BOOLEAN_TYPE
+ || truth_value_p (TREE_CODE (e)))
+   warning_at (switch_cond_loc, OPT_Wswitch_bool,
+   "switch condition has boolean value");
 
  if (!in_system_header_at (input_location)
  && (type == long_integer_type_node
diff --git gcc/doc/invoke.texi gcc/doc/invoke.texi
index 8004da8..04e1c41 100644
--- gcc/doc/invoke.texi
+++ gcc/doc/invoke.texi
@@ -268,7 +268,7 @@ Objective-C and Objective-C++ Dialects}.
 -Wstrict-aliasing=n @gol -Wstrict-overflow -Wstrict-overflow=@var{n} @gol
 -Wsuggest-attribute=@r{[}pure@r{|}const@r{|}noreturn@r{|}format@r{]} @gol
 -Wmissing-format-attribute @gol
--Wswitch  -Wswitch-default  -Wswitch-enum -Wsync-nand @gol
+-Wswitch  -Wswitch-default  -Wswitch-enum -Wswitch-bool -Wsync-nand @gol
 -Wsystem-headers  -Wtrampolines  -Wtrigraphs  -Wtype-limits  -Wundef @gol
 -Wuninitialized  -Wunknown-pragmas  -Wno-pragmas @gol
 -Wunsuffixed-float-constants  -Wunused  -Wunused-function @gol
@@ -3822,6 +3822,12 @@ between @option{-Wswitch} and this option is that this 
option gives a
 warning about an omitted enumeration code even if there is a
 @code{default} label.
 
+@item -Wswitch-bool
+@opindex Wswitch-bool
+@opindex Wno-switch-bool
+Warn whenever a @code{switch} statement has an index of boolean type.
+This warning is enabled by default for C programs.
+
 @item -Wsync-nand @r{(C and C++ only)}
 @opindex Wsync-nand
 @opindex Wno-sync-nand
diff --git gcc/testsuite/gcc.dg/pr60439.c gcc/testsuite/gcc.dg/pr60439.c
index e69de29..26e7c25 100644
--- gcc/testsuite/gcc.dg/pr60439.c
+++ gcc/testsuite/gcc.dg/pr60439.c
@@ -0,0 +1,112 @@
+/* PR c/60439 */
+/* { dg-do compile } */
+
+typedef _Bool bool;
+extern _Bool foo (void);
+
+void
+f1 (const _Bool b)
+{
+  switch (b) /* { dg-warning "switch condition has boolean value" } */
+case 1:
+  break;
+}
+
+void
+f2 (int a, int b)
+{
+  switch (a && b) /* { dg-warning "switch condition has boolean value" } */
+case 1:
+  break;
+  switch ((bool) (a && b)) /* { dg-warning "switch condition has boolean 
value" } */
+case 1:
+  break;
+  switch ((a && b) || a) /* { dg-warning "switch condition has boolean value" 
} */
+case 1:
+  break;
+}
+
+void
+f3 (int a)
+{
+  switch (!!a) /* { dg-warning "switch condition has boolean value" } */
+case 1:
+  break;
+  switch (!a) /* { dg-warning "switch condition has boolean value" } */
+case 1:
+  break;
+}
+
+void
+f4 (void)
+{
+  switch (foo ()) /* { dg-warning "switch condition has boolean value" } */
+case 1:
+  break;
+}
+
+void
+f5 (int a)
+{
+  switch (a == 3) /* { dg-warning "switch condition has boolean value" } */
+case 1:
+  break;
+  switch (a != 3) /* { dg-warning "switch condition has boolean value" } */
+case 1:
+  break;
+  switch (a > 3) /* { dg-warning "switch condition has boolean value" } */
+case 1:
+  break;
+  switch (a < 3

Re: [PATH, SH] Small builtin_strlen improvement

2014-04-18 Thread Oleg Endo
Sorry for the delayed reply.

On Mon, 2014-03-31 at 09:44 +0200, Christian Bruel wrote:
> On 03/30/2014 11:02 PM, Oleg Endo wrote:
> > Hi,
> >
> > On Wed, 2014-03-26 at 08:58 +0100, Christian Bruel wrote:
> >
> >> This patches adds a few instructions to the inlined builtin_strlen to
> >> unroll the remaining bytes for word-at-a-time loop. This enables to have
> >> 2 distinct execution paths (no fall-thru in the byte-at-a-time loop),
> >> allowing block alignment assignation. This partially improves the
> >> problem reported with by Oleg. in [Bug target/0539] New: [SH] builtin
> >> string functions ignore loop and label alignment
> > Actually, my original concern was the (mis)alignment of the 4 byte inner
> > loop.  AFAIR it's better for the SH pipeline if the first insn of a loop
> > is 4 byte aligned.
> 
> yes, this is why I haven't closed the PR. IMHO the problem is with the
> non-aligned loop stems from to the generic alignment code in final.c.
> changing branch frequencies is quite impacting to BB reordering as well.
> Further tuning of static branch estimations, or tuning of the LOOP_ALIGN
> macro is needed. 

OK, I've updated PR 60539 accordingly.

> Note that my branch estimations in this code is very
> empirical, a dynamic profiling benchmarking would be nice as well.
> My point was just that forcing a local .align in this code is a
> workaround, as we should be able to rely on generic reordering/align
> code  for this. So the tuning of loop alignment is more global (and well
> exhibited here indeed)

I think that those two are separate issues.  I've opened a new PR 60884
for this.  Let's continue the discussions and experiments there.

Cheers,
Oleg




Re: [Ping] [Patch C++] RFC Fix PR57958

2014-04-18 Thread Dinar Temirbulatov
I found typo in the testcase header, fixed. Ok to apply to trunk?
thanks, Dinar.

On Fri, Apr 18, 2014 at 1:13 PM, Dinar Temirbulatov
 wrote:
> Hello,
> Here is another version of fix. This time, I added
> complete_type_or_else call just before aggregate_value_p. Bootstraped
> and tested on x86_64-pc-linux-gnu with no new regressions.  Ok to
> apply
> to trunk?
>  Thanks, Dinar.
>
>
> On Mon, Apr 7, 2014 at 11:47 PM, Jason Merrill  wrote:
>> On 04/07/2014 03:46 PM, Jason Merrill wrote:
>>>
>>> I guess we need to call complete_type before aggregate_value_p.
>>
>>
>> complete_type_or_else, actually.
>>
>> Jason
>>
>>
2014-04-18  Dinar Temirbulatov  

PR c++/57958
* semantics.c (apply_deduced_return_type): Complete non-void type
before estimating whether the type is aggregate.


fix.patch
Description: Binary data


Re: [PATCH] Do not run IPA transform phases multiple times

2014-04-18 Thread Martin Jambor
On Fri, Apr 18, 2014 at 01:49:36PM +0200, Jakub Jelinek wrote:
> It reproduces on x86_64 too, I guess the reason why you aren't seeing this
> is that you might have too old assembler that doesn't support
> avx2 instructions (you actually don't need avx2 hw to reproduce, any
> x86_64 or i686 just with gas that supports avx2 should be enough).
> 

I see, with that information I have managed to reproduce the failures
and now am convinced the patch (re-posted below) is indeed the correct
thing to do.  I am going to bootstrap it over the weekend (can't do it
earlier to test these testcases).  Honza, is it OK for trunk if it
passes?

Thanks,

Martin


2014-04-18  Martin Jambor  

* cgraphclones.c (cgraph_function_versioning): Copy
ipa_transforms_to_apply instead of asserting it is empty.

Index: src/gcc/cgraphclones.c
===
--- src.orig/gcc/cgraphclones.c
+++ src/gcc/cgraphclones.c
@@ -974,7 +974,9 @@ cgraph_function_versioning (struct cgrap
 cgraph_copy_node_for_versioning (old_version_node, new_decl,
 redirect_callers, bbs_to_copy);
 
-  gcc_assert (!old_version_node->ipa_transforms_to_apply.exists ());
+  if (old_version_node->ipa_transforms_to_apply.exists ())
+new_version_node->ipa_transforms_to_apply
+  = old_version_node->ipa_transforms_to_apply.copy ();
   /* Copy the OLD_VERSION_NODE function tree to the new version.  */
   tree_function_versioning (old_decl, new_decl, tree_map, false, args_to_skip,
skip_return, bbs_to_copy, new_entry_block);



Re: [PATCH, rs6000, 4.8, 4.9, trunk] Fix little endian behavior of vec_merge[hl] for V4SI/V4SF with VSX

2014-04-18 Thread David Edelsohn
On Thu, Apr 17, 2014 at 11:06 PM, Bill Schmidt
 wrote:
> Hi,
>
> I missed a case in the vector API work for little endian.  When VSX is
> enabled, the vec_mergeh and vec_mergel interfaces for 4x32 vectors are
> translated into xxmrghw and xxmrglw.  The patterns for these were not
> adjusted for little endian.  This patch fixes this and adds tests for
> V4SI and V4SF modes when VSX is available.
>
> Bootstrapped and tested on 4.8, 4.9, and trunk for
> powerpc64le-unknown-linux-gnu with no regressions.  Tests are still
> ongoing for powerpc64-unknown-linux-gnu.  Provided those complete
> without regressions, is this fix ok for trunk, 4.9, and 4.8?
>
> Thanks,
> Bill
>
>
> [gcc]
>
> 2014-04-17  Bill Schmidt  
>
> * config/rs6000/vsx.md (vsx_xxmrghw_): Adjust for
> little-endian.
> (vsx_xxmrglw_): Likewise.
>
> [gcc/testsuite]
>
> 2014-04-17  Bill Schmidt  
>
> * gcc.dg/vmx/merge-vsx.c: Add V4SI and V4SF tests.
> * gcc.dg/vmx/merge-vsx-be-order.c: Likewise.

This is okay for trunk, 4.9.1 and 4.8.3.  Note that the 4.9 branch is
frozen at the moment.

Thanks, David


[PATCH] Fix omp declare simd cloning (PR tree-optimization/60823)

2014-04-18 Thread Jakub Jelinek
Hi!

This patch fixes the adjustments performed by ipa_simd_modify_function_body
on omp declare simd clones.
Previously we've been trying to replace typically SSA_NAMEs with underlying
PARM_DECLs of the to be replaced arguments with loads/stores from/to
array refs (that will be hopefully vectorized) right around the referencing
stmt, but:
1) this can't really work well if there is any life range overlap in SSA_NAMEs
   with the same underlying PARM_DECL
2) PHIs weren't handled at all (neither PHI arguments, nor lhs of the PHIs)
3) for addressable PARM_DECLs the code pretty much assumed the same thing
   can be done too

This patch instead adjusts all SSA_NAMEs with SSA_NAME_VAR equal to the to
be replaced PARM_DECLs to a new underlying VAR_DECL, only changes the
(D) SSA_NAME to a load done at the start of the entry block, and for
addressable PARM_DECLs adjusts them such that they don't have to be
regimplified (as we replace say address of a PARM_DECL which is a
gimple_min_invariant with array ref with variable index which is not
gimple_min_invariant, we need to force the addresses into SSA_NAMEs).

The tree-dfa.c fix is what I've discovered while writing the patch,
if htab_find_slot_with_hash (..., NO_INSERT) fails to find something
in the hash table (most likely not actually needed by the patch, discovered
that just because the patch was buggy initially), it returns NULL rather
than address of some slot which will contain NULL.

Bootstrapped/regtested on x86_64-linux and i686-linux.

Richard, does this look reasonable?

2014-04-18  Jakub Jelinek  

PR tree-optimization/60823
* omp-low.c (ipa_simd_modify_function_body): Go through
all SSA_NAMEs and for those refering to vector arguments
which are going to be replaced adjust SSA_NAME_VAR and,
if it is a default definition, change it into a non-default
definition assigned at the beginning of function from new_decl.
(ipa_simd_modify_stmt_ops): Rewritten.
* tree-dfa.c (set_ssa_default_def): When removing default def,
check for NULL loc instead of NULL *loc.

* c-c++-common/gomp/pr60823-1.c: New test.
* c-c++-common/gomp/pr60823-2.c: New test.
* c-c++-common/gomp/pr60823-3.c: New test.

--- gcc/omp-low.c.jj2014-04-17 14:48:59.076025713 +0200
+++ gcc/omp-low.c   2014-04-18 12:00:16.666701773 +0200
@@ -11281,45 +11281,53 @@ static tree
 ipa_simd_modify_stmt_ops (tree *tp, int *walk_subtrees, void *data)
 {
   struct walk_stmt_info *wi = (struct walk_stmt_info *) data;
-  if (!SSA_VAR_P (*tp))
+  struct modify_stmt_info *info = (struct modify_stmt_info *) wi->info;
+  tree *orig_tp = tp;
+  if (TREE_CODE (*tp) == ADDR_EXPR)
+tp = &TREE_OPERAND (*tp, 0);
+  struct ipa_parm_adjustment *cand = NULL;
+  if (TREE_CODE (*tp) == PARM_DECL)
+cand = ipa_get_adjustment_candidate (&tp, NULL, info->adjustments, true);
+  else
 {
-  /* Make sure we treat subtrees as a RHS.  This makes sure that
-when examining the `*foo' in *foo=x, the `foo' get treated as
-a use properly.  */
-  wi->is_lhs = false;
-  wi->val_only = true;
   if (TYPE_P (*tp))
*walk_subtrees = 0;
-  return NULL_TREE;
-}
-  struct modify_stmt_info *info = (struct modify_stmt_info *) wi->info;
-  struct ipa_parm_adjustment *cand
-= ipa_get_adjustment_candidate (&tp, NULL, info->adjustments, true);
-  if (!cand)
-return NULL_TREE;
-
-  tree t = *tp;
-  tree repl = make_ssa_name (TREE_TYPE (t), NULL);
-
-  gimple stmt;
-  gimple_stmt_iterator gsi = gsi_for_stmt (info->stmt);
-  if (wi->is_lhs)
-{
-  stmt = gimple_build_assign (unshare_expr (cand->new_decl), repl);
-  gsi_insert_after (&gsi, stmt, GSI_SAME_STMT);
-  SSA_NAME_DEF_STMT (repl) = info->stmt;
 }
+
+  tree repl = NULL_TREE;
+  if (cand)
+repl = unshare_expr (cand->new_decl);
   else
 {
-  /* You'd think we could skip the extra SSA variable when
-wi->val_only=true, but we may have `*var' which will get
-replaced into `*var_array[iter]' and will likely be something
-not gimple.  */
-  stmt = gimple_build_assign (repl, unshare_expr (cand->new_decl));
-  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+  if (tp != orig_tp)
+   {
+ *walk_subtrees = 0;
+ bool modified = info->modified;
+ info->modified = false;
+ walk_tree (tp, ipa_simd_modify_stmt_ops, wi, wi->pset);
+ if (!info->modified)
+   {
+ info->modified = modified;
+ return NULL_TREE;
+   }
+ info->modified = modified;
+ repl = *tp;
+   }
+  else
+   return NULL_TREE;
 }
 
-  if (!useless_type_conversion_p (TREE_TYPE (*tp), TREE_TYPE (repl)))
+  if (tp != orig_tp)
+{
+  repl = build_fold_addr_expr (repl);
+  gimple stmt
+   = gimple_build_assign (make_ssa_name (TREE_TYPE (repl), NULL), repl);
+  repl = gimple_assign_lhs (stmt);
+   

Re: [PATCH] Introduce MODE_SIZE mode attribute

2014-04-18 Thread H.J. Lu
On Mon, Jan 6, 2014 at 7:20 AM, Jakub Jelinek  wrote:
> On Sat, Jan 04, 2014 at 12:37:57AM +0100, Jakub Jelinek wrote:
>> That is certainly doable (as attached), but strangely if the patch (that I've
>> already committed) is reverted and this one applied, the .text savings are
>> much smaller.
>>
>> Here are .text and .rodata readelf -WS lines from x86_64 (first 4 pairs) and
>> i686 (last 4 pairs) builds, always vanilla trunk before r206312, that plus
>> r206312 patch, without r206312 but with attached patch, with both r206312
>> and attached patch.  So, for .text size the best is both patches, but
>> for .rodata patches just r206312.  I'll try to look at details why this is so
>> next week.
>
> The difference is I think caused by the way gencondition.c works.
> As the array with the conditions is a toplevel array, __builtin_constant_p
> is folded there already during the parsing, after folding the conditions.
>

For some reason, it triggered:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60887

-- 
H.J.


Re: [Ping] [Patch C++] RFC Fix PR57958

2014-04-18 Thread Jason Merrill

On 04/18/2014 08:53 AM, Dinar Temirbulatov wrote:

+  if (TREE_CODE (TREE_TYPE(result)) != VOID_TYPE)


Space after TREE_TYPE.  OK with that change.

Jason



Re: [Ping] [Patch C++] RFC Fix PR57958

2014-04-18 Thread Paolo Carlini


Hi,

On 18 aprile 2014 17:20:40 CEST, Jason Merrill  wrote:
>On 04/18/2014 08:53 AM, Dinar Temirbulatov wrote:
>> +  if (TREE_CODE (TREE_TYPE(result)) != VOID_TYPE)
>
>Space after TREE_TYPE.  OK with that change.

I'm traveling and I can't really check, but I seem to remember that we do have 
a VOID_TYPE_P.

Paolo



Re: [Ping] [Patch C++] RFC Fix PR57958

2014-04-18 Thread Jason Merrill

On 04/18/2014 11:39 AM, Paolo Carlini wrote:

I'm traveling and I can't really check, but I seem to remember that we do have 
a VOID_TYPE_P.


True, I suppose that would be preferable.

Jason




[COMMITTED] Fix silly error in aarch64_register_move_cost

2014-04-18 Thread Richard Henderson
Building mainline I got

> .../aarch64.c:4879:134: error: invalid conversion from ‘reg_class_t {aka 
> int}’ to ‘machine_mode’ [-fpermissive]
>if (! TARGET_SIMD && GET_MODE_SIZE (from) == 128 && GET_MODE_SIZE (to) == 
> 128)

Sure enough, TO and FROM are not modes.  Did mainline just change away from
permissive or something?  It surely seems like we ought to have seen this
earlier...

Anyway, applied as obvious to all active branches.


r~
* config/aarch64/aarch64.c (aarch64_register_move_cost): Pass a mode
to GET_MODE_SIZE, not a reg_class_t.



diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index a3147ee..7b6c2b3 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -4847,9 +4847,11 @@ aarch64_address_cost (rtx x ATTRIBUTE_UNUSED,
 }
 
 static int
-aarch64_register_move_cost (enum machine_mode mode ATTRIBUTE_UNUSED,
-   reg_class_t from, reg_class_t to)
+aarch64_register_move_cost (enum machine_mode mode,
+   reg_class_t from_i, reg_class_t to_i)
 {
+  enum reg_class from = (enum reg_class) from_i;
+  enum reg_class to = (enum reg_class) to_i;
   const struct cpu_regmove_cost *regmove_cost
 = aarch64_tune_params->regmove_cost;
 
@@ -4875,8 +4877,7 @@ aarch64_register_move_cost (enum machine_mode mode 
ATTRIBUTE_UNUSED,
  secondary reload.  A general register is used as a scratch to move
  the upper DI value and the lower DI value is moved directly,
  hence the cost is the sum of three moves. */
-
-  if (! TARGET_SIMD && GET_MODE_SIZE (from) == 128 && GET_MODE_SIZE (to) == 
128)
+  if (! TARGET_SIMD && GET_MODE_SIZE (mode) == 128)
 return regmove_cost->GP2FP + regmove_cost->FP2GP + regmove_cost->FP2FP;
 
   return regmove_cost->FP2FP;


Re: [RFC] Add aarch64 support for ada

2014-04-18 Thread Richard Henderson
On 04/16/2014 12:55 AM, Eric Botcazou wrote:
>> Similarly with the HAVE_GNAT_ALTERNATE_STACK stuff.  There aren't any
>> linux hosts that don't support sigaltstack, so why is this
>> conditionalized?
> 
> Hum, I didn't know that Android also used the alternate stack...  OK, let's 
> use it unconditionally on Linux then, except for IA-64 which is a totally 
> different beast.  Can you change the patch accordingly?
> 

How about this?  I added a check vs MINSIGSTKSZ just in case, and updated the
commentary a bit.  While 16K is 2*SIGSTKSIZE for i686, it certainly isn't for
powerpc64.  But since things are working as-is I thought the revision is 
clearer.

Ok?


r~
* init.c [__linux__] (HAVE_GNAT_ALTERNATE_STACK): New define.
(__gnat_alternate_stack): Enable for all linux except ia64.


diff --git a/gcc/ada/init.c b/gcc/ada/init.c
index c3824ab..48319d6 100644
--- a/gcc/ada/init.c
+++ b/gcc/ada/init.c
@@ -556,9 +556,14 @@ __gnat_error_handler (int sig, siginfo_t *si 
ATTRIBUTE_UNUSED, void *ucontext)
   Raise_From_Signal_Handler (exception, msg);
 }
 
-#if defined (i386) || defined (__x86_64__) || defined (__powerpc__)
-/* This must be in keeping with System.OS_Interface.Alternate_Stack_Size.  */
-char __gnat_alternate_stack[16 * 1024]; /* 2 * SIGSTKSZ */
+#ifndef __ia64__
+#define HAVE_GNAT_ALTERNATE_STACK 1
+/* This must be in keeping with System.OS_Interface.Alternate_Stack_Size.
+   It must be larger than MINSIGSTKSZ and hopefully near 2 * SIGSTKSZ.  */
+# if 16 * 1024 < MINSIGSTKSZ
+#  error "__gnat_alternate_stack too small"
+# endif
+char __gnat_alternate_stack[16 * 1024];
 #endif
 
 #ifdef __XENO__
@@ -612,7 +617,7 @@ __gnat_install_handler (void)
 sigaction (SIGBUS,  &act, NULL);
   if (__gnat_get_interrupt_state (SIGSEGV) != 's')
 {
-#if defined (i386) || defined (__x86_64__) || defined (__powerpc__)
+#ifdef HAVE_GNAT_ALTERNATE_STACK
   /* Setup an alternate stack region for the handler execution so that
 stack overflows can be handled properly, avoiding a SEGV generation
 from stack usage by the handler itself.  */


LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Aaron Sawdey

Honza,
  Seeing your recent patches relating to inliner heuristics for LTO, I 
thought I should mention some related work I'm doing.


By way of introduction, I've recently joined the IBM LTC's PPC Toolchain 
team, working on gcc performance.


We have not generally seen good results using LTO on IBM power processors 
and one of the problems seems to be excessive inlining that results in the 
generation of excessive spill code. So, I have set out to tackle this by 
doing some analysis at the time of the inliner pass to compute something 
analogous to register pressure, which is then used to shut down inlining of 
routines that have a lot of pressure.


The analysis is basically a liveness analysis on the SSA names per basic 
block and looking for the maximum number live in any block. I've been using 
"liveness pressure" as a shorthand name for this.


This can then be used in two ways.
1) want_inline_function_to_all_callers_p at present always says to inline 
things that have only one call site without regard to size or what this may 
do to the register allocator downstream. In particular, BZ2_decompress in 
bzip2 gets inlined and this causes the pressure reported downstream for the 
int register class to increase 10x. Looking at some combination of pressure 
in caller/callee may help avoid this kind of situation.
2) I also want to experiment with adding the liveness pressure in the 
callee into the badness calculation in edge_badness used by 
inline_small_functions. The idea here is to try to inline functions that 
are less likely to cause register allocator difficulty downstream first.


I am just at the point of getting a prototype working, I will get a patch 
you could take a look at posted next week. In the meantime, do you have any 
comments or feedback?


Thanks,
   Aaron

--
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Jan Hubicka
Hello,
> Honza,
>   Seeing your recent patches relating to inliner heuristics for LTO,
> I thought I should mention some related work I'm doing.
> 
> By way of introduction, I've recently joined the IBM LTC's PPC
> Toolchain team, working on gcc performance.
> 
> We have not generally seen good results using LTO on IBM power
> processors and one of the problems seems to be excessive inlining
> that results in the generation of excessive spill code. So, I have
> set out to tackle this by doing some analysis at the time of the
> inliner pass to compute something analogous to register pressure,
> which is then used to shut down inlining of routines that have a lot
> of pressure.

This is intresting.  I sort of planned to add register pressure logic
but always tought it is somewhat hard to do at GIMPLE level in a way
that would work for all CPUs.
> 
> The analysis is basically a liveness analysis on the SSA names per
> basic block and looking for the maximum number live in any block.
> I've been using "liveness pressure" as a shorthand name for this.

I believe this is usually called width
> 
> This can then be used in two ways.
> 1) want_inline_function_to_all_callers_p at present always says to
> inline things that have only one call site without regard to size or
> what this may do to the register allocator downstream. In
> particular, BZ2_decompress in bzip2 gets inlined and this causes the
> pressure reported downstream for the int register class to increase
> 10x. Looking at some combination of pressure in caller/callee may
> help avoid this kind of situation.
> 2) I also want to experiment with adding the liveness pressure in
> the callee into the badness calculation in edge_badness used by
> inline_small_functions. The idea here is to try to inline functions
> that are less likely to cause register allocator difficulty
> downstream first.

Sounds interesting.  I am very curious if you can get consistent improvements
with this.  I only implemented logic for large stack frames, but in C++ code
it seems often to do more harm than good.

If you find examples of bad inlining, can you also fill it into bugzilla?
Perhaps the individual cases could be handled better by improving IRA.

Honza
> 
> I am just at the point of getting a prototype working, I will get a
> patch you could take a look at posted next week. In the meantime, do
> you have any comments or feedback?
> 
> Thanks,
>Aaron
> 
> -- 
> Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
> 050-2/C113  (507) 253-7520 home: 507/263-0782
> IBM Linux Technology Center - PPC Toolchain


Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Xinliang David Li
Do you witness similar problems with LTO +FDO?

My concern is it can be tricky to get the register pressure estimate
right. The register pressure problem is created by downstream
components (code motions etc) but only exposed by the inliner.  If you
want to get it 'right' (i.e., not exposing the problems), you will
need to bake the knowledge of the downstream components (possibly
bugs) into the analysis which might not be a good thing to do longer
term.

David

On Fri, Apr 18, 2014 at 9:43 AM, Aaron Sawdey
 wrote:
> Honza,
>   Seeing your recent patches relating to inliner heuristics for LTO, I
> thought I should mention some related work I'm doing.
>
> By way of introduction, I've recently joined the IBM LTC's PPC Toolchain
> team, working on gcc performance.
>
> We have not generally seen good results using LTO on IBM power processors
> and one of the problems seems to be excessive inlining that results in the
> generation of excessive spill code. So, I have set out to tackle this by
> doing some analysis at the time of the inliner pass to compute something
> analogous to register pressure, which is then used to shut down inlining of
> routines that have a lot of pressure.
>
> The analysis is basically a liveness analysis on the SSA names per basic
> block and looking for the maximum number live in any block. I've been using
> "liveness pressure" as a shorthand name for this.
>
> This can then be used in two ways.
> 1) want_inline_function_to_all_callers_p at present always says to inline
> things that have only one call site without regard to size or what this may
> do to the register allocator downstream. In particular, BZ2_decompress in
> bzip2 gets inlined and this causes the pressure reported downstream for the
> int register class to increase 10x. Looking at some combination of pressure
> in caller/callee may help avoid this kind of situation.
> 2) I also want to experiment with adding the liveness pressure in the callee
> into the badness calculation in edge_badness used by inline_small_functions.
> The idea here is to try to inline functions that are less likely to cause
> register allocator difficulty downstream first.
>
> I am just at the point of getting a prototype working, I will get a patch
> you could take a look at posted next week. In the meantime, do you have any
> comments or feedback?
>
> Thanks,
>Aaron
>
> --
> Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
> 050-2/C113  (507) 253-7520 home: 507/263-0782
> IBM Linux Technology Center - PPC Toolchain
>


Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Xinliang David Li
On Fri, Apr 18, 2014 at 10:26 AM, Jan Hubicka  wrote:
> Hello,
>> Honza,
>>   Seeing your recent patches relating to inliner heuristics for LTO,
>> I thought I should mention some related work I'm doing.
>>
>> By way of introduction, I've recently joined the IBM LTC's PPC
>> Toolchain team, working on gcc performance.
>>
>> We have not generally seen good results using LTO on IBM power
>> processors and one of the problems seems to be excessive inlining
>> that results in the generation of excessive spill code. So, I have
>> set out to tackle this by doing some analysis at the time of the
>> inliner pass to compute something analogous to register pressure,
>> which is then used to shut down inlining of routines that have a lot
>> of pressure.
>
> This is intresting.  I sort of planned to add register pressure logic
> but always tought it is somewhat hard to do at GIMPLE level in a way
> that would work for all CPUs.
>>
>> The analysis is basically a liveness analysis on the SSA names per
>> basic block and looking for the maximum number live in any block.
>> I've been using "liveness pressure" as a shorthand name for this.
>
> I believe this is usually called width
>>
>> This can then be used in two ways.
>> 1) want_inline_function_to_all_callers_p at present always says to
>> inline things that have only one call site without regard to size or
>> what this may do to the register allocator downstream. In
>> particular, BZ2_decompress in bzip2 gets inlined and this causes the
>> pressure reported downstream for the int register class to increase
>> 10x. Looking at some combination of pressure in caller/callee may
>> help avoid this kind of situation.
>> 2) I also want to experiment with adding the liveness pressure in
>> the callee into the badness calculation in edge_badness used by
>> inline_small_functions. The idea here is to try to inline functions
>> that are less likely to cause register allocator difficulty
>> downstream first.
>
> Sounds interesting.  I am very curious if you can get consistent improvements
> with this.  I only implemented logic for large stack frames, but in C++ code
> it seems often to do more harm than good.
>
> If you find examples of bad inlining, can you also fill it into bugzilla?
> Perhaps the individual cases could be handled better by improving IRA.

yes -- I think this is the right time to do regardless.

David


>
> Honza
>>
>> I am just at the point of getting a prototype working, I will get a
>> patch you could take a look at posted next week. In the meantime, do
>> you have any comments or feedback?
>>
>> Thanks,
>>Aaron
>>
>> --
>> Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
>> 050-2/C113  (507) 253-7520 home: 507/263-0782
>> IBM Linux Technology Center - PPC Toolchain


C++ PATCH for c++/60872 (bogus error converting pointer to restricted pointer)

2014-04-18 Thread Jason Merrill
When building up the internal representation of a conversion, G++ was 
trying to apply 'restrict' to void, which doesn't make sense.


Tested x86_64-pc-linux-gnu, applying to trunk.
commit 53c6cb04ee081b959788da69d36b8d4db1e3d442
Author: Jason Merrill 
Date:   Thu Apr 17 09:22:15 2014 -0400

	PR c++/60872
	* call.c (standard_conversion): Don't try to apply restrict to void.

diff --git a/gcc/cp/call.c b/gcc/cp/call.c
index 7dbe935..fbd2f83 100644
--- a/gcc/cp/call.c
+++ b/gcc/cp/call.c
@@ -1196,9 +1196,10 @@ standard_conversion (tree to, tree from, tree expr, bool c_cast_p,
 	   && TREE_CODE (TREE_TYPE (from)) != FUNCTION_TYPE)
 	{
 	  tree nfrom = TREE_TYPE (from);
+	  /* Don't try to apply restrict to void.  */
+	  int quals = cp_type_quals (nfrom) & ~TYPE_QUAL_RESTRICT;
 	  from = build_pointer_type
-	(cp_build_qualified_type (void_type_node, 
-			  cp_type_quals (nfrom)));
+	(cp_build_qualified_type (void_type_node, quals));
 	  conv = build_conv (ck_ptr, from, conv);
 	}
   else if (TYPE_PTRDATAMEM_P (from))
diff --git a/gcc/testsuite/g++.dg/ext/restrict2.C b/gcc/testsuite/g++.dg/ext/restrict2.C
new file mode 100644
index 000..f053210
--- /dev/null
+++ b/gcc/testsuite/g++.dg/ext/restrict2.C
@@ -0,0 +1,8 @@
+// PR c++/60872
+// { dg-options "" }
+
+typedef double *__restrict T;
+void f(T* p)
+{
+  void *p2 = p;
+}


Re: C++ PATCH for DR 1571 (reference binding)

2014-04-18 Thread Jason Merrill

On 02/25/2014 04:27 PM, Jason Merrill wrote:

Getting the reference binding rules for C++11 right (in the standard)
has taken quite a few iterations.  I'm pretty happy with the latest
wording, which deals with user-defined conversions by recursing on the
result of the conversion.  This patch implements those rules.


The earlier patch broke Firefox and was reverted; this patch avoids that 
regression.


Tested x86_64-pc-linux-gnu, applying to trunk.

commit 17d24bce33f71316e2b1f0a4a96deb29d09de0be
Author: jason 
Date:   Tue Feb 25 21:27:51 2014 +

	DR 1571
	* call.c (reference_binding): Recurse on user-defined conversion.

diff --git a/gcc/cp/call.c b/gcc/cp/call.c
index fbd2f83..8c55c32 100644
--- a/gcc/cp/call.c
+++ b/gcc/cp/call.c
@@ -1684,20 +1684,30 @@ reference_binding (tree rto, tree rfrom, tree expr, bool c_cast_p, int flags,
   if (!conv)
 return NULL;
 
+  if (conv->user_conv_p)
+{
+  /* If initializing the temporary used a conversion function,
+	 recalculate the second conversion sequence.  */
+  for (conversion *t = conv; t; t = next_conversion (t))
+	if (t->kind == ck_user
+	&& DECL_CONV_FN_P (t->cand->fn))
+	  {
+	tree ftype = TREE_TYPE (TREE_TYPE (t->cand->fn));
+	int sflags = (flags|LOOKUP_NO_CONVERSION)&~LOOKUP_NO_TEMP_BIND;
+	conversion *new_second
+	  = reference_binding (rto, ftype, NULL_TREE, c_cast_p,
+   sflags, complain);
+	if (!new_second)
+	  return NULL;
+	return merge_conversion_sequences (t, new_second);
+	  }
+}
+
   conv = build_conv (ck_ref_bind, rto, conv);
   /* This reference binding, unlike those above, requires the
  creation of a temporary.  */
   conv->need_temporary_p = true;
-  if (TYPE_REF_IS_RVALUE (rto))
-{
-  conv->rvaluedness_matches_p = 1;
-  /* In the second case, if the reference is an rvalue reference and
-	 the second standard conversion sequence of the user-defined
-	 conversion sequence includes an lvalue-to-rvalue conversion, the
-	 program is ill-formed.  */
-  if (conv->user_conv_p && next_conversion (conv)->kind == ck_rvalue)
-	conv->bad_p = 1;
-}
+  conv->rvaluedness_matches_p = TYPE_REF_IS_RVALUE (rto);
 
   return conv;
 }
diff --git a/gcc/testsuite/g++.dg/cpp0x/overload3.C b/gcc/testsuite/g++.dg/cpp0x/overload3.C
index 2d95783..0eecabd 100644
--- a/gcc/testsuite/g++.dg/cpp0x/overload3.C
+++ b/gcc/testsuite/g++.dg/cpp0x/overload3.C
@@ -13,5 +13,5 @@ struct wrap
 int main()
 {
   wrap w;
-  f(w);// { dg-error "lvalue" }
+  f(w);// { dg-error "" }
 }
diff --git a/gcc/testsuite/g++.dg/cpp0x/rv-init1.C b/gcc/testsuite/g++.dg/cpp0x/rv-init1.C
new file mode 100644
index 000..2e8d4f7
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/rv-init1.C
@@ -0,0 +1,26 @@
+// Core DR 1604/1571/1572
+// { dg-require-effective-target c++11 }
+
+struct Banana { };
+struct Enigma { operator const Banana(); };
+struct Doof { operator Banana&(); };
+void enigmatic() {
+  typedef const Banana ConstBanana;
+  Banana &&banana1 = ConstBanana(); // { dg-error "" }
+  Banana &&banana2 = Enigma();  // { dg-error "" }
+  Banana &&banana3 = Doof();// { dg-error "" }
+}
+
+class A {
+public:
+  operator volatile int &();
+};
+A a;
+
+const int & ir1a = a.operator volatile int&(); // { dg-error "" }
+const int & ir2a = a;			   // { dg-error "" }
+
+struct X {
+  operator int&();
+} x;
+int&& rri2 = X();		// { dg-error "" }


Re: calloc = malloc + memset

2014-04-18 Thread Marc Glisse

Thanks for the comments!

On Fri, 18 Apr 2014, Jakub Jelinek wrote:


The passes.def change makes me a little bit nervous, but if it works,
perhaps.


Would you prefer running the pass twice? I thought there would be less 
resistance to moving the pass than duplicating it. By the way, I think 
even passes we run only once should have the required functions 
implemented so they can be run several times (at least most of them), in 
case users want to do that in plugins. I was surprised when I tried adding 
a second strlen pass and the compiler refused.



--- gcc/testsuite/g++.dg/tree-ssa/calloc.C  (revision 0)
+++ gcc/testsuite/g++.dg/tree-ssa/calloc.C  (working copy)
@@ -0,0 +1,35 @@
+/* { dg-do compile { target c++11 } } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+
+#include 
+#include 
+#include 
+
+void g(void*);
+inline void* operator new(std::size_t sz)
+{
+  void *p;
+
+  if (sz == 0)
+sz = 1;
+
+  // Slightly modified from the libsupc++ version, that one has 2 calls
+  // to malloc which makes it too hard to optimize.
+  while ((p = std::malloc (sz)) == 0)
+{
+  std::new_handler handler = std::get_new_handler ();
+  if (! handler)
+throw std::bad_alloc();
+  handler ();
+}
+  return p;
+}
+
+void f(void*p,int n){
+  new(p)std::vector(n);
+}
+
+/* { dg-final { scan-tree-dump-times "calloc" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-not "malloc" "optimized" } } */
+/* { dg-final { scan-tree-dump-not "memset" "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */


This looks to me way too much fragile, any time the libstdc++
or glibc headers change a little bit, you might need to adjust the
dg-final directives.  Much better would be if you just provided
the prototypes yourself and subset of the std::vector you really need for
the testcase.  You can throw some class or int, it doesn't have to be
std::bad_alloc, etc.


I don't understand what seems so fragile to you. There is a single 
function in the .optimized dump, which just calls calloc in a loop. It 
doesn't seem that likely that a change in glibc/libstdc++ would make an 
extra memset pop up. A change in libstdc++ could easily prevent the 
optimization completely (I'd like to hope we can avoid that, half of the 
purpose of the testcase was making sure libstdc++ didn't change in a bad 
way), but I don't really see how it could keep it in a way that requires 
tweaking dg-final.


While trying to write a standalone version, I hit again many missed 
optimizations, getting such nice things in the .optimized dump as:


  _12 = p_13 + sz_7;
  if (_12 != p_13)

or:

  _12 = p_13 + sz_7;
  _30 = (unsigned long) _12;
  _9 = p_13 + 4;
  _10 = (unsigned long) _9;
  _11 = _30 - _10;
  _22 = _11 /[ex] 4;
  _21 = _22;
  _40 = _21 + 1;
  _34 = _40 * 4;

It is embarrassing... I hope the combiner GSoC will work well and we can 
just add a dozen patterns to handle this before 4.10.



--- gcc/testsuite/gcc.dg/strlenopt-9.c  (revision 208772)
+++ gcc/testsuite/gcc.dg/strlenopt-9.c  (working copy)
@@ -11,21 +11,21 @@ fn1 (int r)
  optimized away.  */
   return strchr (p, '\0');
 }

 __attribute__((noinline, noclone)) size_t
 fn2 (int r)
 {
   char *p, q[10];
   strcpy (q, "abc");
   p = r ? "a" : q;
-  /* String length for p varies, therefore strlen below isn't
+  /* String length is constant for both alternatives, and strlen is
  optimized away.  */
   return strlen (p);


Is this because of jump threading?


It is PRE that turns:

  if (r_4(D) == 0)
goto ;
  else
goto ;

  :
  goto ;

  :

  :
  # p_1 = PHI <&q(5), "a"(3)>
  _5 = __builtin_strlen (p_1);

into:

  if (r_4(D) == 0)
goto ;
  else
goto ;

  :
  _7 = __builtin_strlen (&q);
  pretmp_8 = _7;
  goto ;

  :

  :
  # p_1 = PHI <&q(5), "a"(3)>
  # prephitmp_9 = PHI 
  _5 = prephitmp_9;

It says:

Found partial redundancy for expression 
{call_expr<__builtin_strlen>,p_1}@.MEM_3 (0005)



--- gcc/testsuite/gcc.dg/tree-ssa/calloc-1.c(revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/calloc-1.c(working copy)
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+#include 
+#include 


Even this I find unsafe.  The strlenopt*.c tests use it's custom
strlenopt.h header for a reason, you might just add a calloc
prototype in there and use that header.


Might as well use __builtin_* then.


+/* Handle a call to malloc or calloc.  */
+
+static void
+handle_builtin_malloc (enum built_in_function bcode, gimple_stmt_iterator *gsi)
+{
+  gimple stmt = gsi_stmt (*gsi);
+  tree lhs = gimple_call_lhs (stmt);
+  gcc_assert (get_stridx (lhs) == 0);
+  int idx = new_stridx (lhs);
+  tree length = NULL_TREE;
+  if (bcode == BUILT_IN_CALLOC)
+length = build_int_cst (size_type_node, 0);


Is this safe?  I mean, if you call int a = 0; ptr = calloc (a, n);
or ptr = calloc (n, a); or ptr = calloc (0, 0); etc., then there is no
zero termination.  Sure, calling strlen or strchr etc. on th

Inliner heuristics TLC 2/n - dump stats about inliner behaviour

2014-04-18 Thread Jan Hubicka
Hi,
this patch adds dumping of some summary stats that I used to getnerate
data for 
http://hubicka.blogspot.ca/2014/04/devirtualization-in-c-part-5-feedback.html
and found it relatively enlightening, so I think they are useful for mainline, 
too.

Bootstrapped/regtested x86_64-linux, comitted.

* ipa-inline.c (spec_rem): New static variable.
(dump_overall_stats): New function.
(dump_inline_stats): New function.
Index: ipa-inline.c
===
--- ipa-inline.c(revision 209490)
+++ ipa-inline.c(working copy)
@@ -127,6 +127,7 @@ along with GCC; see the file COPYING3.
 static int overall_size;
 static gcov_type max_count;
 static sreal max_count_real, max_relbenefit_real, half_int_min_real;
+static gcov_type spec_rem;
 
 /* Return false when inlining edge E would lead to violating
limits on function unit growth or stack usage growth.  
@@ -1533,6 +1534,7 @@ resolve_noninline_speculation (fibheap_t
  ? node->global.inlined_to : node;
   bitmap updated_nodes = BITMAP_ALLOC (NULL);
 
+  spec_rem += edge->count;
   cgraph_resolve_speculation (edge, NULL);
   reset_edge_caches (where);
   inline_update_overall_summary (where);
@@ -1996,6 +1998,130 @@ inline_to_all_callers (struct cgraph_nod
   return false;
 }
 
+/* Output overall time estimate.  */
+static void
+dump_overall_stats (void)
+{
+  HOST_WIDEST_INT sum_weighted = 0, sum = 0;
+  struct cgraph_node *node;
+
+  FOR_EACH_DEFINED_FUNCTION (node)
+if (!node->global.inlined_to
+   && !node->alias)
+  {
+   int time = inline_summary (node)->time;
+   sum += time;
+   sum_weighted += time * node->count;
+  }
+  fprintf (dump_file, "Overall time estimate: "
+  HOST_WIDEST_INT_PRINT_DEC" weighted by profile: "
+  HOST_WIDEST_INT_PRINT_DEC"\n", sum, sum_weighted);
+}
+
+/* Output some useful stats about inlining.  */
+
+static void
+dump_inline_stats (void)
+{
+  HOST_WIDEST_INT inlined_cnt = 0, inlined_indir_cnt = 0;
+  HOST_WIDEST_INT inlined_virt_cnt = 0, inlined_virt_indir_cnt = 0;
+  HOST_WIDEST_INT noninlined_cnt = 0, noninlined_indir_cnt = 0;
+  HOST_WIDEST_INT noninlined_virt_cnt = 0, noninlined_virt_indir_cnt = 0;
+  HOST_WIDEST_INT  inlined_speculative = 0, inlined_speculative_ply = 0;
+  HOST_WIDEST_INT indirect_poly_cnt = 0, indirect_cnt = 0;
+  HOST_WIDEST_INT reason[CIF_N_REASONS][3];
+  int i;
+  struct cgraph_node *node;
+
+  memset (reason, 0, sizeof (reason));
+  FOR_EACH_DEFINED_FUNCTION (node)
+  {
+struct cgraph_edge *e;
+for (e = node->callees; e; e = e->next_callee)
+  {
+   if (e->inline_failed)
+ {
+   reason[(int) e->inline_failed][0] += e->count;
+   reason[(int) e->inline_failed][1] += e->frequency;
+   reason[(int) e->inline_failed][2] ++;
+   if (DECL_VIRTUAL_P (e->callee->decl))
+ {
+   if (e->indirect_inlining_edge)
+ noninlined_virt_indir_cnt += e->count;
+   else
+ noninlined_virt_cnt += e->count;
+ }
+   else
+ {
+   if (e->indirect_inlining_edge)
+ noninlined_indir_cnt += e->count;
+   else
+ noninlined_cnt += e->count;
+ }
+ }
+   else
+ {
+   if (e->speculative)
+ {
+   if (DECL_VIRTUAL_P (e->callee->decl))
+ inlined_speculative_ply += e->count;
+   else
+ inlined_speculative += e->count;
+ }
+   else if (DECL_VIRTUAL_P (e->callee->decl))
+ {
+   if (e->indirect_inlining_edge)
+ inlined_virt_indir_cnt += e->count;
+   else
+ inlined_virt_cnt += e->count;
+ }
+   else
+ {
+   if (e->indirect_inlining_edge)
+ inlined_indir_cnt += e->count;
+   else
+ inlined_cnt += e->count;
+ }
+ }
+  }
+for (e = node->indirect_calls; e; e = e->next_callee)
+  if (e->indirect_info->polymorphic)
+   indirect_poly_cnt += e->count;
+  else
+   indirect_cnt += e->count;
+  }
+  if (max_count)
+{
+  fprintf (dump_file,
+  "Inlined " HOST_WIDEST_INT_PRINT_DEC " + speculative "
+  HOST_WIDEST_INT_PRINT_DEC " + speculative polymorphic "
+  HOST_WIDEST_INT_PRINT_DEC " + previously indirect "
+  HOST_WIDEST_INT_PRINT_DEC " + virtual "
+  HOST_WIDEST_INT_PRINT_DEC " + virtual and previously indirect "
+  HOST_WIDEST_INT_PRINT_DEC "\n" "Not inlined "
+  HOST_WIDEST_INT_PRINT_DEC " + previously indirect "
+  HOST_WIDEST_INT_PRINT_DEC " + virtual "
+  HOST_WIDEST_INT_PRINT_DEC " + virtual and previously indirect "
+   

Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Aaron Sawdey
On Fri, 2014-04-18 at 19:26 +0200, Jan Hubicka wrote:
> This is intresting.  I sort of planned to add register pressure logic
> but always tought it is somewhat hard to do at GIMPLE level in a way
> that would work for all CPUs.

Yes, this is just meant to try to measure something that is
representative of register pressure. Different architectures would
probably want different thresholds for this.

> If you find examples of bad inlining, can you also fill it into bugzilla?
> Perhaps the individual cases could be handled better by improving IRA.

Yes, I can do that for the case that's happening in bzip2.

  Aaron

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Aaron Sawdey
What I've observed on power is that LTO alone reduces performance and
LTO+FDO is not significantly different than FDO alone.

I agree that an exact estimate of the register pressure would be a
difficult problem. I'm hoping that something that approximates potential
register pressure downstream will be sufficient to help inlining
decisions. 

  Aaron

On Fri, 2014-04-18 at 10:36 -0700, Xinliang David Li wrote:
> Do you witness similar problems with LTO +FDO?
> 
> My concern is it can be tricky to get the register pressure estimate
> right. The register pressure problem is created by downstream
> components (code motions etc) but only exposed by the inliner.  If you
> want to get it 'right' (i.e., not exposing the problems), you will
> need to bake the knowledge of the downstream components (possibly
> bugs) into the analysis which might not be a good thing to do longer
> term.
> 
> David
> 
> On Fri, Apr 18, 2014 at 9:43 AM, Aaron Sawdey
>  wrote:
> > Honza,
> >   Seeing your recent patches relating to inliner heuristics for LTO, I
> > thought I should mention some related work I'm doing.
> >
> > By way of introduction, I've recently joined the IBM LTC's PPC Toolchain
> > team, working on gcc performance.
> >
> > We have not generally seen good results using LTO on IBM power processors
> > and one of the problems seems to be excessive inlining that results in the
> > generation of excessive spill code. So, I have set out to tackle this by
> > doing some analysis at the time of the inliner pass to compute something
> > analogous to register pressure, which is then used to shut down inlining of
> > routines that have a lot of pressure.
> >
> > The analysis is basically a liveness analysis on the SSA names per basic
> > block and looking for the maximum number live in any block. I've been using
> > "liveness pressure" as a shorthand name for this.
> >
> > This can then be used in two ways.
> > 1) want_inline_function_to_all_callers_p at present always says to inline
> > things that have only one call site without regard to size or what this may
> > do to the register allocator downstream. In particular, BZ2_decompress in
> > bzip2 gets inlined and this causes the pressure reported downstream for the
> > int register class to increase 10x. Looking at some combination of pressure
> > in caller/callee may help avoid this kind of situation.
> > 2) I also want to experiment with adding the liveness pressure in the callee
> > into the badness calculation in edge_badness used by inline_small_functions.
> > The idea here is to try to inline functions that are less likely to cause
> > register allocator difficulty downstream first.
> >
> > I am just at the point of getting a prototype working, I will get a patch
> > you could take a look at posted next week. In the meantime, do you have any
> > comments or feedback?
> >
> > Thanks,
> >Aaron
> >
> > --
> > Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
> > 050-2/C113  (507) 253-7520 home: 507/263-0782
> > IBM Linux Technology Center - PPC Toolchain
> >
> 

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



[RFC][PATCH] RL78 - Add predicates to reduce code bloat when accessing volatile memory.

2014-04-18 Thread Richard Hulme

Hi,

This patch adds predicates (taken and renamed from the MSP430 backend) 
to allow more efficient code generation when accessing volatile memory 
turning this (for example):


movwr10, #240
movwax, r10
movwhl, ax
mov a, [hl]
or  a, #32
mov [hl], a

into this:

mov a, !240
or  a, #32
mov !240, a

Regards,

Richard.

2014-04-18  Richard Hulme  

* config/rl78/predicates.md (rl78_volatile_memory_operand): New
  (rl78_general_operand): New
  (rl78_nonimmediate_operand): New
  (rl78_any_operand): Now includes volatile memory
  (rl78_nonfar_operand): Likewise
  (rl78_nonfar_nonimm_operand): Likewise

---
 gcc/config/rl78/predicates.md |   26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/gcc/config/rl78/predicates.md b/gcc/config/rl78/predicates.md
index e564f43..29e3922 100644
--- a/gcc/config/rl78/predicates.md
+++ b/gcc/config/rl78/predicates.md
@@ -18,18 +18,38 @@
 ;; along with GCC; see the file COPYING3.  If not see
 ;; .
 
-(define_predicate "rl78_any_operand"
+(define_predicate "rl78_volatile_memory_operand"
+  (and (match_code "mem")
+   (match_test ("memory_address_addr_space_p (GET_MODE (op), XEXP 
(op, 0), MEM_ADDR_SPACE (op))")))

+)
+
+; TRUE for any valid general operand.  We do this because
+; general_operand refuses to match volatile memory refs.
+
+(define_predicate "rl78_general_operand"
   (ior (match_operand 0 "general_operand")
+   (match_operand 0 "rl78_volatile_memory_operand"))
+)
+
+; Likewise for nonimmediate_operand.
+
+(define_predicate "rl78_nonimmediate_operand"
+  (ior (match_operand 0 "nonimmediate_operand")
+   (match_operand 0 "rl78_volatile_memory_operand"))
+)
+
+(define_predicate "rl78_any_operand"
+  (ior (match_operand 0 "rl78_general_operand")
(match_code "mem,const_int,const_double,reg"))
 )

 (define_predicate "rl78_nonfar_operand"
-  (and (match_operand 0 "general_operand")
+  (and (match_operand 0 "rl78_general_operand")
(not (match_test "rl78_far_p (op)")))
 )

 (define_predicate "rl78_nonfar_nonimm_operand"
-  (and (match_operand 0 "nonimmediate_operand")
+  (and (match_operand 0 "rl78_nonimmediate_operand")
(not (match_test "rl78_far_p (op)")))
 )

--
1.7.9.5



Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Jan Hubicka
> What I've observed on power is that LTO alone reduces performance and
> LTO+FDO is not significantly different than FDO alone.
On SPEC2k6?

This is quite surprising, for our (well SUSE's) spec testers (AMD64) LTO seems
off-noise win on SPEC2k6
http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/recent.html
http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006/recent.html

I do not see why PPC should be significantly more constrained by register
pressure.  

I do not have head to head comparsion of FDO and FDO+LTO for SPEC
http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006-patched-FDO/index.html
shows noticeable drop in calculix and gamess.
Martin profiled calculix and tracked it down to a loop that is not trained
but hot in the reference run.  That makes it optimized for size.  

http://dromaeo.com/?id=219677,219672,219965,219877
compares Firefox's dromaeo runs with default build, LTO, FDO and LTO+FDO
Here the benefits of LTO and FDO seems to add up nicely.
> 
> I agree that an exact estimate of the register pressure would be a
> difficult problem. I'm hoping that something that approximates potential
> register pressure downstream will be sufficient to help inlining
> decisions. 

Yep, register pressure and I-cache overhead estimates are used for inline
decisions by some compilers.

I am mostly concerned about the metric suffering from GIGO principe if we mix
together too many estimates that are somehwat wrong by their nature. This is
why I mostly tried to focus on size/time estimates and not add too many other
metrics. But perhaps it is a time to experiment wit these, since obviously we
pushed current infrastructure to mostly to its limits.

Honza
> 
>   Aaron
> 
> On Fri, 2014-04-18 at 10:36 -0700, Xinliang David Li wrote:
> > Do you witness similar problems with LTO +FDO?
> > 
> > My concern is it can be tricky to get the register pressure estimate
> > right. The register pressure problem is created by downstream
> > components (code motions etc) but only exposed by the inliner.  If you
> > want to get it 'right' (i.e., not exposing the problems), you will
> > need to bake the knowledge of the downstream components (possibly
> > bugs) into the analysis which might not be a good thing to do longer
> > term.
> > 
> > David
> > 
> > On Fri, Apr 18, 2014 at 9:43 AM, Aaron Sawdey
> >  wrote:
> > > Honza,
> > >   Seeing your recent patches relating to inliner heuristics for LTO, I
> > > thought I should mention some related work I'm doing.
> > >
> > > By way of introduction, I've recently joined the IBM LTC's PPC Toolchain
> > > team, working on gcc performance.
> > >
> > > We have not generally seen good results using LTO on IBM power processors
> > > and one of the problems seems to be excessive inlining that results in the
> > > generation of excessive spill code. So, I have set out to tackle this by
> > > doing some analysis at the time of the inliner pass to compute something
> > > analogous to register pressure, which is then used to shut down inlining 
> > > of
> > > routines that have a lot of pressure.
> > >
> > > The analysis is basically a liveness analysis on the SSA names per basic
> > > block and looking for the maximum number live in any block. I've been 
> > > using
> > > "liveness pressure" as a shorthand name for this.
> > >
> > > This can then be used in two ways.
> > > 1) want_inline_function_to_all_callers_p at present always says to inline
> > > things that have only one call site without regard to size or what this 
> > > may
> > > do to the register allocator downstream. In particular, BZ2_decompress in
> > > bzip2 gets inlined and this causes the pressure reported downstream for 
> > > the
> > > int register class to increase 10x. Looking at some combination of 
> > > pressure
> > > in caller/callee may help avoid this kind of situation.
> > > 2) I also want to experiment with adding the liveness pressure in the 
> > > callee
> > > into the badness calculation in edge_badness used by 
> > > inline_small_functions.
> > > The idea here is to try to inline functions that are less likely to cause
> > > register allocator difficulty downstream first.
> > >
> > > I am just at the point of getting a prototype working, I will get a patch
> > > you could take a look at posted next week. In the meantime, do you have 
> > > any
> > > comments or feedback?
> > >
> > > Thanks,
> > >Aaron
> > >
> > > --
> > > Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
> > > 050-2/C113  (507) 253-7520 home: 507/263-0782
> > > IBM Linux Technology Center - PPC Toolchain
> > >
> > 
> 
> -- 
> Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
> 050-2/C113  (507) 253-7520 home: 507/263-0782
> IBM Linux Technology Center - PPC Toolchain


Inliner heuristics TLC 3/n

2014-04-18 Thread Jan Hubicka
Hi,
this patch makes FDO inliner to be more aggressive on inlining function
calls that are considered hot.  This is based on observation that
INLINE_INSNS_AUTO is the most common reason for inlining not happening
(20.5% for Firefox, where 63.2% of calls are not inlinable because body
is not avaiable) and 66% for GCC.

With this patch INLINE_HINT_known_hot hint is added to edges that was
determined to be hot by profile and moreover there is at least 50%
chance that caller will invoke the call during its execution.

With this hint we now ignore both limits - this is because the greedy algorithm
driven by speed/size_cost metric should work pretty well here, but we may want
to revisit it (i.e. add INLINE_INSNS_FDO or so).  I am on the aggressive side so
we collect some data on when the profile is a win or loss.

Bootstrapped/regtested x86_64-linux, comitted.

Honza

* ipa-inline.h (INLINE_HINT_known_hot): New hint.
* ipa-inline-analysis.c (dump_inline_hints): Dump it.
(do_estimate_edge_time): Compute it.
* ipa-inline.c (want_inline_small_function_p): Bypass
INLINE_INSNS_AUTO/SINGLE limits for calls that are known
to be hot.
Index: ipa-inline.h
===
--- ipa-inline.h(revision 209489)
+++ ipa-inline.h(working copy)
@@ -68,7 +68,9 @@ enum inline_hints_vals {
   INLINE_HINT_cross_module = 64,
   /* If array indexes of loads/stores become known there may be room for
  further optimization.  */
-  INLINE_HINT_array_index = 128
+  INLINE_HINT_array_index = 128,
+  /* We know that the callee is hot by profile.  */
+  INLINE_HINT_known_hot = 256
 };
 typedef int inline_hints;
 
Index: ipa-inline-analysis.c
===
--- ipa-inline-analysis.c   (revision 209489)
+++ ipa-inline-analysis.c   (working copy)
@@ -671,6 +671,11 @@ dump_inline_hints (FILE *f, inline_hints
   hints &= ~INLINE_HINT_array_index;
   fprintf (f, " array_index");
 }
+  if (hints & INLINE_HINT_known_hot)
+{
+  hints &= ~INLINE_HINT_known_hot;
+  fprintf (f, " known_hot");
+}
   gcc_assert (!hints);
 }
 
@@ -3666,6 +3671,17 @@ do_estimate_edge_time (struct cgraph_edg
&known_aggs);
   estimate_node_size_and_time (callee, clause, known_vals, known_binfos,
   known_aggs, &size, &min_size, &time, &hints, 
es->param);
+
+  /* When we have profile feedback, we can quite safely identify hot
+ edges and for those we disable size limits.  Don't do that when
+ probability that caller will call the callee is low however, since it
+ may hurt optimization of the caller's hot path.  */
+  if (edge->count && cgraph_maybe_hot_edge_p (edge)
+  && (edge->count * 2
+  > (edge->caller->global.inlined_to
+? edge->caller->global.inlined_to->count : edge->caller->count)))
+hints |= INLINE_HINT_known_hot;
+
   known_vals.release ();
   known_binfos.release ();
   known_aggs.release ();
Index: ipa-inline.c
===
--- ipa-inline.c(revision 209522)
+++ ipa-inline.c(working copy)
@@ -578,18 +578,21 @@ want_inline_small_function_p (struct cgr
  inline cnadidate.  At themoment we allow inline hints to
  promote non-inline function to inline and we increase
  MAX_INLINE_INSNS_SINGLE 16fold for inline functions.  */
-  else if (!DECL_DECLARED_INLINE_P (callee->decl)
+  else if ((!DECL_DECLARED_INLINE_P (callee->decl)
+  && (!e->count || !cgraph_maybe_hot_edge_p (e)))
   && inline_summary (callee)->min_size - inline_edge_summary 
(e)->call_stmt_size
  > MAX (MAX_INLINE_INSNS_SINGLE, MAX_INLINE_INSNS_AUTO))
 {
   e->inline_failed = CIF_MAX_INLINE_INSNS_AUTO_LIMIT;
   want_inline = false;
 }
-  else if (DECL_DECLARED_INLINE_P (callee->decl)
+  else if ((DECL_DECLARED_INLINE_P (callee->decl) || e->count)
   && inline_summary (callee)->min_size - inline_edge_summary 
(e)->call_stmt_size
  > 16 * MAX_INLINE_INSNS_SINGLE)
 {
-  e->inline_failed = CIF_MAX_INLINE_INSNS_AUTO_LIMIT;
+  e->inline_failed = (DECL_DECLARED_INLINE_P (callee->decl)
+ ? CIF_MAX_INLINE_INSNS_SINGLE_LIMIT
+ : CIF_MAX_INLINE_INSNS_AUTO_LIMIT);
   want_inline = false;
 }
   else
@@ -606,6 +609,7 @@ want_inline_small_function_p (struct cgr
   && growth >= MAX_INLINE_INSNS_SINGLE
   && ((!big_speedup
&& !(hints & (INLINE_HINT_indirect_call
+ | INLINE_HINT_known_hot
  | INLINE_HINT_loop_iterations
  | INLINE_HINT_array_index
  | INLINE_HINT_loop_stride)))
@@ -630,6 +634,7 @@ want_inline_small_function_p (struct cgr
 

Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Jan Hubicka
> What I've observed on power is that LTO alone reduces performance and
> LTO+FDO is not significantly different than FDO alone.
> 
> I agree that an exact estimate of the register pressure would be a
> difficult problem. I'm hoping that something that approximates potential
> register pressure downstream will be sufficient to help inlining
> decisions. 

One (ortoghonal) way to deal with this problem would be also to disable
inlining of functions called once when the edge frequency is low.
I.e. adding to check_callers something like
edge->frequency > CGRAPH_FREQ_BASE / 2
if you want to disqualify all calls that have only 50% chance that they
will be called during function invocation.

Does something like that help in your cases?

It would help in the case Linus complained about
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

The difficulty here is that disabling inlies on not so important paths
may prevent SRA and other optimizations so it may in turn also penalize
the hot path. I saw this in some cases where EH cleanup code was optimized
for size.

Perhaps SRA canalso be extended to handle cases where non-SRAable code is
on a cold path?

Honza


Re: [COMMITTED] Fix silly error in aarch64_register_move_cost

2014-04-18 Thread Jeff Law

On 04/18/14 10:02, Richard Henderson wrote:

Building mainline I got


.../aarch64.c:4879:134: error: invalid conversion from ‘reg_class_t {aka int}’ 
to ‘machine_mode’ [-fpermissive]
if (! TARGET_SIMD && GET_MODE_SIZE (from) == 128 && GET_MODE_SIZE (to) == 
128)


Sure enough, TO and FROM are not modes.  Did mainline just change away from
permissive or something?  It surely seems like we ought to have seen this
earlier...
I haven't looked at GET_MODE_XXX, but most, if not of the RTL checking 
is disabled because it's too expensive.


I'm hoping that David's work will take us to a place where we can do 
more static checking on the RTL bits too.


jeff


[C PATCH] Fix up diagnostics (PR c/25801)

2014-04-18 Thread Marek Polacek
Here's a patch that tidies up diagnostics given when there's
an arithmetic on pointer to an incomplete type.  Firstly, we ought not
to say "unknown structure" when the type is not a structure; secondly,
we shouldn't report identical error twice.
(It might make sense to instead "increment/decrement of" just say
"arithmetic on", but I'm not changing that now.)

Regtested on x86_64-unknown-linux-gnu, bootstrapped on
powerpc64-unknown-linux-gnu, ok for trunk?

2014-04-18  Marek Polacek  

PR c/25801
* c-typeck.c (c_size_in_bytes): Don't call error.
(pointer_diff): Remove comment.
(build_unary_op): Improve error messages.

* gcc.dg/pr25801.c: New test.

diff --git gcc/c/c-typeck.c gcc/c/c-typeck.c
index 65aad45..bc977a7 100644
--- gcc/c/c-typeck.c
+++ gcc/c/c-typeck.c
@@ -1760,15 +1760,10 @@ c_size_in_bytes (const_tree type)
 {
   enum tree_code code = TREE_CODE (type);
 
-  if (code == FUNCTION_TYPE || code == VOID_TYPE || code == ERROR_MARK)
+  if (code == FUNCTION_TYPE || code == VOID_TYPE || code == ERROR_MARK
+  || !COMPLETE_OR_VOID_TYPE_P (type))
 return size_one_node;
 
-  if (!COMPLETE_OR_VOID_TYPE_P (type))
-{
-  error ("arithmetic on pointer to an incomplete type");
-  return size_one_node;
-}
-
   /* Convert in case a char is more than one unit.  */
   return size_binop_loc (input_location, CEIL_DIV_EXPR, TYPE_SIZE_UNIT (type),
 size_int (TYPE_PRECISION (char_type_node)
@@ -3528,7 +3523,6 @@ pointer_diff (location_t loc, tree op0, tree op1)
   if (!COMPLETE_OR_VOID_TYPE_P (TREE_TYPE (TREE_TYPE (orig_op1
 error_at (loc, "arithmetic on pointer to an incomplete type");
 
-  /* This generates an error if op0 is pointer to incomplete type.  */
   op1 = c_size_in_bytes (target_type);
 
   if (pointer_to_zero_sized_aggr_p (TREE_TYPE (orig_op1)))
@@ -4002,16 +3996,18 @@ build_unary_op (location_t location,
 
if (typecode == POINTER_TYPE)
  {
-   /* If pointer target is an undefined struct,
+   /* If pointer target is an incomplete type,
   we just cannot know how to do the arithmetic.  */
if (!COMPLETE_OR_VOID_TYPE_P (TREE_TYPE (argtype)))
  {
if (code == PREINCREMENT_EXPR || code == POSTINCREMENT_EXPR)
  error_at (location,
-   "increment of pointer to unknown structure");
+   "increment of pointer to an incomplete type %qT",
+   TREE_TYPE (argtype));
else
  error_at (location,
-   "decrement of pointer to unknown structure");
+   "decrement of pointer to an incomplete type %qT",
+   TREE_TYPE (argtype));
  }
else if (TREE_CODE (TREE_TYPE (argtype)) == FUNCTION_TYPE
 || TREE_CODE (TREE_TYPE (argtype)) == VOID_TYPE)
diff --git gcc/testsuite/gcc.dg/pr25801.c gcc/testsuite/gcc.dg/pr25801.c
index e69de29..70b4ef8 100644
--- gcc/testsuite/gcc.dg/pr25801.c
+++ gcc/testsuite/gcc.dg/pr25801.c
@@ -0,0 +1,19 @@
+/* PR c/25801 */
+/* { dg-do compile } */
+
+int (*a)[];
+struct S *s;
+
+void
+f (void)
+{
+  a++; /* { dg-error "increment of pointer to an incomplete type" } */
+  ++a; /* { dg-error "increment of pointer to an incomplete type" } */
+  a--; /* { dg-error "decrement of pointer to an incomplete type" } */
+  --a; /* { dg-error "decrement of pointer to an incomplete type" } */
+
+  s++; /* { dg-error "increment of pointer to an incomplete type" } */
+  ++s; /* { dg-error "increment of pointer to an incomplete type" } */
+  s--; /* { dg-error "decrement of pointer to an incomplete type" } */
+  --s; /* { dg-error "decrement of pointer to an incomplete type" } */
+}

Marek


Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Xinliang David Li
On Fri, Apr 18, 2014 at 12:27 PM, Jan Hubicka  wrote:
>> What I've observed on power is that LTO alone reduces performance and
>> LTO+FDO is not significantly different than FDO alone.
> On SPEC2k6?
>
> This is quite surprising, for our (well SUSE's) spec testers (AMD64) LTO seems
> off-noise win on SPEC2k6
> http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/recent.html
> http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006/recent.html
>
> I do not see why PPC should be significantly more constrained by register
> pressure.
>
> I do not have head to head comparsion of FDO and FDO+LTO for SPEC
> http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006-patched-FDO/index.html
> shows noticeable drop in calculix and gamess.
> Martin profiled calculix and tracked it down to a loop that is not trained
> but hot in the reference run.  That makes it optimized for size.
>
> http://dromaeo.com/?id=219677,219672,219965,219877
> compares Firefox's dromaeo runs with default build, LTO, FDO and LTO+FDO
> Here the benefits of LTO and FDO seems to add up nicely.
>>
>> I agree that an exact estimate of the register pressure would be a
>> difficult problem. I'm hoping that something that approximates potential
>> register pressure downstream will be sufficient to help inlining
>> decisions.
>
> Yep, register pressure and I-cache overhead estimates are used for inline
> decisions by some compilers.
>
> I am mostly concerned about the metric suffering from GIGO principe if we mix
> together too many estimates that are somehwat wrong by their nature. This is
> why I mostly tried to focus on size/time estimates and not add too many other
> metrics. But perhaps it is a time to experiment wit these, since obviously we
> pushed current infrastructure to mostly to its limits.
>

I like the word GIGO here. Getting inline signals right  requires deep
analysis (including interprocedural analysis). Different signals/hints
may also come with different quality thus different weights.

Another challenge is how to quantify cycle savings/overhead more
precisely. With that, we can abandon the threshold based scheme -- any
callsite with a net saving will be considered.

David


> Honza
>>
>>   Aaron
>>
>> On Fri, 2014-04-18 at 10:36 -0700, Xinliang David Li wrote:
>> > Do you witness similar problems with LTO +FDO?
>> >
>> > My concern is it can be tricky to get the register pressure estimate
>> > right. The register pressure problem is created by downstream
>> > components (code motions etc) but only exposed by the inliner.  If you
>> > want to get it 'right' (i.e., not exposing the problems), you will
>> > need to bake the knowledge of the downstream components (possibly
>> > bugs) into the analysis which might not be a good thing to do longer
>> > term.
>> >
>> > David
>> >
>> > On Fri, Apr 18, 2014 at 9:43 AM, Aaron Sawdey
>> >  wrote:
>> > > Honza,
>> > >   Seeing your recent patches relating to inliner heuristics for LTO, I
>> > > thought I should mention some related work I'm doing.
>> > >
>> > > By way of introduction, I've recently joined the IBM LTC's PPC Toolchain
>> > > team, working on gcc performance.
>> > >
>> > > We have not generally seen good results using LTO on IBM power processors
>> > > and one of the problems seems to be excessive inlining that results in 
>> > > the
>> > > generation of excessive spill code. So, I have set out to tackle this by
>> > > doing some analysis at the time of the inliner pass to compute something
>> > > analogous to register pressure, which is then used to shut down inlining 
>> > > of
>> > > routines that have a lot of pressure.
>> > >
>> > > The analysis is basically a liveness analysis on the SSA names per basic
>> > > block and looking for the maximum number live in any block. I've been 
>> > > using
>> > > "liveness pressure" as a shorthand name for this.
>> > >
>> > > This can then be used in two ways.
>> > > 1) want_inline_function_to_all_callers_p at present always says to inline
>> > > things that have only one call site without regard to size or what this 
>> > > may
>> > > do to the register allocator downstream. In particular, BZ2_decompress in
>> > > bzip2 gets inlined and this causes the pressure reported downstream for 
>> > > the
>> > > int register class to increase 10x. Looking at some combination of 
>> > > pressure
>> > > in caller/callee may help avoid this kind of situation.
>> > > 2) I also want to experiment with adding the liveness pressure in the 
>> > > callee
>> > > into the badness calculation in edge_badness used by 
>> > > inline_small_functions.
>> > > The idea here is to try to inline functions that are less likely to cause
>> > > register allocator difficulty downstream first.
>> > >
>> > > I am just at the point of getting a prototype working, I will get a patch
>> > > you could take a look at posted next week. In the meantime, do you have 
>> > > any
>> > > comments or feedback?
>> > >
>> > > Thanks,
>> > >Aaron
>> > >
>> > > --
>> > > Aaron Sa

Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Xinliang David Li
On Fri, Apr 18, 2014 at 12:51 PM, Jan Hubicka  wrote:
>> What I've observed on power is that LTO alone reduces performance and
>> LTO+FDO is not significantly different than FDO alone.
>>
>> I agree that an exact estimate of the register pressure would be a
>> difficult problem. I'm hoping that something that approximates potential
>> register pressure downstream will be sufficient to help inlining
>> decisions.
>
> One (ortoghonal) way to deal with this problem would be also to disable
> inlining of functions called once when the edge frequency is low.
> I.e. adding to check_callers something like
> edge->frequency > CGRAPH_FREQ_BASE / 2
> if you want to disqualify all calls that have only 50% chance that they
> will be called during function invocation.
>
> Does something like that help in your cases?
>
> It would help in the case Linus complained about
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194
>
> The difficulty here is that disabling inlies on not so important paths
> may prevent SRA and other optimizations so it may in turn also penalize
> the hot path. I saw this in some cases where EH cleanup code was optimized
> for size.


yes. The callsite may be cold, but the profile scaled callee body may
still be hot. Inlining the callee allows more context to be passed.
Similarly for hot callers, cold callees may also expose more
information (e.g, better alias info) to the caller.

David

>
> Perhaps SRA canalso be extended to handle cases where non-SRAable code is
> on a cold path?
>
> Honza


Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Jan Hubicka
> On Fri, Apr 18, 2014 at 12:27 PM, Jan Hubicka  wrote:
> >> What I've observed on power is that LTO alone reduces performance and
> >> LTO+FDO is not significantly different than FDO alone.
> > On SPEC2k6?
> >
> > This is quite surprising, for our (well SUSE's) spec testers (AMD64) LTO 
> > seems
> > off-noise win on SPEC2k6
> > http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/recent.html
> > http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006/recent.html
> >
> > I do not see why PPC should be significantly more constrained by register
> > pressure.
> >
> > I do not have head to head comparsion of FDO and FDO+LTO for SPEC
> > http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006-patched-FDO/index.html
> > shows noticeable drop in calculix and gamess.
> > Martin profiled calculix and tracked it down to a loop that is not trained
> > but hot in the reference run.  That makes it optimized for size.
> >
> > http://dromaeo.com/?id=219677,219672,219965,219877
> > compares Firefox's dromaeo runs with default build, LTO, FDO and LTO+FDO
> > Here the benefits of LTO and FDO seems to add up nicely.
> >>
> >> I agree that an exact estimate of the register pressure would be a
> >> difficult problem. I'm hoping that something that approximates potential
> >> register pressure downstream will be sufficient to help inlining
> >> decisions.
> >
> > Yep, register pressure and I-cache overhead estimates are used for inline
> > decisions by some compilers.
> >
> > I am mostly concerned about the metric suffering from GIGO principe if we 
> > mix
> > together too many estimates that are somehwat wrong by their nature. This is
> > why I mostly tried to focus on size/time estimates and not add too many 
> > other
> > metrics. But perhaps it is a time to experiment wit these, since obviously 
> > we
> > pushed current infrastructure to mostly to its limits.
> >
> 
> I like the word GIGO here. Getting inline signals right  requires deep
> analysis (including interprocedural analysis). Different signals/hints
> may also come with different quality thus different weights.
> 
> Another challenge is how to quantify cycle savings/overhead more
> precisely. With that, we can abandon the threshold based scheme -- any
> callsite with a net saving will be considered.

Inline hints are intended to do this - at the moment we bump the limits up
when we estimate big speedups for the inlining and with today patch and FDO
we bypass the thresholds when we know from FDO that call matters.

Concerning your other email, indeed we should consider heavy callees (in Open64
terminology) that consume a lot of time and do not skip the call sites.  Easy
way would be to replace maybe_hot_edge predicate by maybe_hot_call that simply
multiplies the count and estimated time.  (We probably gouth to get rid of the
time capping and use wider arithmetics too).

I wonder if that is not too local and if we should not try to estimate 
cumulative time
of the function and get more agressive on inlining over the whole path leading
to hot code.

Honza


Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Xinliang David Li
On Fri, Apr 18, 2014 at 2:16 PM, Jan Hubicka  wrote:
>> On Fri, Apr 18, 2014 at 12:27 PM, Jan Hubicka  wrote:
>> >> What I've observed on power is that LTO alone reduces performance and
>> >> LTO+FDO is not significantly different than FDO alone.
>> > On SPEC2k6?
>> >
>> > This is quite surprising, for our (well SUSE's) spec testers (AMD64) LTO 
>> > seems
>> > off-noise win on SPEC2k6
>> > http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/recent.html
>> > http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006/recent.html
>> >
>> > I do not see why PPC should be significantly more constrained by register
>> > pressure.
>> >
>> > I do not have head to head comparsion of FDO and FDO+LTO for SPEC
>> > http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006-patched-FDO/index.html
>> > shows noticeable drop in calculix and gamess.
>> > Martin profiled calculix and tracked it down to a loop that is not trained
>> > but hot in the reference run.  That makes it optimized for size.
>> >
>> > http://dromaeo.com/?id=219677,219672,219965,219877
>> > compares Firefox's dromaeo runs with default build, LTO, FDO and LTO+FDO
>> > Here the benefits of LTO and FDO seems to add up nicely.
>> >>
>> >> I agree that an exact estimate of the register pressure would be a
>> >> difficult problem. I'm hoping that something that approximates potential
>> >> register pressure downstream will be sufficient to help inlining
>> >> decisions.
>> >
>> > Yep, register pressure and I-cache overhead estimates are used for inline
>> > decisions by some compilers.
>> >
>> > I am mostly concerned about the metric suffering from GIGO principe if we 
>> > mix
>> > together too many estimates that are somehwat wrong by their nature. This 
>> > is
>> > why I mostly tried to focus on size/time estimates and not add too many 
>> > other
>> > metrics. But perhaps it is a time to experiment wit these, since obviously 
>> > we
>> > pushed current infrastructure to mostly to its limits.
>> >
>>
>> I like the word GIGO here. Getting inline signals right  requires deep
>> analysis (including interprocedural analysis). Different signals/hints
>> may also come with different quality thus different weights.
>>
>> Another challenge is how to quantify cycle savings/overhead more
>> precisely. With that, we can abandon the threshold based scheme -- any
>> callsite with a net saving will be considered.
>
> Inline hints are intended to do this - at the moment we bump the limits up
> when we estimate big speedups for the inlining and with today patch and FDO
> we bypass the thresholds when we know from FDO that call matters.
>
> Concerning your other email, indeed we should consider heavy callees (in 
> Open64
> terminology) that consume a lot of time and do not skip the call sites.  Easy
> way would be to replace maybe_hot_edge predicate by maybe_hot_call that simply
> multiplies the count and estimated time.  (We probably gouth to get rid of the
> time capping and use wider arithmetics too).

That's what we did in Google branches. We had two heuristics -- hot
caller and hot callee heuristics.

1) For the hot caller heuristic, other simple analysis is checked a)
global working set size; b)  callsite argument check -- very simple
check to guess if inlining this callsite would sharpen analysis

2) We had not tuned hot callee heuristic by doing more analysis --
simply turn in on using hotness does not make a noticable differences.
Other hints are needed.

David



>
> I wonder if that is not too local and if we should not try to estimate 
> cumulative time
> of the function and get more agressive on inlining over the whole path leading
> to hot code.
>
> Honza


Re: [VRP][PATCH] Improve value range for loop index

2014-04-18 Thread Kugan
Ping?

On 10/04/14 06:07, Kugan wrote:
> Value range propagation simplifies convergence in vrp_visit_phi_node by
> setting minimum to TYPE_MIN when the computed minimum is smaller than
> the previous minimum. This can however result in pessimistic value
> ranges in some cases.
> 
> for example,
> 
>   unsigned int i;
>   for (i = 0; i < 8; i++)
>   {
> 
>   }
> 
>   # ivtmp_19 = PHI 
>   ...
>   :
>   ivtmp_17 = ivtmp_19 - 1;
>   if (ivtmp_17 != 0)
>   
>   goto ;
> 
> min value of ivtmp_19  is simplified to 0 (in tree-vrp.c:8465) where as
> it should have been 1. This prevents correct value ranges being
> calculated for ivtmp_17 in the example.
> 
> We should be able to see the step (the difference from previous minimum
> to computed minimum) and if there is scope for more iterations (computed
> minimum is greater than step), and then we should be able set minimum to
> do one more iteration and converge to the right minimum value.
> 
> Attached patch fixes this. Is this OK for stage-1?
> 
> Bootstrapped and regression tested on X86_64-unknown-linux-gnu with no
> new regressions.
> 
> Thanks,
> Kugan
> 
> gcc/
> 
> +2014-04-09  Kugan Vivekanandarajah  
> +
> + * tree-vrp.c (vrp_visit_phi_node) : Improve value ranges of loop
> + index when simplifying convergence towards minimum.
> +
> 
> gcc/testsuite
> 
> +2014-04-09  Kugan Vivekanandarajah  
> +
> + * gcc.dg/tree-ssa/vrp91.c: New test
> +
>