Re: s390: SImode pointers vs LR

2015-06-01 Thread Andreas Krebbel
On 05/30/2015 02:57 AM, DJ Delorie wrote:
> In config/s390/s390.c we accept addresses that are SImode:
> 
>   if (!REG_P (base)
> || (GET_MODE (base) != SImode
> && GET_MODE (base) != Pmode))
>   return false;
> 
> However, there doesn't seem to be anything in the s390's opcodes that
> masks the top half of address registers in 64-bit mode, the SImode
> convention seems to just be a convention for addresses in the first
> 4Gb.
> 
> So... what happens if gcc uses a subreg to load the lower half of a
> register (via LR), leaving the upper half with random bits in it, then
> uses that register as an address?  I could see no code that checked
> for this, and I have a fairly large and ungainly test case that says
> it breaks :-(
> 
> My local solution was to just disallow "SImode" as an address in
> s390_decompose_address, which forces gcc to do an explicit SI->DI
> conversion to clear the upper bits, and it seems to work, but I wonder
> if it's the ideal solution...

We also have address style operands which are used as shift count operands but 
never as addresses.
Accepting SImode there was necessary to prevent reload from messing things up:
https://gcc.gnu.org/ml/gcc-patches/2006-03/msg01495.html

I don't know if this is still necessary though.

-Andreas-



Re: Better info for combine results in worse code generated

2015-06-01 Thread Segher Boessenkool
On Mon, Jun 01, 2015 at 11:33:18AM +0930, Alan Modra wrote:
> Unifying andsi_mask with anddi_mask, and the fact that constraints for
> const_int see VOIDmode rather than the operand mode is why we get
> rldicr rather than rlwinm.  Easily fixed by separating the si/di
> patterns, and with a little more work I may even be able to keep them
> together.

Maybe just swapping T to be before S will do what you want, already?

> In and3 expander I think you want the following since
> and64_2_operand covers the extra double-rotate cases, not all DImode.
> 
> -  if ((mode == DImode && !and64_2_operand (operands[2], mode))
> -  || (mode != DImode && !and_operand (operands[2], mode)))
> +  if (!and_operand (operands[2], mode)
> +  && (mode != DImode || !and64_2_operand (operands[2], 
> mode)))

and64_2_operand includes all of and_operand.  I agree it is a mess.

> In and3_imm_mask_dot and and3_imm_mask_dot2.  Typo?
> -   && any_mask_operand (operands[2], mode)"
> +   && !any_mask_operand (operands[2], mode)"

Thinko; that whole line should just be removed.  We prefer e.g. "rlwinm"
over "andi.", but "andi." over "rlwinm.".  I'll do a patch.

> Seems to
> me we should omit !logical_const_operand from those insn predicates.

I tried to have no patterns overlap (where it matters at all); this helps
preserve your sanity if you try to move patterns around in the .md files
(first match wins).

> > I don't think it is a good idea to optimise code based on assumptions
> > of what SImode SETs will do to the dest seen as DImode, without making
> > those assumptions explicit in the RTL.
> 
> I agree.

Ah, good to hear.

> Do you intend to

I don't have firm, short-term plans to remove ancient design, no.

> get rid of WORD_REGISTER_OPERATIONS,

rs6000 should not define it.  What e.g. does it mean for mullw?  Or,
worse, mulhw?  Pretty much anything with "w" in its name is problematic.

[ What does this macro mean for rotates or right shifts at all?  The docs
don't say. ]

> POINTERS_EXTEND_UNSIGNED,

Not related to this.

> PUSH_ROUNDING,

I would love to see this disappear.  But, not related, for rs6000
anyway.

> SHORT_IMMEDIATES_SIGN_EXTEND,

Not sure about this one.

> and LOAD_EXTEND_OP?  ;-)

With this macro you can just "opt out" for the hard cases.


In any case: I am working on a patch that unifies all rotate-and-mask
patterns, fixes the rtx_cost for those, and gets rid of 2rld completely.
If you really want to disallow most SImode ANDs, it won't be hard to
add that, much easier than with the current code.


Segher


Re: s390: SImode pointers vs LR

2015-06-01 Thread Jeff Law

On 06/01/2015 06:23 AM, Andreas Krebbel wrote:

On 05/30/2015 02:57 AM, DJ Delorie wrote:

In config/s390/s390.c we accept addresses that are SImode:

   if (!REG_P (base)
  || (GET_MODE (base) != SImode
  && GET_MODE (base) != Pmode))
return false;

However, there doesn't seem to be anything in the s390's opcodes that
masks the top half of address registers in 64-bit mode, the SImode
convention seems to just be a convention for addresses in the first
4Gb.

So... what happens if gcc uses a subreg to load the lower half of a
register (via LR), leaving the upper half with random bits in it, then
uses that register as an address?  I could see no code that checked
for this, and I have a fairly large and ungainly test case that says
it breaks :-(

My local solution was to just disallow "SImode" as an address in
s390_decompose_address, which forces gcc to do an explicit SI->DI
conversion to clear the upper bits, and it seems to work, but I wonder
if it's the ideal solution...


We also have address style operands which are used as shift count operands but 
never as addresses.
Accepting SImode there was necessary to prevent reload from messing things up:
https://gcc.gnu.org/ml/gcc-patches/2006-03/msg01495.html

I don't know if this is still necessary though.
Looks like a hack to me.  Sadly, no testcase in that BZ, though 
presumably since it was a bootstrap failure one could checkout a 
development tree from that era and see the failure.


jeff


Regression: template function accessing a temporary through a lambda

2015-06-01 Thread Sebastien Alaiwan
Hi folks,
the following C++ snippet used to compile, but doesn't anymore with current gcc 
trunk.
Hope this helps!

Thanks,
Sebastien Alaiwan

-
// simple.cpp
int getValueOfSix()
{
  return 6;
}

template
void f()
{
  auto const c = getValueOfSix();

  auto lambda = [&] ()
  {
auto a = c;
auto b = c;
  };
}

void g()
{
  f<0>();
}

-
Here are the errors I get:

$ g++ -c -std=c++11 simple.cpp

simple.cpp: In instantiation of ‘f():: [with int A = 0]’:
simple.cpp:12:19:   required from ‘struct f() [with int A = 0]::’
simple.cpp:16:3:   required from ‘void f() [with int A = 0]’
simple.cpp:21:8:   required from here
simple.cpp:15:14: error: redeclaration of ‘const int& c’
 auto b = c;
  ^
simple.cpp:14:14: note: ‘const int& c’ previously declared here
 auto a = c;
  ^
simple.cpp: In instantiation of ‘void f() [with int A = 0]’:
simple.cpp:21:8:   required from here
simple.cpp:12:17: sorry, unimplemented: non-trivial designated initializers not 
supported
   auto lambda = [&] ()
 ^




Strange insn rtx is emitted in a custom backend

2015-06-01 Thread Lev Yudalevich
While working on a custom backend for quite a standard RISC-like
architecture, I defined 'high'/'lo_sum' patterns as follows:

(define_insn "mov_high"
  [(set (match_operand:SI 0 "register_operand" "=r")
  (high:SI (match_operand:SI 1 "immediate_operand" "i")))]
  ""
  "mvup #high(%1), %0"
  [(set_attr "insn" "alu")
   (set_attr "length" "4")]
)

(define_insn "add_low"
  [(set (match_operand:SI 0 "register_operand" "=r")
  (lo_sum:SI (match_operand:SI 1 "register_operand" "0")
   (match_operand:SI 2 "symbolic_operand" "")))]
   (match_operand:SI 2 "immediate_operand" "i")))]
  ""
  "add3 %0, #low(%2), %0"
  [(set_attr "insn" "alu")
   (set_attr "length" "4")]
)


Despite having the patters above (along with usual 'plus' patterns)
I'm getting the following rtx emitted at the expansion stage:

(insn 53 52 63 2 (set (reg:SI 62)
   (const:SI (plus:SI (symbol_ref:SI [flags 0x6])
  (const_int
64 [0x40] -1
(nil))

which can not be recognized unless I define a pattern like this:

(define_insn "*make_gcc_happy"
  [(set (match_operand:SI 0 "register_operand" "=r")
  (const:SI (plus:SI (match_operand:SI 1 "symbolic_operand" "")
 (match_operand:SI 2
"const_int_operand" ""]
  ""
  "mvup #high(%1), %0  \\n  add3 %0, #low(%1+%2), %0"
  [(set_attr "insn" "alu")
   (set_attr "length" "8")]
)


I don't understand why the rtx above is emitted as a whole single
instruction, without using the high/lo_sum patterns.
I'd very appreciate any help, hints or pointing me to a possibly
missing/erroneous spots.

Sincerely,
Lev.


Re: Strange insn rtx is emitted in a custom backend

2015-06-01 Thread Jeff Law

On 06/01/2015 12:51 PM, Lev Yudalevich wrote:

While working on a custom backend for quite a standard RISC-like
architecture, I defined 'high'/'lo_sum' patterns as follows:

(define_insn "mov_high"
   [(set (match_operand:SI 0 "register_operand" "=r")
   (high:SI (match_operand:SI 1 "immediate_operand" "i")))]
   ""
   "mvup #high(%1), %0"
   [(set_attr "insn" "alu")
(set_attr "length" "4")]
)

(define_insn "add_low"
   [(set (match_operand:SI 0 "register_operand" "=r")
   (lo_sum:SI (match_operand:SI 1 "register_operand" "0")
(match_operand:SI 2 "symbolic_operand" "")))]
(match_operand:SI 2 "immediate_operand" "i")))]
   ""
   "add3 %0, #low(%2), %0"
   [(set_attr "insn" "alu")
(set_attr "length" "4")]
)


Despite having the patters above (along with usual 'plus' patterns)
I'm getting the following rtx emitted at the expansion stage:

(insn 53 52 63 2 (set (reg:SI 62)
(const:SI (plus:SI (symbol_ref:SI [flags 0x6])
   (const_int
64 [0x40] -1
 (nil))

which can not be recognized unless I define a pattern like this:
You should debug precisely where that came from.  You're missing 
something in your backend to tell GCC that such constants are not valid.


You should probably look at how expansion occurs for your tests on other 
ports that make heavy use of high/lo_sum.  sparc & hppa come immediately 
to mind, but I'm sure there's others.


jeff



Re: confirm subscribe to gcc@gcc.gnu.org

2015-06-01 Thread Rich Felker


Static PIE support in GCC

2015-06-01 Thread Rich Felker
A feature I've been interested in getting upstream in GCC for a while
now is support for producing static-linked PIE executables for Linux.
In the model I'm working with, static PIE executables are ET_DYN
format with no PT_INTERP, and are intended to contain only relative
type relocations (no symbol references). A custom crt1 start file
named rcrt1.o is responsible for processing these relative relocations
before passing execution to the libc entry point code. I have as part
of musl libc an implementation of rcrt1.o (for all targets musl
presently supports) that's working for this model. The way it works is
completely analogous to what OpenBSD has done in their fork of GCC
(see http://www.openbsd.org/papers/asiabsdcon2015-pie-slides.pdf), but
aside from adopting the 'r' prefix for crt that they used, which I did
for some level of consistency, my work on static PIE has been
completely independent of the development of this feature in OpenBSD.

While OpenBSD's motivations for static PIE seem to be purely security
focused, I'm also interested in static PIE as a form of executable
that can be used on NOMMU targets. My motivation for doing the
relocations in the start file, rather than with an external program
interpreter, is both to reduce runtime cost on very small systems, and
to make deployment easier. For musl users, one of the main benefits of
static linking is that the resulting binary can be run on systems
without any additional runtime files installed.

Unfortunately, producing static PIE binaries with GCC is not as simple
as passing -static -pie when linking. The linker arguments I've been
using to test this so far have been:

-shared -Wl,-Bstatic -Wl,-Bsymbolic

and adding the rcrt1.o to the beginning of the inputs. This looks like
something of a hack, and on the GCC command line I would say it is, at
least for -shared which is being used both to suppress the default
crt1 file and to produce ET_DYN output without PT_INTERP. On the other
hand,-Bstatic is just being used to suppress use of .so files to
satisfy -l dependencies, and -Bsymbolic to produce relative
relocations in the output instead of symbol references.

Thankfully this gets a lot less ugly if you put it in the specs. Just
replacing:

#define LINK_PIE_SPEC "%{pie:-pie} "

with:

#define LINK_PIE_SPEC "%{pie:%{static:-shared -Bsymbolic;:-pie}} "

causes -static -pie to invoke the linker in a manner which matches the
desired static PIE model. Aside from this, a per-target addition to
STARTFILE_SPEC is needed to make GCC choose rcrt1.o instead of Scrt1.o
when -static is used with -pie, and to change the logic for crtbegin
so that -pie's choice of crtbeginS.o overrides -static's crtbeginT.o,
and likewise for crtend.

Before proposing anything in the way of patches I'd like some feedback
on whether this approach is acceptable for upstreaming in GCC. The
obvious alternative to the LINK_PIE_SPEC change is making ld accept
-static -pie and do "the right thing" on its side, but the startfile
changes needed on the GCC side are the same either way.

Rich


GNU Tools Cauldron 2015 - Call for Participation

2015-06-01 Thread Jan Hubicka
Hello,
A reminder about this year's Cauldron (http://gcc.gnu.org/wiki/cauldron2015).
The webpage was just updated with a list of accepted talks and BoFs as well
as a brief review of the schedule to let you plan your travel.

With 19 of very relevant talks and 10 BoFs we will have a busy schedule.  We
are still able to accept last minute submissions to be scheduled in parallel
with other presentations.  To register your abstract, send e-mail to
tools-cauldron-ad...@googlegroups.com.  Please do so early so we can adjust
schedule accordingly. If you submitted an abstract but it is not in the list,
please let us know, too.

If you intend to participate, but not necessarily present, please let us know
as soon as possible.  This will allow us to plan for space and food.  Send a
message to tools-cauldron-ad...@googlegroups.com stating your intent to
participate.

Jan


Re: Better info for combine results in worse code generated

2015-06-01 Thread Alan Modra
On Mon, Jun 01, 2015 at 08:39:05AM -0500, Segher Boessenkool wrote:
> On Mon, Jun 01, 2015 at 11:33:18AM +0930, Alan Modra wrote:
> > Unifying andsi_mask with anddi_mask, and the fact that constraints for
> > const_int see VOIDmode rather than the operand mode is why we get
> > rldicr rather than rlwinm.  Easily fixed by separating the si/di
> > patterns, and with a little more work I may even be able to keep them
> > together.
> 
> Maybe just swapping T to be before S will do what you want, already?

Nope.

Index: gcc/config/rs6000/predicates.md
===
--- gcc/config/rs6000/predicates.md (revision 223878)
+++ gcc/config/rs6000/predicates.md (working copy)
@@ -764,7 +764,11 @@
 
   if (TARGET_POWERPC64)
 {
-  /* Fail if the mask is not 32-bit.  */
+  /* Fail if the mask is not 32-bit.  Note: If constraints are
+implemented using mask_operand then they will never fail this
+test.  const_ints are VOIDmode, which is what is seen here
+when called from a constraint.  When called as a predicate,
+the match_operand mode is seen.  */
   if (mode == DImode && (c & ~(unsigned HOST_WIDE_INT) 0x) != 0)
return 0;
 
The above, part of a patch I was writing to fix these problems, is why
putting T before S doesn't work.  eg. "T" matches 0x8000,
which is good for SImode where you're really masking with 0x8000,
but using rlwinm for the same constant in DImode would of course mask
off the top 32 bits.

> > In and3 expander I think you want the following since
> > and64_2_operand covers the extra double-rotate cases, not all DImode.
> > 
> > -  if ((mode == DImode && !and64_2_operand (operands[2], mode))
> > -  || (mode != DImode && !and_operand (operands[2], mode)))
> > +  if (!and_operand (operands[2], mode)
> > +  && (mode != DImode || !and64_2_operand (operands[2], 
> > mode)))
> 
> and64_2_operand includes all of and_operand.  I agree it is a mess.

but and64_2_operand doesn't include all of and_operand!

> > In and3_imm_mask_dot and and3_imm_mask_dot2.  Typo?
> > -   && any_mask_operand (operands[2], mode)"
> > +   && !any_mask_operand (operands[2], mode)"
> 
> Thinko; that whole line should just be removed.  We prefer e.g. "rlwinm"
> over "andi.", but "andi." over "rlwinm.".  I'll do a patch.

OK, I have a patch too..

> > get rid of WORD_REGISTER_OPERATIONS,
> 
> rs6000 should not define it.  What e.g. does it mean for mullw?  Or,
> worse, mulhw?  Pretty much anything with "w" in its name is problematic.

In many places WORD_REGISTER_OPERATIONS is used, it is saying "don't
trust the high bits".  At the moment we definitely do need it defined!

> and gets rid of 2rld completely.

That's a good idea.  If we want it still, and I think we do, just
implement the two rldicl/r in the expander.

-- 
Alan Modra
Australia Development Lab, IBM


Data race in PGO profile collection for multi-process program

2015-06-01 Thread Pengfei Yuan
Hi,

I am trying PGO on Nginx, which has a main process and several worker processes.
I find that the collected profile data files only contain information
for the main process, which is probably a data race (the main process
exits immediately after worker processes exit).
How can I solve this problem? Should I switch to the sampling-based AutoFDO?

Regards,
Yuan, Pengfei


Re: Data race in PGO profile collection for multi-process program

2015-06-01 Thread Xinliang David Li
Using AutoFDO is one way. For PGO, you may want to to try using
__gcov_dump interface to explicitly control the timing and order of
the profile dump --- i.e., invoke __gcov_dump in main process after
work processes exit and before the main process exits.

David

On Mon, Jun 1, 2015 at 8:08 PM, Pengfei Yuan <0xcool...@gmail.com> wrote:
> Hi,
>
> I am trying PGO on Nginx, which has a main process and several worker 
> processes.
> I find that the collected profile data files only contain information
> for the main process, which is probably a data race (the main process
> exits immediately after worker processes exit).
> How can I solve this problem? Should I switch to the sampling-based AutoFDO?
>
> Regards,
> Yuan, Pengfei


Re: Data race in PGO profile collection for multi-process program

2015-06-01 Thread Pengfei Yuan
Thank you very much for the advice. I will try __gcov_dump first.

Yuan

2015-06-02 12:14 GMT+08:00 Xinliang David Li :
> Using AutoFDO is one way. For PGO, you may want to to try using
> __gcov_dump interface to explicitly control the timing and order of
> the profile dump --- i.e., invoke __gcov_dump in main process after
> work processes exit and before the main process exits.


g++ regression: template function accessing a temporary through a lambda

2015-06-01 Thread ace17
Hi folks,
the following C++ snippet used to compile, but doesn't anymore with current gcc 
trunk.
Hope this helps!

Thanks,
Sebastien Alaiwan

-
// simple.cpp
int getValueOfSix()
{
  return 6;
}

template
void f()
{
  auto const c = getValueOfSix();

  auto lambda = [&] ()
  {
auto a = c;
auto b = c;
  };
}

void g()
{
  f<0>();
}


RE: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE

2015-06-01 Thread Ajit Kumar Agarwal


-Original Message-
From: Jeff Law [mailto:l...@redhat.com] 
Sent: Friday, May 29, 2015 9:24 PM
To: Ajit Kumar Agarwal; Richard Biener; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [RFC] Design and Implementation for Path Splitting for Loop with 
Conditional IF-THEN-ELSE

On 05/16/2015 06:49 AM, Ajit Kumar Agarwal wrote:
> I have Designed and  implemented with the following design for the path 
> splitting of the loops with conditional IF-THEN-ELSE.
> The implementation has gone through the bootstrap for Microblaze 
> target along DEJA GNU regressions tests and running the MIBench/EEMBC 
> benchmarks. There is no regression seen in Deja GNU tests and Deja C++ 
> tests results are better with the path splitting changes(lesser 
> failures). The Mibench/EEMBC benchmarks are run and no performance 
> degradation is seen and the performance improvement in telcom_gsm( 
> Mibench benchmarks) of 9.15% and rgbcmy01_lite(EEMBC
> benchmarks) the performance improvement of 9.4% is observed.
>
> Proposal on the below were made earlier on path splitting and the reference 
> is attached.
>>The first thing I notice is you haven't included any tests for the new 
>>optimization.  We're trying really hard these days to have tests in the suite 
>>for this kind of >>new optimization/feature work.

>>I don't offhand know if any of the benchmarks you cite above are free-enough 
>>to derive a testcase from.  But one trick many of us use is to instrument the 
pass and compile some known free software (often gcc
>>itself) to find triggering code and use that to generate tests for the new 
>>transformation.

I will add tests in the suite. I could see many existing tests in the suite 
also get triggered with this optimization.

>
> Design changes.
>
> 1. The dominators of the block with conditional IF statements say BB1 
> are found and the join node of the IF-THEN-ELSE inside the loops is found on 
> the blocks dominated by the BB1 and are not successor of BB1 are the join 
> node.
This isn't necessarily my area of strength, but it doesn't seem right to 
me.   I guess it would work if there aren't any more nodes in the loop 
after the join block, except for the loop latch, but it doesn't see right in 
the general case.  It might work if you also verify that the IF block is the 
immediate dominator of potential join node.



>
> 2. The Join node is same as the source of the loops latch basic blocks.
>>OK.  This is why your initial algorithm in #1 works.
Thanks.

>
> 3. If the above conditional in (1) and (2) are found the Join node  
> same as the source of the Loop latch node is moved into predecessors 
> and the Join node ( Source of the Loop latch node) is made empty statement 
> block with only the phi nodes.
>>Hopefully you do this by duplicating the join node and manipulating the CFG 
>>so that the original predecessors of the join each go to their copy of the 
>>join >>node.  The copied join nodes unconditionally then pass control to the 
>>original join node.

Yes.

>
> 4. In the process of moving the Join node into its predecessors the 
> result of the phi node in the Join node propagated with the corresponding phi 
> arguments  based on which predecessor it came from in the Join blocks and 
> move into its predecessors.
>>Right.  Note there's a phi-only cprop pass which is designed to do this 
>>propagation and in theory should be very fast.

Thanks. 

> 5. The Dominator INFO is updated after performing the steps of (1) (2) (3) 
> and (4).
>>Right.  Getting the sequencing of the various updates can be tricky. 
>>Even more so if you try to do it as a part of jump threading because you have 
>>to (for example) deal with unreachable blocks in the CFG.


>
> 6. The Join which is same as the source of the Loop latch node is made empty 
> with only the phi node in the Join node.
>>Right.

Thanks.

>
> 7. The implementation is done in Jump threading phase of the machine 
> independent optimization on tree based
> representation. The path splitting is called after the Jump threading 
> optimization is performed.
> The following files are modifed.
>
>  a) tree-vrp.c
>  b) tree-ssa-threadedge.c
>  c) tree-cfg.c
>  d) tree-ssa-threadedge.h
>  e) tree-cfg.h
>  f) cfghooks.c
>
> The diff is attached along with this mail and pasted below.
>>I would recommend this as a separate pass.  One of the goals for GCC 6 
>>is to break out the threader into its own pass, if you're piggybacking 
>>on the threader it just makes this task harder.

>>Also by implementing this as a distinct pass you have more freedom to 
>>put it where it really belongs in the optimization pipeline.  Given the 
>>duplication may end up exposing partial redundancies, dead code & 
>>propagation opportunities, it's probably best placed before DOM.  That 
>>also eliminates the need for running phi-only cprop because DOM will 
>>take care of it for you.

>>When jump threading is