configure --program-suffix could change gccjit library path?

2015-07-02 Thread Basile Starynkevitch
Hello All,

Imagine that someone (e.g. a distribution packager) wants to have
several versions of gccjit like he does have several versions of
gcc. Concretely, GCC 5.2 with GCC 6.1

I would believe that the --program-suffix argument of /configure would be 
useful.

So he/she would compile GCC 5.2 with a --program-suffix=-5 argument to
configure, and would like the gcc compiler to go in
/usr/local/bin/gcc-5 and the GCCJIT library to go in
/usr/local/lib/libgccjit-5.so, so users of GCCJIT 5.1 would link with -lgccjit-5

And he/she would also compiile GCC 6.1 with a --program-suffix=-6
argument to configure and would like the gcc compiler to go in
/usr/local/bin/gcc-6 and the GCCJIT library to go in 
/usr/local/lib/libgccjit-6.so
so users ofg GCCJIT 6.1 woudl link with -lgccjit-6

Having two different GCCJIT libraries is IMHO a legitimate wish (likewise, one 
can 
easily have several versions of LLVM libraries on Debian).

AFAIU, the --program-suffix is not yet understood by GCCJIT configuration 
things.

This is a wish, since I don't know autoconf well enough to be able to
propose any patch.

Or perhaps there is some existing configure switch already related to
location of libgccjit?

Regards.
-- 
Basile Starynkevitch  http://starynkevitch.net/Basile/
France




Transformation from SEME(Single Entry Multiple Exit) to SESE(Single Entry Single Exit)

2015-07-02 Thread Ajit Kumar Agarwal
All:

Single Entry and Multiple Exits disables traditional Loop optimization. The 
presence of short circuit also makes the CFG as
Single Entry and Multiple Exits. The transformation from SEME(Single Entry and 
Multiple Exits) to SESE( Single Entry and 
Single Exits enables many Loop Optimizations. 

The approach like Node Splitting to make SEME regions to SESE regions is an 
important optimization on the CFG that 
Enable the transformation with respect to Loops and Conditionals.

The Loops transformation in LLVM does the node splitting to convert from SEME 
regions to SESE regions. The presence
of break and GOTO statements inside the loops makes the CFG unstructured 
transforming  it SEME.  To convert such control
Flow from unstructured to Structured control flow enables many Loop 
transformation.

I would like to implement a  transformation phase on the loops before any Loop 
optimizations pass is enabled to transform 
Unstructured CFG to structured CFG like LLVM.

Does the GCC already has such transformation passes on Loops? Please share your 
thoughts.

Thanks & Regards
Ajit


Re: making the new if-converter not mangle IR that is already vectorizer-friendly

2015-07-02 Thread Alan Lawrence

Abe wrote:




Hi, pleased to meet you :)




As some of you already know, at SARC we are working on a new "if converter" to 
help convert
simple "if"-based blocks of code that appear inside loops into an 
autovectorizer-friendly form
that closely resembles the C ternary operator ["c ? x : y"].  GCC already has 
such a converter,
but it is off by default, in part because it is unsafe: if enabled, it can 
cause certain code
to be transformed in such a way that it malfunctions even though the 
non-converted code worked
just fine with the same inputs.  The new converter, originally by my teammate 
Sebastian Pop,
is safer [almost-always safe *]; we are working on getting it into good-enough 
shape that the
always-safe transformations can be turned on by default whenever the 
autovectorizer is on.

* Always safe for stores, sometimes a little risky for loads:
   speculative loads might cause multithreaded programs with
   insufficient locking to fail due to writes by another thread
   being "lost"/"missed", even though the same program works OK
   "by luck" when compiled without if-conversion of loads.
   This risk comes mainly/only from what the relevant literature
   calls a "half hammock": an "if" with a "then" section but no
   "else" section [or effectively vice-versa, e.g. an empty "then"
   and a non-empty "else"].  In this case, e.g. "if (c)  X[x] = Y[y];"
   with no attached "else" section is risky to fully if-convert
   in the event of the code being compiled running multithreaded
   and not having been written with all the locking it really needs.
   Respectively, e.g. "if (c)  ; /* empty ''then'' */  else  X[x] = Y[y];".


For the unenlightened, can you outline the problem with this code sequence? 
(i.e. the expected transformation that makes it unsafe!?) I would hope your 
scratchpad patch would turn this into something like


a1 = c ? &Y[y] : &scratch;
temp = *a1;
a2 = c ? &X[x] : &scratch;
*a2 = temp;

which seems OK to me - so is the scratchpad approach going away?

(The problem that things might be read in a different order *across* the 
elements of a vector, I can see, but that belongs in the domain of the 
vectorizer itself, not if-conversion, I would think?)



One of the reasons the new if converter has not yet been submitted
for incorporation into GCC`s trunk is that it still has some
performance regressions WRT the old converter, and most of those
are "true regressions", i.e. not just because the old converter
was less safe and the additional safety is what is causing the loss,
but rather because there is more work to do before the patch is ready.

As of this writing, the new if converter sometimes tries
to "convert" something that is already vectorizer-friendly,
and in doing so it renders that code now-NOT-vectorizer-friendly.


Can you give an example? My understanding was that the existing vectorizer 
bailed out pretty much straightaway if the number of basic blocks in the loop 
was not exactly 2 (for inner loops) or 5 (for outermost loops, i.e. containing 
exactly one inner loop)...that seems to rule out vectorization of *any* kind of 
conditional execution, that the if-converter might convert?



Thanks, Alan



Consideration of Cost associated with SEME regions.

2015-07-02 Thread Ajit Kumar Agarwal
All:

The Cost Calculation for a candidate to Spill in the Integrated Register 
Allocator(IRA) considers only the SESE regions.
The Cost Calculation in the IRA should consider the SEME regions into consider 
for spilling decisions. 

The Cost associated with the path that has un-matured exists should be less, 
thus making the more chances of spilling decision
In the path of  un-matured exits. The path that has un-matured (normal )exists 
should be having a higher cost than the cost of un-matured exists and
Spilling decisions has to made accordingly in order to spill inside the less 
frequency path with the un-matured exists than the high frequency
Path with the normal exits.

I would like to propose the above for consideration of cost associated with 
SEME regions in IRA.

Thoughts?

Thanks & Regards
Ajit


Re: GCC 5.1.1 Status Report (2015-06-22)

2015-07-02 Thread Matthew Wahab

On 22/06/15 12:56, Richard Biener wrote:


I plan to release GCC 5.2.0 around July 10th which means a release
candidate being done around July 3rd.

Please check your open regression bugs for ones that eligible for
backporting.  Also please help getting the P1 bug count to zero
(there is still the ARM aligned argument passing ABI issue).


I'd like to get the fix for PR target/65697 (weak memory barriers for __sync builtins 
on ARMv8) into GCC-5.2.


The backported patches for the Aarch64 back-end have been submitted:
https://gcc.gnu.org/ml/gcc-patches/2015-06/msg01937.html

There's a reviewer for the Aarch64 patches but I still need a reviewer for the first 
patch which touches the middle-end and several back-ends.


Matthew


RE: Consideration of Cost associated with SEME regions.

2015-07-02 Thread Ajit Kumar Agarwal
Sorry for the typo error. I meant exits instead of exists.

The below is corrected.

The Cost Calculation for a candidate to Spill in the Integrated Register 
Allocator(IRA) considers only the SESE regions.
The Cost Calculation in the IRA should consider the SEME regions into 
consideration for spilling decisions. 

The Cost associated with the path that has un-matured exits should be less, 
thus making the more chances of spilling decision 
In the path of  un-matured exits. The path that has normal exit should be 
having a higher cost than the cost of un-matured
exit and Spilling decisions has to made accordingly in order to spill inside 
the less frequency path with the un-matured exits 
than the high frequency Path with the normal exits.

I would like to propose the above for consideration of cost associated with 
SEME regions in IRA.

Thoughts?

Thanks & Regards
Ajit



-Original Message-
From: Ajit Kumar Agarwal 
Sent: Thursday, July 02, 2015 3:33 PM
To: vmaka...@redhat.com; l...@redhat.com; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Consideration of Cost associated with SEME regions.

All:

The Cost Calculation for a candidate to Spill in the Integrated Register 
Allocator(IRA) considers only the SESE regions.
The Cost Calculation in the IRA should consider the SEME regions into consider 
for spilling decisions. 

The Cost associated with the path that has un-matured exists should be less, 
thus making the more chances of spilling decision In the path of  un-matured 
exits. The path that has un-matured (normal )exists should be having a higher 
cost than the cost of un-matured exists and Spilling decisions has to made 
accordingly in order to spill inside the less frequency path with the 
un-matured exists than the high frequency Path with the normal exits.

I would like to propose the above for consideration of cost associated with 
SEME regions in IRA.

Thoughts?

Thanks & Regards
Ajit


Re: GCC 5.1.1 Status Report (2015-06-22)

2015-07-02 Thread Richard Biener
On Thu, 2 Jul 2015, Matthew Wahab wrote:

> On 22/06/15 12:56, Richard Biener wrote:
> > 
> > I plan to release GCC 5.2.0 around July 10th which means a release
> > candidate being done around July 3rd.
> > 
> > Please check your open regression bugs for ones that eligible for
> > backporting.  Also please help getting the P1 bug count to zero
> > (there is still the ARM aligned argument passing ABI issue).
> 
> I'd like to get the fix for PR target/65697 (weak memory barriers for __sync
> builtins on ARMv8) into GCC-5.2.
> 
> The backported patches for the Aarch64 back-end have been submitted:
> https://gcc.gnu.org/ml/gcc-patches/2015-06/msg01937.html
> 
> There's a reviewer for the Aarch64 patches but I still need a reviewer for the
> first patch which touches the middle-end and several back-ends.

The patch is ok to backport.

Thanks,
Richard.


Re: GCC 5.1.1 Status Report (2015-06-22)

2015-07-02 Thread Ramana Radhakrishnan
On Thu, Jul 2, 2015 at 12:03 PM, Matthew Wahab
 wrote:
> On 22/06/15 12:56, Richard Biener wrote:
>>
>>
>> I plan to release GCC 5.2.0 around July 10th which means a release
>> candidate being done around July 3rd.
>>
>> Please check your open regression bugs for ones that eligible for
>> backporting.  Also please help getting the P1 bug count to zero
>> (there is still the ARM aligned argument passing ABI issue).
>
>
> I'd like to get the fix for PR target/65697 (weak memory barriers for __sync
> builtins on ARMv8) into GCC-5.2.
>
> The backported patches for the Aarch64 back-end have been submitted:
> https://gcc.gnu.org/ml/gcc-patches/2015-06/msg01937.html
>
> There's a reviewer for the Aarch64 patches but I still need a reviewer for
> the first patch which touches the middle-end and several back-ends.

I'd also like to see the ARM patches backported please unless there
are RM objections for it. It doesn't make sense for them to diverge
with something like this at different release points on the branch if
we can avoid it.

Ramana

>
> Matthew


Re: Consideration of Cost associated with SEME regions.

2015-07-02 Thread Vladimir Makarov



On 07/02/2015 07:06 AM, Ajit Kumar Agarwal wrote:

Sorry for the typo error. I meant exits instead of exists.

The below is corrected.

The Cost Calculation for a candidate to Spill in the Integrated Register 
Allocator(IRA) considers only the SESE regions.
The Cost Calculation in the IRA should consider the SEME regions into 
consideration for spilling decisions.
IRA is a regional allocator.  It spills pseudo in a region. Currently 
regions are loops as the most important for optimizations.  Loops can 
have more one exit.  So your assumption that IRA works on SESE region is 
not accurate.


IRA with some work can be extended non-loop regions too.  I am not sure 
it will give significant improvement (LRA compensates partially this by 
inheritance in EBBs) but it definitely will slowdown IRA whose speed 
heavily depends on the number of regions (therefore a lot was done in 
IRA to decrease regions number by merging regions with low register 
pressure).


Of course, it would be interesting to implement non-loop regions in IRA 
and see the results.

The Cost associated with the path that has un-matured exits should be less, 
thus making the more chances of spilling decision
In the path of  un-matured exits. The path that has normal exit should be 
having a higher cost than the cost of un-matured
exit and Spilling decisions has to made accordingly in order to spill inside 
the less frequency path with the un-matured exits
than the high frequency Path with the normal exits.

I would like to propose the above for consideration of cost associated with 
SEME regions in IRA.
IRA uses standard GCC evaluation of edge and BB execution frequencies, 
static (see predict.c) or based on execution profile (see *profile.c).  
Without profile usage it might be inaccurate. May be it can be 
improved.  For me it has more sense to work on this code instead of 
working on IRA code only.




%fs and %gs segments on x86/x86-64

2015-07-02 Thread Armin Rigo
Hi all,

I implemented support for %fs and %gs segment prefixes on the x86 and
x86-64 platforms, in what turns out to be a small patch.

For those not familiar with it, at least on x86-64, %fs and %gs are
two special registers that a user program can ask be added to any
address machine instruction.  This is done with a one-byte instruction
prefix, "%fs:" or "%gs:".  The actual value stored in these two
registers cannot quickly be modified (at least before the Haswell
CPU), but the general idea is that they are rarely modified.
Speed-wise, though, an instruction like "movq %gs:(%rdx), %rax" runs
at the same speed as a "movq (%rdx), %rax" would.  (I failed to
measure any difference, but I guess that the instruction is one more
byte in length, which means that a large quantity of them would tax
the instruction caches a bit more.)

For reference, the pthread library on x86-64 uses %fs to point to
thread-local variables.  There are a number of special modes in gcc to
already produce instructions like "movq %fs:(16), %rax" to load
thread-local variables (declared with __thread).  However, this
support is special-case only.  The %gs register is free to use.  (On
x86, %gs is used by pthread and %fs is free to use.)


So what I did is to add the __seg_fs and __seg_gs address spaces.  It
is used like this, for example:

typedef __seg_gs struct myobject_s {
int a, b, c;
} myobject_t;

You can then use variables of type "struct myobject_s *o1" as regular
pointers, and "myobject_t *o2" as %gs-based pointers.  Accesses to
"o2->a" are compiled to instructions that use the %gs prefix; accesses
to "o1->a" are compiled as usual.  These two pointer types are
incompatible.  The way you obtain %gs-based pointers, or control the
value of %gs itself, is out of the scope of gcc; you do that by using
the correct system calls and by manual arithmetic.  There is no
automatic conversion; the C code can contain casts between the three
address spaces (regular, %fs and %gs) which, like regular pointer
casts, are no-ops.


My motivation comes from the PyPy-STM project ("removing the Global
Interpreter Lock" for this Python interpreter).  In this project, I
want *almost all* pointer manipulations to resolve to different
addresses depending on which thread runs the code.  The idea is to use
mmap() tricks to ensure that the actual memory usage remains
reasonable, by sharing most of the pages (but not all of them) between
each thread's "segment".  So most accesses to a %gs-prefixed address
actually access the same physical memory in all threads; but not all
of them.  This gives me a dynamic way to have a large quantity of data
which every thread can read, and by changing occasionally the mapping
of a single page, I can make some changes be thread-local, i.e.
invisible to other threads.

Of course, the same effect can be achieved in other ways, like
declaring a regular "__thread intptr_t base;" and adding the "base"
explicitly to every pointer access.  Clearly, this would have a large
performance impact.  The %gs solution comes at almost no cost.  The
patched gcc is able to compile the hundreds of MBs of (generated) C
code with systematic %gs usage and seems to work well (with one
exception, see below).


Is there interest in that?  And if so, how to progress?

* The patch included here is very minimal.  It is against the
gcc_5_1_0_release branch but adapting it to "trunk" should be
straightforward.

* I'm unclear if target_default_pointer_address_modes_p() should
return "true" or not in this situation: i386-c.c now defines more than
the default address mode, but the new ones also use pointers of the
same standard size.

* One case in which this patched gcc miscompiles code is found in the
attached bug1.c/bug1.s.  (This case almost never occurs in PyPy-STM,
so I could work around it easily.)  I think that some early, pre-RTL
optimization is to "blame" here, possibly getting confused because the
nonstandard address spaces also use the same size for pointers.  Of
course it is also possible that I messed up somewhere, or that the
whole idea is doomed because many optimizations make a similar
assumption.  Hopefully not: it is the only issue I encountered.

* The extra byte needed for the "%gs:" prefix is not explicitly
accounted for.  Is it only by chance that I did not observe gcc
underestimating how large the code it writes is, and then e.g. use
jump instructions that would be rejected by the assembler?

* For completeness: this is very similar to clang's
__attribute__((addressspace(256))) but a few details differ.  (Also,
not to discredit other projects in their concurrent's mailing list,
but I had to fix three distinct bugs in llvm before I could use it.
It contributes to me having more trust in gcc...)


Links for more info about pypy-stm:

* http://morepypy.blogspot.ch/2015/03/pypy-stm-251-released.html
* https://bitbucket.org/pypy/stmgc/src/use-gcc/gcc-seg-gs/
* https://bitbucket.org/pypy/stmgc/src/use-gcc/c8/stmgc.h


Than

Re: GCC 5.1.1 Status Report (2015-06-22)

2015-07-02 Thread Matthew Wahab

On 02/07/15 13:40, Ramana Radhakrishnan wrote:

On Thu, Jul 2, 2015 at 12:03 PM, Matthew Wahab
 wrote:


I'd like to get the fix for PR target/65697 (weak memory barriers for __sync
builtins on ARMv8) into GCC-5.2.



I'd also like to see the ARM patches backported please unless there
are RM objections for it.


The patches are up for review: 
https://gcc.gnu.org/ml/gcc-patches/2015-07/msg00129.html

Matthew




libgomp: Purpose of gomp_thread_pool::last_team?

2015-07-02 Thread Sebastian Huber
Hello,

does anyone know what the purpose of gomp_thread_pool::last_team is? This field 
seems to be used to delay the team destruction in gomp_team_end() in case the 
team has more than one thread and the previous team state has no team 
associated (identifies this a master thread?):

  if (__builtin_expect (thr->ts.team != NULL, 0)
  || __builtin_expect (team->nthreads == 1, 0))
free_team (team);
  else
{
  struct gomp_thread_pool *pool = thr->thread_pool;
  if (pool->last_team)
free_team (pool->last_team);
  pool->last_team = team;
}

Why can you not immediately free the team?

-- 
Sebastian Huber, embedded brains GmbH

Address : Dornierstr. 4, D-82178 Puchheim, Germany
Phone   : +49 89 189 47 41-16
Fax : +49 89 189 47 41-09
E-Mail  : sebastian.huber at embedded-brains.de
PGP : Public key available on request.

Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.


Re: libgomp: Purpose of gomp_thread_pool::last_team?

2015-07-02 Thread Jakub Jelinek
On Thu, Jul 02, 2015 at 09:57:20PM +0200, Sebastian Huber wrote:
> does anyone know what the purpose of gomp_thread_pool::last_team is? This 
> field seems to be used to delay the team destruction in gomp_team_end() in 
> case the team has more than one thread and the previous team state has no 
> team associated (identifies this a master thread?):
> 
>   if (__builtin_expect (thr->ts.team != NULL, 0)
>   || __builtin_expect (team->nthreads == 1, 0))
> free_team (team);
>   else
> {
>   struct gomp_thread_pool *pool = thr->thread_pool;
>   if (pool->last_team)
>   free_team (pool->last_team);
>   pool->last_team = team;
> }
> 
> Why can you not immediately free the team?

That was added with
https://gcc.gnu.org/ml/gcc-patches/2008-05/msg01674.html
and the purpose is for non-nested teams make sure all the
threads in the team move on to the pool's barrier before
the team's barrier is destroyed.

Jakub


Re: making the new if-converter not mangle IR that is already vectorizer-friendly

2015-07-02 Thread Abe

On 7/2/15 4:30 AM, Alan Lawrence wrote:


Hi, pleased to meet you :)


Likewise.  :-)


[Abe wrote:]


* Always safe for stores, sometimes a little risky for loads:
   speculative loads might cause multithreaded programs with
   insufficient locking to fail due to writes by another thread
   being "lost"/"missed", even though the same program works OK
   "by luck" when compiled without if-conversion of loads.
   This risk comes mainly/only from what the relevant literature
   calls a "half hammock": an "if" with a "then" section but no
   "else" section [or effectively vice-versa, e.g. an empty "then"
   and a non-empty "else"].  In this case, e.g. "if (c)  X[x] = Y[y];"
   with no attached "else" section is risky to fully if-convert
   in the event of the code being compiled running multithreaded
   and not having been written with all the locking it really needs.
   Respectively, e.g. "if (c)  ; /* empty ''then'' */  else  X[x] = Y[y];".



[Alan wrote:]


For the unenlightened, can you outline the problem with this code sequence?
(i.e. the expected transformation that makes it unsafe!?)
I would hope your scratchpad patch would turn this into something like



a1 = c ? &Y[y] : &scratch;
temp = *a1;
a2 = c ? &X[x] : &scratch;
*a2 = temp;



which seems OK to me


Yes, you are right.  The problem I was thinking about is not present in the 
above:
in the "'c' is false" case, the vectorized code for the above just wastes some 
effort
by reading garbage from the scratchpad and writing it back to the scratchpad.



so is the scratchpad approach going away?


Not at all.  :-)


My example[s] was/were not written well with regard to expressing what I had in 
mind.
The problem I was thinking about is shown with a scalar destination, e.g.:

  if (c)  foo = X[x];

... which is if-converted into the equivalent of:

  foo = c ? X[x] : foo;


The [perceived/potential] problem with the preceding is that the part that is equivalent 
to "foo = foo;"
in the above can _not_ be optimized out, as it normally would for a non-"volatile" 
"foo",
because it is part of a larger vectorized operation and this small part cannot 
be broken out of the whole
without breaking vectorization.  Therefore, the value of "foo" might be read 
and then written back
a few {micro|nano|pico|whatever}-seconds later, which may cause an update to 
the same location to be overwritten.
That`s why the pathological program-under-compilation is a badly-written 
multithreaded program without enough locking:
without the "if something, then overwrite" being replaced by "read, then if 
something then write the new value,
and if not that same something then rewrite the old value", the badly-written 
multithreaded program might work correctly
through "good luck", but with the replacement [i.e. the if conversion] the 
chances of success
[where "success" here basically means not "missing" a write by another thread] 
is lower than it was before.

However, after some discussion with Sebastian I learned that this is already 
taken care of, i.e. safe:
the "read, then if something then write the new value, and if not that same 
something then rewrite the old value"
replacement strategy is only used for thread-local scalars.  For global scalars 
and static scalars,
we treat the scalar as if it were the first element of a length=1 array and 
don`t have this problem.

In other words, the problem about which I was concerned is not going to be triggered by 
e.g. "if (c)  x = ..."
which lacks an attached "else  x = ..." in a multithreaded program without 
enough locking just because 'x' is global/static.

The only remaining case to consider is if some code being compiler takes the address of 
something thread-local and then "gives"
that pointer to another thread.  Even for _that_ extreme case, Sebastian says 
that the gimplifier will detect this
"address has been taken" situation and do the right thing such that the new if 
converter also does the right thing.

TLDR: Abe was being too paranoid; to the best of our knowledge, it`s OK as-is.  
;-)


[Abe wrote:]


One of the reasons the new if converter has not yet been submitted
for incorporation into GCC`s trunk is that it still has some
performance regressions WRT the old converter, and most of those
are "true regressions", i.e. not just because the old converter
was less safe and the additional safety is what is causing the loss,
but rather because there is more work to do before the patch is ready.



As of this writing, the new if converter sometimes tries
to "convert" something that is already vectorizer-friendly,
and in doing so it renders that code now-NOT-vectorizer-friendly.



[Alan wrote:]


Can you give an example?


The test cases in the GCC tree at "gcc.dg/vect/pr61194.c" and 
"gcc.dg/vect/vect-mask-load-1.c"
currently test as: the new if-converter is "converting" something that`s 
already vectorizer-friendly,
messing up the IR in the process and thus disabling vectorization for the test 
case in question.