GCC 4.7.0 Status Report (2011-09-09)

2011-09-09 Thread Jakub Jelinek
Status
==

The trunk is in Stage 1, which, if we follow roughly the 4.6
release schedule, should end around end of October.
At this point I'd like to gather the status of the various
development branches that haven't been merged into trunk yet
and whether it is possible to merge them with such a schedule
or whether e.g. a two weeks delay would help them.
In particular, is transactional-memory branch mergeable within
a month and half, at least some parts of cxx-mem-model branch,
bitfield lowering?  What is the status of lra, reload-2a, pph,
cilkplus, gupc (I assume at least some of these are 4.8+ material)?


Quality Data


Priority  #   Change from Last Report
---   ---
P16   +  6
P2   95   + 10
P3   59   + 56
---   ---
Total   160   + 72


Previous Report
===

http://gcc.gnu.org/ml/gcc/2011-03/msg00178.html

The next status report will be sent by Joseph.


Re: GCC 4.7.0 Status Report (2011-09-09)

2011-09-09 Thread Richard Guenther
On Fri, Sep 9, 2011 at 9:09 AM, Jakub Jelinek  wrote:
> Status
> ==
>
> The trunk is in Stage 1, which, if we follow roughly the 4.6
> release schedule, should end around end of October.
> At this point I'd like to gather the status of the various
> development branches that haven't been merged into trunk yet
> and whether it is possible to merge them with such a schedule
> or whether e.g. a two weeks delay would help them.
> In particular, is transactional-memory branch mergeable within
> a month and half, at least some parts of cxx-mem-model branch,
> bitfield lowering?  What is the status of lra, reload-2a, pph,
> cilkplus, gupc (I assume at least some of these are 4.8+ material)?

Bitfield lowering is not going to happen (well, completely at least) unless
I can find some more time to work on it.  Instead I want to finally
make no-longer-sign-extending sizetypes happen for 4.7, which currently
only waits on Ada frontend issues.

Btw, end of October will then be 7 1/2 month worth of stage1 already.

Richard.


should sync builtins be full optimization barriers?

2011-09-09 Thread Paolo Bonzini

Hi all,

sync builtins are described in the documentations as being full memory 
barriers, with the possible exception of __sync_lock_test_and_set. 
However, GCC is not enforcing the fact that they are also full 
_optimization_ barriers.  The RTL produced by builtins does not in 
general include a memory optimization barrier such as a set of 
(mem/v:BLK (scratch:P)).


This can cause problems with lock-free algorithms, for example this:

http://libdispatch.macosforge.org/trac/ticket/35

This can be solved either in generic code, by wrapping sync builtins 
(before and after) with an asm("":::"memory"), or in the single machine 
descriptions by adding a memory barrier in parallel to the locked 
instructions or with the ll/sc instructions.


Is the above analysis correct?  Or should the users put explicit 
compiler barriers?


Paolo


Re: should sync builtins be full optimization barriers?

2011-09-09 Thread Jakub Jelinek
On Fri, Sep 09, 2011 at 10:07:30AM +0200, Paolo Bonzini wrote:
> sync builtins are described in the documentations as being full
> memory barriers, with the possible exception of
> __sync_lock_test_and_set. However, GCC is not enforcing the fact
> that they are also full _optimization_ barriers.  The RTL produced
> by builtins does not in general include a memory optimization
> barrier such as a set of (mem/v:BLK (scratch:P)).
> 
> This can cause problems with lock-free algorithms, for example this:
> 
> http://libdispatch.macosforge.org/trac/ticket/35
> 
> This can be solved either in generic code, by wrapping sync builtins
> (before and after) with an asm("":::"memory"), or in the single
> machine descriptions by adding a memory barrier in parallel to the
> locked instructions or with the ll/sc instructions.
> 
> Is the above analysis correct?  Or should the users put explicit
> compiler barriers?

I'd say they should be optimization barriers too (and at the tree level
they I think work that way, being represented as function calls), so if
they don't act as memory barriers in RTL, the *.md patterns should be
fixed.  The only exception should be IMHO the __SYNC_MEM_RELAXED
variants - if the CPU can reorder memory accesses across them at will,
why shouldn't the compiler be able to do the same as well?

Jakub


Re: should sync builtins be full optimization barriers?

2011-09-09 Thread Paolo Bonzini

On 09/09/2011 10:17 AM, Jakub Jelinek wrote:

>  Is the above analysis correct?  Or should the users put explicit
>  compiler barriers?

I'd say they should be optimization barriers too (and at the tree level
they I think work that way, being represented as function calls), so if
they don't act as memory barriers in RTL, the *.md patterns should be
fixed.  The only exception should be IMHO the __SYNC_MEM_RELAXED
variants - if the CPU can reorder memory accesses across them at will,
why shouldn't the compiler be able to do the same as well?


Agreed, so we have a bug in all released versions of GCC. :(

Paolo


Re: GCC 4.7.0 Status Report (2011-09-09)

2011-09-09 Thread Andrew MacLeod

On 09/09/2011 03:09 AM, Jakub Jelinek wrote:


In particular, is transactional-memory branch mergeable within
a month and half, at least some parts of cxx-mem-model branch,


There will certainly be some parts of the branch which would be 
appropriate for merging with mainline in october.  We ought to at least 
have the new __sync_mem builtins available to replace the old ones, and 
the testing infrastructure.  Im not sure we will have *all* the 
infrastructure in place, but it should be pretty close if not.  Its also 
fairly low risk.


Andrew


Re: Comparison of GCC-4.6.1 and LLVM-2.9 on x86/x86-64 targets

2011-09-09 Thread Vladimir Makarov

On 09/07/2011 12:23 PM, Vladimir Makarov wrote:

On 09/07/2011 11:55 AM, Xinliang David Li wrote:

Why is lto/whole program mode not used in LLVM for peak performance
comparison? (of course, peak performance should really use FDO..)

Thanks for the feedback.  I did not manage to use LTO for LLVM as it 
described on


http://llvm.org/docs/LinkTimeOptimization.html#lto

I am getting 'file not recognized: File format not recognized'  during 
the linkage pass.


You probably right that I should use -Ofast without -flto for gcc 
then.  Although I don't think that it significantly change GCC peak 
performance.  Still I am going to run SPEC2000 without -flto and post 
the data (probably on the next week).


As for FDO, unfortunately for some tests SPEC uses different training 
sets and it gives sometimes wrong info for the further optimizations.


I do not look at this comparison as finished work and am going to run 
more SPEC2000 tests and change the results if I have serious 
reasonable objections for the current comparison.
I've add -Ofast without -flto -fwhole-program for GCC as well and 
updated the graphs:


http://vmakarov.fedorapeople.org/spec/



Re: should sync builtins be full optimization barriers?

2011-09-09 Thread Andrew MacLeod

On 09/09/2011 04:17 AM, Jakub Jelinek wrote:

On Fri, Sep 09, 2011 at 10:07:30AM +0200, Paolo Bonzini wrote:

sync builtins are described in the documentations as being full
memory barriers, with the possible exception of
__sync_lock_test_and_set. However, GCC is not enforcing the fact
that they are also full _optimization_ barriers.  The RTL produced
by builtins does not in general include a memory optimization
barrier such as a set of (mem/v:BLK (scratch:P)).

This can cause problems with lock-free algorithms, for example this:

http://libdispatch.macosforge.org/trac/ticket/35

This can be solved either in generic code, by wrapping sync builtins
(before and after) with an asm("":::"memory"), or in the single
machine descriptions by adding a memory barrier in parallel to the
locked instructions or with the ll/sc instructions.

Is the above analysis correct?  Or should the users put explicit
compiler barriers?

I'd say they should be optimization barriers too (and at the tree level
they I think work that way, being represented as function calls), so if
they don't act as memory barriers in RTL, the *.md patterns should be
fixed.  The only exception should be IMHO the __SYNC_MEM_RELAXED
variants - if the CPU can reorder memory accesses across them at will,
why shouldn't the compiler be able to do the same as well?

Yeah, some of this is part of the ongoing C++0x work... the memory model 
parameter is going to allow certain types of code movement in optimizers 
based on whether its an acquire operation, a release operation, neither, 
or both.It is ongoing and hopefully we will eventually have proper 
consistency.  The older __sync builtins are eventually going to invoke 
the new__sync_mem routines and their new patterns, but will fall back to 
the old ones if new patterns aren't specified.


In the case of your program, this would in fact be a valid 
transformation I believe...  __sync_lock_test_and_set is documented to 
only have ACQUIRE semantics. This does not guarantee that a store BEFORE 
the operation will be visible in another thread, which means it possible 
to reorder it.  (A summary of the different modes can be found at : 
http://gcc.gnu.org/wiki/Atomic/GCCMM/Optimizations So you would 
require a  barrier before this code anyway for the behaviour you are 
looking for.Once the new routines are available and implemented, you 
could simply specify the SEQ_CST model and then it should, in theory, 
work properly with a barrier being emitted for you.


I don't see anything in this pattern however that would enforce acquire 
mode and prevent the reverse operation.. moving something from after to 
before it... so there may be a bug there anyway.


And I suspect most people actually expect all the old __sync routines to 
be full optimization barriers all the time...  maybe we should consider 
just doing that...


Andrew


Re: Comparison of GCC-4.6.1 and LLVM-2.9 on x86/x86-64 targets

2011-09-09 Thread Vladimir Makarov

On 09/08/2011 04:47 AM, Jakub Jelinek wrote:

On Wed, Sep 07, 2011 at 11:15:39AM -0400, Vladimir Makarov wrote:

   This year I used -Ofast -flto -fwhole-program instead of
-O3 for GCC and -O3 -ffast-math for LLVM for comparison of peak
performance.  I could improve GCC performance even more by using
other GCC possibilities (like support of AVX insns, Graphite optimizations
and even some experimental stuff like LIPO) but I wanted to give LLVM
some chances too.  Probably an experienced user in LLVM could improve
LLVM performance too.  So I think it is a fair comparison.

-march=native in addition would be nice to see, that can make significant
difference, especially on AVX capable CPUs.  I guess LLVM equivalent would
be -march=corei7 -mtune=corei7 and, if it works, -mavx too (though, the only
time I've tried LLVM 2.9 it crashed on almost anything with -mavx).

Yes, Jakub.  It would be better to use corei7 with avx for GCC.  
Unfortunately, the last tuning which llvm 2.9 supports is core2 
therefore I used -march=core2 for comparison on x86-64.  So I think it 
would be unfair to use corei7 and avx for GCC without using it for LLVM.


I mostly tried to compare state of general optimizations of GCC and 
LLVM.  There are a lot of other aspects of the compilers which we could 
compare and I did not do it.  But for me it is obvious that Apple loses 
a lot not using modern versions of GCC.





Re: should sync builtins be full optimization barriers?

2011-09-09 Thread Paolo Bonzini

On 09/09/2011 04:22 PM, Andrew MacLeod wrote:



Yeah, some of this is part of the ongoing C++0x work... the memory model
parameter is going to allow certain types of code movement in optimizers
based on whether its an acquire operation, a release operation, neither,
or both.It is ongoing and hopefully we will eventually have proper
consistency.  The older __sync builtins are eventually going to invoke
the new__sync_mem routines and their new patterns, but will fall back to
the old ones if new patterns aren't specified.

In the case of your program, this would in fact be a valid
transformation I believe...  __sync_lock_test_and_set is documented to
only have ACQUIRE semantics.


Yes, that's true.  However, there's nothing special in the compiler to 
handle __sync_lock_test_and_set differently (optimization-wise) from say 
__sync_fetch_and_add.



I don't see anything in this pattern however that would enforce acquire
mode and prevent the reverse operation.. moving something from after to
before it... so there may be a bug there anyway.


Yes.


And I suspect most people actually expect all the old __sync routines to
be full optimization barriers all the time...  maybe we should consider
just doing that...


That would be very nice.  I would like to introduce that kind of data 
structure in QEMU, too. :)


Paolo


Re: Comparison of GCC-4.6.1 and LLVM-2.9 on x86/x86-64 targets

2011-09-09 Thread Jakub Jelinek
On Fri, Sep 09, 2011 at 10:26:22AM -0400, Vladimir Makarov wrote:
> Yes, Jakub.  It would be better to use corei7 with avx for GCC.
> Unfortunately, the last tuning which llvm 2.9 supports is core2
> therefore I used -march=core2 for comparison on x86-64.  So I think
> it would be unfair to use corei7 and avx for GCC without using it
> for LLVM.

LLVM 2.9 seems to accept -march=corei7 (though, maybe it just accepts it
and tunes fore core2 anyway, haven't checked), doesn't accept
-march=corei7-avx.

I wonder for which CPUs LLVM actually tunes, because
e.g. when I looked at Phoronix benchmarks (PovRay in particular),
GCC on that particular "benchmark" lost to LLVM because the configury
uses -march=k8 -mtune=k8 for x86_64-linux unconditionally, which wasn't
the best tuning for the contemporary Intel CPUs, while LLVM apparently
didn't show much difference between k8 and core2i7 tuning, see
http://phoronix.com/forums/showthread.php?59341-AMD-Llano-Compiler-Performance&p=224367#post224367

Jakub


Re: GCC 4.7.0 Status Report (2011-09-09)

2011-09-09 Thread Vladimir Makarov

On 09/09/2011 03:09 AM, Jakub Jelinek wrote:

Status
==

The trunk is in Stage 1, which, if we follow roughly the 4.6
release schedule, should end around end of October.
At this point I'd like to gather the status of the various
development branches that haven't been merged into trunk yet
and whether it is possible to merge them with such a schedule
or whether e.g. a two weeks delay would help them.
In particular, is transactional-memory branch mergeable within
a month and half, at least some parts of cxx-mem-model branch,
bitfield lowering?  What is the status of lra, reload-2a, pph,
cilkplus, gupc (I assume at least some of these are 4.8+ material)?

LRA is a long project.  At the best case it will be ready for 4.8 but 
most probably for 4.9.




Re: question on find_if_case_2 in ifcvt.c

2011-09-09 Thread Jeff Law

On 09/08/2011 08:20 PM, Amker.Cheng wrote:

Hi,
In ifcvt.c's function find_if_case_2, it uses cheap_bb_rtx_cost_p to
judge the conversion.

Function cheap_bb_rtx_cost_p checks whether the total insn_rtx_cost on
non-jump insns in
basic block BB is less than MAX_COST.

So the question is why uses cheap_bb_rtx_cost_p, even when we know the
ELSE is predicted,
which means there is benefit from this conversion anyway.
Not necessarily.  This transformation is speculating  insns from the 
ELSE path.  So there's a cost every time we mispredict the branch.




Second, should cheap_bb_rtx_cost_p be tuned as "checks whether the
total insn_rtx_cost on
non-jump insns in basic block BB is no larger than MAX_COST." to
prefer normal instructions
than branch even there have same costs.

Perhaps, it's a corner case and I doubt it matters too much.

I have a pending patch which twiddles this code so that it takes into 
account the weight of the prediction.  This is important, particularly 
in the case where the ELSE is not predicted -- we're paying an awful 
cost for the speculation in that case.

jeff




gcc-4.6-20110909 is now available

2011-09-09 Thread gccadmin
Snapshot gcc-4.6-20110909 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/4.6-20110909/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 4.6 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_6-branch 
revision 178740

You'll find:

 gcc-4.6-20110909.tar.bz2 Complete GCC

  MD5=85e1d6a9d3e6eb8a9cebd231f7196f5b
  SHA1=7b701a49f48d544f5b37077f8caf22bdee5bc67e

Diffs from 4.6-20110902 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-4.6
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Re: Comparison of GCC-4.6.1 and LLVM-2.9 on x86/x86-64 targets

2011-09-09 Thread Lawrence Crowl
On 9/7/11, Vladimir Makarov  wrote:
> Some people asked me to do comparison of  GCC-4.6 and LLVM-2.9 (both
> released this spring) as I did GCC-LLVM comparison in previous year.
>
> You can find it on http://vmakarov.fedorapeople.org/spec under
> 2011 GCC-LLVM comparison tab entry.

The format of these graphs exaggerates differences.  The reason is
that our hind brains cannot help but compare the heights of bars and
ignore the non-zero bases.  In short, non-zero based graphs are lies.
So, please 0-base all the graphs.  The graphs should show compilation
time from 0 up, execution time from 0 up, SPEC score from 0 up, etc.
A consequence is that you will get rid of the "change" graphs.

In my mind, an interesting graph would be to plot the execution
time of the benchmarks as a function of the compile time of the
benchmarks.  This graph would show you, in particular, what you
buy or lose by changing compilers and/or optimization/debug levels.

-- 
Lawrence Crowl


Re: should sync builtins be full optimization barriers?

2011-09-09 Thread Geert Bosch

On Sep 9, 2011, at 04:17, Jakub Jelinek wrote:

> I'd say they should be optimization barriers too (and at the tree level
> they I think work that way, being represented as function calls), so if
> they don't act as memory barriers in RTL, the *.md patterns should be
> fixed.  The only exception should be IMHO the __SYNC_MEM_RELAXED
> variants - if the CPU can reorder memory accesses across them at will,
> why shouldn't the compiler be able to do the same as well?

They are different concepts. If a program runs on a single processor,
all memory operations will appear to be sequentially consistent, even if
the CPU reorders them at the hardware level.  However, compiler 
optimizations can still cause multiple threads to see the accesses 
as not sequentially consistent. 

For example, for atomic objects accessed only from a single processor 
(but  possibly multiple threads), you'd not want the compiler to reorder
memory accesses to global variables across the atomic operations, but 
you wouldn't have  to emit the expensive fences.

For the C++0x atomic types there are:

void A::store(C desired, memory_order order = memory_order_seq_cst) volatile;
void A::store(C desired, memory_order order = memory_order_seq_cst);

where the first variant (with order = memory_order_relaxed) 
would allow fences to be omitted, while still preventing the compiler from
reordering memory accesses, IIUC.

To be honest, I can't quite see the use of completely unordered
atomic operations, where we not even prohibit compiler optimizations.
It would seem if we guarantee that a variable will not be accessed
concurrently from any other thread, we wouldn't need the operation
to be atomic in the first place. That said, it's quite likely I'm 
missing something here. 

For Ada, all atomic accesses are always memory_order_seq_cst, and we
just care about being able to optimize accesses if we know they'll be
done from the same processor. For the C++11 model, thinking about
the semantics of any memory orders other than memory_order_seq_cst
and their interaction with operations with different ordering semantics
makes my head hurt.

Regards,
  -Geert


Re: should sync builtins be full optimization barriers?

2011-09-09 Thread Paolo Bonzini
On Sat, Sep 10, 2011 at 03:09, Geert Bosch  wrote:
> For example, for atomic objects accessed only from a single processor
> (but  possibly multiple threads), you'd not want the compiler to reorder
> memory accesses to global variables across the atomic operations, but
> you wouldn't have  to emit the expensive fences.

I am not 100% sure, but I tend to disagree.  The original bug report
can be represented as

   node->next = NULL [relaxed];
   xchg(tail, node) [seq_cst];

and the problem was that the two operations were swapped.  But that's
not a problem with the first access, but rather with the second.  So
it should be fine if the  [relaxed] access does not include a barrier,
because it relies on the [seq_cst] access providing it later.

Paolo


Re: should sync builtins be full optimization barriers?

2011-09-09 Thread Jakub Jelinek
On Fri, Sep 09, 2011 at 09:09:27PM -0400, Geert Bosch wrote:
> To be honest, I can't quite see the use of completely unordered
> atomic operations, where we not even prohibit compiler optimizations.
> It would seem if we guarantee that a variable will not be accessed
> concurrently from any other thread, we wouldn't need the operation
> to be atomic in the first place. That said, it's quite likely I'm 
> missing something here. 

E.g. OpenMP #pragma omp atomic just documents that the operation performed
on the variable is atomic, but has no requirements on it being any kind of
barrier for stores/loads to/from other memory locations.  That is what I'd
like to use related sync operations for.  Say
  var2 = 5;
#pragma omp atomic update
  var = var + 6;
  var3 = 7;
only guarantees that you atomically increment var by 6, the var2 store can
happen after it or var3 store before it (only var stores/loads should be
before/after the atomic operation in program order, but you don't need any
barriers for it).

Of course if you use atomic operations for locking etc. you want to
serialize other memory accesses too (say acquire, or release, or full
barriers).

Jakub