GCC 4.7.0 Status Report (2011-09-09)
Status == The trunk is in Stage 1, which, if we follow roughly the 4.6 release schedule, should end around end of October. At this point I'd like to gather the status of the various development branches that haven't been merged into trunk yet and whether it is possible to merge them with such a schedule or whether e.g. a two weeks delay would help them. In particular, is transactional-memory branch mergeable within a month and half, at least some parts of cxx-mem-model branch, bitfield lowering? What is the status of lra, reload-2a, pph, cilkplus, gupc (I assume at least some of these are 4.8+ material)? Quality Data Priority # Change from Last Report --- --- P16 + 6 P2 95 + 10 P3 59 + 56 --- --- Total 160 + 72 Previous Report === http://gcc.gnu.org/ml/gcc/2011-03/msg00178.html The next status report will be sent by Joseph.
Re: GCC 4.7.0 Status Report (2011-09-09)
On Fri, Sep 9, 2011 at 9:09 AM, Jakub Jelinek wrote: > Status > == > > The trunk is in Stage 1, which, if we follow roughly the 4.6 > release schedule, should end around end of October. > At this point I'd like to gather the status of the various > development branches that haven't been merged into trunk yet > and whether it is possible to merge them with such a schedule > or whether e.g. a two weeks delay would help them. > In particular, is transactional-memory branch mergeable within > a month and half, at least some parts of cxx-mem-model branch, > bitfield lowering? What is the status of lra, reload-2a, pph, > cilkplus, gupc (I assume at least some of these are 4.8+ material)? Bitfield lowering is not going to happen (well, completely at least) unless I can find some more time to work on it. Instead I want to finally make no-longer-sign-extending sizetypes happen for 4.7, which currently only waits on Ada frontend issues. Btw, end of October will then be 7 1/2 month worth of stage1 already. Richard.
should sync builtins be full optimization barriers?
Hi all, sync builtins are described in the documentations as being full memory barriers, with the possible exception of __sync_lock_test_and_set. However, GCC is not enforcing the fact that they are also full _optimization_ barriers. The RTL produced by builtins does not in general include a memory optimization barrier such as a set of (mem/v:BLK (scratch:P)). This can cause problems with lock-free algorithms, for example this: http://libdispatch.macosforge.org/trac/ticket/35 This can be solved either in generic code, by wrapping sync builtins (before and after) with an asm("":::"memory"), or in the single machine descriptions by adding a memory barrier in parallel to the locked instructions or with the ll/sc instructions. Is the above analysis correct? Or should the users put explicit compiler barriers? Paolo
Re: should sync builtins be full optimization barriers?
On Fri, Sep 09, 2011 at 10:07:30AM +0200, Paolo Bonzini wrote: > sync builtins are described in the documentations as being full > memory barriers, with the possible exception of > __sync_lock_test_and_set. However, GCC is not enforcing the fact > that they are also full _optimization_ barriers. The RTL produced > by builtins does not in general include a memory optimization > barrier such as a set of (mem/v:BLK (scratch:P)). > > This can cause problems with lock-free algorithms, for example this: > > http://libdispatch.macosforge.org/trac/ticket/35 > > This can be solved either in generic code, by wrapping sync builtins > (before and after) with an asm("":::"memory"), or in the single > machine descriptions by adding a memory barrier in parallel to the > locked instructions or with the ll/sc instructions. > > Is the above analysis correct? Or should the users put explicit > compiler barriers? I'd say they should be optimization barriers too (and at the tree level they I think work that way, being represented as function calls), so if they don't act as memory barriers in RTL, the *.md patterns should be fixed. The only exception should be IMHO the __SYNC_MEM_RELAXED variants - if the CPU can reorder memory accesses across them at will, why shouldn't the compiler be able to do the same as well? Jakub
Re: should sync builtins be full optimization barriers?
On 09/09/2011 10:17 AM, Jakub Jelinek wrote: > Is the above analysis correct? Or should the users put explicit > compiler barriers? I'd say they should be optimization barriers too (and at the tree level they I think work that way, being represented as function calls), so if they don't act as memory barriers in RTL, the *.md patterns should be fixed. The only exception should be IMHO the __SYNC_MEM_RELAXED variants - if the CPU can reorder memory accesses across them at will, why shouldn't the compiler be able to do the same as well? Agreed, so we have a bug in all released versions of GCC. :( Paolo
Re: GCC 4.7.0 Status Report (2011-09-09)
On 09/09/2011 03:09 AM, Jakub Jelinek wrote: In particular, is transactional-memory branch mergeable within a month and half, at least some parts of cxx-mem-model branch, There will certainly be some parts of the branch which would be appropriate for merging with mainline in october. We ought to at least have the new __sync_mem builtins available to replace the old ones, and the testing infrastructure. Im not sure we will have *all* the infrastructure in place, but it should be pretty close if not. Its also fairly low risk. Andrew
Re: Comparison of GCC-4.6.1 and LLVM-2.9 on x86/x86-64 targets
On 09/07/2011 12:23 PM, Vladimir Makarov wrote: On 09/07/2011 11:55 AM, Xinliang David Li wrote: Why is lto/whole program mode not used in LLVM for peak performance comparison? (of course, peak performance should really use FDO..) Thanks for the feedback. I did not manage to use LTO for LLVM as it described on http://llvm.org/docs/LinkTimeOptimization.html#lto I am getting 'file not recognized: File format not recognized' during the linkage pass. You probably right that I should use -Ofast without -flto for gcc then. Although I don't think that it significantly change GCC peak performance. Still I am going to run SPEC2000 without -flto and post the data (probably on the next week). As for FDO, unfortunately for some tests SPEC uses different training sets and it gives sometimes wrong info for the further optimizations. I do not look at this comparison as finished work and am going to run more SPEC2000 tests and change the results if I have serious reasonable objections for the current comparison. I've add -Ofast without -flto -fwhole-program for GCC as well and updated the graphs: http://vmakarov.fedorapeople.org/spec/
Re: should sync builtins be full optimization barriers?
On 09/09/2011 04:17 AM, Jakub Jelinek wrote: On Fri, Sep 09, 2011 at 10:07:30AM +0200, Paolo Bonzini wrote: sync builtins are described in the documentations as being full memory barriers, with the possible exception of __sync_lock_test_and_set. However, GCC is not enforcing the fact that they are also full _optimization_ barriers. The RTL produced by builtins does not in general include a memory optimization barrier such as a set of (mem/v:BLK (scratch:P)). This can cause problems with lock-free algorithms, for example this: http://libdispatch.macosforge.org/trac/ticket/35 This can be solved either in generic code, by wrapping sync builtins (before and after) with an asm("":::"memory"), or in the single machine descriptions by adding a memory barrier in parallel to the locked instructions or with the ll/sc instructions. Is the above analysis correct? Or should the users put explicit compiler barriers? I'd say they should be optimization barriers too (and at the tree level they I think work that way, being represented as function calls), so if they don't act as memory barriers in RTL, the *.md patterns should be fixed. The only exception should be IMHO the __SYNC_MEM_RELAXED variants - if the CPU can reorder memory accesses across them at will, why shouldn't the compiler be able to do the same as well? Yeah, some of this is part of the ongoing C++0x work... the memory model parameter is going to allow certain types of code movement in optimizers based on whether its an acquire operation, a release operation, neither, or both.It is ongoing and hopefully we will eventually have proper consistency. The older __sync builtins are eventually going to invoke the new__sync_mem routines and their new patterns, but will fall back to the old ones if new patterns aren't specified. In the case of your program, this would in fact be a valid transformation I believe... __sync_lock_test_and_set is documented to only have ACQUIRE semantics. This does not guarantee that a store BEFORE the operation will be visible in another thread, which means it possible to reorder it. (A summary of the different modes can be found at : http://gcc.gnu.org/wiki/Atomic/GCCMM/Optimizations So you would require a barrier before this code anyway for the behaviour you are looking for.Once the new routines are available and implemented, you could simply specify the SEQ_CST model and then it should, in theory, work properly with a barrier being emitted for you. I don't see anything in this pattern however that would enforce acquire mode and prevent the reverse operation.. moving something from after to before it... so there may be a bug there anyway. And I suspect most people actually expect all the old __sync routines to be full optimization barriers all the time... maybe we should consider just doing that... Andrew
Re: Comparison of GCC-4.6.1 and LLVM-2.9 on x86/x86-64 targets
On 09/08/2011 04:47 AM, Jakub Jelinek wrote: On Wed, Sep 07, 2011 at 11:15:39AM -0400, Vladimir Makarov wrote: This year I used -Ofast -flto -fwhole-program instead of -O3 for GCC and -O3 -ffast-math for LLVM for comparison of peak performance. I could improve GCC performance even more by using other GCC possibilities (like support of AVX insns, Graphite optimizations and even some experimental stuff like LIPO) but I wanted to give LLVM some chances too. Probably an experienced user in LLVM could improve LLVM performance too. So I think it is a fair comparison. -march=native in addition would be nice to see, that can make significant difference, especially on AVX capable CPUs. I guess LLVM equivalent would be -march=corei7 -mtune=corei7 and, if it works, -mavx too (though, the only time I've tried LLVM 2.9 it crashed on almost anything with -mavx). Yes, Jakub. It would be better to use corei7 with avx for GCC. Unfortunately, the last tuning which llvm 2.9 supports is core2 therefore I used -march=core2 for comparison on x86-64. So I think it would be unfair to use corei7 and avx for GCC without using it for LLVM. I mostly tried to compare state of general optimizations of GCC and LLVM. There are a lot of other aspects of the compilers which we could compare and I did not do it. But for me it is obvious that Apple loses a lot not using modern versions of GCC.
Re: should sync builtins be full optimization barriers?
On 09/09/2011 04:22 PM, Andrew MacLeod wrote: Yeah, some of this is part of the ongoing C++0x work... the memory model parameter is going to allow certain types of code movement in optimizers based on whether its an acquire operation, a release operation, neither, or both.It is ongoing and hopefully we will eventually have proper consistency. The older __sync builtins are eventually going to invoke the new__sync_mem routines and their new patterns, but will fall back to the old ones if new patterns aren't specified. In the case of your program, this would in fact be a valid transformation I believe... __sync_lock_test_and_set is documented to only have ACQUIRE semantics. Yes, that's true. However, there's nothing special in the compiler to handle __sync_lock_test_and_set differently (optimization-wise) from say __sync_fetch_and_add. I don't see anything in this pattern however that would enforce acquire mode and prevent the reverse operation.. moving something from after to before it... so there may be a bug there anyway. Yes. And I suspect most people actually expect all the old __sync routines to be full optimization barriers all the time... maybe we should consider just doing that... That would be very nice. I would like to introduce that kind of data structure in QEMU, too. :) Paolo
Re: Comparison of GCC-4.6.1 and LLVM-2.9 on x86/x86-64 targets
On Fri, Sep 09, 2011 at 10:26:22AM -0400, Vladimir Makarov wrote: > Yes, Jakub. It would be better to use corei7 with avx for GCC. > Unfortunately, the last tuning which llvm 2.9 supports is core2 > therefore I used -march=core2 for comparison on x86-64. So I think > it would be unfair to use corei7 and avx for GCC without using it > for LLVM. LLVM 2.9 seems to accept -march=corei7 (though, maybe it just accepts it and tunes fore core2 anyway, haven't checked), doesn't accept -march=corei7-avx. I wonder for which CPUs LLVM actually tunes, because e.g. when I looked at Phoronix benchmarks (PovRay in particular), GCC on that particular "benchmark" lost to LLVM because the configury uses -march=k8 -mtune=k8 for x86_64-linux unconditionally, which wasn't the best tuning for the contemporary Intel CPUs, while LLVM apparently didn't show much difference between k8 and core2i7 tuning, see http://phoronix.com/forums/showthread.php?59341-AMD-Llano-Compiler-Performance&p=224367#post224367 Jakub
Re: GCC 4.7.0 Status Report (2011-09-09)
On 09/09/2011 03:09 AM, Jakub Jelinek wrote: Status == The trunk is in Stage 1, which, if we follow roughly the 4.6 release schedule, should end around end of October. At this point I'd like to gather the status of the various development branches that haven't been merged into trunk yet and whether it is possible to merge them with such a schedule or whether e.g. a two weeks delay would help them. In particular, is transactional-memory branch mergeable within a month and half, at least some parts of cxx-mem-model branch, bitfield lowering? What is the status of lra, reload-2a, pph, cilkplus, gupc (I assume at least some of these are 4.8+ material)? LRA is a long project. At the best case it will be ready for 4.8 but most probably for 4.9.
Re: question on find_if_case_2 in ifcvt.c
On 09/08/2011 08:20 PM, Amker.Cheng wrote: Hi, In ifcvt.c's function find_if_case_2, it uses cheap_bb_rtx_cost_p to judge the conversion. Function cheap_bb_rtx_cost_p checks whether the total insn_rtx_cost on non-jump insns in basic block BB is less than MAX_COST. So the question is why uses cheap_bb_rtx_cost_p, even when we know the ELSE is predicted, which means there is benefit from this conversion anyway. Not necessarily. This transformation is speculating insns from the ELSE path. So there's a cost every time we mispredict the branch. Second, should cheap_bb_rtx_cost_p be tuned as "checks whether the total insn_rtx_cost on non-jump insns in basic block BB is no larger than MAX_COST." to prefer normal instructions than branch even there have same costs. Perhaps, it's a corner case and I doubt it matters too much. I have a pending patch which twiddles this code so that it takes into account the weight of the prediction. This is important, particularly in the case where the ELSE is not predicted -- we're paying an awful cost for the speculation in that case. jeff
gcc-4.6-20110909 is now available
Snapshot gcc-4.6-20110909 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/4.6-20110909/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 4.6 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_6-branch revision 178740 You'll find: gcc-4.6-20110909.tar.bz2 Complete GCC MD5=85e1d6a9d3e6eb8a9cebd231f7196f5b SHA1=7b701a49f48d544f5b37077f8caf22bdee5bc67e Diffs from 4.6-20110902 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-4.6 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Re: Comparison of GCC-4.6.1 and LLVM-2.9 on x86/x86-64 targets
On 9/7/11, Vladimir Makarov wrote: > Some people asked me to do comparison of GCC-4.6 and LLVM-2.9 (both > released this spring) as I did GCC-LLVM comparison in previous year. > > You can find it on http://vmakarov.fedorapeople.org/spec under > 2011 GCC-LLVM comparison tab entry. The format of these graphs exaggerates differences. The reason is that our hind brains cannot help but compare the heights of bars and ignore the non-zero bases. In short, non-zero based graphs are lies. So, please 0-base all the graphs. The graphs should show compilation time from 0 up, execution time from 0 up, SPEC score from 0 up, etc. A consequence is that you will get rid of the "change" graphs. In my mind, an interesting graph would be to plot the execution time of the benchmarks as a function of the compile time of the benchmarks. This graph would show you, in particular, what you buy or lose by changing compilers and/or optimization/debug levels. -- Lawrence Crowl
Re: should sync builtins be full optimization barriers?
On Sep 9, 2011, at 04:17, Jakub Jelinek wrote: > I'd say they should be optimization barriers too (and at the tree level > they I think work that way, being represented as function calls), so if > they don't act as memory barriers in RTL, the *.md patterns should be > fixed. The only exception should be IMHO the __SYNC_MEM_RELAXED > variants - if the CPU can reorder memory accesses across them at will, > why shouldn't the compiler be able to do the same as well? They are different concepts. If a program runs on a single processor, all memory operations will appear to be sequentially consistent, even if the CPU reorders them at the hardware level. However, compiler optimizations can still cause multiple threads to see the accesses as not sequentially consistent. For example, for atomic objects accessed only from a single processor (but possibly multiple threads), you'd not want the compiler to reorder memory accesses to global variables across the atomic operations, but you wouldn't have to emit the expensive fences. For the C++0x atomic types there are: void A::store(C desired, memory_order order = memory_order_seq_cst) volatile; void A::store(C desired, memory_order order = memory_order_seq_cst); where the first variant (with order = memory_order_relaxed) would allow fences to be omitted, while still preventing the compiler from reordering memory accesses, IIUC. To be honest, I can't quite see the use of completely unordered atomic operations, where we not even prohibit compiler optimizations. It would seem if we guarantee that a variable will not be accessed concurrently from any other thread, we wouldn't need the operation to be atomic in the first place. That said, it's quite likely I'm missing something here. For Ada, all atomic accesses are always memory_order_seq_cst, and we just care about being able to optimize accesses if we know they'll be done from the same processor. For the C++11 model, thinking about the semantics of any memory orders other than memory_order_seq_cst and their interaction with operations with different ordering semantics makes my head hurt. Regards, -Geert
Re: should sync builtins be full optimization barriers?
On Sat, Sep 10, 2011 at 03:09, Geert Bosch wrote: > For example, for atomic objects accessed only from a single processor > (but possibly multiple threads), you'd not want the compiler to reorder > memory accesses to global variables across the atomic operations, but > you wouldn't have to emit the expensive fences. I am not 100% sure, but I tend to disagree. The original bug report can be represented as node->next = NULL [relaxed]; xchg(tail, node) [seq_cst]; and the problem was that the two operations were swapped. But that's not a problem with the first access, but rather with the second. So it should be fine if the [relaxed] access does not include a barrier, because it relies on the [seq_cst] access providing it later. Paolo
Re: should sync builtins be full optimization barriers?
On Fri, Sep 09, 2011 at 09:09:27PM -0400, Geert Bosch wrote: > To be honest, I can't quite see the use of completely unordered > atomic operations, where we not even prohibit compiler optimizations. > It would seem if we guarantee that a variable will not be accessed > concurrently from any other thread, we wouldn't need the operation > to be atomic in the first place. That said, it's quite likely I'm > missing something here. E.g. OpenMP #pragma omp atomic just documents that the operation performed on the variable is atomic, but has no requirements on it being any kind of barrier for stores/loads to/from other memory locations. That is what I'd like to use related sync operations for. Say var2 = 5; #pragma omp atomic update var = var + 6; var3 = 7; only guarantees that you atomically increment var by 6, the var2 store can happen after it or var3 store before it (only var stores/loads should be before/after the atomic operation in program order, but you don't need any barriers for it). Of course if you use atomic operations for locking etc. you want to serialize other memory accesses too (say acquire, or release, or full barriers). Jakub