configure --program-suffix could change gccjit library path?
Hello All, Imagine that someone (e.g. a distribution packager) wants to have several versions of gccjit like he does have several versions of gcc. Concretely, GCC 5.2 with GCC 6.1 I would believe that the --program-suffix argument of /configure would be useful. So he/she would compile GCC 5.2 with a --program-suffix=-5 argument to configure, and would like the gcc compiler to go in /usr/local/bin/gcc-5 and the GCCJIT library to go in /usr/local/lib/libgccjit-5.so, so users of GCCJIT 5.1 would link with -lgccjit-5 And he/she would also compiile GCC 6.1 with a --program-suffix=-6 argument to configure and would like the gcc compiler to go in /usr/local/bin/gcc-6 and the GCCJIT library to go in /usr/local/lib/libgccjit-6.so so users ofg GCCJIT 6.1 woudl link with -lgccjit-6 Having two different GCCJIT libraries is IMHO a legitimate wish (likewise, one can easily have several versions of LLVM libraries on Debian). AFAIU, the --program-suffix is not yet understood by GCCJIT configuration things. This is a wish, since I don't know autoconf well enough to be able to propose any patch. Or perhaps there is some existing configure switch already related to location of libgccjit? Regards. -- Basile Starynkevitch http://starynkevitch.net/Basile/ France
Transformation from SEME(Single Entry Multiple Exit) to SESE(Single Entry Single Exit)
All: Single Entry and Multiple Exits disables traditional Loop optimization. The presence of short circuit also makes the CFG as Single Entry and Multiple Exits. The transformation from SEME(Single Entry and Multiple Exits) to SESE( Single Entry and Single Exits enables many Loop Optimizations. The approach like Node Splitting to make SEME regions to SESE regions is an important optimization on the CFG that Enable the transformation with respect to Loops and Conditionals. The Loops transformation in LLVM does the node splitting to convert from SEME regions to SESE regions. The presence of break and GOTO statements inside the loops makes the CFG unstructured transforming it SEME. To convert such control Flow from unstructured to Structured control flow enables many Loop transformation. I would like to implement a transformation phase on the loops before any Loop optimizations pass is enabled to transform Unstructured CFG to structured CFG like LLVM. Does the GCC already has such transformation passes on Loops? Please share your thoughts. Thanks & Regards Ajit
Re: making the new if-converter not mangle IR that is already vectorizer-friendly
Abe wrote: Hi, pleased to meet you :) As some of you already know, at SARC we are working on a new "if converter" to help convert simple "if"-based blocks of code that appear inside loops into an autovectorizer-friendly form that closely resembles the C ternary operator ["c ? x : y"]. GCC already has such a converter, but it is off by default, in part because it is unsafe: if enabled, it can cause certain code to be transformed in such a way that it malfunctions even though the non-converted code worked just fine with the same inputs. The new converter, originally by my teammate Sebastian Pop, is safer [almost-always safe *]; we are working on getting it into good-enough shape that the always-safe transformations can be turned on by default whenever the autovectorizer is on. * Always safe for stores, sometimes a little risky for loads: speculative loads might cause multithreaded programs with insufficient locking to fail due to writes by another thread being "lost"/"missed", even though the same program works OK "by luck" when compiled without if-conversion of loads. This risk comes mainly/only from what the relevant literature calls a "half hammock": an "if" with a "then" section but no "else" section [or effectively vice-versa, e.g. an empty "then" and a non-empty "else"]. In this case, e.g. "if (c) X[x] = Y[y];" with no attached "else" section is risky to fully if-convert in the event of the code being compiled running multithreaded and not having been written with all the locking it really needs. Respectively, e.g. "if (c) ; /* empty ''then'' */ else X[x] = Y[y];". For the unenlightened, can you outline the problem with this code sequence? (i.e. the expected transformation that makes it unsafe!?) I would hope your scratchpad patch would turn this into something like a1 = c ? &Y[y] : &scratch; temp = *a1; a2 = c ? &X[x] : &scratch; *a2 = temp; which seems OK to me - so is the scratchpad approach going away? (The problem that things might be read in a different order *across* the elements of a vector, I can see, but that belongs in the domain of the vectorizer itself, not if-conversion, I would think?) One of the reasons the new if converter has not yet been submitted for incorporation into GCC`s trunk is that it still has some performance regressions WRT the old converter, and most of those are "true regressions", i.e. not just because the old converter was less safe and the additional safety is what is causing the loss, but rather because there is more work to do before the patch is ready. As of this writing, the new if converter sometimes tries to "convert" something that is already vectorizer-friendly, and in doing so it renders that code now-NOT-vectorizer-friendly. Can you give an example? My understanding was that the existing vectorizer bailed out pretty much straightaway if the number of basic blocks in the loop was not exactly 2 (for inner loops) or 5 (for outermost loops, i.e. containing exactly one inner loop)...that seems to rule out vectorization of *any* kind of conditional execution, that the if-converter might convert? Thanks, Alan
Consideration of Cost associated with SEME regions.
All: The Cost Calculation for a candidate to Spill in the Integrated Register Allocator(IRA) considers only the SESE regions. The Cost Calculation in the IRA should consider the SEME regions into consider for spilling decisions. The Cost associated with the path that has un-matured exists should be less, thus making the more chances of spilling decision In the path of un-matured exits. The path that has un-matured (normal )exists should be having a higher cost than the cost of un-matured exists and Spilling decisions has to made accordingly in order to spill inside the less frequency path with the un-matured exists than the high frequency Path with the normal exits. I would like to propose the above for consideration of cost associated with SEME regions in IRA. Thoughts? Thanks & Regards Ajit
Re: GCC 5.1.1 Status Report (2015-06-22)
On 22/06/15 12:56, Richard Biener wrote: I plan to release GCC 5.2.0 around July 10th which means a release candidate being done around July 3rd. Please check your open regression bugs for ones that eligible for backporting. Also please help getting the P1 bug count to zero (there is still the ARM aligned argument passing ABI issue). I'd like to get the fix for PR target/65697 (weak memory barriers for __sync builtins on ARMv8) into GCC-5.2. The backported patches for the Aarch64 back-end have been submitted: https://gcc.gnu.org/ml/gcc-patches/2015-06/msg01937.html There's a reviewer for the Aarch64 patches but I still need a reviewer for the first patch which touches the middle-end and several back-ends. Matthew
RE: Consideration of Cost associated with SEME regions.
Sorry for the typo error. I meant exits instead of exists. The below is corrected. The Cost Calculation for a candidate to Spill in the Integrated Register Allocator(IRA) considers only the SESE regions. The Cost Calculation in the IRA should consider the SEME regions into consideration for spilling decisions. The Cost associated with the path that has un-matured exits should be less, thus making the more chances of spilling decision In the path of un-matured exits. The path that has normal exit should be having a higher cost than the cost of un-matured exit and Spilling decisions has to made accordingly in order to spill inside the less frequency path with the un-matured exits than the high frequency Path with the normal exits. I would like to propose the above for consideration of cost associated with SEME regions in IRA. Thoughts? Thanks & Regards Ajit -Original Message- From: Ajit Kumar Agarwal Sent: Thursday, July 02, 2015 3:33 PM To: vmaka...@redhat.com; l...@redhat.com; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Consideration of Cost associated with SEME regions. All: The Cost Calculation for a candidate to Spill in the Integrated Register Allocator(IRA) considers only the SESE regions. The Cost Calculation in the IRA should consider the SEME regions into consider for spilling decisions. The Cost associated with the path that has un-matured exists should be less, thus making the more chances of spilling decision In the path of un-matured exits. The path that has un-matured (normal )exists should be having a higher cost than the cost of un-matured exists and Spilling decisions has to made accordingly in order to spill inside the less frequency path with the un-matured exists than the high frequency Path with the normal exits. I would like to propose the above for consideration of cost associated with SEME regions in IRA. Thoughts? Thanks & Regards Ajit
Re: GCC 5.1.1 Status Report (2015-06-22)
On Thu, 2 Jul 2015, Matthew Wahab wrote: > On 22/06/15 12:56, Richard Biener wrote: > > > > I plan to release GCC 5.2.0 around July 10th which means a release > > candidate being done around July 3rd. > > > > Please check your open regression bugs for ones that eligible for > > backporting. Also please help getting the P1 bug count to zero > > (there is still the ARM aligned argument passing ABI issue). > > I'd like to get the fix for PR target/65697 (weak memory barriers for __sync > builtins on ARMv8) into GCC-5.2. > > The backported patches for the Aarch64 back-end have been submitted: > https://gcc.gnu.org/ml/gcc-patches/2015-06/msg01937.html > > There's a reviewer for the Aarch64 patches but I still need a reviewer for the > first patch which touches the middle-end and several back-ends. The patch is ok to backport. Thanks, Richard.
Re: GCC 5.1.1 Status Report (2015-06-22)
On Thu, Jul 2, 2015 at 12:03 PM, Matthew Wahab wrote: > On 22/06/15 12:56, Richard Biener wrote: >> >> >> I plan to release GCC 5.2.0 around July 10th which means a release >> candidate being done around July 3rd. >> >> Please check your open regression bugs for ones that eligible for >> backporting. Also please help getting the P1 bug count to zero >> (there is still the ARM aligned argument passing ABI issue). > > > I'd like to get the fix for PR target/65697 (weak memory barriers for __sync > builtins on ARMv8) into GCC-5.2. > > The backported patches for the Aarch64 back-end have been submitted: > https://gcc.gnu.org/ml/gcc-patches/2015-06/msg01937.html > > There's a reviewer for the Aarch64 patches but I still need a reviewer for > the first patch which touches the middle-end and several back-ends. I'd also like to see the ARM patches backported please unless there are RM objections for it. It doesn't make sense for them to diverge with something like this at different release points on the branch if we can avoid it. Ramana > > Matthew
Re: Consideration of Cost associated with SEME regions.
On 07/02/2015 07:06 AM, Ajit Kumar Agarwal wrote: Sorry for the typo error. I meant exits instead of exists. The below is corrected. The Cost Calculation for a candidate to Spill in the Integrated Register Allocator(IRA) considers only the SESE regions. The Cost Calculation in the IRA should consider the SEME regions into consideration for spilling decisions. IRA is a regional allocator. It spills pseudo in a region. Currently regions are loops as the most important for optimizations. Loops can have more one exit. So your assumption that IRA works on SESE region is not accurate. IRA with some work can be extended non-loop regions too. I am not sure it will give significant improvement (LRA compensates partially this by inheritance in EBBs) but it definitely will slowdown IRA whose speed heavily depends on the number of regions (therefore a lot was done in IRA to decrease regions number by merging regions with low register pressure). Of course, it would be interesting to implement non-loop regions in IRA and see the results. The Cost associated with the path that has un-matured exits should be less, thus making the more chances of spilling decision In the path of un-matured exits. The path that has normal exit should be having a higher cost than the cost of un-matured exit and Spilling decisions has to made accordingly in order to spill inside the less frequency path with the un-matured exits than the high frequency Path with the normal exits. I would like to propose the above for consideration of cost associated with SEME regions in IRA. IRA uses standard GCC evaluation of edge and BB execution frequencies, static (see predict.c) or based on execution profile (see *profile.c). Without profile usage it might be inaccurate. May be it can be improved. For me it has more sense to work on this code instead of working on IRA code only.
%fs and %gs segments on x86/x86-64
Hi all, I implemented support for %fs and %gs segment prefixes on the x86 and x86-64 platforms, in what turns out to be a small patch. For those not familiar with it, at least on x86-64, %fs and %gs are two special registers that a user program can ask be added to any address machine instruction. This is done with a one-byte instruction prefix, "%fs:" or "%gs:". The actual value stored in these two registers cannot quickly be modified (at least before the Haswell CPU), but the general idea is that they are rarely modified. Speed-wise, though, an instruction like "movq %gs:(%rdx), %rax" runs at the same speed as a "movq (%rdx), %rax" would. (I failed to measure any difference, but I guess that the instruction is one more byte in length, which means that a large quantity of them would tax the instruction caches a bit more.) For reference, the pthread library on x86-64 uses %fs to point to thread-local variables. There are a number of special modes in gcc to already produce instructions like "movq %fs:(16), %rax" to load thread-local variables (declared with __thread). However, this support is special-case only. The %gs register is free to use. (On x86, %gs is used by pthread and %fs is free to use.) So what I did is to add the __seg_fs and __seg_gs address spaces. It is used like this, for example: typedef __seg_gs struct myobject_s { int a, b, c; } myobject_t; You can then use variables of type "struct myobject_s *o1" as regular pointers, and "myobject_t *o2" as %gs-based pointers. Accesses to "o2->a" are compiled to instructions that use the %gs prefix; accesses to "o1->a" are compiled as usual. These two pointer types are incompatible. The way you obtain %gs-based pointers, or control the value of %gs itself, is out of the scope of gcc; you do that by using the correct system calls and by manual arithmetic. There is no automatic conversion; the C code can contain casts between the three address spaces (regular, %fs and %gs) which, like regular pointer casts, are no-ops. My motivation comes from the PyPy-STM project ("removing the Global Interpreter Lock" for this Python interpreter). In this project, I want *almost all* pointer manipulations to resolve to different addresses depending on which thread runs the code. The idea is to use mmap() tricks to ensure that the actual memory usage remains reasonable, by sharing most of the pages (but not all of them) between each thread's "segment". So most accesses to a %gs-prefixed address actually access the same physical memory in all threads; but not all of them. This gives me a dynamic way to have a large quantity of data which every thread can read, and by changing occasionally the mapping of a single page, I can make some changes be thread-local, i.e. invisible to other threads. Of course, the same effect can be achieved in other ways, like declaring a regular "__thread intptr_t base;" and adding the "base" explicitly to every pointer access. Clearly, this would have a large performance impact. The %gs solution comes at almost no cost. The patched gcc is able to compile the hundreds of MBs of (generated) C code with systematic %gs usage and seems to work well (with one exception, see below). Is there interest in that? And if so, how to progress? * The patch included here is very minimal. It is against the gcc_5_1_0_release branch but adapting it to "trunk" should be straightforward. * I'm unclear if target_default_pointer_address_modes_p() should return "true" or not in this situation: i386-c.c now defines more than the default address mode, but the new ones also use pointers of the same standard size. * One case in which this patched gcc miscompiles code is found in the attached bug1.c/bug1.s. (This case almost never occurs in PyPy-STM, so I could work around it easily.) I think that some early, pre-RTL optimization is to "blame" here, possibly getting confused because the nonstandard address spaces also use the same size for pointers. Of course it is also possible that I messed up somewhere, or that the whole idea is doomed because many optimizations make a similar assumption. Hopefully not: it is the only issue I encountered. * The extra byte needed for the "%gs:" prefix is not explicitly accounted for. Is it only by chance that I did not observe gcc underestimating how large the code it writes is, and then e.g. use jump instructions that would be rejected by the assembler? * For completeness: this is very similar to clang's __attribute__((addressspace(256))) but a few details differ. (Also, not to discredit other projects in their concurrent's mailing list, but I had to fix three distinct bugs in llvm before I could use it. It contributes to me having more trust in gcc...) Links for more info about pypy-stm: * http://morepypy.blogspot.ch/2015/03/pypy-stm-251-released.html * https://bitbucket.org/pypy/stmgc/src/use-gcc/gcc-seg-gs/ * https://bitbucket.org/pypy/stmgc/src/use-gcc/c8/stmgc.h Than
Re: GCC 5.1.1 Status Report (2015-06-22)
On 02/07/15 13:40, Ramana Radhakrishnan wrote: On Thu, Jul 2, 2015 at 12:03 PM, Matthew Wahab wrote: I'd like to get the fix for PR target/65697 (weak memory barriers for __sync builtins on ARMv8) into GCC-5.2. I'd also like to see the ARM patches backported please unless there are RM objections for it. The patches are up for review: https://gcc.gnu.org/ml/gcc-patches/2015-07/msg00129.html Matthew
libgomp: Purpose of gomp_thread_pool::last_team?
Hello, does anyone know what the purpose of gomp_thread_pool::last_team is? This field seems to be used to delay the team destruction in gomp_team_end() in case the team has more than one thread and the previous team state has no team associated (identifies this a master thread?): if (__builtin_expect (thr->ts.team != NULL, 0) || __builtin_expect (team->nthreads == 1, 0)) free_team (team); else { struct gomp_thread_pool *pool = thr->thread_pool; if (pool->last_team) free_team (pool->last_team); pool->last_team = team; } Why can you not immediately free the team? -- Sebastian Huber, embedded brains GmbH Address : Dornierstr. 4, D-82178 Puchheim, Germany Phone : +49 89 189 47 41-16 Fax : +49 89 189 47 41-09 E-Mail : sebastian.huber at embedded-brains.de PGP : Public key available on request. Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.
Re: libgomp: Purpose of gomp_thread_pool::last_team?
On Thu, Jul 02, 2015 at 09:57:20PM +0200, Sebastian Huber wrote: > does anyone know what the purpose of gomp_thread_pool::last_team is? This > field seems to be used to delay the team destruction in gomp_team_end() in > case the team has more than one thread and the previous team state has no > team associated (identifies this a master thread?): > > if (__builtin_expect (thr->ts.team != NULL, 0) > || __builtin_expect (team->nthreads == 1, 0)) > free_team (team); > else > { > struct gomp_thread_pool *pool = thr->thread_pool; > if (pool->last_team) > free_team (pool->last_team); > pool->last_team = team; > } > > Why can you not immediately free the team? That was added with https://gcc.gnu.org/ml/gcc-patches/2008-05/msg01674.html and the purpose is for non-nested teams make sure all the threads in the team move on to the pool's barrier before the team's barrier is destroyed. Jakub
Re: making the new if-converter not mangle IR that is already vectorizer-friendly
On 7/2/15 4:30 AM, Alan Lawrence wrote: Hi, pleased to meet you :) Likewise. :-) [Abe wrote:] * Always safe for stores, sometimes a little risky for loads: speculative loads might cause multithreaded programs with insufficient locking to fail due to writes by another thread being "lost"/"missed", even though the same program works OK "by luck" when compiled without if-conversion of loads. This risk comes mainly/only from what the relevant literature calls a "half hammock": an "if" with a "then" section but no "else" section [or effectively vice-versa, e.g. an empty "then" and a non-empty "else"]. In this case, e.g. "if (c) X[x] = Y[y];" with no attached "else" section is risky to fully if-convert in the event of the code being compiled running multithreaded and not having been written with all the locking it really needs. Respectively, e.g. "if (c) ; /* empty ''then'' */ else X[x] = Y[y];". [Alan wrote:] For the unenlightened, can you outline the problem with this code sequence? (i.e. the expected transformation that makes it unsafe!?) I would hope your scratchpad patch would turn this into something like a1 = c ? &Y[y] : &scratch; temp = *a1; a2 = c ? &X[x] : &scratch; *a2 = temp; which seems OK to me Yes, you are right. The problem I was thinking about is not present in the above: in the "'c' is false" case, the vectorized code for the above just wastes some effort by reading garbage from the scratchpad and writing it back to the scratchpad. so is the scratchpad approach going away? Not at all. :-) My example[s] was/were not written well with regard to expressing what I had in mind. The problem I was thinking about is shown with a scalar destination, e.g.: if (c) foo = X[x]; ... which is if-converted into the equivalent of: foo = c ? X[x] : foo; The [perceived/potential] problem with the preceding is that the part that is equivalent to "foo = foo;" in the above can _not_ be optimized out, as it normally would for a non-"volatile" "foo", because it is part of a larger vectorized operation and this small part cannot be broken out of the whole without breaking vectorization. Therefore, the value of "foo" might be read and then written back a few {micro|nano|pico|whatever}-seconds later, which may cause an update to the same location to be overwritten. That`s why the pathological program-under-compilation is a badly-written multithreaded program without enough locking: without the "if something, then overwrite" being replaced by "read, then if something then write the new value, and if not that same something then rewrite the old value", the badly-written multithreaded program might work correctly through "good luck", but with the replacement [i.e. the if conversion] the chances of success [where "success" here basically means not "missing" a write by another thread] is lower than it was before. However, after some discussion with Sebastian I learned that this is already taken care of, i.e. safe: the "read, then if something then write the new value, and if not that same something then rewrite the old value" replacement strategy is only used for thread-local scalars. For global scalars and static scalars, we treat the scalar as if it were the first element of a length=1 array and don`t have this problem. In other words, the problem about which I was concerned is not going to be triggered by e.g. "if (c) x = ..." which lacks an attached "else x = ..." in a multithreaded program without enough locking just because 'x' is global/static. The only remaining case to consider is if some code being compiler takes the address of something thread-local and then "gives" that pointer to another thread. Even for _that_ extreme case, Sebastian says that the gimplifier will detect this "address has been taken" situation and do the right thing such that the new if converter also does the right thing. TLDR: Abe was being too paranoid; to the best of our knowledge, it`s OK as-is. ;-) [Abe wrote:] One of the reasons the new if converter has not yet been submitted for incorporation into GCC`s trunk is that it still has some performance regressions WRT the old converter, and most of those are "true regressions", i.e. not just because the old converter was less safe and the additional safety is what is causing the loss, but rather because there is more work to do before the patch is ready. As of this writing, the new if converter sometimes tries to "convert" something that is already vectorizer-friendly, and in doing so it renders that code now-NOT-vectorizer-friendly. [Alan wrote:] Can you give an example? The test cases in the GCC tree at "gcc.dg/vect/pr61194.c" and "gcc.dg/vect/vect-mask-load-1.c" currently test as: the new if-converter is "converting" something that`s already vectorizer-friendly, messing up the IR in the process and thus disabling vectorization for the test case in question.