Unrolling factor heuristics for Loop Unrolling
Hello All: The Loop unrolling without good unrolling factor heuristics becomes the performance bottleneck. The Unrolling factor heuristics based on minimum Initiation interval is quite useful with respect to better ILP. The minimum Initiation interval based on recurrence and resource calculation on Data Dependency Graph along with the register pressure can be used to add the unrolling factor heuristics. To achieve better ILP with the given schedule, the Loops unrolling and the scheduling are inter dependent and has been widely used in Software Pipelining Literature along with the more granular List and Trace Scheduling. The recurrence calculation based on the Loop carried dependencies and the resource allocation based on the simultaneous access of the resources Using the reservation table will give good heuristics with respect to calculation of unrolling factor. This has been taken care in the MII interval Calculation. Along with MII, the register pressure should also be considered in the calculation of heuristics for unrolling factor. This enable better heuristics with respect to unrolling factor. The main advantage of the above heuristics for unrolling factor is that it can be Implemented in the Code generation Level. Currently Loop unrolling is done much before the code generation. Let's go by the current implementation Of doing Loop unrolling optimization at the Loop optimizer level and unrolling happens. After the Current unrolling at the optimizer level the above heuristics Can be used to do the unrolling at the Code generation Level with the accurate Register pressure calculation as done in the register allocator and the Unrolling is done at the code generation level. This looks feasible solution which I am going to propose for the above unrolling heuristics. This enables the Loop unrolling done at the Optimizer Level + at the Code Generation Level. This double level of Loop unrolling is quite useful. This will overcome the shortcomings of the Loop unrolling at the optimizer level. The SPEC benchmarks are the better candidates for the above heuristics instead of Mibench and EEMBC. Thanks & Regards Ajit
Re: Unrolling factor heuristics for Loop Unrolling
On Thu, 2015-02-12 at 10:09 +, Ajit Kumar Agarwal wrote: > Hello All: > > The Loop unrolling without good unrolling factor heuristics becomes the > performance bottleneck. The Unrolling factor heuristics based on minimum > Initiation interval is quite useful with respect to better ILP. The minimum > Initiation interval based on recurrence and resource calculation on Data > Dependency Graph along with the register pressure can be used to add the > unrolling factor heuristics. To achieve better ILP with the given schedule, > the Loops unrolling and the scheduling are inter dependent and has been > widely used in Software Pipelining Literature along with the more granular > List and Trace Scheduling. > > The recurrence calculation based on the Loop carried dependencies and the > resource allocation based on the simultaneous access of the resources > Using the reservation table will give good heuristics with respect to > calculation of unrolling factor. This has been taken care in the > MII interval Calculation. > > Along with MII, the register pressure should also be considered in the > calculation of heuristics for unrolling factor. > > This enable better heuristics with respect to unrolling factor. The main > advantage of the above heuristics for unrolling factor is that it can be > Implemented in the Code generation Level. Currently Loop unrolling is done > much before the code generation. Let's go by the current implementation > Of doing Loop unrolling optimization at the Loop optimizer level and > unrolling happens. After the Current unrolling at the optimizer level the > above heuristics > Can be used to do the unrolling at the Code generation Level with the > accurate Register pressure calculation as done in the register allocator and > the > Unrolling is done at the code generation level. This looks feasible solution > which I am going to propose for the above unrolling heuristics. > > This enables the Loop unrolling done at the Optimizer Level + at the Code > Generation Level. This double level of Loop unrolling is quite useful. > This will overcome the shortcomings of the Loop unrolling at the optimizer > level. > > The SPEC benchmarks are the better candidates for the above heuristics > instead of Mibench and EEMBC. Not taking register pressure into account when unrolling (and doing other optimizations/choices) is an old problem. See also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=20969 Cheers, Oleg
libcc1 depencies
Hello, I am trying to build a cross-compiler for arm, like I did so for years. I am keen on not having depencies on libraries, so that the compiler can be used on multiple systems. Up until 4.9 the only depency is libc, I tried using the gcc-5-20150208 snapshot, and now I get depencies on the build-compilers libstdc++ and libgcc_s. This is double annoying because this is not the default compiler of the OS, and those paths might not exist and there is a chance of mixing versions. The rest of gcc is adhering to the configuration, but for libcc1 this seems to be lacking. I use this configuration to statiscally link C++: --with-host-libstdcxx="-Wl,-Bstatic,`g++ --print-file-name libstdc++.a`,`g++ --print-file-name libsupc++.a`,-Bdynamic" *) I tried setting -static-libstdc++ via LDFLAGS, but libtool is linking the explicit libraries with the path to the build-compiler (not the system library) *) I tried compiling libcc1 seperately /home/build/toolchain-arm-none-eabi-5.0-5.0.0/gcc/libcc1/configure --prefix=/opt/toolchain-5.0 --build=i686-pc-linux-gnu --host=i686-pc-linux-gnu --target=arm-none-eabi Note that this is an out-of-source-tree build! The result is an error: /home/build/toolchain-arm-none-eabi-5.0-5.0.0/gcc/libcc1/findcomp.cc:20:20: fatal error: config.h: No such file or directory I would want to at least be able to statically link libstdc++. Ideally the reference to libgcc_s would be gone aswell, or atleast be pointing to the system library. Right now I dont find the measures to do that. Kind Regards, Norbert Lange (I send one mail as HTML, apologies if this message will appear twice)
Function outlining and partial Inlining
Hello All: The large functions are the important part of high performance application. They contribute to performance bottleneck with many respect. Some of the large hot functions are frequently executed but many regions inside the functions are cold regions. The large Function blocks the function inlining to happen before of the code size constraints. Such cold regions inside the hot large functions can be extracted out and form the function outlining. Thus breaking the large functions Into smaller function segments which causes the functions to be inlined at the caller site or helps in partial inlining. LLVM Compiler has the functionality and the optimizations for function outlining based on regions like basic blocks, superblocks and Hyperblocks which gets extracted out into smaller function segments and thus enabling the partial inlining and function inlining to happen At the caller site. This optimization is the good case of profile guided optimizations and based on the profile feedback data by the Compiler. Without profile information the above function outlining optimizations will not be useful. We are doing lot of optimization regarding polymorphism and also the indirect icall promotion based on the profile feedback on the Callgraph profile. Are we doing the function outlining optimization in GCC with respect to function inline and partial inline based on profile feedback Data. If not this optimization can be implemented. If already implemented in GCC Can I know any pointer for such code in GCC and the Scope of this function outlining optimization. If not implemented , Can I propose to have the optimization like function outlining in GCC. Thoughts Please? Thanks & Regards Ajit
Re: Postpone expanding va_arg until pass_stdarg
Hi, On Wed, 11 Feb 2015, Tom de Vries wrote: > > My idea was to not generate temporaries and hence copies for > > non-scalar types, but rather construct the "result" of va_arg directly > > into the original LHS (that would then also trivially solve the > > problem of nno-copyable types). > > The copy mentioned here is of ap, not of the result of va_arg. Whoops, I misread, yes. Thanks. > > > I'm not really sure yet why std_gimplify_va_arg_expr has a part > > > commented out. Michael, can you comment? > > > > I think I did that because of SSA form. The old sequence calculated > > > >vatmp = valist; > >vatmp = vatmp + boundary-1 > >vatmp = vatmp & -boundary > > > > (where the local variable in that function 'valist_tmp' is the tree > > VAR_DECL 'vatmp') and then continue to use valist_tmp. When in SSA form > > the gimplifier will rewrite this into: > > > >vatmp_1 = valist; > >vatmp_2 = vatmp_1 + boundary-1 > >vatmp_3 = vatmp_2 & -boundary > > > > but the local valist_tmp variable will continue to be the VAR_DECL, not > > the vatmp_3 ssa name. Basically whenever one gimplifies a MODIFY_EXPR > > while in SSA form it's suspicious. So the new code simply build the > > expression: > > > >((valist + bound-1) & -bound) > > > > gimplifies that into an rvalue (most probably an SSA name) and uses that > > to go on generating code by making valist_tmp be that returned rvalue. > > > > I think you'll find that removing that code will make the SSA verifier > > scream or generate invalid code with -m32 when that hook is used. > > > > Thanks for the detailed explanation. I'm not sure I understand the > problem well enough, so I'll try to trigger it and investigate. Actually the above fails to mention what the real problem is :-) The problem is that the local variable valist_tmp will be used to generate further code after the above expression is generated. Without my patch it will continue to point to the VAR_DECL, not to the SSA name that actually holds the computed value in the generated code. Ciao, Michael.
unaligned memory access for vectorization
Hello All: The unaligned array access are the blocking factor in the vectorization. This is due to unaligned load and stores with respect to SIMD instructions are costly operations. To enable the vectorizations for unaligned array access the loop peeling is done to make the multiversioning of the loop with a loop for the iterations for unaligned array access where the code is non vectorized and also the loop where the loop can be vectorized for aligned access. This is possible with loop multiversioning to not to generate the unaligned moves. Can I know the scope of the above optimization and pointer to the code in GCC where this optimizations is implemented. If not implemented , it's good to have this optimization. Thoughts Please? Thanks & Regards Ajit
Failure to dlopen libgomp due to static TLS data
Hello, we're running into a problem related to use of initial-exec access to TLS variables in dynamically-loaded libraries. Now, in general, this is actually not supported. However, there seems to an "inofficial" extension that allows selected system libraries to use small amounts of static TLS space to allow critical variables to be defined to use the initial-exec model even in dynamically-loaded libraries. One example of a system library that does this is libgomp, the OpenMP support library provided with GCC. Here's an email thread from the gcc mailing lists debating the use of the initial-exec model: [gomp] Avoid -Wl,-z,nodlopen (PR libgomp/28482) https://gcc.gnu.org/ml/gcc-patches/2007-05/msg00097.html The idea why this is supposed to work is that glibc/ld.so will always allocate a small amount of surplus static TLS data space at startup. As long as the total amount of initial-exec TLS variables defined in dynamically-loaded libraries fits into that extra space, everything is supposed to work out fine. This could be ensured by allowing only certain defined system libraries to use this extension. However, in fact there is a *second* restriction, which may cause loading a library requiring static TLS to fail, *even if* there still is enough surplus space. This is due to the following check in dl-open.c:dl_open_worker: /* For static TLS we have to allocate the memory here and now. This includes allocating memory in the DTV. But we cannot change any DTV other than our own. So, if we cannot guarantee that there is room in the DTV we don't even try it and fail the load. XXX We could track the minimum DTV slots allocated in all threads. */ if (! RTLD_SINGLE_THREAD_P && imap->l_tls_modid > DTV_SURPLUS) _dl_signal_error (0, "dlopen", NULL, N_("\ cannot load any more object with static TLS")); This is a seriously problematic condition for the use case described above. There is no reasonable way a system library can ensure that, when it is loaded via dlopen, it gets assigned a module ID not larger than DTV_SURPLUS (which currently equals 14). Specifically, we've had a bug report from a major ISV that one of their large applications fails to load a plugin via dlopen with the above error message, which turned out to be because: - the plugin uses OpenMP and is thus implicitly linked against libgomp - the main application does not use libgomp, so it gets loaded at dlopen - at this point, some 150 libraries are already in use - many of those libraries define (regular!) TLS variables Therefore, the TLS module ID of the (indirectly loaded) libgomp ends up being larger than 14, and the dlopen fails. It doesn't seem to be the case that the ISV is doing anything "wrong" here; the problem is caused solely by the interaction of glibc and libgomp. It seems to me that something ought to be fixed here. Either the use of initial-exec variables simply isn't reliably supportable, but then not even system libraries like libgomp should use it. Or else, glibc *wants* to support that use case, but then it should do so in a way that reliably works as long as system libraries adhere to conditions that are in their power to implement. Thinking along the latter lines, it seems the dl_open_worker check may be overly conservative: For static TLS we have to allocate the memory here and now. This includes allocating memory in the DTV. It is not obvious to me that this second sentence is actually true. It *is* true that *given the current implementation*, we would fail if the DTV were not allocated. This is because init_one_static_tls (in nptl/allocatestack.c) does: /* Fill in the DTV slot so that a later LD/GD access will find it. */ dtv[map->l_tls_modid].pointer.val = dest; dtv[map->l_tls_modid].pointer.is_static = true; which would simply crash if the DTV were not allocated. However, I'm not sure why we have to do that at this point. Variables accessed via the initial-exec model do not actually use the DTV, since the linker resolves the offsets in the static TLS block directly as offsets relative to the thread pointer, without using the DTV. Of course, if such a variable were to be *also* accessed via a normal general-dynamic (or local-dynamic) access, *then* we'd need the DTV. But at this point, the __tls_get_addr routine would get involved, which would have the chance to set up the DTV entry on the fly, and (re-)allocate DTV space as needed. It's just that the current implementation of __tls_get_addr implicitly assumes it is never called for static TLS modules, and would (wrongly) also allocate the TLS data area. If __tls_get_addr were changed to also work on static TLS modules (i.e. only allocate the DTV and have it point to the pre-allocated static TLS data area in such cases), then we wouldn't have to init the DTV in init_one_static_tls, and then we could do wi
Re: Failure to dlopen libgomp due to static TLS data
On Thu, Feb 12, 2015 at 04:18:57PM +0100, Ulrich Weigand wrote: > we're running into a problem related to use of initial-exec access to > TLS variables in dynamically-loaded libraries. Now, in general, this > is actually not supported. However, there seems to an "inofficial" > extension that allows selected system libraries to use small amounts > of static TLS space to allow critical variables to be defined to use > the initial-exec model even in dynamically-loaded libraries. You can always LD_PRELOAD libgomp or link the main app with it if you need it. Otherwise, sure, there is no guarantee it will work, but usually it does, and the performance difference is significant enough to make it worthwhile. Making libgomp -Wl,-z,nodlopen would just make it problem for everyone, even when it works fine for most people. And, the restriction you are mentioning is there only if !RTLD_SINGLE_THREAD_P, so you can also avoid it by dlopening libgomp before you spawn first threads rather than after that. Jakub
Re: Failure to dlopen libgomp due to static TLS data
On Thu, Feb 12, 2015 at 3:18 PM, Ulrich Weigand wrote: > Hello, > > we're running into a problem related to use of initial-exec access to > TLS variables in dynamically-loaded libraries. Now, in general, this > is actually not supported. However, there seems to an "inofficial" > extension that allows selected system libraries to use small amounts > of static TLS space to allow critical variables to be defined to use > the initial-exec model even in dynamically-loaded libraries. This sounds v. similar to the discussion here. https://sourceware.org/ml/libc-alpha/2014-10/msg00134.html though my brain is too frazzled today to remember what the conclusion was. regards Ramana > > One example of a system library that does this is libgomp, the OpenMP > support library provided with GCC. Here's an email thread from the > gcc mailing lists debating the use of the initial-exec model: > > [gomp] Avoid -Wl,-z,nodlopen (PR libgomp/28482) > https://gcc.gnu.org/ml/gcc-patches/2007-05/msg00097.html > > The idea why this is supposed to work is that glibc/ld.so will always > allocate a small amount of surplus static TLS data space at startup. > As long as the total amount of initial-exec TLS variables defined in > dynamically-loaded libraries fits into that extra space, everything > is supposed to work out fine. This could be ensured by allowing > only certain defined system libraries to use this extension. > > However, in fact there is a *second* restriction, which may cause > loading a library requiring static TLS to fail, *even if* there > still is enough surplus space. This is due to the following check > in dl-open.c:dl_open_worker: > > /* For static TLS we have to allocate the memory here and > now. This includes allocating memory in the DTV. But we > cannot change any DTV other than our own. So, if we > cannot guarantee that there is room in the DTV we don't > even try it and fail the load. > > XXX We could track the minimum DTV slots allocated in > all threads. */ > if (! RTLD_SINGLE_THREAD_P && imap->l_tls_modid > DTV_SURPLUS) > _dl_signal_error (0, "dlopen", NULL, N_("\ > cannot load any more object with static TLS")); > > This is a seriously problematic condition for the use case described > above. There is no reasonable way a system library can ensure that, > when it is loaded via dlopen, it gets assigned a module ID not larger > than DTV_SURPLUS (which currently equals 14). > > Specifically, we've had a bug report from a major ISV that one of > their large applications fails to load a plugin via dlopen with > the above error message, which turned out to be because: > - the plugin uses OpenMP and is thus implicitly linked against libgomp > - the main application does not use libgomp, so it gets loaded at dlopen > - at this point, some 150 libraries are already in use > - many of those libraries define (regular!) TLS variables > > Therefore, the TLS module ID of the (indirectly loaded) libgomp ends > up being larger than 14, and the dlopen fails. It doesn't seem to be > the case that the ISV is doing anything "wrong" here; the problem is > caused solely by the interaction of glibc and libgomp. > > It seems to me that something ought to be fixed here. Either the use > of initial-exec variables simply isn't reliably supportable, but then > not even system libraries like libgomp should use it. Or else, glibc > *wants* to support that use case, but then it should do so in a way > that reliably works as long as system libraries adhere to conditions > that are in their power to implement. > > Thinking along the latter lines, it seems the dl_open_worker check > may be overly conservative: > > For static TLS we have to allocate the memory here and > now. This includes allocating memory in the DTV. > > It is not obvious to me that this second sentence is actually true. > > It *is* true that *given the current implementation*, we would fail > if the DTV were not allocated. This is because init_one_static_tls > (in nptl/allocatestack.c) does: > > /* Fill in the DTV slot so that a later LD/GD access will find it. */ > dtv[map->l_tls_modid].pointer.val = dest; > dtv[map->l_tls_modid].pointer.is_static = true; > > which would simply crash if the DTV were not allocated. > > However, I'm not sure why we have to do that at this point. Variables > accessed via the initial-exec model do not actually use the DTV, since > the linker resolves the offsets in the static TLS block directly as > offsets relative to the thread pointer, without using the DTV. > > Of course, if such a variable were to be *also* accessed via a normal > general-dynamic (or local-dynamic) access, *then* we'd need the DTV. > But at this point, the __tls_get_addr routine would get involved, > which would have the chance to set up the DTV entry on the fly, and > (re-)allocate DTV space as needed. It's
Re: Failure to dlopen libgomp due to static TLS data
There's a pending patch for glibc that addresses this issue among others: https://sourceware.org/ml/libc-alpha/2014-11/msg00469.html ([BZ#17090/17620/17621]: fix DTV race, assert, and DTV_SURPLUS Static TLS limit) Alexander
Re: GCC 5.0 and OpenMP 4.0 accelerator : Adapteva/Parallella board
Hi, On Wed, Feb 11, 2015 at 21:33:47 -0800, Nicholas Yue wrote: > I would like to find out if this is the correct forum to > ask/discuss about GCC 5's OpenMP 4.0 implementation, in particular > the new accelerator feature which from what I understand, allows the > compute to be offloaded to external GPU/accelerator. > > I have a Parallella board (ARM dual core) which has an Adapteva > chip (16 cores) and I would like to build a GCC 5 version for it. > > I recall that the Adapteva is a supported CPU with GCC. Currently offloading to Epiphany targets is not supported by GCC. To support it, one needs to implement at least 2 things: 1. mkoffload tool, like gcc/config/i386/intelmic-mkoffload.c or gcc/config/nvptx/mkoffload.c 2. libgomp plugin, like liboffloadmic/plugin/libgomp-plugin-intelmic.cpp or libgomp/plugin/plugin-nvptx.c -- Ilya
Re: Failure to dlopen libgomp due to static TLS data
Alexander Monakov wrote: > > There's a pending patch for glibc that addresses this issue among others: > https://sourceware.org/ml/libc-alpha/2014-11/msg00469.html > > ([BZ#17090/17620/17621]: fix DTV race, assert, and DTV_SURPLUS Static TLS > limit) Ah, indeed, that would fix the issue! Thanks for pointing this out. I see that the latest revision: https://sourceware.org/ml/libc-alpha/2014-11/msg00590.html has been pinged a couple of times already, so let me add another ping :-) Bye, Ulrich -- Dr. Ulrich Weigand GNU/Linux compilers and toolchain ulrich.weig...@de.ibm.com
Re: GCC 5.0 and OpenMP 4.0 accelerator : Adapteva/Parallella board
On Thu, Feb 12, 2015 at 06:42:17PM +0300, Ilya Verbin wrote: > Hi, > > On Wed, Feb 11, 2015 at 21:33:47 -0800, Nicholas Yue wrote: > > I would like to find out if this is the correct forum to > > ask/discuss about GCC 5's OpenMP 4.0 implementation, in particular > > the new accelerator feature which from what I understand, allows the > > compute to be offloaded to external GPU/accelerator. > > > > I have a Parallella board (ARM dual core) which has an Adapteva > > chip (16 cores) and I would like to build a GCC 5 version for it. > > > > I recall that the Adapteva is a supported CPU with GCC. > > Currently offloading to Epiphany targets is not supported by GCC. > > To support it, one needs to implement at least 2 things: > > 1. mkoffload tool, like gcc/config/i386/intelmic-mkoffload.c or > gcc/config/nvptx/mkoffload.c > > 2. libgomp plugin, like liboffloadmic/plugin/libgomp-plugin-intelmic.cpp or > libgomp/plugin/plugin-nvptx.c And likely 3. port libgomp to the epiphany which supposedly doesn't have pthread support, but some other way to spawn threads (this is similar to nvptx). Jakub
Re: Failure to dlopen libgomp due to static TLS data
On Thu, Feb 12, 2015 at 04:18:57PM +0100, Ulrich Weigand wrote: > Hello, > > we're running into a problem related to use of initial-exec access to > TLS variables in dynamically-loaded libraries. Now, in general, this > is actually not supported. However, there seems to an "inofficial" > extension that allows selected system libraries to use small amounts > of static TLS space to allow critical variables to be defined to use > the initial-exec model even in dynamically-loaded libraries. This usage is supposed to be deprecated. Why isn't libgomp using TLSDESC/gnu2 model? Rich
Re: Failure to dlopen libgomp due to static TLS data
On Thu, Feb 12, 2015 at 11:09:59AM -0500, Rich Felker wrote: > On Thu, Feb 12, 2015 at 04:18:57PM +0100, Ulrich Weigand wrote: > > Hello, > > > > we're running into a problem related to use of initial-exec access to > > TLS variables in dynamically-loaded libraries. Now, in general, this > > is actually not supported. However, there seems to an "inofficial" > > extension that allows selected system libraries to use small amounts > > of static TLS space to allow critical variables to be defined to use > > the initial-exec model even in dynamically-loaded libraries. > > This usage is supposed to be deprecated. Why isn't libgomp using > TLSDESC/gnu2 model? Because it is significantly slower. Jakub
Re: Failure to dlopen libgomp due to static TLS data
On Thu, Feb 12, 2015 at 05:11:45PM +0100, Jakub Jelinek wrote: > On Thu, Feb 12, 2015 at 11:09:59AM -0500, Rich Felker wrote: > > On Thu, Feb 12, 2015 at 04:18:57PM +0100, Ulrich Weigand wrote: > > > Hello, > > > > > > we're running into a problem related to use of initial-exec access to > > > TLS variables in dynamically-loaded libraries. Now, in general, this > > > is actually not supported. However, there seems to an "inofficial" > > > extension that allows selected system libraries to use small amounts > > > of static TLS space to allow critical variables to be defined to use > > > the initial-exec model even in dynamically-loaded libraries. > > > > This usage is supposed to be deprecated. Why isn't libgomp using > > TLSDESC/gnu2 model? > > Because it is significantly slower. Seems very unlikely. If storage is allocated in static TLS, TLSDESC is almost indistinguishable from IE in performance, even when you run artificial benchmarks that do nothing but hammer TLS access. When it gets allocated in dynamic TLS, it's somewhat slower, but still unlikely to matter for most usage IMO. Do you have actual numbers showing that TLSDESC is too slow for libgomp? Rich
Re: Failure to dlopen libgomp due to static TLS data
On Thu, Feb 12, 2015 at 8:11 AM, Jakub Jelinek wrote: > On Thu, Feb 12, 2015 at 11:09:59AM -0500, Rich Felker wrote: >> On Thu, Feb 12, 2015 at 04:18:57PM +0100, Ulrich Weigand wrote: >> > Hello, >> > >> > we're running into a problem related to use of initial-exec access to >> > TLS variables in dynamically-loaded libraries. Now, in general, this >> > is actually not supported. However, there seems to an "inofficial" >> > extension that allows selected system libraries to use small amounts >> > of static TLS space to allow critical variables to be defined to use >> > the initial-exec model even in dynamically-loaded libraries. >> >> This usage is supposed to be deprecated. Why isn't libgomp using >> TLSDESC/gnu2 model? > > Because it is significantly slower. And TLSDESC/gnu2 model isn't implemented for x32. There are no tests for TLSDESC/gnu2 model in glibc. I have no ideas if it works in glibc master on x86-32 or x86-64 today. -- H.J.
Re: Function outlining and partial Inlining
> Hello All: > > The large functions are the important part of high performance application. > They contribute to performance bottleneck with many > respect. Some of the large hot functions are frequently executed but many > regions inside the functions are cold regions. The large > Function blocks the function inlining to happen before of the code size > constraints. > > Such cold regions inside the hot large functions can be extracted out and > form the function outlining. Thus breaking the large functions > Into smaller function segments which causes the functions to be inlined at > the caller site or helps in partial inlining. > > LLVM Compiler has the functionality and the optimizations for function > outlining based on regions like basic blocks, superblocks and > Hyperblocks which gets extracted out into smaller function segments and thus > enabling the partial inlining and function inlining to happen > At the caller site. > > This optimization is the good case of profile guided optimizations and based > on the profile feedback data by the Compiler. > Without profile information the above function outlining optimizations will > not be useful. > > We are doing lot of optimization regarding polymorphism and also the indirect > icall promotion based on the profile feedback on the > Callgraph profile. > > Are we doing the function outlining optimization in GCC with respect to > function inline and partial inline based on profile feedback > Data. If not this optimization can be implemented. If already implemented in > GCC Can I know any pointer for such code in GCC and the > Scope of this function outlining optimization. The outlining pass is called ipa-split. The heuristic used is however quite simplistic and it looks for very specific case where you have small header of a function containing conditional and splits after that. It does use profile. Any work on improving the heuristics or providing interesting testcases to consider would be welcome. I think LLVM pass is doing pretty much the same analysis minus the profile feedback considerations. After splitting, LLVm will inline the header into all callers while GCC leaves this on the decision of inliner heuristics that may just merge the function back into one block. The actual outlining logic is contained in tree-inline.c and also used by OpenMP. Honza > > If not implemented , Can I propose to have the optimization like function > outlining in GCC. > > Thoughts Please? > > Thanks & Regards > Ajit
Re: Failure to dlopen libgomp due to static TLS data
On Thu, Feb 12, 2015 at 08:56:26AM -0800, H.J. Lu wrote: > On Thu, Feb 12, 2015 at 8:11 AM, Jakub Jelinek wrote: > > On Thu, Feb 12, 2015 at 11:09:59AM -0500, Rich Felker wrote: > >> On Thu, Feb 12, 2015 at 04:18:57PM +0100, Ulrich Weigand wrote: > >> > Hello, > >> > > >> > we're running into a problem related to use of initial-exec access to > >> > TLS variables in dynamically-loaded libraries. Now, in general, this > >> > is actually not supported. However, there seems to an "inofficial" > >> > extension that allows selected system libraries to use small amounts > >> > of static TLS space to allow critical variables to be defined to use > >> > the initial-exec model even in dynamically-loaded libraries. > >> > >> This usage is supposed to be deprecated. Why isn't libgomp using > >> TLSDESC/gnu2 model? > > > > Because it is significantly slower. > > And TLSDESC/gnu2 model isn't implemented for x32. > There are no tests for TLSDESC/gnu2 model in glibc. > I have no ideas if it works in glibc master on x86-32 or > x86-64 today. Then fixing this should be a priority, IMO. Broken libraries using IE model "for performance" are a problem that's not going to go away until TLSDESC gets properly adopted. Rich
Re: Failure to dlopen libgomp due to static TLS data
On 02/12/2015 04:16 PM, Rich Felker wrote: > On Thu, Feb 12, 2015 at 05:11:45PM +0100, Jakub Jelinek wrote: >> On Thu, Feb 12, 2015 at 11:09:59AM -0500, Rich Felker wrote: >>> >>> This usage is supposed to be deprecated. Why isn't libgomp using >>> TLSDESC/gnu2 model? >> >> Because it is significantly slower. > > Seems very unlikely. If storage is allocated in static TLS, TLSDESC is > almost indistinguishable from IE in performance, even when you run > artificial benchmarks that do nothing but hammer TLS access. When it > gets allocated in dynamic TLS, it's somewhat slower, but still > unlikely to matter for most usage IMO. The problem I'm seeing is that dynamic TLS is always used even when not necessary, and that hurts Java (which accesses TLS 128k times in the first 500ms or so of execution). According to lxo his patch fixes that. Andrew.
gcc-4.8-20150212 is now available
Snapshot gcc-4.8-20150212 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/4.8-20150212/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 4.8 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_8-branch revision 220665 You'll find: gcc-4.8-20150212.tar.bz2 Complete GCC MD5=7cceff112b4dfca602d1264326b37ab5 SHA1=f1963df7da0e82372f8e3c6ec5400f3775e4c05c Diffs from 4.8-20150129 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-4.8 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Re: Postpone expanding va_arg until pass_stdarg
On 12-02-15 14:57, Michael Matz wrote: Hi, On Wed, 11 Feb 2015, Tom de Vries wrote: My idea was to not generate temporaries and hence copies for non-scalar types, but rather construct the "result" of va_arg directly into the original LHS (that would then also trivially solve the problem of nno-copyable types). The copy mentioned here is of ap, not of the result of va_arg. Whoops, I misread, yes. Thanks. Hi, Btw, I'm not happy about the ap copies, but I haven't been able to get rid of them. I'm not really sure yet why std_gimplify_va_arg_expr has a part commented out. Michael, can you comment? I think I did that because of SSA form. The old sequence calculated vatmp = valist; vatmp = vatmp + boundary-1 vatmp = vatmp & -boundary (where the local variable in that function 'valist_tmp' is the tree VAR_DECL 'vatmp') and then continue to use valist_tmp. When in SSA form the gimplifier will rewrite this into: vatmp_1 = valist; vatmp_2 = vatmp_1 + boundary-1 vatmp_3 = vatmp_2 & -boundary but the local valist_tmp variable will continue to be the VAR_DECL, not the vatmp_3 ssa name. Basically whenever one gimplifies a MODIFY_EXPR while in SSA form it's suspicious. So the new code simply build the expression: ((valist + bound-1) & -bound) gimplifies that into an rvalue (most probably an SSA name) and uses that to go on generating code by making valist_tmp be that returned rvalue. I think you'll find that removing that code will make the SSA verifier scream or generate invalid code with -m32 when that hook is used. Thanks for the detailed explanation. I'm not sure I understand the problem well enough, so I'll try to trigger it and investigate. Actually the above fails to mention what the real problem is :-) The problem is that the local variable valist_tmp will be used to generate further code after the above expression is generated. Without my patch it will continue to point to the VAR_DECL, not to the SSA name that actually holds the computed value in the generated code. I have not been able to reproduce this problem (with a bootstrap build on x86_64 for all languages, and {unix/,unix/-m32} testing), so I've dropped this bit for now. I've pushed the latest status to vries/expand-va-arg-at-pass-stdarg. -ftree-stdarg-opt (the va_list_gpr/fpr_size optimization) has been renabled again. I needed patch "Always check phi-ops in optimize_va_list_gpr_fpr_size" for that. With a similar bootstrap and reg-test as described above, there's only one failure left: ... FAIL: gcc.dg/tree-ssa/stdarg-2.c scan-tree-dump stdarg "f15: va_list escapes 0, needs to save [148] GPR units and [1-9][0-9]* FPR units" ... And this is due to the ap copy, which is classified as escape. [ We're still expanding ifn_va_arg before the va_list_gpr/fpr_size optimization. ] Thanks, - Tom
Re: Failure to dlopen libgomp due to static TLS data
On Thu, Feb 12, 2015 at 06:23:12PM +, Andrew Haley wrote: > On 02/12/2015 04:16 PM, Rich Felker wrote: > > On Thu, Feb 12, 2015 at 05:11:45PM +0100, Jakub Jelinek wrote: > >> On Thu, Feb 12, 2015 at 11:09:59AM -0500, Rich Felker wrote: > >>> > >>> This usage is supposed to be deprecated. Why isn't libgomp using > >>> TLSDESC/gnu2 model? > >> > >> Because it is significantly slower. > > > > Seems very unlikely. If storage is allocated in static TLS, TLSDESC is > > almost indistinguishable from IE in performance, even when you run > > artificial benchmarks that do nothing but hammer TLS access. When it > > gets allocated in dynamic TLS, it's somewhat slower, but still > > unlikely to matter for most usage IMO. > > The problem I'm seeing is that dynamic TLS is always used even when not > necessary, and that hurts Java (which accesses TLS 128k times in the first > 500ms or so of execution). According to lxo his patch fixes that. Given those numbers, each access would need to be taking 38ns to consume even 1% of the cpu time being spent. I would guess accesses are closer to 5ns for TLSDESC in static area and 10-15ns for dynamic. So I don't think this is a botteneck. Rich
Re: Failure to dlopen libgomp due to static TLS data
On Thu, Feb 12, 2015 at 12:07:24PM -0500, Rich Felker wrote: > On Thu, Feb 12, 2015 at 08:56:26AM -0800, H.J. Lu wrote: > > On Thu, Feb 12, 2015 at 8:11 AM, Jakub Jelinek wrote: > > > On Thu, Feb 12, 2015 at 11:09:59AM -0500, Rich Felker wrote: > > >> On Thu, Feb 12, 2015 at 04:18:57PM +0100, Ulrich Weigand wrote: > > >> > Hello, > > >> > > > >> > we're running into a problem related to use of initial-exec access to > > >> > TLS variables in dynamically-loaded libraries. Now, in general, this > > >> > is actually not supported. However, there seems to an "inofficial" > > >> > extension that allows selected system libraries to use small amounts > > >> > of static TLS space to allow critical variables to be defined to use > > >> > the initial-exec model even in dynamically-loaded libraries. > > >> > > >> This usage is supposed to be deprecated. Why isn't libgomp using > > >> TLSDESC/gnu2 model? > > > > > > Because it is significantly slower. > > > > And TLSDESC/gnu2 model isn't implemented for x32. > > There are no tests for TLSDESC/gnu2 model in glibc. > > I have no ideas if it works in glibc master on x86-32 or > > x86-64 today. > > Then fixing this should be a priority, IMO. Broken libraries using IE > model "for performance" are a problem that's not going to go away > until TLSDESC gets properly adopted. I posted support for TLSDESC on powerpc back in 2009 (search for powerpc _tls_get_addr call optimization). The patch wasn't reviewed, and I didn't push it because my benchmark tests didn't show a much of a gain. Quite possibly I wasn't using the right benchmark. -- Alan Modra Australia Development Lab, IBM
Re: Failure to dlopen libgomp due to static TLS data
On Fri, Feb 13, 2015 at 10:12:11AM +1030, Alan Modra wrote: > On Thu, Feb 12, 2015 at 12:07:24PM -0500, Rich Felker wrote: > > On Thu, Feb 12, 2015 at 08:56:26AM -0800, H.J. Lu wrote: > > > On Thu, Feb 12, 2015 at 8:11 AM, Jakub Jelinek wrote: > > > > On Thu, Feb 12, 2015 at 11:09:59AM -0500, Rich Felker wrote: > > > >> On Thu, Feb 12, 2015 at 04:18:57PM +0100, Ulrich Weigand wrote: > > > >> > Hello, > > > >> > > > > >> > we're running into a problem related to use of initial-exec access to > > > >> > TLS variables in dynamically-loaded libraries. Now, in general, this > > > >> > is actually not supported. However, there seems to an "inofficial" > > > >> > extension that allows selected system libraries to use small amounts > > > >> > of static TLS space to allow critical variables to be defined to use > > > >> > the initial-exec model even in dynamically-loaded libraries. > > > >> > > > >> This usage is supposed to be deprecated. Why isn't libgomp using > > > >> TLSDESC/gnu2 model? > > > > > > > > Because it is significantly slower. > > > > > > And TLSDESC/gnu2 model isn't implemented for x32. > > > There are no tests for TLSDESC/gnu2 model in glibc. > > > I have no ideas if it works in glibc master on x86-32 or > > > x86-64 today. > > > > Then fixing this should be a priority, IMO. Broken libraries using IE > > model "for performance" are a problem that's not going to go away > > until TLSDESC gets properly adopted. > > I posted support for TLSDESC on powerpc back in 2009 (search for > powerpc _tls_get_addr call optimization). The patch wasn't reviewed, > and I didn't push it because my benchmark tests didn't show a much of > a gain. Quite possibly I wasn't using the right benchmark. Were you measuring static-allocated TLSDESC vs non-TLSDESC GD model? That's the case where there should be a "big" difference, though I'm still somewhat skeptical of the benefits in real-world usage cases. I think Alexandre Oliva had a tool along with the original paper to measure the performance, and I did some simple testing myself a while back I could dig up the source for. Rich
Re: Failure to dlopen libgomp due to static TLS data
On Thu, Feb 12, 2015 at 06:55:30PM -0500, Rich Felker wrote: > On Fri, Feb 13, 2015 at 10:12:11AM +1030, Alan Modra wrote: > > I posted support for TLSDESC on powerpc back in 2009 (search for > > powerpc _tls_get_addr call optimization). The patch wasn't reviewed, > > and I didn't push it because my benchmark tests didn't show a much of > > a gain. Quite possibly I wasn't using the right benchmark. > > Were you measuring static-allocated TLSDESC vs non-TLSDESC GD model? > That's the case where there should be a "big" difference, though I'm > still somewhat skeptical of the benefits in real-world usage cases. I can't remember, sorry, it was too long ago. -- Alan Modra Australia Development Lab, IBM