Unrolling factor heuristics for Loop Unrolling

2015-02-12 Thread Ajit Kumar Agarwal
Hello All:

The Loop unrolling without good unrolling factor heuristics becomes the 
performance bottleneck. The Unrolling factor heuristics based on minimum 
Initiation interval is quite useful with respect to better ILP.  The minimum 
Initiation interval based on recurrence and resource calculation on Data 
Dependency Graph  along with the register pressure can be used to add the 
unrolling factor heuristics. To achieve better ILP with the given schedule,
the Loops unrolling and the scheduling are inter dependent and has been widely 
used in Software Pipelining Literature along with the more granular
List and Trace Scheduling.

The recurrence calculation based on the Loop carried dependencies and the 
resource allocation based on the simultaneous access of the resources 
Using the reservation table will give good heuristics with respect to 
calculation of unrolling factor. This has been taken care in the
MII interval Calculation.

Along with MII, the register pressure should also be  considered in the 
calculation of heuristics for unrolling factor.

This enable better heuristics with respect to unrolling factor. The main 
advantage of the above heuristics for unrolling factor is that it can be 
Implemented in the Code generation Level. Currently Loop unrolling is done much 
before the code generation. Let's go by the current implementation
Of doing Loop unrolling optimization at the Loop optimizer level and unrolling 
happens. After the Current unrolling at the optimizer level the above heuristics
Can be  used to do the unrolling at the Code generation Level with the accurate 
Register pressure calculation as done in the register allocator and the
Unrolling is done at the code generation level. This looks feasible solution 
which I am going to propose for the above unrolling heuristics.

This enables the Loop unrolling done at the Optimizer Level  +  at the Code 
Generation Level. This double level of Loop unrolling is quite useful.
This will overcome the shortcomings of the Loop unrolling at the optimizer 
level.

The SPEC benchmarks are the better candidates for the above heuristics instead 
of Mibench and EEMBC.

Thanks & Regards
Ajit


Re: Unrolling factor heuristics for Loop Unrolling

2015-02-12 Thread Oleg Endo
On Thu, 2015-02-12 at 10:09 +, Ajit Kumar Agarwal wrote:
> Hello All:
> 
> The Loop unrolling without good unrolling factor heuristics becomes the 
> performance bottleneck. The Unrolling factor heuristics based on minimum 
> Initiation interval is quite useful with respect to better ILP.  The minimum 
> Initiation interval based on recurrence and resource calculation on Data 
> Dependency Graph  along with the register pressure can be used to add the 
> unrolling factor heuristics. To achieve better ILP with the given schedule,
> the Loops unrolling and the scheduling are inter dependent and has been 
> widely used in Software Pipelining Literature along with the more granular
> List and Trace Scheduling.
> 
> The recurrence calculation based on the Loop carried dependencies and the 
> resource allocation based on the simultaneous access of the resources 
> Using the reservation table will give good heuristics with respect to 
> calculation of unrolling factor. This has been taken care in the
> MII interval Calculation.
> 
> Along with MII, the register pressure should also be  considered in the 
> calculation of heuristics for unrolling factor.
> 
> This enable better heuristics with respect to unrolling factor. The main 
> advantage of the above heuristics for unrolling factor is that it can be 
> Implemented in the Code generation Level. Currently Loop unrolling is done 
> much before the code generation. Let's go by the current implementation
> Of doing Loop unrolling optimization at the Loop optimizer level and 
> unrolling happens. After the Current unrolling at the optimizer level the 
> above heuristics
> Can be  used to do the unrolling at the Code generation Level with the 
> accurate Register pressure calculation as done in the register allocator and 
> the
> Unrolling is done at the code generation level. This looks feasible solution 
> which I am going to propose for the above unrolling heuristics.
> 
> This enables the Loop unrolling done at the Optimizer Level  +  at the Code 
> Generation Level. This double level of Loop unrolling is quite useful.
> This will overcome the shortcomings of the Loop unrolling at the optimizer 
> level.
> 
> The SPEC benchmarks are the better candidates for the above heuristics 
> instead of Mibench and EEMBC.

Not taking register pressure into account when unrolling (and doing
other optimizations/choices) is an old problem.  See also:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=20969

Cheers,
Oleg



libcc1 depencies

2015-02-12 Thread la...@chello.at
Hello,

I am trying to build a cross-compiler for arm, like I did so for years. I am
keen on not having depencies on libraries, so that the compiler can be used on
multiple systems.
Up until 4.9 the only depency is libc, I tried using the gcc-5-20150208
snapshot, and now I get depencies on the build-compilers libstdc++ and libgcc_s.
This is double annoying because this is not the default compiler of the OS, and
those paths might not exist and there is a chance of mixing versions.

The rest of gcc is adhering to the configuration, but for libcc1 this seems to
be lacking. I use this configuration to statiscally link C++:
--with-host-libstdcxx="-Wl,-Bstatic,`g++ --print-file-name libstdc++.a`,`g++
--print-file-name libsupc++.a`,-Bdynamic"

*) I tried setting -static-libstdc++ via LDFLAGS, but libtool is linking the
explicit libraries with the path to the build-compiler (not the system library)

*) I tried compiling libcc1 seperately
/home/build/toolchain-arm-none-eabi-5.0-5.0.0/gcc/libcc1/configure
--prefix=/opt/toolchain-5.0 --build=i686-pc-linux-gnu --host=i686-pc-linux-gnu
--target=arm-none-eabi

Note that this is an out-of-source-tree build!
The result is an error:
/home/build/toolchain-arm-none-eabi-5.0-5.0.0/gcc/libcc1/findcomp.cc:20:20:
fatal error: config.h: No such file or directory

I would want to at least be able to statically link libstdc++. Ideally the
reference to libgcc_s would be gone aswell, or atleast be pointing to the system
library. Right now I dont find the measures to do that.

Kind Regards,
Norbert Lange

(I send one mail as HTML, apologies if this message will appear twice)


Function outlining and partial Inlining

2015-02-12 Thread Ajit Kumar Agarwal
Hello All:

The large functions are the important part of high performance application. 
They contribute to performance bottleneck with many
respect. Some of the large hot functions are frequently executed but many 
regions inside the functions are cold regions. The large
Function blocks the function inlining  to happen before of the code size 
constraints.

Such cold regions inside the hot large functions can be extracted out and form 
the function outlining. Thus breaking the large functions
Into smaller function segments which causes the functions to be inlined at the 
caller site or helps in partial inlining.

LLVM Compiler has the functionality and the optimizations for function 
outlining based on regions like basic blocks, superblocks and
Hyperblocks which gets extracted out into smaller function segments and thus 
enabling the partial inlining and function inlining to happen
At the caller site.

This optimization is the good case of profile guided optimizations and based on 
the profile feedback data by the Compiler.
Without profile information the above function outlining optimizations will not 
be useful.

We are doing lot of optimization regarding polymorphism and also the indirect 
icall promotion based on the profile feedback on the 
Callgraph profile.

Are we doing the function outlining optimization in GCC with respect to 
function inline and partial inline based on profile feedback
Data. If not this optimization can be implemented. If already implemented in 
GCC  Can I  know any pointer for such code in GCC and the 
Scope of this function outlining optimization.

If not implemented , Can I propose to have the optimization like function 
outlining in GCC.

Thoughts Please?

Thanks & Regards
Ajit



Re: Postpone expanding va_arg until pass_stdarg

2015-02-12 Thread Michael Matz
Hi,

On Wed, 11 Feb 2015, Tom de Vries wrote:

> > My idea was to not generate temporaries and hence copies for 
> > non-scalar types, but rather construct the "result" of va_arg directly 
> > into the original LHS (that would then also trivially solve the 
> > problem of nno-copyable types).
> 
> The copy mentioned here is of ap, not of the result of va_arg.

Whoops, I misread, yes.  Thanks.

> > > I'm not really sure yet why std_gimplify_va_arg_expr has a part
> > > commented out. Michael, can you comment?
> > 
> > I think I did that because of SSA form.  The old sequence calculated
> > 
> >vatmp = valist;
> >vatmp = vatmp + boundary-1
> >vatmp = vatmp & -boundary
> > 
> > (where the local variable in that function 'valist_tmp' is the tree
> > VAR_DECL 'vatmp') and then continue to use valist_tmp.  When in SSA form
> > the gimplifier will rewrite this into:
> > 
> >vatmp_1 = valist;
> >vatmp_2 = vatmp_1 + boundary-1
> >vatmp_3 = vatmp_2 & -boundary
> > 
> > but the local valist_tmp variable will continue to be the VAR_DECL, not
> > the vatmp_3 ssa name.  Basically whenever one gimplifies a MODIFY_EXPR
> > while in SSA form it's suspicious.  So the new code simply build the
> > expression:
> > 
> >((valist + bound-1) & -bound)
> > 
> > gimplifies that into an rvalue (most probably an SSA name) and uses that
> > to go on generating code by making valist_tmp be that returned rvalue.
> > 
> > I think you'll find that removing that code will make the SSA verifier
> > scream or generate invalid code with -m32 when that hook is used.
> > 
> 
> Thanks for the detailed explanation. I'm not sure I understand the 
> problem well enough, so I'll try to trigger it and investigate.

Actually the above fails to mention what the real problem is :-)  The 
problem is that the local variable valist_tmp will be used to generate 
further code after the above expression is generated.  Without my patch it 
will continue to point to the VAR_DECL, not to the SSA name that actually 
holds the computed value in the generated code.


Ciao,
Michael.


unaligned memory access for vectorization

2015-02-12 Thread Ajit Kumar Agarwal
Hello All:

The unaligned array access are the blocking factor in the vectorization. This 
is due to unaligned load and stores with respect to
SIMD instructions are costly operations. 

To enable the vectorizations for unaligned array access the loop peeling is 
done to make the multiversioning of the loop with 
a loop for the iterations for unaligned array access where the code is non 
vectorized and also the loop where the loop can be 
vectorized for aligned access. This is possible with loop multiversioning to 
not to generate the unaligned moves.


Can I know the scope of the above optimization and pointer to the code in GCC 
where this optimizations is implemented.
If not implemented , it's good to have this optimization.

Thoughts Please?

Thanks & Regards
Ajit 




Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread Ulrich Weigand
Hello,

we're running into a problem related to use of initial-exec access to
TLS variables in dynamically-loaded libraries.  Now, in general, this
is actually not supported.  However, there seems to an "inofficial"
extension that allows selected system libraries to use small amounts
of static TLS space to allow critical variables to be defined to use
the initial-exec model even in dynamically-loaded libraries.

One example of a system library that does this is libgomp, the OpenMP
support library provided with GCC.  Here's an email thread from the
gcc mailing lists debating the use of the initial-exec model:

[gomp] Avoid -Wl,-z,nodlopen (PR libgomp/28482)
https://gcc.gnu.org/ml/gcc-patches/2007-05/msg00097.html

The idea why this is supposed to work is that glibc/ld.so will always
allocate a small amount of surplus static TLS data space at startup.
As long as the total amount of initial-exec TLS variables defined in
dynamically-loaded libraries fits into that extra space, everything
is supposed to work out fine.  This could be ensured by allowing
only certain defined system libraries to use this extension.

However, in fact there is a *second* restriction, which may cause
loading a library requiring static TLS to fail, *even if* there
still is enough surplus space.  This is due to the following check
in dl-open.c:dl_open_worker:

  /* For static TLS we have to allocate the memory here and
 now.  This includes allocating memory in the DTV.  But we
 cannot change any DTV other than our own.  So, if we
 cannot guarantee that there is room in the DTV we don't
 even try it and fail the load.

 XXX We could track the minimum DTV slots allocated in
 all threads.  */
  if (! RTLD_SINGLE_THREAD_P && imap->l_tls_modid > DTV_SURPLUS)
_dl_signal_error (0, "dlopen", NULL, N_("\
cannot load any more object with static TLS"));

This is a seriously problematic condition for the use case described
above.  There is no reasonable way a system library can ensure that,
when it is loaded via dlopen, it gets assigned a module ID not larger
than DTV_SURPLUS (which currently equals 14).

Specifically, we've had a bug report from a major ISV that one of
their large applications fails to load a plugin via dlopen with
the above error message, which turned out to be because:
- the plugin uses OpenMP and is thus implicitly linked against libgomp
- the main application does not use libgomp, so it gets loaded at dlopen
- at this point, some 150 libraries are already in use
- many of those libraries define (regular!) TLS variables

Therefore, the TLS module ID of the (indirectly loaded) libgomp ends
up being larger than 14, and the dlopen fails.  It doesn't seem to be
the case that the ISV is doing anything "wrong" here; the problem is
caused solely by the interaction of glibc and libgomp.

It seems to me that something ought to be fixed here.  Either the use
of initial-exec variables simply isn't reliably supportable, but then
not even system libraries like libgomp should use it.  Or else, glibc
*wants* to support that use case, but then it should do so in a way
that reliably works as long as system libraries adhere to conditions
that are in their power to implement.

Thinking along the latter lines, it seems the dl_open_worker check
may be overly conservative:

For static TLS we have to allocate the memory here and
now.  This includes allocating memory in the DTV.

It is not obvious to me that this second sentence is actually true.

It *is* true that *given the current implementation*, we would fail
if the DTV were not allocated.  This is because init_one_static_tls
(in nptl/allocatestack.c) does:

  /* Fill in the DTV slot so that a later LD/GD access will find it.  */
  dtv[map->l_tls_modid].pointer.val = dest;
  dtv[map->l_tls_modid].pointer.is_static = true;

which would simply crash if the DTV were not allocated.

However, I'm not sure why we have to do that at this point.  Variables
accessed via the initial-exec model do not actually use the DTV, since
the linker resolves the offsets in the static TLS block directly as
offsets relative to the thread pointer, without using the DTV.

Of course, if such a variable were to be *also* accessed via a normal
general-dynamic (or local-dynamic) access, *then* we'd need the DTV.
But at this point, the __tls_get_addr routine would get involved,
which would have the chance to set up the DTV entry on the fly, and
(re-)allocate DTV space as needed.  It's just that the current
implementation of __tls_get_addr implicitly assumes it is never
called for static TLS modules, and would (wrongly) also allocate the
TLS data area.

If __tls_get_addr were changed to also work on static TLS modules
(i.e. only allocate the DTV and have it point to the pre-allocated
static TLS data area in such cases), then we wouldn't have to init
the DTV in init_one_static_tls, and then we could do wi

Re: Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread Jakub Jelinek
On Thu, Feb 12, 2015 at 04:18:57PM +0100, Ulrich Weigand wrote:
> we're running into a problem related to use of initial-exec access to
> TLS variables in dynamically-loaded libraries.  Now, in general, this
> is actually not supported.  However, there seems to an "inofficial"
> extension that allows selected system libraries to use small amounts
> of static TLS space to allow critical variables to be defined to use
> the initial-exec model even in dynamically-loaded libraries.

You can always LD_PRELOAD libgomp or link the main app with it if you need
it.  Otherwise, sure, there is no guarantee it will work, but usually it
does, and the performance difference is significant enough to make it
worthwhile.  Making libgomp -Wl,-z,nodlopen would just make it problem for
everyone, even when it works fine for most people.
And, the restriction you are mentioning is there only if
!RTLD_SINGLE_THREAD_P, so you can also avoid it by dlopening libgomp before
you spawn first threads rather than after that.

Jakub


Re: Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread Ramana Radhakrishnan
On Thu, Feb 12, 2015 at 3:18 PM, Ulrich Weigand  wrote:
> Hello,
>
> we're running into a problem related to use of initial-exec access to
> TLS variables in dynamically-loaded libraries.  Now, in general, this
> is actually not supported.  However, there seems to an "inofficial"
> extension that allows selected system libraries to use small amounts
> of static TLS space to allow critical variables to be defined to use
> the initial-exec model even in dynamically-loaded libraries.


This sounds v. similar to the discussion here.

https://sourceware.org/ml/libc-alpha/2014-10/msg00134.html

though my brain is too frazzled today to remember what the conclusion was.

regards
Ramana


>
> One example of a system library that does this is libgomp, the OpenMP
> support library provided with GCC.  Here's an email thread from the
> gcc mailing lists debating the use of the initial-exec model:
>
> [gomp] Avoid -Wl,-z,nodlopen (PR libgomp/28482)
> https://gcc.gnu.org/ml/gcc-patches/2007-05/msg00097.html
>
> The idea why this is supposed to work is that glibc/ld.so will always
> allocate a small amount of surplus static TLS data space at startup.
> As long as the total amount of initial-exec TLS variables defined in
> dynamically-loaded libraries fits into that extra space, everything
> is supposed to work out fine.  This could be ensured by allowing
> only certain defined system libraries to use this extension.
>
> However, in fact there is a *second* restriction, which may cause
> loading a library requiring static TLS to fail, *even if* there
> still is enough surplus space.  This is due to the following check
> in dl-open.c:dl_open_worker:
>
>   /* For static TLS we have to allocate the memory here and
>  now.  This includes allocating memory in the DTV.  But we
>  cannot change any DTV other than our own.  So, if we
>  cannot guarantee that there is room in the DTV we don't
>  even try it and fail the load.
>
>  XXX We could track the minimum DTV slots allocated in
>  all threads.  */
>   if (! RTLD_SINGLE_THREAD_P && imap->l_tls_modid > DTV_SURPLUS)
> _dl_signal_error (0, "dlopen", NULL, N_("\
> cannot load any more object with static TLS"));
>
> This is a seriously problematic condition for the use case described
> above.  There is no reasonable way a system library can ensure that,
> when it is loaded via dlopen, it gets assigned a module ID not larger
> than DTV_SURPLUS (which currently equals 14).
>
> Specifically, we've had a bug report from a major ISV that one of
> their large applications fails to load a plugin via dlopen with
> the above error message, which turned out to be because:
> - the plugin uses OpenMP and is thus implicitly linked against libgomp
> - the main application does not use libgomp, so it gets loaded at dlopen
> - at this point, some 150 libraries are already in use
> - many of those libraries define (regular!) TLS variables
>
> Therefore, the TLS module ID of the (indirectly loaded) libgomp ends
> up being larger than 14, and the dlopen fails.  It doesn't seem to be
> the case that the ISV is doing anything "wrong" here; the problem is
> caused solely by the interaction of glibc and libgomp.
>
> It seems to me that something ought to be fixed here.  Either the use
> of initial-exec variables simply isn't reliably supportable, but then
> not even system libraries like libgomp should use it.  Or else, glibc
> *wants* to support that use case, but then it should do so in a way
> that reliably works as long as system libraries adhere to conditions
> that are in their power to implement.
>
> Thinking along the latter lines, it seems the dl_open_worker check
> may be overly conservative:
>
> For static TLS we have to allocate the memory here and
> now.  This includes allocating memory in the DTV.
>
> It is not obvious to me that this second sentence is actually true.
>
> It *is* true that *given the current implementation*, we would fail
> if the DTV were not allocated.  This is because init_one_static_tls
> (in nptl/allocatestack.c) does:
>
>   /* Fill in the DTV slot so that a later LD/GD access will find it.  */
>   dtv[map->l_tls_modid].pointer.val = dest;
>   dtv[map->l_tls_modid].pointer.is_static = true;
>
> which would simply crash if the DTV were not allocated.
>
> However, I'm not sure why we have to do that at this point.  Variables
> accessed via the initial-exec model do not actually use the DTV, since
> the linker resolves the offsets in the static TLS block directly as
> offsets relative to the thread pointer, without using the DTV.
>
> Of course, if such a variable were to be *also* accessed via a normal
> general-dynamic (or local-dynamic) access, *then* we'd need the DTV.
> But at this point, the __tls_get_addr routine would get involved,
> which would have the chance to set up the DTV entry on the fly, and
> (re-)allocate DTV space as needed.  It's 

Re: Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread Alexander Monakov
There's a pending patch for glibc that addresses this issue among others:
https://sourceware.org/ml/libc-alpha/2014-11/msg00469.html

([BZ#17090/17620/17621]: fix DTV race, assert, and DTV_SURPLUS Static TLS
limit)

Alexander


Re: GCC 5.0 and OpenMP 4.0 accelerator : Adapteva/Parallella board

2015-02-12 Thread Ilya Verbin
Hi,

On Wed, Feb 11, 2015 at 21:33:47 -0800, Nicholas Yue wrote:
> I would like to find out if this is the correct forum to
> ask/discuss about GCC 5's OpenMP 4.0 implementation, in particular
> the new accelerator feature which from what I understand, allows the
> compute to be offloaded to external GPU/accelerator.
> 
> I have a Parallella board (ARM dual core) which has an Adapteva
> chip (16 cores) and I would like to build a GCC 5 version for it.
> 
> I recall that the Adapteva is a supported CPU with GCC.

Currently offloading to Epiphany targets is not supported by GCC.

To support it, one needs to implement at least 2 things:

1. mkoffload tool, like gcc/config/i386/intelmic-mkoffload.c or
gcc/config/nvptx/mkoffload.c

2. libgomp plugin, like liboffloadmic/plugin/libgomp-plugin-intelmic.cpp or
libgomp/plugin/plugin-nvptx.c

  -- Ilya


Re: Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread Ulrich Weigand
Alexander Monakov wrote:
> 
> There's a pending patch for glibc that addresses this issue among others:
> https://sourceware.org/ml/libc-alpha/2014-11/msg00469.html
> 
> ([BZ#17090/17620/17621]: fix DTV race, assert, and DTV_SURPLUS Static TLS
> limit)

Ah, indeed, that would fix the issue!  Thanks for pointing this out.

I see that the latest revision:
https://sourceware.org/ml/libc-alpha/2014-11/msg00590.html
has been pinged a couple of times already, so let me add another ping :-)

Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU/Linux compilers and toolchain
  ulrich.weig...@de.ibm.com



Re: GCC 5.0 and OpenMP 4.0 accelerator : Adapteva/Parallella board

2015-02-12 Thread Jakub Jelinek
On Thu, Feb 12, 2015 at 06:42:17PM +0300, Ilya Verbin wrote:
> Hi,
> 
> On Wed, Feb 11, 2015 at 21:33:47 -0800, Nicholas Yue wrote:
> > I would like to find out if this is the correct forum to
> > ask/discuss about GCC 5's OpenMP 4.0 implementation, in particular
> > the new accelerator feature which from what I understand, allows the
> > compute to be offloaded to external GPU/accelerator.
> > 
> > I have a Parallella board (ARM dual core) which has an Adapteva
> > chip (16 cores) and I would like to build a GCC 5 version for it.
> > 
> > I recall that the Adapteva is a supported CPU with GCC.
> 
> Currently offloading to Epiphany targets is not supported by GCC.
> 
> To support it, one needs to implement at least 2 things:
> 
> 1. mkoffload tool, like gcc/config/i386/intelmic-mkoffload.c or
> gcc/config/nvptx/mkoffload.c
> 
> 2. libgomp plugin, like liboffloadmic/plugin/libgomp-plugin-intelmic.cpp or
> libgomp/plugin/plugin-nvptx.c

And likely
3. port libgomp to the epiphany which supposedly doesn't have pthread
support, but some other way to spawn threads (this is similar to nvptx).

Jakub


Re: Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread Rich Felker
On Thu, Feb 12, 2015 at 04:18:57PM +0100, Ulrich Weigand wrote:
> Hello,
> 
> we're running into a problem related to use of initial-exec access to
> TLS variables in dynamically-loaded libraries.  Now, in general, this
> is actually not supported.  However, there seems to an "inofficial"
> extension that allows selected system libraries to use small amounts
> of static TLS space to allow critical variables to be defined to use
> the initial-exec model even in dynamically-loaded libraries.

This usage is supposed to be deprecated. Why isn't libgomp using
TLSDESC/gnu2 model?

Rich


Re: Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread Jakub Jelinek
On Thu, Feb 12, 2015 at 11:09:59AM -0500, Rich Felker wrote:
> On Thu, Feb 12, 2015 at 04:18:57PM +0100, Ulrich Weigand wrote:
> > Hello,
> > 
> > we're running into a problem related to use of initial-exec access to
> > TLS variables in dynamically-loaded libraries.  Now, in general, this
> > is actually not supported.  However, there seems to an "inofficial"
> > extension that allows selected system libraries to use small amounts
> > of static TLS space to allow critical variables to be defined to use
> > the initial-exec model even in dynamically-loaded libraries.
> 
> This usage is supposed to be deprecated. Why isn't libgomp using
> TLSDESC/gnu2 model?

Because it is significantly slower.

Jakub


Re: Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread Rich Felker
On Thu, Feb 12, 2015 at 05:11:45PM +0100, Jakub Jelinek wrote:
> On Thu, Feb 12, 2015 at 11:09:59AM -0500, Rich Felker wrote:
> > On Thu, Feb 12, 2015 at 04:18:57PM +0100, Ulrich Weigand wrote:
> > > Hello,
> > > 
> > > we're running into a problem related to use of initial-exec access to
> > > TLS variables in dynamically-loaded libraries.  Now, in general, this
> > > is actually not supported.  However, there seems to an "inofficial"
> > > extension that allows selected system libraries to use small amounts
> > > of static TLS space to allow critical variables to be defined to use
> > > the initial-exec model even in dynamically-loaded libraries.
> > 
> > This usage is supposed to be deprecated. Why isn't libgomp using
> > TLSDESC/gnu2 model?
> 
> Because it is significantly slower.

Seems very unlikely. If storage is allocated in static TLS, TLSDESC is
almost indistinguishable from IE in performance, even when you run
artificial benchmarks that do nothing but hammer TLS access. When it
gets allocated in dynamic TLS, it's somewhat slower, but still
unlikely to matter for most usage IMO. Do you have actual numbers
showing that TLSDESC is too slow for libgomp?

Rich


Re: Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread H.J. Lu
On Thu, Feb 12, 2015 at 8:11 AM, Jakub Jelinek  wrote:
> On Thu, Feb 12, 2015 at 11:09:59AM -0500, Rich Felker wrote:
>> On Thu, Feb 12, 2015 at 04:18:57PM +0100, Ulrich Weigand wrote:
>> > Hello,
>> >
>> > we're running into a problem related to use of initial-exec access to
>> > TLS variables in dynamically-loaded libraries.  Now, in general, this
>> > is actually not supported.  However, there seems to an "inofficial"
>> > extension that allows selected system libraries to use small amounts
>> > of static TLS space to allow critical variables to be defined to use
>> > the initial-exec model even in dynamically-loaded libraries.
>>
>> This usage is supposed to be deprecated. Why isn't libgomp using
>> TLSDESC/gnu2 model?
>
> Because it is significantly slower.

And TLSDESC/gnu2 model isn't implemented for x32.
There are no tests for TLSDESC/gnu2 model in glibc.
I have no ideas if it works in glibc master on x86-32 or
x86-64 today.


-- 
H.J.


Re: Function outlining and partial Inlining

2015-02-12 Thread Jan Hubicka
> Hello All:
> 
> The large functions are the important part of high performance application. 
> They contribute to performance bottleneck with many
> respect. Some of the large hot functions are frequently executed but many 
> regions inside the functions are cold regions. The large
> Function blocks the function inlining  to happen before of the code size 
> constraints.
> 
> Such cold regions inside the hot large functions can be extracted out and 
> form the function outlining. Thus breaking the large functions
> Into smaller function segments which causes the functions to be inlined at 
> the caller site or helps in partial inlining.
> 
> LLVM Compiler has the functionality and the optimizations for function 
> outlining based on regions like basic blocks, superblocks and
> Hyperblocks which gets extracted out into smaller function segments and thus 
> enabling the partial inlining and function inlining to happen
> At the caller site.
> 
> This optimization is the good case of profile guided optimizations and based 
> on the profile feedback data by the Compiler.
> Without profile information the above function outlining optimizations will 
> not be useful.
> 
> We are doing lot of optimization regarding polymorphism and also the indirect 
> icall promotion based on the profile feedback on the 
> Callgraph profile.
> 
> Are we doing the function outlining optimization in GCC with respect to 
> function inline and partial inline based on profile feedback
> Data. If not this optimization can be implemented. If already implemented in 
> GCC  Can I  know any pointer for such code in GCC and the 
> Scope of this function outlining optimization.

The outlining pass is called ipa-split.  The heuristic used is however quite
simplistic and it looks for very specific case where you have small header of a
function containing conditional and splits after that.  It does use profile.

Any work on improving the heuristics or providing interesting testcases to 
consider
would be welcome.

I think LLVM pass is doing pretty much the same analysis minus the profile 
feedback
considerations.  After splitting, LLVm will inline the header into all callers 
while
GCC leaves this on the decision of inliner heuristics that may just merge the
function back into one block.

The actual outlining logic is contained in tree-inline.c and also used by 
OpenMP.

Honza
> 
> If not implemented , Can I propose to have the optimization like function 
> outlining in GCC.
> 
> Thoughts Please?
> 
> Thanks & Regards
> Ajit


Re: Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread Rich Felker
On Thu, Feb 12, 2015 at 08:56:26AM -0800, H.J. Lu wrote:
> On Thu, Feb 12, 2015 at 8:11 AM, Jakub Jelinek  wrote:
> > On Thu, Feb 12, 2015 at 11:09:59AM -0500, Rich Felker wrote:
> >> On Thu, Feb 12, 2015 at 04:18:57PM +0100, Ulrich Weigand wrote:
> >> > Hello,
> >> >
> >> > we're running into a problem related to use of initial-exec access to
> >> > TLS variables in dynamically-loaded libraries.  Now, in general, this
> >> > is actually not supported.  However, there seems to an "inofficial"
> >> > extension that allows selected system libraries to use small amounts
> >> > of static TLS space to allow critical variables to be defined to use
> >> > the initial-exec model even in dynamically-loaded libraries.
> >>
> >> This usage is supposed to be deprecated. Why isn't libgomp using
> >> TLSDESC/gnu2 model?
> >
> > Because it is significantly slower.
> 
> And TLSDESC/gnu2 model isn't implemented for x32.
> There are no tests for TLSDESC/gnu2 model in glibc.
> I have no ideas if it works in glibc master on x86-32 or
> x86-64 today.

Then fixing this should be a priority, IMO. Broken libraries using IE
model "for performance" are a problem that's not going to go away
until TLSDESC gets properly adopted.

Rich


Re: Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread Andrew Haley
On 02/12/2015 04:16 PM, Rich Felker wrote:
> On Thu, Feb 12, 2015 at 05:11:45PM +0100, Jakub Jelinek wrote:
>> On Thu, Feb 12, 2015 at 11:09:59AM -0500, Rich Felker wrote:
>>>
>>> This usage is supposed to be deprecated. Why isn't libgomp using
>>> TLSDESC/gnu2 model?
>>
>> Because it is significantly slower.
> 
> Seems very unlikely. If storage is allocated in static TLS, TLSDESC is
> almost indistinguishable from IE in performance, even when you run
> artificial benchmarks that do nothing but hammer TLS access. When it
> gets allocated in dynamic TLS, it's somewhat slower, but still
> unlikely to matter for most usage IMO.

The problem I'm seeing is that dynamic TLS is always used even when not
necessary, and that hurts Java (which accesses TLS 128k times in the first
500ms or so of execution).  According to lxo his patch fixes that.

Andrew.


gcc-4.8-20150212 is now available

2015-02-12 Thread gccadmin
Snapshot gcc-4.8-20150212 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/4.8-20150212/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 4.8 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_8-branch 
revision 220665

You'll find:

 gcc-4.8-20150212.tar.bz2 Complete GCC

  MD5=7cceff112b4dfca602d1264326b37ab5
  SHA1=f1963df7da0e82372f8e3c6ec5400f3775e4c05c

Diffs from 4.8-20150129 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-4.8
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Re: Postpone expanding va_arg until pass_stdarg

2015-02-12 Thread Tom de Vries

On 12-02-15 14:57, Michael Matz wrote:

Hi,

On Wed, 11 Feb 2015, Tom de Vries wrote:


My idea was to not generate temporaries and hence copies for
non-scalar types, but rather construct the "result" of va_arg directly
into the original LHS (that would then also trivially solve the
problem of nno-copyable types).


The copy mentioned here is of ap, not of the result of va_arg.


Whoops, I misread, yes.  Thanks.



Hi,

Btw, I'm not happy about the ap copies, but I haven't been able to get rid of 
them.


I'm not really sure yet why std_gimplify_va_arg_expr has a part
commented out. Michael, can you comment?


I think I did that because of SSA form.  The old sequence calculated

vatmp = valist;
vatmp = vatmp + boundary-1
vatmp = vatmp & -boundary

(where the local variable in that function 'valist_tmp' is the tree
VAR_DECL 'vatmp') and then continue to use valist_tmp.  When in SSA form
the gimplifier will rewrite this into:

vatmp_1 = valist;
vatmp_2 = vatmp_1 + boundary-1
vatmp_3 = vatmp_2 & -boundary

but the local valist_tmp variable will continue to be the VAR_DECL, not
the vatmp_3 ssa name.  Basically whenever one gimplifies a MODIFY_EXPR
while in SSA form it's suspicious.  So the new code simply build the
expression:

((valist + bound-1) & -bound)

gimplifies that into an rvalue (most probably an SSA name) and uses that
to go on generating code by making valist_tmp be that returned rvalue.

I think you'll find that removing that code will make the SSA verifier
scream or generate invalid code with -m32 when that hook is used.



Thanks for the detailed explanation. I'm not sure I understand the
problem well enough, so I'll try to trigger it and investigate.


Actually the above fails to mention what the real problem is :-)  The
problem is that the local variable valist_tmp will be used to generate
further code after the above expression is generated.  Without my patch it
will continue to point to the VAR_DECL, not to the SSA name that actually
holds the computed value in the generated code.



I have not been able to reproduce this problem (with a bootstrap build on x86_64 
for all languages, and {unix/,unix/-m32} testing), so I've dropped this bit for now.


I've pushed the latest status to vries/expand-va-arg-at-pass-stdarg.

-ftree-stdarg-opt (the va_list_gpr/fpr_size optimization) has been renabled 
again. I needed patch "Always check phi-ops in optimize_va_list_gpr_fpr_size" 
for that.


With a similar bootstrap and reg-test as described above, there's only one 
failure left:

...
FAIL: gcc.dg/tree-ssa/stdarg-2.c scan-tree-dump stdarg "f15: va_list escapes 0, 
needs to save [148] GPR units and [1-9][0-9]* FPR units"

...
And this is due to the ap copy, which is classified as escape.

[ We're still expanding ifn_va_arg before the va_list_gpr/fpr_size 
optimization. ]

Thanks,
- Tom


Re: Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread Rich Felker
On Thu, Feb 12, 2015 at 06:23:12PM +, Andrew Haley wrote:
> On 02/12/2015 04:16 PM, Rich Felker wrote:
> > On Thu, Feb 12, 2015 at 05:11:45PM +0100, Jakub Jelinek wrote:
> >> On Thu, Feb 12, 2015 at 11:09:59AM -0500, Rich Felker wrote:
> >>>
> >>> This usage is supposed to be deprecated. Why isn't libgomp using
> >>> TLSDESC/gnu2 model?
> >>
> >> Because it is significantly slower.
> > 
> > Seems very unlikely. If storage is allocated in static TLS, TLSDESC is
> > almost indistinguishable from IE in performance, even when you run
> > artificial benchmarks that do nothing but hammer TLS access. When it
> > gets allocated in dynamic TLS, it's somewhat slower, but still
> > unlikely to matter for most usage IMO.
> 
> The problem I'm seeing is that dynamic TLS is always used even when not
> necessary, and that hurts Java (which accesses TLS 128k times in the first
> 500ms or so of execution).  According to lxo his patch fixes that.

Given those numbers, each access would need to be taking 38ns to
consume even 1% of the cpu time being spent. I would guess accesses
are closer to 5ns for TLSDESC in static area and 10-15ns for dynamic.
So I don't think this is a botteneck.

Rich


Re: Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread Alan Modra
On Thu, Feb 12, 2015 at 12:07:24PM -0500, Rich Felker wrote:
> On Thu, Feb 12, 2015 at 08:56:26AM -0800, H.J. Lu wrote:
> > On Thu, Feb 12, 2015 at 8:11 AM, Jakub Jelinek  wrote:
> > > On Thu, Feb 12, 2015 at 11:09:59AM -0500, Rich Felker wrote:
> > >> On Thu, Feb 12, 2015 at 04:18:57PM +0100, Ulrich Weigand wrote:
> > >> > Hello,
> > >> >
> > >> > we're running into a problem related to use of initial-exec access to
> > >> > TLS variables in dynamically-loaded libraries.  Now, in general, this
> > >> > is actually not supported.  However, there seems to an "inofficial"
> > >> > extension that allows selected system libraries to use small amounts
> > >> > of static TLS space to allow critical variables to be defined to use
> > >> > the initial-exec model even in dynamically-loaded libraries.
> > >>
> > >> This usage is supposed to be deprecated. Why isn't libgomp using
> > >> TLSDESC/gnu2 model?
> > >
> > > Because it is significantly slower.
> > 
> > And TLSDESC/gnu2 model isn't implemented for x32.
> > There are no tests for TLSDESC/gnu2 model in glibc.
> > I have no ideas if it works in glibc master on x86-32 or
> > x86-64 today.
> 
> Then fixing this should be a priority, IMO. Broken libraries using IE
> model "for performance" are a problem that's not going to go away
> until TLSDESC gets properly adopted.

I posted support for TLSDESC on powerpc back in 2009 (search for
powerpc _tls_get_addr call optimization).  The patch wasn't reviewed,
and I didn't push it because my benchmark tests didn't show a much of
a gain.  Quite possibly I wasn't using the right benchmark.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread Rich Felker
On Fri, Feb 13, 2015 at 10:12:11AM +1030, Alan Modra wrote:
> On Thu, Feb 12, 2015 at 12:07:24PM -0500, Rich Felker wrote:
> > On Thu, Feb 12, 2015 at 08:56:26AM -0800, H.J. Lu wrote:
> > > On Thu, Feb 12, 2015 at 8:11 AM, Jakub Jelinek  wrote:
> > > > On Thu, Feb 12, 2015 at 11:09:59AM -0500, Rich Felker wrote:
> > > >> On Thu, Feb 12, 2015 at 04:18:57PM +0100, Ulrich Weigand wrote:
> > > >> > Hello,
> > > >> >
> > > >> > we're running into a problem related to use of initial-exec access to
> > > >> > TLS variables in dynamically-loaded libraries.  Now, in general, this
> > > >> > is actually not supported.  However, there seems to an "inofficial"
> > > >> > extension that allows selected system libraries to use small amounts
> > > >> > of static TLS space to allow critical variables to be defined to use
> > > >> > the initial-exec model even in dynamically-loaded libraries.
> > > >>
> > > >> This usage is supposed to be deprecated. Why isn't libgomp using
> > > >> TLSDESC/gnu2 model?
> > > >
> > > > Because it is significantly slower.
> > > 
> > > And TLSDESC/gnu2 model isn't implemented for x32.
> > > There are no tests for TLSDESC/gnu2 model in glibc.
> > > I have no ideas if it works in glibc master on x86-32 or
> > > x86-64 today.
> > 
> > Then fixing this should be a priority, IMO. Broken libraries using IE
> > model "for performance" are a problem that's not going to go away
> > until TLSDESC gets properly adopted.
> 
> I posted support for TLSDESC on powerpc back in 2009 (search for
> powerpc _tls_get_addr call optimization).  The patch wasn't reviewed,
> and I didn't push it because my benchmark tests didn't show a much of
> a gain.  Quite possibly I wasn't using the right benchmark.

Were you measuring static-allocated TLSDESC vs non-TLSDESC GD model?
That's the case where there should be a "big" difference, though I'm
still somewhat skeptical of the benefits in real-world usage cases.

I think Alexandre Oliva had a tool along with the original paper to
measure the performance, and I did some simple testing myself a while
back I could dig up the source for.

Rich


Re: Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread Alan Modra
On Thu, Feb 12, 2015 at 06:55:30PM -0500, Rich Felker wrote:
> On Fri, Feb 13, 2015 at 10:12:11AM +1030, Alan Modra wrote:
> > I posted support for TLSDESC on powerpc back in 2009 (search for
> > powerpc _tls_get_addr call optimization).  The patch wasn't reviewed,
> > and I didn't push it because my benchmark tests didn't show a much of
> > a gain.  Quite possibly I wasn't using the right benchmark.
> 
> Were you measuring static-allocated TLSDESC vs non-TLSDESC GD model?
> That's the case where there should be a "big" difference, though I'm
> still somewhat skeptical of the benefits in real-world usage cases.

I can't remember, sorry, it was too long ago.

-- 
Alan Modra
Australia Development Lab, IBM