date:20131015

Re: lto-plugin: mismatch between ld's architecture and GCC's configure --host

2013-10-15 Thread Thomas Schwinge

Hi!

On Mon, 14 Oct 2013 12:15:41 +0200, Richard Biener  
wrote:
> I suppose nobody thought of this but I wouldn't call it a scenario that
> is desired to support either ;)

Why not support this scenario?

Have you seen the patches I posted yesterday?  There are no changes for
builds when not using the new option I added.

Grüße,
 Thomas

pgp7pEh4ZNHSc.pgp
Description: PGP signature

[gomp4] Building binaries for offload.

2013-10-15 Thread Kirill Yukhin

Hello,
Let me somewhat summarize current understanding of
host binary linking as well as target binary building/linking.

We put code which supposed to be offloaded to dedicated sections,
with name starting with gnu.target_lto_

At link time (I mean, link time of host app):
  1. Generate dedicated data section in each binary (executable or DSO),
 which'll be a placeholder for offloading stuff.

  2. Generate __OPENMP_TARGET__ (weak, hidden) symbol,
 which'll point to start of the section mentioned in previous item.

This section should contain at least:
  1. Number of targets
  2. Size of offl. symbols table

  [ Repeat `number of targets']
  2. Name of target
  3. Offset to beginning of image to offload to that target
  4. Size of image

  5. Offl. symbols table

Offloading symbols table will contain information about addresses
of offloadable symbols in order to create mapping of host<->target
addresses at runtime.

To get list of target addresses we need to have dedicated interface call
to libgomp plugin, something like getTargetAddresses () which will
query target for the list of addresses (accompanied with symbol names).
To get this information target DSO should contain similar table of
mapping symbols to address.

Application is going to have single instance of libgomp, which
in turn means that we'll have single splay tree holding information
about mapping  (host -> target) for all DSO and executable.

When GOMP_target* is called, pointer to table of current execution
module is passed to libgomp along with pointer to routine (or global).
libgomp in turn:
  1. Verify in splay tree if address of given pointer (to the table)
 exists. If not - then this means given table is not yet initialized.
 libgomp initializes it (see below) and insert address of the table
 in to splay tree.
  2. Performs lookup for the address (host) in table provided
 and extracting target address.
  3. After target address is found, we perform API call (passing that address)
 to given device

We have at least 2 approaches of host->target mapping solving.

I. Preserve order of symbols appearance.
   Table row: [ address, size ]
   For routines, size to be 1

   In order to initialize the table we need to get two arrays:
   of host and target addresses. The order of appearance of objects in
   these arrays must be the same. Having this makes mapping easy.
   We just need to find index if given address in array of host addrs and
   then dereference array of target addresses with index found.

   The problem is that it unlikely will work when LTO of host is ON.
   I am also not sure, that order of handling objects on target is the same
   as on host.

II. Store symbol identifier along with address.
  Table row: [ symbol_name, address, size]
  For routines, size to be 1

  To construct the table of host addresses, at link
  time we put all symbol (marked at compile time with dedicated
  attribute) addresses to the table, accompanied with symbol names (they'll
  serve as keys)

  During initialization of the table we create host->target address mapping
  using symbol names as keys.

The last thing I wanted to summarize: compiling target code.

We have 2 approaches here:

   1. Perform WPA and extract sections, marked as target, into separate object
  file. Then call target compiler on that object file to produce the binary.

  As mentioned by Jakub, this approach will complicate debugging.

   2. Pass fat object files directly to the target compiler (one CU at a time).
  So, for every object file we are going to call GCC twice:
  - Host GCC, which will compile all host code for every CU
  - Target GCC, which will compile all target code for every CU

I vote for option #2 as far as WPA-based approach complicates debugging.
What do you guys think?

--
Thanks, K

Re: programming language that does not inhibit further optimization by gcc

2013-10-15 Thread Rob

GCC does value analysis similar to what you mentioned. You'll find it
under the -fdump-tree-vrp options. To provide extra information you
can add range checks which GCC will pick up on. If you know a value is
small, use a small integer type and gcc will pick up the range of
values which can be assigned to it.

What are the problems you're trying to solve? Is it a low memory
system you're running on?

If you're after performance, add restrict to your parameters and
either use unions to get around aliasing or do what the Linux dev team
do with -fno-strict-aliasing.

Regarding threading - I think trying to use multiple threads without
having to learn thread libraries is a bit of a gamble. Threading is
difficult even in high level languages and you should have a good
background before approaching.

For struct packing, I suppose you could just order your entries
largest-first which is one approach, but it's kinda like the 0-1
knapsack problem.


On 15 October 2013 01:31, Albert Abramson  wrote:
> I have been looking everywhere online and talking to other coders at
> every opportunity about this, but cannot find a complete answer.
> Different languages have different obstacles to complete optimization.
>  Software developers often have to drop down into non-portable
> Assembly because they can't get the performance or small size of
> hand-optimized Assembly for their particular platform.
>
> The C language has the alias issue that limits the hoisting of loads.
> Unless the programmer specifies that two arrays will never overlap
> using the 'restrict' keyword, the compiler may not be able to handle
> operations on arrays efficiently because of the unlikely event that
> the arrays could overlap.  Most/all languages also demand the
> appearance of serialization of instructions and memory operations, as
> well as extreme correctness in even the most unlikely circumstances,
> even where the programmer may not need them.
>
> Is there a language out there (similar to Fortran or a dialect of C)
> that doesn't inhibit the compiler from taking advantage of every
> optimization possible?  Is there some way to provide a C/C++ compiler
> with extra information about variables and programs so that it can
> maximize performance or minimize size?  For example:
>
> int age = 21;//[0, 150)  setting maximum limits, compiler could use byte 
> int
> int outsideTemp = 20;//[-273, 80]
> float ERA = 297;   //[0, 1000, 3]   [min, max, digits of
> accuracy needed]
>
> Better yet, allow some easier way of spawning multiple threads without
> have to learn all of the Boost libraries, OpenCL, or OpenGL.  In other
> words, is there yet a language that is designed only for performance
> that places no limits on compiler optimizations?  Is there a language
> that allows the compiler to pack struct variables in tighter by
> reorganizing those values, etc?
>
> If not, is it possible to put together some dialect of C/C++ that
> replaces Assembly outright?
>
> --
> Max Abramson
> “In the end, more than freedom, they wanted security. They wanted a
> comfortable life, and they lost it all – security, comfort, and
> freedom. When the Athenians finally wanted not to give to society but
> for society to give to them, when the freedom they wished for most was
> freedom from responsibility, then Athens ceased to be free and was
> never free again.” --Sir Edward Gibbon

Re: wide-int branch timings

2013-10-15 Thread Richard Biener

On Tue, Oct 15, 2013 at 1:12 AM, Mike Stump  wrote:
> So, here is a comparison of the time required to do a make -j15 of a 
> --disable-bootstrap --enable-checking=none --enable-languages=c,c++ style 
> compiler.  The base compiler is a --enable-checking=none 
> --enable-languages=c,c++,lto style compiler, which is 
> 1b2bf75690af8115739ebba710a44d05388c7a1a (aka trunk@202797) from git.  The 
> wide branch compiler is 4529820913813b810860784382f975ea8e6be61d (aka 
> wide-int@203462) from git.  The software compiled in both cases is the base 
> compiler described above.
>
> Net result, around 2.6% regression in user time, and 0.4% in elapsed time.  
> The raw data is below, just in case one is interested.  This is on Ubuntu 
> 12.04.3 system with 12GB ram with 8 cores.

Btw, more interesting are testcases that put a heavy load on the alias
machinery, like (many) (nested) loops with a lot of memory references.
Like the testcase in PR39326.  If you profile that you will see some
of the double_int routines high in the profile which means on the
branch wide_int routines should start to show up.

I didn't expect visible differences for a bootstrap, but you proved me
wrong :(  Btw, with parallel make a single file getting a lot slower can
be masked by parallelism completely, so I take timings with -j
with a grain of salt.

Thanks,
Richard.

> wide branch:
>
> 1760.94user 145.78system 5:06.23elapsed 622%CPU (0avgtext+0avgdata 
> 2317824maxresident)k
> 32976inputs+5713232outputs (1487major+72639003minor)pagefaults 0swaps
> 1758.53user 145.40system 5:06.66elapsed 620%CPU (0avgtext+0avgdata 
> 2317808maxresident)k
> 1104inputs+5713240outputs (9major+72644909minor)pagefaults 0swaps
> 1751.91user 145.77system 5:05.27elapsed 621%CPU (0avgtext+0avgdata 
> 2317808maxresident)k
> 0inputs+5713232outputs (0major+72652872minor)pagefaults 0swaps
> 1751.29user 145.78system 5:06.15elapsed 619%CPU (0avgtext+0avgdata 
> 2317808maxresident)k
> 8inputs+5713256outputs (0major+72647952minor)pagefaults 0swaps
> 1755.10user 145.26system 5:02.74elapsed 627%CPU (0avgtext+0avgdata 
> 2317808maxresident)k
> 96inputs+5713264outputs (1major+72642787minor)pagefaults 0swaps
>
> base:
>
> 1708.71user 145.02system 5:04.98elapsed 607%CPU (0avgtext+0avgdata 
> 2317824maxresident)k
> 0inputs+5713448outputs (0major+72602789minor)pagefaults 0swaps
> 1707.43user 145.56system 5:05.24elapsed 607%CPU (0avgtext+0avgdata 
> 2317808maxresident)k
> 0inputs+5713424outputs (0major+72606028minor)pagefaults 0swaps
> 1711.61user 145.53system 5:03.49elapsed 611%CPU (0avgtext+0avgdata 
> 2317808maxresident)k
> 160inputs+5713424outputs (6major+72614090minor)pagefaults 0swaps
> 1712.64user 145.25system 5:02.98elapsed 613%CPU (0avgtext+0avgdata 
> 2317808maxresident)k
> 0inputs+5713432outputs (0major+72599974minor)pagefaults 0swaps
> 1708.81user 144.66system 5:01.61elapsed 614%CPU (0avgtext+0avgdata 
> 2317808maxresident)k
> 24inputs+5713448outputs (0major+72599501minor)pagefaults 0swaps

Re: wide-int branch timings

2013-10-15 Thread Richard Biener

On Tue, Oct 15, 2013 at 2:10 PM, Richard Biener
 wrote:
> On Tue, Oct 15, 2013 at 1:12 AM, Mike Stump  wrote:
>> So, here is a comparison of the time required to do a make -j15 of a 
>> --disable-bootstrap --enable-checking=none --enable-languages=c,c++ style 
>> compiler.  The base compiler is a --enable-checking=none 
>> --enable-languages=c,c++,lto style compiler, which is 
>> 1b2bf75690af8115739ebba710a44d05388c7a1a (aka trunk@202797) from git.  The 
>> wide branch compiler is 4529820913813b810860784382f975ea8e6be61d (aka 
>> wide-int@203462) from git.  The software compiled in both cases is the base 
>> compiler described above.
>>
>> Net result, around 2.6% regression in user time, and 0.4% in elapsed time.  
>> The raw data is below, just in case one is interested.  This is on Ubuntu 
>> 12.04.3 system with 12GB ram with 8 cores.
>
> Btw, more interesting are testcases that put a heavy load on the alias
> machinery, like (many) (nested) loops with a lot of memory references.
> Like the testcase in PR39326.  If you profile that you will see some
> of the double_int routines high in the profile which means on the
> branch wide_int routines should start to show up.
>
> I didn't expect visible differences for a bootstrap, but you proved me
> wrong :(  Btw, with parallel make a single file getting a lot slower can
> be masked by parallelism completely, so I take timings with -j
> with a grain of salt.

For example for get_ref_base_and_extent the adds to bit_offset
(even though initially of addr_wide_int kind) end up unoptimized,
exposing

  if (len_822 > 2)
goto ;
  else
goto ;

:
  xprecision_819 = (unsigned int) D.54901_818;
  if (xprecision_819 > 127)
goto ;
  else
goto ;

:
  D.54899_838 = D.54922_816->base.u.bits.unsigned_flag;
  D.54900_839 = (signop) D.54899_838;
  len_840 = wi::force_to_size (&MEM[(struct wide_int_ref_storage
*)&yi].scratch, val_823, len_822, xprecision_819, 128, D.54900_839);

:
  # val_1543 = PHI 
  # len_1542 = PHI <2(93), len_840(95), len_822(94)>
  MEM[(struct generic_wide_int *)&yi].val = val_1543;
  MEM[(struct generic_wide_int *)&yi].len = len_1542;
  MEM[(struct generic_wide_int *)&yi].precision = 128;
  D.54871_813 = wi::add_large (&MEM[(struct fixed_wide_int_storage
*)&D.54875].D.43191.val, &MEM[(const struct fixed_wide_int_storage
*)&bit_offset].val, D.54872_808, val_1543, len_1542, 128, 1, 0B);
  MEM[(unsigned int *)&D.54875 + 24B] = D.54871_813;
  __builtin_memcpy (&bit_offset, &D.54875, 28);
  goto  ();

one issue you can clearly see is that too much of the temporaries
(like here the wide_int_ref yi that is created for the tree) ends up
being addressable.  That's because its data is embedded and
passed to add_large (instead of what you'd say is "ref" storage, refering
to storage elsewhere).  Which is because of the canonicalization
mismatch between tree, wide-int and RTX I guess.

Not sure where the memcpy comes from in the above code - seems
that

  bit_offset += TREE_OPERAND (exp, 2);

builds a temporary bit_offset + TREE_OPERAND (exp, 2) that is
then copied to bit_offset and this copy cannot be elided.

That said, how do cc1 binary sizes compare branch vs. trunk at
the last merge point?

Richard.

Re: wide-int branch timings

2013-10-15 Thread Richard Biener

On Tue, Oct 15, 2013 at 2:41 PM, Richard Biener
 wrote:
> On Tue, Oct 15, 2013 at 2:10 PM, Richard Biener
>  wrote:
>> On Tue, Oct 15, 2013 at 1:12 AM, Mike Stump  wrote:
>>> So, here is a comparison of the time required to do a make -j15 of a 
>>> --disable-bootstrap --enable-checking=none --enable-languages=c,c++ style 
>>> compiler.  The base compiler is a --enable-checking=none 
>>> --enable-languages=c,c++,lto style compiler, which is 
>>> 1b2bf75690af8115739ebba710a44d05388c7a1a (aka trunk@202797) from git.  The 
>>> wide branch compiler is 4529820913813b810860784382f975ea8e6be61d (aka 
>>> wide-int@203462) from git.  The software compiled in both cases is the base 
>>> compiler described above.
>>>
>>> Net result, around 2.6% regression in user time, and 0.4% in elapsed time.  
>>> The raw data is below, just in case one is interested.  This is on Ubuntu 
>>> 12.04.3 system with 12GB ram with 8 cores.
>>
>> Btw, more interesting are testcases that put a heavy load on the alias
>> machinery, like (many) (nested) loops with a lot of memory references.
>> Like the testcase in PR39326.  If you profile that you will see some
>> of the double_int routines high in the profile which means on the
>> branch wide_int routines should start to show up.
>>
>> I didn't expect visible differences for a bootstrap, but you proved me
>> wrong :(  Btw, with parallel make a single file getting a lot slower can
>> be masked by parallelism completely, so I take timings with -j
>> with a grain of salt.
>
> For example for get_ref_base_and_extent the adds to bit_offset
> (even though initially of addr_wide_int kind) end up unoptimized,
> exposing
>
>   if (len_822 > 2)
> goto ;
>   else
> goto ;
>
> :
>   xprecision_819 = (unsigned int) D.54901_818;
>   if (xprecision_819 > 127)
> goto ;
>   else
> goto ;
>
> :
>   D.54899_838 = D.54922_816->base.u.bits.unsigned_flag;
>   D.54900_839 = (signop) D.54899_838;
>   len_840 = wi::force_to_size (&MEM[(struct wide_int_ref_storage
> *)&yi].scratch, val_823, len_822, xprecision_819, 128, D.54900_839);
>
> :
>   # val_1543 = PHI  *)&yi].scratch(95), val_823(94)>
>   # len_1542 = PHI <2(93), len_840(95), len_822(94)>
>   MEM[(struct generic_wide_int *)&yi].val = val_1543;
>   MEM[(struct generic_wide_int *)&yi].len = len_1542;
>   MEM[(struct generic_wide_int *)&yi].precision = 128;
>   D.54871_813 = wi::add_large (&MEM[(struct fixed_wide_int_storage
> *)&D.54875].D.43191.val, &MEM[(const struct fixed_wide_int_storage
> *)&bit_offset].val, D.54872_808, val_1543, len_1542, 128, 1, 0B);
>   MEM[(unsigned int *)&D.54875 + 24B] = D.54871_813;
>   __builtin_memcpy (&bit_offset, &D.54875, 28);
>   goto  ();

That was built with host G++ 4.6, with trunk you see it more obvious:

  :
  # SR.574_214 = PHI <_507(69), &MEM[(struct wide_int_ref_storage
*)&yi].scratch(70), _507(68)>
  # SR.575_810 = PHI 
  MEM[(struct generic_wide_int *)&yi] = SR.574_214;
  MEM[(struct generic_wide_int *)&yi + 8B] = SR.575_810;
  MEM[(struct generic_wide_int *)&yi + 12B] = 128;
  _468 = wi::add_large (&MEM[(struct fixed_wide_int_storage
*)&D.52085].val, &MEM[(const struct fixed_wide_int_storage
*)&bit_offset].val, _463, SR.574_214, SR.575_810, 128, 1, 0B);
  MEM[(unsigned int *)&D.52085 + 24B] = _468;
  yi ={v} {CLOBBER};
  MEM[(struct generic_wide_int *)&bit_offset] = MEM[(struct
generic_wide_int *)&D.52085];
  D.52085 ={v} {CLOBBER};
  goto  ();

even though yi dies after the call to wi::add_large we cannot remove the
pointless initializations of its members as its address escapes.

Richard.

Compilation flags in libgfortran

2013-10-15 Thread Igor Zamyatin

Hi All!

Is there any particular reason that matmul* modules from libgfortran
are compiled with -O2 -ftree-vectorize?

I see some regressions on Atom processor after r202980
(http://gcc.gnu.org/ml/gcc-cvs/2013-09/msg00846.html)

Why not just use O3 for those modules?


Thanks,
Igor

Re: programming language that does not inhibit further optimization by gcc

2013-10-15 Thread Ian Lance Taylor

On Mon, Oct 14, 2013 at 5:31 PM, Albert Abramson
 wrote:
>
> Is there a language out there (similar to Fortran or a dialect of C)
> that doesn't inhibit the compiler from taking advantage of every
> optimization possible?

Sure: Fortran.


> Is there some way to provide a C/C++ compiler
> with extra information about variables and programs so that it can
> maximize performance or minimize size?  For example:
>
> int age = 21;//[0, 150)  setting maximum limits, compiler could use byte 
> int
> int outsideTemp = 20;//[-273, 80]
> float ERA = 297;   //[0, 1000, 3]   [min, max, digits of
> accuracy needed]

Hmmm, OK, that kind of thing is available in PL/1 and, I think, in
Ada.  But as far as I know it doesn't help compilers very much in
practice.

Ian

Re: function attributes

2013-10-15 Thread Nagaraju Mekala

 Hi Ian,

  Thanks for the reply.

On Fri, Oct 11, 2013 at 10:31 PM, Ian Lance Taylor  wrote:
> On Fri, Oct 11, 2013 at 9:20 AM, Nagaraju Mekala  
> wrote:
>>
>> I observed that in rs6000 port longcall is implemented by using
>> CALL_LONG define.
>> #define CALL_LONG 0x0008 /* always call indirect */
>> In the md file they are checking the operand with CALL_LONG
>> if (INTVAL (operands[3]) & CALL_LONG)
>> operands[1] = rs6000_longcall_ref (operands[1]);
>> In my port I dont have suchthing to compare. Can we somehow parse the
>> tree chain and check the attributes of the functions..
>
> Look at init_cumulative_args in rs6000.c to see how CALL_LONG is set
> based on the function attribute.

I was able to get the function attribute from the init_cumulative_args
function.  I have used the fndecl tree to get the attribute details
but I have failed to stop generating br instruction. It should print
bk instruction.
I was unable to relate the super attribute from init_cumulative_args
to the branch pattern in md file to generate bk instruction.
I have intialized a global variable to 1 if super is detected and
checking the same in my pattern.
 My branch pattern looks like below
(define_insn "call_int1"
  [(call (mem (match_operand:SI 0 "call_insn_simple_operand" "ri"))
 (match_operand:SI 1 "" "i"))
  (clobber (reg:SI R_RS))]
 ""
  {
register rtx t = operands[0];
register rtx t2 = gen_rtx_REG (Pmode,
  GP_REG_FIRST + RETURN_ADDR_REGNUM);
if (GET_CODE (t) == SYMBOL_REF) {
if(super_var()) ---> Here I am
checking for global variable
{
return "bk\tr1,8\;%#";
}
else {
gen_rtx_CLOBBER (VOIDmode, t2);
return "br\tr1,%0\;%#";

I observed that init_cumulative_args is called first for all the
functions once they are done then the above pattern for all the
instructions are called so my global variable is not useful.

Can you help me how to exactly emit bk instruction from the pattern
when super function is called.

> Ian

 Thanks,
Nagaraju

Re: function attributes

2013-10-15 Thread Ian Lance Taylor

On Tue, Oct 15, 2013 at 8:04 AM, Nagaraju Mekala  wrote:
>  Hi Ian,
>
>   Thanks for the reply.
>
> On Fri, Oct 11, 2013 at 10:31 PM, Ian Lance Taylor  wrote:
>> On Fri, Oct 11, 2013 at 9:20 AM, Nagaraju Mekala  
>> wrote:
>>>
>>> I observed that in rs6000 port longcall is implemented by using
>>> CALL_LONG define.
>>> #define CALL_LONG 0x0008 /* always call indirect */
>>> In the md file they are checking the operand with CALL_LONG
>>> if (INTVAL (operands[3]) & CALL_LONG)
>>> operands[1] = rs6000_longcall_ref (operands[1]);
>>> In my port I dont have suchthing to compare. Can we somehow parse the
>>> tree chain and check the attributes of the functions..
>>
>> Look at init_cumulative_args in rs6000.c to see how CALL_LONG is set
>> based on the function attribute.
>
> I was able to get the function attribute from the init_cumulative_args
> function.  I have used the fndecl tree to get the attribute details
> but I have failed to stop generating br instruction. It should print
> bk instruction.
> I was unable to relate the super attribute from init_cumulative_args
> to the branch pattern in md file to generate bk instruction.
> I have intialized a global variable to 1 if super is detected and
> checking the same in my pattern.
>  My branch pattern looks like below
> (define_insn "call_int1"
>   [(call (mem (match_operand:SI 0 "call_insn_simple_operand" "ri"))
>  (match_operand:SI 1 "" "i"))
>   (clobber (reg:SI R_RS))]
>  ""
>   {
> register rtx t = operands[0];
> register rtx t2 = gen_rtx_REG (Pmode,
>   GP_REG_FIRST + RETURN_ADDR_REGNUM);
> if (GET_CODE (t) == SYMBOL_REF) {
> if(super_var()) ---> Here I am
> checking for global variable
> {
> return "bk\tr1,8\;%#";
> }
> else {
> gen_rtx_CLOBBER (VOIDmode, t2);
> return "br\tr1,%0\;%#";
>
> I observed that init_cumulative_args is called first for all the
> functions once they are done then the above pattern for all the
> instructions are called so my global variable is not useful.
>
> Can you help me how to exactly emit bk instruction from the pattern
> when super function is called.


Again I just have to say: look at the rs6000 port.  Look at the rs6000
call instruction.  Look at how it decides whether to do a longcall or
not.

Ian

Re: Compilation flags in libgfortran

2013-10-15 Thread Toon Moene


On 10/15/2013 03:58 PM, Igor Zamyatin wrote:


Hi All!

Is there any particular reason that matmul* modules from libgfortran
are compiled with -O2 -ftree-vectorize?

I see some regressions on Atom processor after r202980
(http://gcc.gnu.org/ml/gcc-cvs/2013-09/msg00846.html)

Why not just use O3 for those modules?


Igor,

It helps (:-) to send questions about gfortran and its run time library 
libgfortran cc'd to fort...@gcc.gnu.org, because not every GNU Fortran 
maintainer reads gcc@gcc.gnu.org 


Kind regards,

--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news

Re: Compilation flags in libgfortran

2013-10-15 Thread Janne Blomqvist

On Tue, Oct 15, 2013 at 4:58 PM, Igor Zamyatin  wrote:
> Hi All!
>
> Is there any particular reason that matmul* modules from libgfortran
> are compiled with -O2 -ftree-vectorize?

Yes, testing showed that it improved performance compared to the
default options. See the thread starting at

http://gcc.gnu.org/ml/fortran/2005-11/msg00366.html

In the almost 8 years (!!) since the patch was merged, I believe the
importance of vectorization for utilizing current processors has only
increased.

[snip]

> Why not just use O3 for those modules?

Back when the change was made, -ftree-vectorize wasn't enabled by -O3.
IIRC I did some tests, and -O3 didn't really improve things beyond
what "-O2 -funroll-loops -ftree-vectorize" already did. That was a
while ago however, so if somebody (*wink*) would care to redo the
benchmarks things might look different with today's GCC on today's
hardware.

Hope this helps,

-- 
Janne Blomqvist

Re: Cilk Library

2013-10-15 Thread Jeff Law


On 10/09/13 12:32, Iyer, Balaji V wrote:

Dear Jeff and the rest of Steering committee members,
  Thank you very much for approving the license terms of the Cilk Library. 
I couldn't attach the zipped copy of the patch due to its size, so here is a 
link to the Cilk library patch that can be applied to the trunk: 
(https://docs.google.com/file/d/0BzEpbbnrYKsSWjBWSkNrVS1SaGs/edit?usp=sharing).

  Is it OK for trunk?

Here are the ChangeLog entries:

ChangeLog:
2013-10-09  Balaji V. Iyer  

 * Makefile.def: Add libcilkrts to target_modules.  Make libcilkrts
 depend on libstdc++ and libgcc.
 * configure.ac: Added libcilkrts to target binaries.
 * configure: Likewise.
 * Makefile.in: Added libcilkrts related fields to support building it.

libcilkrts/ChangeLog:
2013-10-09  Balaji V. Iyer  

* libcilkrts/Makefile.am: New file.  Libcilkrts version 3613.
* libcilkrts/Makefile.in: Likewise.
* libcilkrts/README: Likewise.
* libcilkrts/aclocal.m4: Likewise.
* libcilkrts/configure: Likewise.
* libcilkrts/configure.ac: Likewise.
* libcilkrts/include/cilk/cilk.h: Likewise.
* libcilkrts/include/cilk/cilk_api.h: Likewise.
* libcilkrts/include/cilk/cilk_api_linux.h: Likewise.
* libcilkrts/include/cilk/cilk_stub.h: Likewise.
* libcilkrts/include/cilk/cilk_undocumented.h: Likewise.
* libcilkrts/include/cilk/common.h: Likewise.
* libcilkrts/include/cilk/holder.h: Likewise.
* libcilkrts/include/cilk/hyperobject_base.h: Likewise.
* libcilkrts/include/cilk/metaprogramming.h: Likewise.
* libcilkrts/include/cilk/reducer.h: Likewise.
* libcilkrts/include/cilk/reducer_file.h: Likewise.
* libcilkrts/include/cilk/reducer_list.h: Likewise.
* libcilkrts/include/cilk/reducer_max.h: Likewise.
* libcilkrts/include/cilk/reducer_min.h: Likewise.
* libcilkrts/include/cilk/reducer_min_max.h: Likewise.
* libcilkrts/include/cilk/reducer_opadd.h: Likewise.
* libcilkrts/include/cilk/reducer_opand.h: Likewise.
* libcilkrts/include/cilk/reducer_opmul.h: Likewise.
* libcilkrts/include/cilk/reducer_opor.h: Likewise.
* libcilkrts/include/cilk/reducer_opxor.h: Likewise.
* libcilkrts/include/cilk/reducer_ostream.h: Likewise.
* libcilkrts/include/cilk/reducer_string.h: Likewise.
* libcilkrts/include/cilktools/cilkscreen.h: Likewise.
* libcilkrts/include/cilktools/cilkview.h: Likewise.
* libcilkrts/include/cilktools/fake_mutex.h: Likewise.
* libcilkrts/include/cilktools/lock_guard.h: Likewise.
* libcilkrts/include/internal/abi.h: Likewise.
* libcilkrts/include/internal/cilk_fake.h: Likewise.
* libcilkrts/include/internal/cilk_version.h: Likewise.
* libcilkrts/include/internal/inspector-abi.h: Likewise.
* libcilkrts/include/internal/metacall.h: Likewise.
* libcilkrts/include/internal/rev.mk: Likewise.
* libcilkrts/mk/cilk-version.mk: Likewise.
* libcilkrts/mk/unix-common.mk: Likewise.
* libcilkrts/runtime/acknowledgements.dox: Likewise.
* libcilkrts/runtime/bug.cpp: Likewise.
* libcilkrts/runtime/bug.h: Likewise.
* libcilkrts/runtime/c_reducers.c: Likewise.
* libcilkrts/runtime/cilk-abi-cilk-for.cpp: Likewise.
* libcilkrts/runtime/cilk-abi-vla-internal.c: Likewise.
* libcilkrts/runtime/cilk-abi-vla-internal.h: Likewise.
* libcilkrts/runtime/cilk-abi-vla.c: Likewise.
* libcilkrts/runtime/cilk-abi.c: Likewise.
* libcilkrts/runtime/cilk-ittnotify.h: Likewise.
* libcilkrts/runtime/cilk-tbb-interop.h: Likewise.
* libcilkrts/runtime/cilk_api.c: Likewise.
* libcilkrts/runtime/cilk_fiber-unix.cpp: Likewise.
* libcilkrts/runtime/cilk_fiber-unix.h: Likewise.
* libcilkrts/runtime/cilk_fiber.cpp: Likewise.
* libcilkrts/runtime/cilk_fiber.h: Likewise.
* libcilkrts/runtime/cilk_malloc.c: Likewise.
* libcilkrts/runtime/cilk_malloc.h: Likewise.
* libcilkrts/runtime/component.h: Likewise.
* libcilkrts/runtime/doxygen-layout.xml: Likewise.
* libcilkrts/runtime/doxygen.cfg: Likewise.
* libcilkrts/runtime/except-gcc.cpp: Likewise.
* libcilkrts/runtime/except-gcc.h: Likewise.
* libcilkrts/runtime/except.h: Likewise.
* libcilkrts/runtime/frame_malloc.c: Likewise.
* libcilkrts/runtime/frame_malloc.h: Likewise.
* libcilkrts/runtime/full_frame.c: Likewise.
* libcilkrts/runtime/full_frame.h: Likewise.
* libcilkrts/runtime/global_state.cpp: Likewise.
* libcilkrts/runtime/global_state.h: Likewise.
* libcilkrts/runtime/jmpbuf.c: Likewise.
* libcilkrts/runtime/jmpbuf.h: Likewise.
* libcilkrts/runtime/local_state.c: Likewise.
* libcilkrts/runtime/local_state

Re: wide-int branch timings

2013-10-15 Thread Mike Stump

On Oct 15, 2013, at 5:41 AM, Richard Biener  wrote:
> That said, how do cc1 binary sizes compare branch vs. trunk at
> the last merge point?

$ size /tmp/gcc-*/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/cc1plus 
   textdata bss dec hex filename
14224227  33960 1061304 15319491 e9c1c3 
/tmp/gcc-1/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/cc1plus
13973978  33952 1061272 15069202 e5f012 
/tmp/gcc-base/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/cc1plus
$ size /tmp/gcc-*/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/cc1
   textdata bss dec hex filename
13146268  33864 1038808 14218940 d8f6bc 
/tmp/gcc-1/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/cc1
12899907  33856 1038776 13972539 d5343b 
/tmp/gcc-base/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/cc1
$ bc -l
14224227/13973978
1.01790821482615759091
13146268/12899907
1.01909788962044455049

1.8% and 1.9% bigger in text.  8 bytes, and 32 bytes bigger in data.

Re: programming language that does not inhibit further optimization by gcc

2013-10-15 Thread gwenael chailleu

Here is the way I understood the goal of your long quest (I may be
completely mistaken since I do not quite get what part of the job you
want to leave to the language and what part to its compiler)

"Is there a language that allow the developer to add information about
the way a particular program will really use the variables it
declares, or the function it calls, so that this information can be
exploited by the compiler to optimize as far as possible the final
binary ?"

We are developing Cawen (please do not hesitate to take a look at
http://www.melvenn.com/en/cawen/why-cawen/), a language that includes
C99 and produces C99 source code (it can be considered as a
precompiling tool for C).

---
Variables :

Cawen gives you the possibility to enrich variables with user information :

Your sample code :

int age = 21;//[0, 150)  setting maximum limits, compiler could use byte int
int outsideTemp = 20;//[-273, 80]
float ERA = 297;   //[0, 1000, 3]   [min, max, digits of
accuracy needed]

Can be written in Cawen as

int age < < min = 0, max = 150 > > = 21;
int outsideTemp < < min = -273, max = 80 > > = 20;
float ERA < < min=0,max=1000,accur=3 > > = 297;

The range properties can further be asked for in the Cawen code with
age = > min, age = > max and so on.

Ex :

int repartition [ age = > max  + 1];

min, max are not Cawen keywords, you can create as many labels as you want :

int age < < min = 0, max = 150, average = 100 > >;


One can also code things like :

@declare(integer,age,max,150,min,0);
etc. ..

It is up to the Cawen coder to implement the @declare macro that would
consider that an integer beetwen 0 and 150 must be declared as an
unsigned char in the generated C code.

unsigned char age;

age = > min and age = > max remain available...

This was for user code. As far as giving hints to the compiler is
concerned, Cawen has got no compiler and relies entirely on the C
compiler. So that range information can only be used at compile time
if the C compiler can make use of them through a specific syntax.
In this case, Cawen preprocessor would let you code your own
transformation from its own syntax :

int age < < min =0, max = 150 > > = 21;

to the C target

int age whatever_compiler_specific_syntax_(0,150);

Of course, age = > min, age = > max are still available.

-

Functions :

Here is an example of how Cawen's function template mechanism can be
used for optimization :

This line appends the 10 first elements of a to b.

@govel::append(a,10,b); // govel is the first Cawen's standard library

The function will first check if there is enough elements in a. This
checking is totally unnecessary if the coder knows that a is equal to
"a_string_that_is_more_than_10_char_long".

Coding

@govel::append{ !src_check }(a,10,b)

you can tell Cawen (in govel's code) to skip it.

Feel free to create and implement hundreds of templating parameters !

@govel::append{ !src_check  size_opt_level = 1  speed_opt_level = 3
!memcpy debug   comment = " with a lot of care" ...   }(a,10,b)


Regards

TS & GC

2013/10/15 Albert Abramson :
> I have been looking everywhere online and talking to other coders at
> every opportunity about this, but cannot find a complete answer.
> Different languages have different obstacles to complete optimization.
>  Software developers often have to drop down into non-portable
> Assembly because they can't get the performance or small size of
> hand-optimized Assembly for their particular platform.
>
> The C language has the alias issue that limits the hoisting of loads.
> Unless the programmer specifies that two arrays will never overlap
> using the 'restrict' keyword, the compiler may not be able to handle
> operations on arrays efficiently because of the unlikely event that
> the arrays could overlap.  Most/all languages also demand the
> appearance of serialization of instructions and memory operations, as
> well as extreme correctness in even the most unlikely circumstances,
> even where the programmer may not need them.
>
> Is there a language out there (similar to Fortran or a dialect of C)
> that doesn't inhibit the compiler from taking advantage of every
> optimization possible?  Is there some way to provide a C/C++ compiler
> with extra information about variables and programs so that it can
> maximize performance or minimize size?  For example:
>
> int age = 21;//[0, 150)  setting maximum limits, compiler could use byte 
> int
> int outsideTemp = 20;//[-273, 80]
> float ERA = 297;   //[0, 1000, 3]   [min, max, digits of
> accuracy needed]
>
> Better yet, allow some easier way of spawning multiple threads without
> have to learn all of the Boost libraries, OpenCL, or OpenGL.  In other
> words, is there yet a language that is designed only for performance
> that places no limits on compiler optimizations?  Is there a language
> that allows the compiler to pack struct variables in tighter by
> reorganizing those values, etc?
>
> If not, i

Re: lto-plugin: mismatch between ld's architecture and GCC's configure --host

[gomp4] Building binaries for offload.

Re: programming language that does not inhibit further optimization by gcc

Re: wide-int branch timings

Re: wide-int branch timings

Re: wide-int branch timings

Compilation flags in libgfortran

Re: programming language that does not inhibit further optimization by gcc

Re: function attributes

Re: function attributes

Re: Compilation flags in libgfortran

Re: Compilation flags in libgfortran

Re: Cilk Library

Re: wide-int branch timings

Re: programming language that does not inhibit further optimization by gcc

15 matches

Site Navigation

Mail list logo

Footer information