Re: lto-plugin: mismatch between ld's architecture and GCC's configure --host
Hi! On Mon, 14 Oct 2013 12:15:41 +0200, Richard Biener wrote: > I suppose nobody thought of this but I wouldn't call it a scenario that > is desired to support either ;) Why not support this scenario? Have you seen the patches I posted yesterday? There are no changes for builds when not using the new option I added. Grüße, Thomas pgp7pEh4ZNHSc.pgp Description: PGP signature
[gomp4] Building binaries for offload.
Hello, Let me somewhat summarize current understanding of host binary linking as well as target binary building/linking. We put code which supposed to be offloaded to dedicated sections, with name starting with gnu.target_lto_ At link time (I mean, link time of host app): 1. Generate dedicated data section in each binary (executable or DSO), which'll be a placeholder for offloading stuff. 2. Generate __OPENMP_TARGET__ (weak, hidden) symbol, which'll point to start of the section mentioned in previous item. This section should contain at least: 1. Number of targets 2. Size of offl. symbols table [ Repeat `number of targets'] 2. Name of target 3. Offset to beginning of image to offload to that target 4. Size of image 5. Offl. symbols table Offloading symbols table will contain information about addresses of offloadable symbols in order to create mapping of host<->target addresses at runtime. To get list of target addresses we need to have dedicated interface call to libgomp plugin, something like getTargetAddresses () which will query target for the list of addresses (accompanied with symbol names). To get this information target DSO should contain similar table of mapping symbols to address. Application is going to have single instance of libgomp, which in turn means that we'll have single splay tree holding information about mapping (host -> target) for all DSO and executable. When GOMP_target* is called, pointer to table of current execution module is passed to libgomp along with pointer to routine (or global). libgomp in turn: 1. Verify in splay tree if address of given pointer (to the table) exists. If not - then this means given table is not yet initialized. libgomp initializes it (see below) and insert address of the table in to splay tree. 2. Performs lookup for the address (host) in table provided and extracting target address. 3. After target address is found, we perform API call (passing that address) to given device We have at least 2 approaches of host->target mapping solving. I. Preserve order of symbols appearance. Table row: [ address, size ] For routines, size to be 1 In order to initialize the table we need to get two arrays: of host and target addresses. The order of appearance of objects in these arrays must be the same. Having this makes mapping easy. We just need to find index if given address in array of host addrs and then dereference array of target addresses with index found. The problem is that it unlikely will work when LTO of host is ON. I am also not sure, that order of handling objects on target is the same as on host. II. Store symbol identifier along with address. Table row: [ symbol_name, address, size] For routines, size to be 1 To construct the table of host addresses, at link time we put all symbol (marked at compile time with dedicated attribute) addresses to the table, accompanied with symbol names (they'll serve as keys) During initialization of the table we create host->target address mapping using symbol names as keys. The last thing I wanted to summarize: compiling target code. We have 2 approaches here: 1. Perform WPA and extract sections, marked as target, into separate object file. Then call target compiler on that object file to produce the binary. As mentioned by Jakub, this approach will complicate debugging. 2. Pass fat object files directly to the target compiler (one CU at a time). So, for every object file we are going to call GCC twice: - Host GCC, which will compile all host code for every CU - Target GCC, which will compile all target code for every CU I vote for option #2 as far as WPA-based approach complicates debugging. What do you guys think? -- Thanks, K
Re: programming language that does not inhibit further optimization by gcc
GCC does value analysis similar to what you mentioned. You'll find it under the -fdump-tree-vrp options. To provide extra information you can add range checks which GCC will pick up on. If you know a value is small, use a small integer type and gcc will pick up the range of values which can be assigned to it. What are the problems you're trying to solve? Is it a low memory system you're running on? If you're after performance, add restrict to your parameters and either use unions to get around aliasing or do what the Linux dev team do with -fno-strict-aliasing. Regarding threading - I think trying to use multiple threads without having to learn thread libraries is a bit of a gamble. Threading is difficult even in high level languages and you should have a good background before approaching. For struct packing, I suppose you could just order your entries largest-first which is one approach, but it's kinda like the 0-1 knapsack problem. On 15 October 2013 01:31, Albert Abramson wrote: > I have been looking everywhere online and talking to other coders at > every opportunity about this, but cannot find a complete answer. > Different languages have different obstacles to complete optimization. > Software developers often have to drop down into non-portable > Assembly because they can't get the performance or small size of > hand-optimized Assembly for their particular platform. > > The C language has the alias issue that limits the hoisting of loads. > Unless the programmer specifies that two arrays will never overlap > using the 'restrict' keyword, the compiler may not be able to handle > operations on arrays efficiently because of the unlikely event that > the arrays could overlap. Most/all languages also demand the > appearance of serialization of instructions and memory operations, as > well as extreme correctness in even the most unlikely circumstances, > even where the programmer may not need them. > > Is there a language out there (similar to Fortran or a dialect of C) > that doesn't inhibit the compiler from taking advantage of every > optimization possible? Is there some way to provide a C/C++ compiler > with extra information about variables and programs so that it can > maximize performance or minimize size? For example: > > int age = 21;//[0, 150) setting maximum limits, compiler could use byte > int > int outsideTemp = 20;//[-273, 80] > float ERA = 297; //[0, 1000, 3] [min, max, digits of > accuracy needed] > > Better yet, allow some easier way of spawning multiple threads without > have to learn all of the Boost libraries, OpenCL, or OpenGL. In other > words, is there yet a language that is designed only for performance > that places no limits on compiler optimizations? Is there a language > that allows the compiler to pack struct variables in tighter by > reorganizing those values, etc? > > If not, is it possible to put together some dialect of C/C++ that > replaces Assembly outright? > > -- > Max Abramson > “In the end, more than freedom, they wanted security. They wanted a > comfortable life, and they lost it all – security, comfort, and > freedom. When the Athenians finally wanted not to give to society but > for society to give to them, when the freedom they wished for most was > freedom from responsibility, then Athens ceased to be free and was > never free again.” --Sir Edward Gibbon
Re: wide-int branch timings
On Tue, Oct 15, 2013 at 1:12 AM, Mike Stump wrote: > So, here is a comparison of the time required to do a make -j15 of a > --disable-bootstrap --enable-checking=none --enable-languages=c,c++ style > compiler. The base compiler is a --enable-checking=none > --enable-languages=c,c++,lto style compiler, which is > 1b2bf75690af8115739ebba710a44d05388c7a1a (aka trunk@202797) from git. The > wide branch compiler is 4529820913813b810860784382f975ea8e6be61d (aka > wide-int@203462) from git. The software compiled in both cases is the base > compiler described above. > > Net result, around 2.6% regression in user time, and 0.4% in elapsed time. > The raw data is below, just in case one is interested. This is on Ubuntu > 12.04.3 system with 12GB ram with 8 cores. Btw, more interesting are testcases that put a heavy load on the alias machinery, like (many) (nested) loops with a lot of memory references. Like the testcase in PR39326. If you profile that you will see some of the double_int routines high in the profile which means on the branch wide_int routines should start to show up. I didn't expect visible differences for a bootstrap, but you proved me wrong :( Btw, with parallel make a single file getting a lot slower can be masked by parallelism completely, so I take timings with -j with a grain of salt. Thanks, Richard. > wide branch: > > 1760.94user 145.78system 5:06.23elapsed 622%CPU (0avgtext+0avgdata > 2317824maxresident)k > 32976inputs+5713232outputs (1487major+72639003minor)pagefaults 0swaps > 1758.53user 145.40system 5:06.66elapsed 620%CPU (0avgtext+0avgdata > 2317808maxresident)k > 1104inputs+5713240outputs (9major+72644909minor)pagefaults 0swaps > 1751.91user 145.77system 5:05.27elapsed 621%CPU (0avgtext+0avgdata > 2317808maxresident)k > 0inputs+5713232outputs (0major+72652872minor)pagefaults 0swaps > 1751.29user 145.78system 5:06.15elapsed 619%CPU (0avgtext+0avgdata > 2317808maxresident)k > 8inputs+5713256outputs (0major+72647952minor)pagefaults 0swaps > 1755.10user 145.26system 5:02.74elapsed 627%CPU (0avgtext+0avgdata > 2317808maxresident)k > 96inputs+5713264outputs (1major+72642787minor)pagefaults 0swaps > > base: > > 1708.71user 145.02system 5:04.98elapsed 607%CPU (0avgtext+0avgdata > 2317824maxresident)k > 0inputs+5713448outputs (0major+72602789minor)pagefaults 0swaps > 1707.43user 145.56system 5:05.24elapsed 607%CPU (0avgtext+0avgdata > 2317808maxresident)k > 0inputs+5713424outputs (0major+72606028minor)pagefaults 0swaps > 1711.61user 145.53system 5:03.49elapsed 611%CPU (0avgtext+0avgdata > 2317808maxresident)k > 160inputs+5713424outputs (6major+72614090minor)pagefaults 0swaps > 1712.64user 145.25system 5:02.98elapsed 613%CPU (0avgtext+0avgdata > 2317808maxresident)k > 0inputs+5713432outputs (0major+72599974minor)pagefaults 0swaps > 1708.81user 144.66system 5:01.61elapsed 614%CPU (0avgtext+0avgdata > 2317808maxresident)k > 24inputs+5713448outputs (0major+72599501minor)pagefaults 0swaps
Re: wide-int branch timings
On Tue, Oct 15, 2013 at 2:10 PM, Richard Biener wrote: > On Tue, Oct 15, 2013 at 1:12 AM, Mike Stump wrote: >> So, here is a comparison of the time required to do a make -j15 of a >> --disable-bootstrap --enable-checking=none --enable-languages=c,c++ style >> compiler. The base compiler is a --enable-checking=none >> --enable-languages=c,c++,lto style compiler, which is >> 1b2bf75690af8115739ebba710a44d05388c7a1a (aka trunk@202797) from git. The >> wide branch compiler is 4529820913813b810860784382f975ea8e6be61d (aka >> wide-int@203462) from git. The software compiled in both cases is the base >> compiler described above. >> >> Net result, around 2.6% regression in user time, and 0.4% in elapsed time. >> The raw data is below, just in case one is interested. This is on Ubuntu >> 12.04.3 system with 12GB ram with 8 cores. > > Btw, more interesting are testcases that put a heavy load on the alias > machinery, like (many) (nested) loops with a lot of memory references. > Like the testcase in PR39326. If you profile that you will see some > of the double_int routines high in the profile which means on the > branch wide_int routines should start to show up. > > I didn't expect visible differences for a bootstrap, but you proved me > wrong :( Btw, with parallel make a single file getting a lot slower can > be masked by parallelism completely, so I take timings with -j > with a grain of salt. For example for get_ref_base_and_extent the adds to bit_offset (even though initially of addr_wide_int kind) end up unoptimized, exposing if (len_822 > 2) goto ; else goto ; : xprecision_819 = (unsigned int) D.54901_818; if (xprecision_819 > 127) goto ; else goto ; : D.54899_838 = D.54922_816->base.u.bits.unsigned_flag; D.54900_839 = (signop) D.54899_838; len_840 = wi::force_to_size (&MEM[(struct wide_int_ref_storage *)&yi].scratch, val_823, len_822, xprecision_819, 128, D.54900_839); : # val_1543 = PHI # len_1542 = PHI <2(93), len_840(95), len_822(94)> MEM[(struct generic_wide_int *)&yi].val = val_1543; MEM[(struct generic_wide_int *)&yi].len = len_1542; MEM[(struct generic_wide_int *)&yi].precision = 128; D.54871_813 = wi::add_large (&MEM[(struct fixed_wide_int_storage *)&D.54875].D.43191.val, &MEM[(const struct fixed_wide_int_storage *)&bit_offset].val, D.54872_808, val_1543, len_1542, 128, 1, 0B); MEM[(unsigned int *)&D.54875 + 24B] = D.54871_813; __builtin_memcpy (&bit_offset, &D.54875, 28); goto (); one issue you can clearly see is that too much of the temporaries (like here the wide_int_ref yi that is created for the tree) ends up being addressable. That's because its data is embedded and passed to add_large (instead of what you'd say is "ref" storage, refering to storage elsewhere). Which is because of the canonicalization mismatch between tree, wide-int and RTX I guess. Not sure where the memcpy comes from in the above code - seems that bit_offset += TREE_OPERAND (exp, 2); builds a temporary bit_offset + TREE_OPERAND (exp, 2) that is then copied to bit_offset and this copy cannot be elided. That said, how do cc1 binary sizes compare branch vs. trunk at the last merge point? Richard.
Re: wide-int branch timings
On Tue, Oct 15, 2013 at 2:41 PM, Richard Biener wrote: > On Tue, Oct 15, 2013 at 2:10 PM, Richard Biener > wrote: >> On Tue, Oct 15, 2013 at 1:12 AM, Mike Stump wrote: >>> So, here is a comparison of the time required to do a make -j15 of a >>> --disable-bootstrap --enable-checking=none --enable-languages=c,c++ style >>> compiler. The base compiler is a --enable-checking=none >>> --enable-languages=c,c++,lto style compiler, which is >>> 1b2bf75690af8115739ebba710a44d05388c7a1a (aka trunk@202797) from git. The >>> wide branch compiler is 4529820913813b810860784382f975ea8e6be61d (aka >>> wide-int@203462) from git. The software compiled in both cases is the base >>> compiler described above. >>> >>> Net result, around 2.6% regression in user time, and 0.4% in elapsed time. >>> The raw data is below, just in case one is interested. This is on Ubuntu >>> 12.04.3 system with 12GB ram with 8 cores. >> >> Btw, more interesting are testcases that put a heavy load on the alias >> machinery, like (many) (nested) loops with a lot of memory references. >> Like the testcase in PR39326. If you profile that you will see some >> of the double_int routines high in the profile which means on the >> branch wide_int routines should start to show up. >> >> I didn't expect visible differences for a bootstrap, but you proved me >> wrong :( Btw, with parallel make a single file getting a lot slower can >> be masked by parallelism completely, so I take timings with -j >> with a grain of salt. > > For example for get_ref_base_and_extent the adds to bit_offset > (even though initially of addr_wide_int kind) end up unoptimized, > exposing > > if (len_822 > 2) > goto ; > else > goto ; > > : > xprecision_819 = (unsigned int) D.54901_818; > if (xprecision_819 > 127) > goto ; > else > goto ; > > : > D.54899_838 = D.54922_816->base.u.bits.unsigned_flag; > D.54900_839 = (signop) D.54899_838; > len_840 = wi::force_to_size (&MEM[(struct wide_int_ref_storage > *)&yi].scratch, val_823, len_822, xprecision_819, 128, D.54900_839); > > : > # val_1543 = PHI *)&yi].scratch(95), val_823(94)> > # len_1542 = PHI <2(93), len_840(95), len_822(94)> > MEM[(struct generic_wide_int *)&yi].val = val_1543; > MEM[(struct generic_wide_int *)&yi].len = len_1542; > MEM[(struct generic_wide_int *)&yi].precision = 128; > D.54871_813 = wi::add_large (&MEM[(struct fixed_wide_int_storage > *)&D.54875].D.43191.val, &MEM[(const struct fixed_wide_int_storage > *)&bit_offset].val, D.54872_808, val_1543, len_1542, 128, 1, 0B); > MEM[(unsigned int *)&D.54875 + 24B] = D.54871_813; > __builtin_memcpy (&bit_offset, &D.54875, 28); > goto (); That was built with host G++ 4.6, with trunk you see it more obvious: : # SR.574_214 = PHI <_507(69), &MEM[(struct wide_int_ref_storage *)&yi].scratch(70), _507(68)> # SR.575_810 = PHI MEM[(struct generic_wide_int *)&yi] = SR.574_214; MEM[(struct generic_wide_int *)&yi + 8B] = SR.575_810; MEM[(struct generic_wide_int *)&yi + 12B] = 128; _468 = wi::add_large (&MEM[(struct fixed_wide_int_storage *)&D.52085].val, &MEM[(const struct fixed_wide_int_storage *)&bit_offset].val, _463, SR.574_214, SR.575_810, 128, 1, 0B); MEM[(unsigned int *)&D.52085 + 24B] = _468; yi ={v} {CLOBBER}; MEM[(struct generic_wide_int *)&bit_offset] = MEM[(struct generic_wide_int *)&D.52085]; D.52085 ={v} {CLOBBER}; goto (); even though yi dies after the call to wi::add_large we cannot remove the pointless initializations of its members as its address escapes. Richard.
Compilation flags in libgfortran
Hi All! Is there any particular reason that matmul* modules from libgfortran are compiled with -O2 -ftree-vectorize? I see some regressions on Atom processor after r202980 (http://gcc.gnu.org/ml/gcc-cvs/2013-09/msg00846.html) Why not just use O3 for those modules? Thanks, Igor
Re: programming language that does not inhibit further optimization by gcc
On Mon, Oct 14, 2013 at 5:31 PM, Albert Abramson wrote: > > Is there a language out there (similar to Fortran or a dialect of C) > that doesn't inhibit the compiler from taking advantage of every > optimization possible? Sure: Fortran. > Is there some way to provide a C/C++ compiler > with extra information about variables and programs so that it can > maximize performance or minimize size? For example: > > int age = 21;//[0, 150) setting maximum limits, compiler could use byte > int > int outsideTemp = 20;//[-273, 80] > float ERA = 297; //[0, 1000, 3] [min, max, digits of > accuracy needed] Hmmm, OK, that kind of thing is available in PL/1 and, I think, in Ada. But as far as I know it doesn't help compilers very much in practice. Ian
Re: function attributes
Hi Ian, Thanks for the reply. On Fri, Oct 11, 2013 at 10:31 PM, Ian Lance Taylor wrote: > On Fri, Oct 11, 2013 at 9:20 AM, Nagaraju Mekala > wrote: >> >> I observed that in rs6000 port longcall is implemented by using >> CALL_LONG define. >> #define CALL_LONG 0x0008 /* always call indirect */ >> In the md file they are checking the operand with CALL_LONG >> if (INTVAL (operands[3]) & CALL_LONG) >> operands[1] = rs6000_longcall_ref (operands[1]); >> In my port I dont have suchthing to compare. Can we somehow parse the >> tree chain and check the attributes of the functions.. > > Look at init_cumulative_args in rs6000.c to see how CALL_LONG is set > based on the function attribute. I was able to get the function attribute from the init_cumulative_args function. I have used the fndecl tree to get the attribute details but I have failed to stop generating br instruction. It should print bk instruction. I was unable to relate the super attribute from init_cumulative_args to the branch pattern in md file to generate bk instruction. I have intialized a global variable to 1 if super is detected and checking the same in my pattern. My branch pattern looks like below (define_insn "call_int1" [(call (mem (match_operand:SI 0 "call_insn_simple_operand" "ri")) (match_operand:SI 1 "" "i")) (clobber (reg:SI R_RS))] "" { register rtx t = operands[0]; register rtx t2 = gen_rtx_REG (Pmode, GP_REG_FIRST + RETURN_ADDR_REGNUM); if (GET_CODE (t) == SYMBOL_REF) { if(super_var()) ---> Here I am checking for global variable { return "bk\tr1,8\;%#"; } else { gen_rtx_CLOBBER (VOIDmode, t2); return "br\tr1,%0\;%#"; I observed that init_cumulative_args is called first for all the functions once they are done then the above pattern for all the instructions are called so my global variable is not useful. Can you help me how to exactly emit bk instruction from the pattern when super function is called. > Ian Thanks, Nagaraju
Re: function attributes
On Tue, Oct 15, 2013 at 8:04 AM, Nagaraju Mekala wrote: > Hi Ian, > > Thanks for the reply. > > On Fri, Oct 11, 2013 at 10:31 PM, Ian Lance Taylor wrote: >> On Fri, Oct 11, 2013 at 9:20 AM, Nagaraju Mekala >> wrote: >>> >>> I observed that in rs6000 port longcall is implemented by using >>> CALL_LONG define. >>> #define CALL_LONG 0x0008 /* always call indirect */ >>> In the md file they are checking the operand with CALL_LONG >>> if (INTVAL (operands[3]) & CALL_LONG) >>> operands[1] = rs6000_longcall_ref (operands[1]); >>> In my port I dont have suchthing to compare. Can we somehow parse the >>> tree chain and check the attributes of the functions.. >> >> Look at init_cumulative_args in rs6000.c to see how CALL_LONG is set >> based on the function attribute. > > I was able to get the function attribute from the init_cumulative_args > function. I have used the fndecl tree to get the attribute details > but I have failed to stop generating br instruction. It should print > bk instruction. > I was unable to relate the super attribute from init_cumulative_args > to the branch pattern in md file to generate bk instruction. > I have intialized a global variable to 1 if super is detected and > checking the same in my pattern. > My branch pattern looks like below > (define_insn "call_int1" > [(call (mem (match_operand:SI 0 "call_insn_simple_operand" "ri")) > (match_operand:SI 1 "" "i")) > (clobber (reg:SI R_RS))] > "" > { > register rtx t = operands[0]; > register rtx t2 = gen_rtx_REG (Pmode, > GP_REG_FIRST + RETURN_ADDR_REGNUM); > if (GET_CODE (t) == SYMBOL_REF) { > if(super_var()) ---> Here I am > checking for global variable > { > return "bk\tr1,8\;%#"; > } > else { > gen_rtx_CLOBBER (VOIDmode, t2); > return "br\tr1,%0\;%#"; > > I observed that init_cumulative_args is called first for all the > functions once they are done then the above pattern for all the > instructions are called so my global variable is not useful. > > Can you help me how to exactly emit bk instruction from the pattern > when super function is called. Again I just have to say: look at the rs6000 port. Look at the rs6000 call instruction. Look at how it decides whether to do a longcall or not. Ian
Re: Compilation flags in libgfortran
On 10/15/2013 03:58 PM, Igor Zamyatin wrote: Hi All! Is there any particular reason that matmul* modules from libgfortran are compiled with -O2 -ftree-vectorize? I see some regressions on Atom processor after r202980 (http://gcc.gnu.org/ml/gcc-cvs/2013-09/msg00846.html) Why not just use O3 for those modules? Igor, It helps (:-) to send questions about gfortran and its run time library libgfortran cc'd to fort...@gcc.gnu.org, because not every GNU Fortran maintainer reads gcc@gcc.gnu.org Kind regards, -- Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news
Re: Compilation flags in libgfortran
On Tue, Oct 15, 2013 at 4:58 PM, Igor Zamyatin wrote: > Hi All! > > Is there any particular reason that matmul* modules from libgfortran > are compiled with -O2 -ftree-vectorize? Yes, testing showed that it improved performance compared to the default options. See the thread starting at http://gcc.gnu.org/ml/fortran/2005-11/msg00366.html In the almost 8 years (!!) since the patch was merged, I believe the importance of vectorization for utilizing current processors has only increased. [snip] > Why not just use O3 for those modules? Back when the change was made, -ftree-vectorize wasn't enabled by -O3. IIRC I did some tests, and -O3 didn't really improve things beyond what "-O2 -funroll-loops -ftree-vectorize" already did. That was a while ago however, so if somebody (*wink*) would care to redo the benchmarks things might look different with today's GCC on today's hardware. Hope this helps, -- Janne Blomqvist
Re: Cilk Library
On 10/09/13 12:32, Iyer, Balaji V wrote: Dear Jeff and the rest of Steering committee members, Thank you very much for approving the license terms of the Cilk Library. I couldn't attach the zipped copy of the patch due to its size, so here is a link to the Cilk library patch that can be applied to the trunk: (https://docs.google.com/file/d/0BzEpbbnrYKsSWjBWSkNrVS1SaGs/edit?usp=sharing). Is it OK for trunk? Here are the ChangeLog entries: ChangeLog: 2013-10-09 Balaji V. Iyer * Makefile.def: Add libcilkrts to target_modules. Make libcilkrts depend on libstdc++ and libgcc. * configure.ac: Added libcilkrts to target binaries. * configure: Likewise. * Makefile.in: Added libcilkrts related fields to support building it. libcilkrts/ChangeLog: 2013-10-09 Balaji V. Iyer * libcilkrts/Makefile.am: New file. Libcilkrts version 3613. * libcilkrts/Makefile.in: Likewise. * libcilkrts/README: Likewise. * libcilkrts/aclocal.m4: Likewise. * libcilkrts/configure: Likewise. * libcilkrts/configure.ac: Likewise. * libcilkrts/include/cilk/cilk.h: Likewise. * libcilkrts/include/cilk/cilk_api.h: Likewise. * libcilkrts/include/cilk/cilk_api_linux.h: Likewise. * libcilkrts/include/cilk/cilk_stub.h: Likewise. * libcilkrts/include/cilk/cilk_undocumented.h: Likewise. * libcilkrts/include/cilk/common.h: Likewise. * libcilkrts/include/cilk/holder.h: Likewise. * libcilkrts/include/cilk/hyperobject_base.h: Likewise. * libcilkrts/include/cilk/metaprogramming.h: Likewise. * libcilkrts/include/cilk/reducer.h: Likewise. * libcilkrts/include/cilk/reducer_file.h: Likewise. * libcilkrts/include/cilk/reducer_list.h: Likewise. * libcilkrts/include/cilk/reducer_max.h: Likewise. * libcilkrts/include/cilk/reducer_min.h: Likewise. * libcilkrts/include/cilk/reducer_min_max.h: Likewise. * libcilkrts/include/cilk/reducer_opadd.h: Likewise. * libcilkrts/include/cilk/reducer_opand.h: Likewise. * libcilkrts/include/cilk/reducer_opmul.h: Likewise. * libcilkrts/include/cilk/reducer_opor.h: Likewise. * libcilkrts/include/cilk/reducer_opxor.h: Likewise. * libcilkrts/include/cilk/reducer_ostream.h: Likewise. * libcilkrts/include/cilk/reducer_string.h: Likewise. * libcilkrts/include/cilktools/cilkscreen.h: Likewise. * libcilkrts/include/cilktools/cilkview.h: Likewise. * libcilkrts/include/cilktools/fake_mutex.h: Likewise. * libcilkrts/include/cilktools/lock_guard.h: Likewise. * libcilkrts/include/internal/abi.h: Likewise. * libcilkrts/include/internal/cilk_fake.h: Likewise. * libcilkrts/include/internal/cilk_version.h: Likewise. * libcilkrts/include/internal/inspector-abi.h: Likewise. * libcilkrts/include/internal/metacall.h: Likewise. * libcilkrts/include/internal/rev.mk: Likewise. * libcilkrts/mk/cilk-version.mk: Likewise. * libcilkrts/mk/unix-common.mk: Likewise. * libcilkrts/runtime/acknowledgements.dox: Likewise. * libcilkrts/runtime/bug.cpp: Likewise. * libcilkrts/runtime/bug.h: Likewise. * libcilkrts/runtime/c_reducers.c: Likewise. * libcilkrts/runtime/cilk-abi-cilk-for.cpp: Likewise. * libcilkrts/runtime/cilk-abi-vla-internal.c: Likewise. * libcilkrts/runtime/cilk-abi-vla-internal.h: Likewise. * libcilkrts/runtime/cilk-abi-vla.c: Likewise. * libcilkrts/runtime/cilk-abi.c: Likewise. * libcilkrts/runtime/cilk-ittnotify.h: Likewise. * libcilkrts/runtime/cilk-tbb-interop.h: Likewise. * libcilkrts/runtime/cilk_api.c: Likewise. * libcilkrts/runtime/cilk_fiber-unix.cpp: Likewise. * libcilkrts/runtime/cilk_fiber-unix.h: Likewise. * libcilkrts/runtime/cilk_fiber.cpp: Likewise. * libcilkrts/runtime/cilk_fiber.h: Likewise. * libcilkrts/runtime/cilk_malloc.c: Likewise. * libcilkrts/runtime/cilk_malloc.h: Likewise. * libcilkrts/runtime/component.h: Likewise. * libcilkrts/runtime/doxygen-layout.xml: Likewise. * libcilkrts/runtime/doxygen.cfg: Likewise. * libcilkrts/runtime/except-gcc.cpp: Likewise. * libcilkrts/runtime/except-gcc.h: Likewise. * libcilkrts/runtime/except.h: Likewise. * libcilkrts/runtime/frame_malloc.c: Likewise. * libcilkrts/runtime/frame_malloc.h: Likewise. * libcilkrts/runtime/full_frame.c: Likewise. * libcilkrts/runtime/full_frame.h: Likewise. * libcilkrts/runtime/global_state.cpp: Likewise. * libcilkrts/runtime/global_state.h: Likewise. * libcilkrts/runtime/jmpbuf.c: Likewise. * libcilkrts/runtime/jmpbuf.h: Likewise. * libcilkrts/runtime/local_state.c: Likewise. * libcilkrts/runtime/local_state
Re: wide-int branch timings
On Oct 15, 2013, at 5:41 AM, Richard Biener wrote: > That said, how do cc1 binary sizes compare branch vs. trunk at > the last merge point? $ size /tmp/gcc-*/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/cc1plus textdata bss dec hex filename 14224227 33960 1061304 15319491 e9c1c3 /tmp/gcc-1/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/cc1plus 13973978 33952 1061272 15069202 e5f012 /tmp/gcc-base/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/cc1plus $ size /tmp/gcc-*/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/cc1 textdata bss dec hex filename 13146268 33864 1038808 14218940 d8f6bc /tmp/gcc-1/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/cc1 12899907 33856 1038776 13972539 d5343b /tmp/gcc-base/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/cc1 $ bc -l 14224227/13973978 1.01790821482615759091 13146268/12899907 1.01909788962044455049 1.8% and 1.9% bigger in text. 8 bytes, and 32 bytes bigger in data.
Re: programming language that does not inhibit further optimization by gcc
Here is the way I understood the goal of your long quest (I may be completely mistaken since I do not quite get what part of the job you want to leave to the language and what part to its compiler) "Is there a language that allow the developer to add information about the way a particular program will really use the variables it declares, or the function it calls, so that this information can be exploited by the compiler to optimize as far as possible the final binary ?" We are developing Cawen (please do not hesitate to take a look at http://www.melvenn.com/en/cawen/why-cawen/), a language that includes C99 and produces C99 source code (it can be considered as a precompiling tool for C). --- Variables : Cawen gives you the possibility to enrich variables with user information : Your sample code : int age = 21;//[0, 150) setting maximum limits, compiler could use byte int int outsideTemp = 20;//[-273, 80] float ERA = 297; //[0, 1000, 3] [min, max, digits of accuracy needed] Can be written in Cawen as int age < < min = 0, max = 150 > > = 21; int outsideTemp < < min = -273, max = 80 > > = 20; float ERA < < min=0,max=1000,accur=3 > > = 297; The range properties can further be asked for in the Cawen code with age = > min, age = > max and so on. Ex : int repartition [ age = > max + 1]; min, max are not Cawen keywords, you can create as many labels as you want : int age < < min = 0, max = 150, average = 100 > >; One can also code things like : @declare(integer,age,max,150,min,0); etc. .. It is up to the Cawen coder to implement the @declare macro that would consider that an integer beetwen 0 and 150 must be declared as an unsigned char in the generated C code. unsigned char age; age = > min and age = > max remain available... This was for user code. As far as giving hints to the compiler is concerned, Cawen has got no compiler and relies entirely on the C compiler. So that range information can only be used at compile time if the C compiler can make use of them through a specific syntax. In this case, Cawen preprocessor would let you code your own transformation from its own syntax : int age < < min =0, max = 150 > > = 21; to the C target int age whatever_compiler_specific_syntax_(0,150); Of course, age = > min, age = > max are still available. - Functions : Here is an example of how Cawen's function template mechanism can be used for optimization : This line appends the 10 first elements of a to b. @govel::append(a,10,b); // govel is the first Cawen's standard library The function will first check if there is enough elements in a. This checking is totally unnecessary if the coder knows that a is equal to "a_string_that_is_more_than_10_char_long". Coding @govel::append{ !src_check }(a,10,b) you can tell Cawen (in govel's code) to skip it. Feel free to create and implement hundreds of templating parameters ! @govel::append{ !src_check size_opt_level = 1 speed_opt_level = 3 !memcpy debug comment = " with a lot of care" ... }(a,10,b) Regards TS & GC 2013/10/15 Albert Abramson : > I have been looking everywhere online and talking to other coders at > every opportunity about this, but cannot find a complete answer. > Different languages have different obstacles to complete optimization. > Software developers often have to drop down into non-portable > Assembly because they can't get the performance or small size of > hand-optimized Assembly for their particular platform. > > The C language has the alias issue that limits the hoisting of loads. > Unless the programmer specifies that two arrays will never overlap > using the 'restrict' keyword, the compiler may not be able to handle > operations on arrays efficiently because of the unlikely event that > the arrays could overlap. Most/all languages also demand the > appearance of serialization of instructions and memory operations, as > well as extreme correctness in even the most unlikely circumstances, > even where the programmer may not need them. > > Is there a language out there (similar to Fortran or a dialect of C) > that doesn't inhibit the compiler from taking advantage of every > optimization possible? Is there some way to provide a C/C++ compiler > with extra information about variables and programs so that it can > maximize performance or minimize size? For example: > > int age = 21;//[0, 150) setting maximum limits, compiler could use byte > int > int outsideTemp = 20;//[-273, 80] > float ERA = 297; //[0, 1000, 3] [min, max, digits of > accuracy needed] > > Better yet, allow some easier way of spawning multiple threads without > have to learn all of the Boost libraries, OpenCL, or OpenGL. In other > words, is there yet a language that is designed only for performance > that places no limits on compiler optimizations? Is there a language > that allows the compiler to pack struct variables in tighter by > reorganizing those values, etc? > > If not, i