Re: Build failure in dwarf2out
Paul Thomas wrote: I am being hit by this: rf2out.c -o dwarf2out.o ../../trunk/gcc/dwarf2out.c: In function `file_name_acquire': ../../trunk/gcc/dwarf2out.c:7672: error: `files' undeclared (first use in this f unction) ../../trunk/gcc/dwarf2out.c:7672: error: (Each undeclared identifier is reported only once ../../trunk/gcc/dwarf2out.c:7672: error: for each function it appears in.) ../../trunk/gcc/dwarf2out.c:7672: error: `i' undeclared (first use in this funct ion) My guess is that the #define activating that region of code is erroneously triggered. I am running the 2-day (on cygwin with a substandard BIOS) testsuite now.
Re: Call to arms: testsuite failures on various targets
FX Coudert wrote: Hi all, I reviewed this afternoon the postings from the gcc-testresults mailing-list for the past month, and we have a couple of gfortran testsuite failures showing up on various targets. Could people with access to said targets (possibly maintainers) please file PRs in bugzilla for each testcase, reporting the error message and/or backtrace? (I'd be happy to be added to the Cc list of these) * ia64-suse-linux-gnu: gfortran.dg/vect/vect-4.f90 FAIL: gfortran.dg/vect/vect-4.f90 -O scan-tree-dump-times Alignment of access forced using peeling 1 FAIL: gfortran.dg/vect/vect-4.f90 -O scan-tree-dump-times Vectorizing an unali gned access 1 This happens on all reported ia64 targets, including mine. What is expected here? There is no vectorization on ia64, no reason for peeling. The compilation has no problem, and there is no report generated. As far as I know, the vectorization options are ignored. Without unrolling, of course, gfortran doesn't optimize the loop at all, but I assume that's a different question.
Re: Call to arms: testsuite failures on various targets
Dorit Nuzman wrote: FX Coudert wrote: Hi all, I reviewed this afternoon the postings from the gcc-testresults mailing-list for the past month, and we have a couple of gfortran testsuite failures showing up on various targets. Could people with access to said targets (possibly maintainers) please file PRs in bugzilla for each testcase, reporting the error message and/or backtrace? (I'd be happy to be added to the Cc list of these) * ia64-suse-linux-gnu: gfortran.dg/vect/vect-4.f90 FAIL: gfortran.dg/vect/vect-4.f90 -O scan-tree-dump-times Alignment of access forced using peeling 1 FAIL: gfortran.dg/vect/vect-4.f90 -O scan-tree-dump-times Vectorizing an unali gned access 1 These tests should xfail on "vect_no_align" targets. On targets that support misaligned accesses we use peeling to align two datarefs, and generate a misaligned memory-access for a third dataref. But on targets that do not support misaligned accesses I expect we just use versioning with runtime alignment test. Does the following pass for you (I just added "{ xfail vect_no_align }" to the two failing tests)? Index: vect-4.f90 === --- vect-4.f90 (revision 123409) +++ vect-4.f90 (working copy) @@ -10,7 +10,7 @@ END ! { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } -! { dg-final { scan-tree-dump-times "Alignment of access forced using peeling" 1 "vect" } } -! { dg-final { scan-tree-dump-times "Vectorizing an unaligned access" 1 "vect" } } +! { dg-final { scan-tree-dump-times "Alignment of access forced using peeling" 1 "vect" { xfail vect_no_align } } } +! { dg-final { scan-tree-dump-times "Vectorizing an unaligned access" 1 "vect" { xfail vect_no_align } } } ! { dg-final { scan-tree-dump-times "accesses have the same alignment." 1 "vect" } } ! { dg-final { cleanup-tree-dump "vect" } } This patch does change those reports to XFAIL (testsuite report attached). I suppose any attempts to optimize for ia64, such as load-pair versioning, would be in the dataflow branch, whose location I don't know. laST_UPDATED: Obtained from SVN: trunk revision 123799 Native configuration is ia64-unknown-linux-gnu === gcc tests === Running target unix FAIL: gcc.c-torture/execute/mayalias-2.c compilation, -O3 -g (internal compiler error) UNRESOLVED: gcc.c-torture/execute/mayalias-2.c execution, -O3 -g FAIL: gcc.c-torture/execute/mayalias-3.c compilation, -O3 -g (internal compiler error) UNRESOLVED: gcc.c-torture/execute/mayalias-3.c execution, -O3 -g FAIL: gcc.c-torture/execute/va-arg-24.c execution, -O3 -fomit-frame-pointer -funroll-loops FAIL: gcc.c-torture/execute/va-arg-24.c execution, -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions FAIL: gcc.dg/builtin-apply4.c execution test FAIL: gcc.dg/pr30643.c scan-assembler-not undefined WARNING: gcc.dg/torture/fp-int-convert-float128-timode.c -O0 compilation failed to produce executable WARNING: gcc.dg/torture/fp-int-convert-float128-timode.c -O1 compilation failed to produce executable WARNING: gcc.dg/torture/fp-int-convert-float128-timode.c -O2 compilation failed to produce executable WARNING: gcc.dg/torture/fp-int-convert-float128-timode.c -O3 -fomit-frame-pointer compilation failed to produce executable WARNING: gcc.dg/torture/fp-int-convert-float128-timode.c -O3 -g compilation failed to produce executable WARNING: gcc.dg/torture/fp-int-convert-float128-timode.c -Os compilation failed to produce executable FAIL: gcc.dg/torture/fp-int-convert-float128.c -O0 (test for excess errors) WARNING: gcc.dg/torture/fp-int-convert-float128.c -O0 compilation failed to produce executable FAIL: gcc.dg/torture/fp-int-convert-float128.c -O1 (test for excess errors) WARNING: gcc.dg/torture/fp-int-convert-float128.c -O1 compilation failed to produce executable FAIL: gcc.dg/torture/fp-int-convert-float128.c -O2 (test for excess errors) WARNING: gcc.dg/torture/fp-int-convert-float128.c -O2 compilation failed to produce executable FAIL: gcc.dg/torture/fp-int-convert-float128.c -O3 -fomit-frame-pointer (test for excess errors) WARNING: gcc.dg/torture/fp-int-convert-float128.c -O3 -fomit-frame-pointer compilation failed to produce executable FAIL: gcc.dg/torture/fp-int-convert-float128.c -O3 -g (test for excess errors) WARNING: gcc.dg/torture/fp-int-convert-float128.c -O3 -g compilation failed to produce executable FAIL: gcc.dg/torture/fp-int-convert-float128.c -Os (test for excess errors) WARNING: gcc.dg/torture/fp-int-convert-float128.c -Os compilation failed to produce executable XPASS: gcc.dg/tree-ssa/loop-1.c scan-assembler-times foo 5 XPASS: gcc.dg/tree-ssa/update-threading.c scan-tree-dump-times Invalid sum 0 FAIL: gcc.dg/vect/pr30771.c scan-tree-dump-times vectorized 1 loops 1 FAIL: gcc.dg/vect/vect-iv-4.c scan-tree-dump-times vectorized 1 loops 1 FAIL: gcc.dg/vect/vect-iv-9.c scan-tree-dump-times vectorize
Re: Where is gstdint.h
[EMAIL PROTECTED] wrote: Where is gstdint.h ? Does it acctually exist ? libdecnumber seems to use it. decimal32|64|128.h's include decNumber.h which includes deccontext.h which includes gstdint.h When you configure libdecnumber (e.g. by running top-level gcc configure), gstdint.h should be created, by modifying . Since you said nothing about the conditions where you had a problem, you can't expect anyone to fix it for you. If you do want it fixed, you should at least file a complete PR. As it is more likely to happen with a poorly supported target, you may have to look into it in more detail than that. When this happened to me, I simply made a copy of stdint.h to get over the hump.
Re: Where is gstdint.h
[EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Where is gstdint.h ? Does it acctually exist ? libdecnumber seems to use it. decimal32|64|128.h's include decNumber.h which includes deccontext.h which includes gstdint.h When you configure libdecnumber (e.g. by running top-level gcc configure), gstdint.h should be created, by modifying . Since you said nothing about the conditions where you had a problem, you can't expect anyone to fix it for you. If you do want it fixed, you should at least file a complete PR. As it is more likely to happen with a poorly supported target, you may have to look into it in more detail than that. When this happened to me, I simply made a copy of stdint.h to get over the hump. Thanks for prompt reply. I am doing 386 build. I could not find it in my build directory, but it is there after all. Sorry, not used to finding files in Linux. Aaron You can't expect people to guess which 386 build you are doing. Certain 386 builds clearly are not in the "poorly supported" category, others may be.
Re: Where is gstdint.h
[EMAIL PROTECTED] wrote: Tim Prince wrote: [EMAIL PROTECTED] wrote: Where is gstdint.h ? Does it acctually exist ? libdecnumber seems to use it. decimal32|64|128.h's include decNumber.h which includes deccontext.h which includes gstdint.h When you configure libdecnumber (e.g. by running top-level gcc configure), gstdint.h should be created, by modifying . Since you said nothing about the conditions where you had a problem, you can't expect anyone to fix it for you. If you do want it fixed, you should at least file a complete PR. As it is more likely to happen with a poorly supported target, you may have to look into it in more detail than that. When this happened to me, I simply made a copy of stdint.h to get over the hump. This might happen when you run the top level gcc configure in its own directory. You may want to try to make a new directory elsewhere and run configure there. pwd .../my-gcc-source-tree mkdir ../build cd ../build ../my-gcc-source-tree/configure make If you're suggesting trying to build in the top level directory to see if the same problem occurs, I would expect other problems to arise. If it would help diagnose the problem, and the problem persists for a few weeks, I'd be willing to try it.
Re: Effects of newly introduced -mpcX 80387 precision flag
[EMAIL PROTECTED] wrote: I just (re-)discovered these tables giving maximum known errors in some libm functions when extended precision is enabled: http://people.inf.ethz.ch/gonnet/FPAccuracy/linux/summary.html and when the precision of the mantissa is set to 53 bits (double precision): http://people.inf.ethz.ch/gonnet/FPAccuracy/linux64/summary.html This is from 2002, and indeed, some of the errors in double-precision results are hundreds or thousands of times bigger when the precision is set to 53 bits. This isn't very helpful. I can't find an indication of whose libm is being tested, it appears to be an unspecified non-standard version of gcc, and a lot of digging would be needed to find out what the tests are. It makes no sense at all for sqrt() to break down with change in precision mode. Extended precision typically gives a significant improvement in accuracy of complex math functions, as shown in the Celefunt suite from TOMS. The functions shown, if properly coded for SSE2, should be capable of giving good results, independent of x87 precision mode. I understand there is continuing academic research. Arguments have been going on for some time on whether to accept approximate SSE2 math libraries. I personally would not like to see new libraries without some requirement for readable C source and testing. I agree that it would be bad to set 53-bit mode blindly for a library which expects 64-bit mode, but it seems a serious weakness if such a library doesn't take care of precision mode itself. The whole precision mode issue seems somewhat moot, now that years have passed since the last CPUs were made which do not support SSE2, or the equivalent in other CPU families.
Re: Effects of newly introduced -mpcX 80387 precision flag
[EMAIL PROTECTED] wrote: On Apr 29, 2007, at 1:01 PM, Tim Prince wrote: It makes no sense at all for sqrt() to break down with change in precision mode. If you do an extended-precision (80-bit) sqrt and then round the result again to a double (64-bit) then those two roundings will increase the error, sometimes to > 1/2 ulp. To give current results on a machine I have access to, I ran the tests there on vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 875 using euler-59% gcc -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../configure --prefix=/pkgs/gcc-4.1.2 Thread model: posix gcc version 4.1.2 on an up-to-date RHEL 4.0 server (so whatever libm is offered there), and, indeed, the only differences that it found were in 1/x, sqrt(x), and Pi*x because of double rounding. In other words, the code that went through libm gave identical answers whether running on sse, x87 (extended precision), or x87 (double precision). I don't know whether there are still math libraries for which Gonnet's 2002 results prevail. Double rounding ought to be avoided by -mfpmath=sse and permitting builtin_sqrt to do its thing, or by setting 53-bit precision. The latter disables long double. The original URL showed total failure of sqrt(); double rounding only brings error of .5 ULP, as usually assessed. I don't think the 64-/53-bit double rounding of sqrt can be detected, but of course such double rounding of * can be measured. With Pi, you have various possibilities, according to precision of the Pi value (including the possibility of the one supplied by the x87 instruction) as well as the 2 choices of arithmetic precision mode.
Re: Successfull Build of gcc on Cygwin WinXp SP2
[EMAIL PROTECTED] wrote: Cygcheck version 1.90 Compiled on Jan 31 2007 How do I get a later version of Cygwin ? 1.90 is the current release version. It seems unlikely that later trial versions have a patch for the stdio.h conflict with C99, or changes headers to avoid warnings which by default are fatal. If you want a newer cygwin.dll, read the cygwin mail list archive for hints, but it doesn't appear to be relevant.
Re: Successfull Build of gcc on Cygwin WinXp SP2
[EMAIL PROTECTED] wrote: James, On 5/1/07, Aaron Gray <[EMAIL PROTECTED]> wrote: Hi James, > Successfully built latest gcc on Win XP SP2 with cvs built cygwin. I was wondering whether you could help to get me to the same point please. You will need to use Dave Korns patch for newlib. http://sourceware.org/ml/newlib/2007/msg00292.html I am getting the following :- $ patch newlib/libc/include/stdio.h fix-gcc-bootstrap-on-cygwin-patch.diff patching file newlib/libc/include/stdio.h Hunk #1 succeeded at 475 (offset 78 lines). Hunk #2 FAILED at 501. Hunk #3 FAILED at 521. 2 out of 3 hunks FAILED -- saving rejects to file newlib/libc/include/stdio.h.rej I had to apply the relevant changes manually to the cygwin . It doesn't appear to match the version for which Dave made the patch.
Re: What happend to bootstrap-lean?
Gabriel Dos Reis wrote: Andrew Pinski <[EMAIL PROTECTED]> writes: | > | > On Fri, 16 Dec 2005, Paolo Bonzini wrote: | > > Yes. "make bubblestrap" is now called simply "make". | > | > Okay, how is "make bootstrap-lean" called these days? ;-) | > | > In fact, bootstrap-lean is still documented in install.texi and | > makefile.texi, but it no longer seems to be present in the Makefile | > machinery. Could we get this back? | | bootstrap-lean is done by doing the following (which I feel is the wrong way): | Configure with --enable-bootstrap=lean | and then do a "make bootstrap" Hmm, does that mean that I would have to reconfigure GCC if I wanted to do "make bootstrap-lean" after a previous configuration and build? I think the answer must be "no", but I'm not sure. -- Gaby I've not been able to find another way to rebuild (on SuSE 9.2, for example) after applying the weekly patch file. I'm hoping that suggestion works.
Re: Fwd: Windows support dropped from gcc trunk
On 10/14/2015 11:36 AM, Steve Kargl wrote: > On Wed, Oct 14, 2015 at 11:32:52AM -0400, Tim Prince wrote: >> Sorry if someone sees this multiple times; I think it may have been >> stopped by ISP or text mode filtering: >> >> Since Sept. 26, the partial support for Windows 64-bit has been dropped >> from gcc trunk: >> winnt.c apparently has problems with seh, which prevent bootstrapping, >> and prevent the new gcc from building libraries. >> libgfortran build throws a fatal error on account of lack of support for >> __float128, even if a working gcc is used. >> I didn't see any notification about this; maybe it wasn't a consensus >> decision? >> There are satisfactory pre-built gfortran 5.2 compilers (including >> libgomp, although that is off by default and the testsuite wants acc as >> well as OpenMP) available in cygwin64 (test version) and (apparently) >> mingw-64. >> > The last comment to winnt.c is > > 2015-10-02 Kai Tietz > > PR target/51726 > * config/i386/winnt.c (ix86_handle_selectany_attribute): Handle > selectany within this function without need to keep attribute. > (i386_pe_encode_section_info): Remove selectany-code. > > Perhaps, contact Kai. > > I added gcc@gcc.gnu.org as this technically isn't a Fortran issue. test suite reports hundred of new ICE instances, all referring to this seh_unwind_emit function: /cygdrive/c/users/tim/tim/tim/src/gnu/gcc1/gcc/testsuite/gcc.c-torture/compile/2127-1.c: In function 'foo':^M /cygdrive/c/users/tim/tim/tim/src/gnu/gcc1/gcc/testsuite/gcc.c-torture/compile/2127-1.c:7:1: internal compiler error: in i386_pe_seh_unwind_emit, at config/i386/winnt.c:1137^M Please submit a full bug report,^M I will file a bugzila if that is what is wanted, but I wanted to know if there is a new configure option required. As far as I know there were always problems with long double for Windows targets, but the refusal of libgfortran to build on account of it is new. Thanks, Tim
Re: question about -ffast-math implementation
On 6/2/2014 3:00 AM, Andrew Pinski wrote: On Sun, Jun 1, 2014 at 11:09 PM, Janne Blomqvist wrote: On Sun, Jun 1, 2014 at 9:52 AM, Mike Izbicki wrote: I'm trying to copy gcc's behavior with the -ffast-math compiler flag into haskell's ghc compiler. The only documentation I can find about it is at: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html I understand how floating point operations work and have come up with a reasonable list of optimizations to perform. But I doubt it is exhaustive. My question is: where can I find all the gory details about what gcc will do with this flag? I'm perfectly willing to look at source code if that's what it takes. In addition to the official documentation, a nice overview is at https://gcc.gnu.org/wiki/FloatingPointMath Useful, thanks for the pointer Though for the gory details and authoritative answers I suppose you'd have to look into the source code. Also, are there any optimizations that you wish -ffast-math could perform, but for various architectural reasons they don't fit into gcc? There are of course a (nearly endless?) list of optimizations that could be done but aren't (lack of manpower, impractical, whatnot). I'm not sure there are any interesting optimizations that would be dependent on loosening -ffast-math further? I find it difficult to remember how to reconcile differing treatments by gcc and gfortran under -ffast-math; in particular, with respect to -fprotect-parens and -freciprocal-math. The latter appears to comply with Fortran standard. (One thing I wish wouldn't be included in -ffast-math is -fcx-limited-range; the naive complex division algorithm can easily lead to comically poor results.) Which is kinda interesting because the Google folks have been trying to turn on -fcx-limited-range for C++ a few times now. Intel tried to add -complex-limited-range as a default under -fp-model fast=1 but that was shown to be unsatisfactory. Now, with the introduction of omp simd directives and pragmas, we have disagreement among various compilers on the relative roles of the directives and the fast-math options. I've submitted PR60117 hoping to get some insight on whether omp simd should disable optimizations otherwise performed by -ffast-math. Intel made the directives over-ride the compiler line fast (or "no-fast") settings locally, so that complex-limited-range might be in effect inside the scope of the directive (no matter whether you want it). They made changes in the current beta compiler, so it's no longer practical to set standard-compliant options but discard them by pragma in individual for loops. -- Tim Prince
Re: Vector permutation only deals with # of vector elements same as mask?
On 2/11/2011 7:30 AM, Bingfeng Mei wrote: Thanks. Another question. Is there any plan to vectorize the loops like the following ones? for (i=127; i>=0; i--) { x[i] = y[i] + z[i]; } When I last tried, the Sun compilers could vectorize such loops efficiently (for fairly short loops), with appropriate data definitions. The Sun compilers didn't peel for alignment, to improve performance on longer loops, as gcc and others do. For a case with no data overlaps (float * __restrict__ x, ,y,z, or Fortran), loop reversal can do the job. gcc has some loop reversal machinery, but I haven't seen it used for vectorization. In a simple case like this, some might argue there's no reason to write a backward loop when it could easily be reversed in source code, and compilers have been seen to make mistakes in reversal. -- Tim Prince
Re: numerical results differ after irrelevant code change
On 5/8/2011 8:25 AM, Michael D. Berger wrote: -Original Message- From: Robert Dewar [mailto:de...@adacore.com] Sent: Sunday, May 08, 2011 11:13 To: Michael D. Berger Cc: gcc@gcc.gnu.org Subject: Re: numerical results differ after irrelevant code change [...] This kind of result is quite expected on an x86 using the old style (default) floating-point (becauae of extra precision in intermediate results). How does the extra precision lead to the variable result? Also, is there a way to prevent it? It is a pain in regression testing. If you don't need to support CPUs over 10 years old, consider -march=pentium4 -mfpmath=sse or use the 64-bit OS and gcc. Note the resemblance of your quoted differences to DBL_EPSILON from . That's 1 ULP relative to 1.0. I have a hard time imagining the nature of real applications which don't need to tolerate differences of 1 ULP. -- Tim Prince
Re: Profiling gcc itself
On 11/20/2011 11:10 AM, Basile Starynkevitch wrote: On Sun, 20 Nov 2011 03:43:20 -0800 Jeff Evarts wrote: I posted this question at irc://irc.oftc.net/#gcc and they suggested that I pose it here instead. I do some "large-ish" builds (linux, gcc itself, etc) on a too-regular basis, and I was wondering what could be done to speed things up. A little printf-style checking hints to me that I might be spending the majority of my time in CPP rather g++, gasm, ld, etc. Has anyone (ever, regularly, or recently) built gcc (g++, gcpp) with profiling turned on? Is it hard? Did you get good results? I'm not sure the question belongs to gcc@gcc.gnu.org, perhaps gcc-h...@gcc.gnu.org might be a better place. If you choose to follow such advice, explaining whether other facilities already in gcc, e.g. http://gcc.gnu.org/onlinedocs/gcc/Precompiled-Headers.html apply to your situation may be useful. -- Tim Prince
Re: C Compiler benchmark: gcc 4.6.3 vs. Intel v11 and others
On 1/19/2012 9:27 AM, willus.com wrote: On 1/19/2012 2:59 AM, Richard Guenther wrote: On Thu, Jan 19, 2012 at 7:37 AM, Marc Glisse wrote: On Wed, 18 Jan 2012, willus.com wrote: For those who might be interested, I've recently benchmarked gcc 4.6.3 (and 3.4.2) vs. Intel v11 and Microsoft (in Windows 7) here: http://willus.com/ccomp_benchmark2.shtml http://en.wikipedia.org/wiki/Microsoft_Windows_SDK#64-bit_development For the math functions, this is normally more a libc feature, so you might get very different results on different OS. Then again, by using -ffast-math, you allow the math functions to return any random value, so I can think of ways to make it even faster ;-) Also for math functions you can simply substitute the Intel compilers one (GCC uses the Microsoft ones) by linking against libimf. You can also make use of their vectorized variants from GCC by specifying -mveclibabi=svml and link against libimf (the GCC autovectorizer will then use the routines from the Intel compiler math library). That makes a huge difference for code using functions from math.h. Richard. -- Marc Glisse Thank you both for the tips. Are you certain that with the flags I used Intel doesn't completely in-line the math2.h functions at the compile stage? gcc? I take it to use libimf.a (legally) I would have to purchase the Intel compiler? In-line math functions, beyond what gcc does automatically (sqrt...) are possible only with x87 code; those aren't vectorizable nor remarkably fast, although quality can be made good (with care). As Richard said, the icc svml library is the one supporting the fast vector math functions. There is also an arch-consistency version of svml (different internal function names) which is not as fast but may give more accurate results or avoid platform-dependent bugs. Yes, the Intel library license makes restrictions on usage: http://software.intel.com/en-us/articles/faq-intel-parallel-composer-redistributable-package/?wapkw=%28redistributable+license%29 You might use it for personal purposes under terms of this linux license: http://software.intel.com/en-us/articles/Non-Commercial-license/?wapkw=%28non-commercial+license%29 It isn't supported in the gcc context. Needless to say, I don't speak for my employer. -- Tim Prince
Re: C Compiler benchmark: gcc 4.6.3 vs. Intel v11 and others
On 1/19/2012 9:24 PM, willus.com wrote: On 1/18/2012 10:37 PM, Marc Glisse wrote: On Wed, 18 Jan 2012, willus.com wrote: For those who might be interested, I've recently benchmarked gcc 4.6.3 (and 3.4.2) vs. Intel v11 and Microsoft (in Windows 7) here: http://willus.com/ccomp_benchmark2.shtml http://en.wikipedia.org/wiki/Microsoft_Windows_SDK#64-bit_development For the math functions, this is normally more a libc feature, so you might get very different results on different OS. Then again, by using -ffast-math, you allow the math functions to return any random value, so I can think of ways to make it even faster ;-) I use -ffast-math all the time and have always gotten virtually identical results to when I turn it off. The speed difference is important for me. The default for the Intel compiler is more aggressive than gcc -ffast-math -fno-cx-limited-range, as long as you don't use one of the old buggy mathinline.h header files. For a fair comparison, you need detailed attention to comparable options. If you don't set gcc -ffast-math, you will want icc -fp-model-source. It's good to have in mind what you want from the more aggressive options, e.g. auto-vectorization of sum reduction. If you do want gcc -fcx-limited range, icc spells it -complex-limited-range. -- Tim Prince
Re: weird optimization in sin+cos, x86 backend
On 02/05/2012 11:08 AM, James Courtier-Dutton wrote: Hi, I looked at this a bit closer. sin(1.0e22) is outside the +-2^63 range, so FPREM1 is used to bring it inside the range. So, I looked at FPREM1 a bit closer. #include #include int main (void) { long double x, r, m; x = 1.0e22; // x = 5.26300791462049950360708478127784;<- This is what the answer should be give or take 2PI. m = M_PIl * 2.0; r = remainderl(x, m); // Utilizes FPREM1 printf ("x = %.17Lf\n", x); printf ("m = %.17Lf\n", m); printf ("r = %.17Lf\n", r); return 1; } This outputs: x = 100.0 m = 6.28318530717958648 r = 2.66065232182161996 But, r should be 5.26300791462049950360708478127784... or -1.020177392559086973318201985281... according to wolfram alpha and most arbitrary maths libs I tried. I need to do a bit more digging, but this might point to a bug in the cpu instruction FPREM1 Kind Regards James As I recall, the remaindering instruction was documented as using a 66-bit rounded approximation fo PI, in case that is what you refer to. -- Tim Prince
Re: Failure building current 4.5 snapshot on Cygwin
Eric Niebler wrote: Angelo Graziosi wrote: Eric Niebler wrote: I am running into the same problem (cannnot build latest snapshot on cygwin). I have built and installed the latest binutils from head (see attached config.log for details). But still the build fails. Any help? This is strange! Recent snapshots (4.3, 4.4, 4.5) build OB both on Cygwin-1.5 and 1.7. In 1.5 I have build the same binutils of 1.7. I've attached objdir/intl/config.log. It says you have triggered cross compilation mode, without complete setup. Also, it says you are building in a directory below your source code directory, which I always used to do myself, but stopped on account of the number of times I've seen this criticized. The only new build-blocking problem I've run into in the last month is the unsupported autoconf test, which has a #FIXME comment. I had to comment it out.
Re: [4.4] Strange performance regression?
Joern Rennecke wrote: Quoting Mark Tall : Joern Rennecke wrote: But at any rate, the subject does not agree with the content of the original post. When we talk about a 'regression' in a particular gcc version, we generally mean that this version is in some way worse than a previous version of gcc. Didn't the original poster indicate that gcc 4.3 was faster than 4.4 ? In my book that is a regression. He also said that it was a different machine, Core 2 Q6600 vs some kind of Xeon Core 2 system with a total of eight cores. As different memory subsystems are likely to affect the code, it is not an established regression till he can reproduce a performance drop going from an older to a current compiler on the same or sufficiently similar machines, under comparable load conditions - which generally means that the machine must be idle apart from the benchmark. Ian's judgment in diverting to gcc-help was born out when it developed that -funroll-loops was wanted. This appeared to confirm his suggestion that it might have had to do with loop alignments. As long as everyone is editorializing, I'll venture say this case raises the suspicion that gcc might benefit from better default loop alignments, at least for that particular CPU. However, I've played a lot of games on Core i7 with varying unrolling etc. I find the behavior of current gcc entirely satisfactory, aside from the verbosity of the options required.
Re: Whole program optimization and functions-only-called-once.
Toon Moene wrote: Richard Guenther wrote: On Sun, Nov 15, 2009 at 8:07 AM, Toon Moene wrote: Steven Bosscher wrote: At least CPROP, LCM-PRE, and HOIST (i.e. all passes in gcse.c), and variable tracking. Are they covered by a --param ? At least that way I could teach them to go on indefinitely ... I think most of them are. Maybe we should diagnose the cases where we hit these limits. That would be a good idea. One other compiler I work with frequently (the Intel Fortran compiler) does just that. However, either it doesn't have or their marketing department doesn't want you to know about knobs to tweak these decisions :-) Both gfortran and ifort have a much longer list of adjustable limits on in-lining than most customers are willing to study or test.
Re: On the x86_64, does one have to zero a vector register before filling it completely ?
Toon Moene wrote: H.J. Lu wrote: On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene wrote: L.S., Due to the discussion on register allocation, I went back to a hobby of mine: Studying the assembly output of the compiler. For this Fortran subroutine (note: unless otherwise told to the Fortran front end, reals are 32 bit floating point numbers): subroutine sum(a, b, c, n) integer i, n real a(n), b(n), c(n) do i = 1, n c(i) = a(i) + b(i) enddo end with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop: xorps %xmm2, %xmm2 .L6: movaps %xmm2, %xmm0 movaps %xmm2, %xmm1 movlps (%r9,%rax), %xmm0 movlps (%r8,%rax), %xmm1 movhps 8(%r9,%rax), %xmm0 movhps 8(%r8,%rax), %xmm1 incl%ecx addps %xmm1, %xmm0 movaps %xmm0, 0(%rbp,%rax) addq$16, %rax cmpl%ebx, %ecx jb .L6 I'm not a master of x86_64 assembly, but this strongly looks like %xmm{0,1} have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), before they are completely filled with the mov{l,h}ps instructions ? I think it is used to avoid partial SSE register stall. You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for %xmm1) instruction (to copy 4*32 bits to the register) ? If you want those, you must request them with -mtune=barcelona.
Re: On the x86_64, does one have to zero a vector register before filling it completely ?
Richard Guenther wrote: On Sat, Nov 28, 2009 at 4:26 PM, Tim Prince wrote: Toon Moene wrote: H.J. Lu wrote: On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene wrote: L.S., Due to the discussion on register allocation, I went back to a hobby of mine: Studying the assembly output of the compiler. For this Fortran subroutine (note: unless otherwise told to the Fortran front end, reals are 32 bit floating point numbers): subroutine sum(a, b, c, n) integer i, n real a(n), b(n), c(n) do i = 1, n c(i) = a(i) + b(i) enddo end with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop: xorps %xmm2, %xmm2 .L6: movaps %xmm2, %xmm0 movaps %xmm2, %xmm1 movlps (%r9,%rax), %xmm0 movlps (%r8,%rax), %xmm1 movhps 8(%r9,%rax), %xmm0 movhps 8(%r8,%rax), %xmm1 incl%ecx addps %xmm1, %xmm0 movaps %xmm0, 0(%rbp,%rax) addq$16, %rax cmpl%ebx, %ecx jb .L6 I'm not a master of x86_64 assembly, but this strongly looks like %xmm{0,1} have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), before they are completely filled with the mov{l,h}ps instructions ? I think it is used to avoid partial SSE register stall. You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for %xmm1) instruction (to copy 4*32 bits to the register) ? If you want those, you must request them with -mtune=barcelona. Which would then get you movups (%r9,%rax), %xmm0 (unaligned move). generic tuning prefers the split moves, AMD Fam10 and above handle unaligned moves just fine. Correct, the movaps would have been used if alignment were recognized. The newer CPUs achieve full performance with movups. Do you consider Core i7/Nehalem as included in "AMD Fam10 and above?"
Re: On the x86_64, does one have to zero a vector register before filling it completely ?
Toon Moene wrote: Toon Moene wrote: Tim Prince wrote: > If you want those, you must request them with -mtune=barcelona. OK, so it is an alignment issue (with -mtune=barcelona): .L6: movups 0(%rbp,%rax), %xmm0 movups (%rbx,%rax), %xmm1 incl%ecx addps %xmm1, %xmm0 movaps %xmm0, (%r8,%rax) addq$16, %rax cmpl%r10d, %ecx jb .L6 Once this problem is solved (well, determined how it could be solved), we go on to the next, the extraneous induction variable %ecx. There are two ways to deal with it: 1. Eliminate it with respect to the other induction variable that counts in the same direction (upwards, with steps 16) and remember that induction variable's (%rax) limit. or: 2. Count %ecx down from %r10d to zero (which eliminates %r10d as a loop carried register). g77 avoided this by coding counted do loops with a separate loop counter counting down to zero - not so with gfortran (quoting): /* Translate the simple DO construct. This is where the loop variable has integer type and step +-1. We can't use this in the general case because integer overflow and floating point errors could give incorrect results. We translate a do loop from: DO dovar = from, to, step body END DO to: [Evaluate loop bounds and step] dovar = from; if ((step > 0) ? (dovar <= to) : (dovar => to)) { for (;;) { body; cycle_label: cond = (dovar == to); dovar += step; if (cond) goto end_label; } } end_label: This helps the optimizers by avoiding the extra induction variable used in the general case. */ So either we teach the Fortran front end this trick, or we teach the loop optimization the trick of flipping the sense of a (n otherwise unused) induction variable This would have paid off more frequently in i386 mode, where there is a possibility of integer register pressure in loops small enough for such an optimization to succeed. This seems to be among the types of optimizations envisioned for run-time binary interpretation systems.
Re: Graphite and Loop fusion.
Toon Moene wrote: REAL, ALLOCATABLE :: A(:,:), B(:,:), C(:,:), D(:,:), E(:,:), F(:,:) ! ... READ IN EXTEND OF ARRAYS ... READ*,N ! ... ALLOCATE ARRAYS ALLOCATE(A(N,N),B(N,N),C(N,N),D(N,N),E(N,N),F(N,N)) ! ... READ IN ARRAYS READ*,A,B C = A + B D = A * C E = B * EXP(D) F = C * LOG(E) where the four assignments all have the structure of loops like: DO I = 1, N DO J = 1, N X(J,I) = OP(A(J,I), B(J,I)) ENDDO ENDDO Obviously, this could benefit from loop fusion, by combining the four assignments in one loop. Provided that it were still possible to vectorize suitable portions, or N is known to be so large that cache locality outweighs vectorization. This raises the question of progress on vector math functions, as well as the one about relative alignments (or ignoring them in view of recent CPU designs).
Re: Need an assembler consult!
FX wrote: Hi all, I have picked up what seems to be a simple patch from PR36399, but I don't know enough assembler to tell whether it's fixing it completely or not. The following function: #include __m128i r(__m128 d1, __m128 d2, __m128 d3, __m128i r, int t, __m128i s) {return r+s;} is compiled by Apple's GCC into: pushl %ebp movl%esp, %ebp subl$72, %esp movaps %xmm0, -24(%ebp) movaps %xmm1, -40(%ebp) movaps %xmm2, -56(%ebp) movdqa %xmm3, -72(%ebp) # movdqa 24(%ebp), %xmm0 # paddq -72(%ebp), %xmm0 # leave ret Instead of lines marked with #, FSF's GCC gives: movdqa 40(%ebp), %xmm1 movdqa 8(%ebp), %xmm0 paddq %xmm1, %xmm0 By fixing SSE_REGPARM_MAX in config/i386/i386.h (following Apple's compiler value), I get GCC now generates: movdqa %xmm3, -72(%ebp) movdqa 24(%ebp), %xmm0 movdqa -72(%ebp), %xmm1 paddq %xmm1, %xmm0 The first two lines are identical to Apple, but the last two don't. They seem OK to me, but I don't know enough assembler to be really sure. Could someone confirm the two are equivalent? Apparently the same as far as what is returned in xmm0.
Re: The "right way" to handle alignment of pointer targets in the compiler?
Benjamin Redelings I wrote: Hi, I have been playing with the GCC vectorizer and examining assembly code that is produced for dot products that are not for a fixed number of elements. (This comes up surprisingly often in scientific codes.) So far, the generated code is not faster than non-vectorized code, and I think that it is because I can't find a way to tell the compiler that the target of a double* is 16-byte aligned. From Pr 27827 - http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827 : "I just quickly glanced at the code, and I see that it never uses "movapd" from memory, which is a key to getting decent performance." How many people would take advantage of special machinery for some old CPU, if that's your goal? simplifying your example to double f3(const double* p_, const double* q_,int n) { double sum = 0; for(int i=0; iOn CPUs introduced in the last 2 years, movupd should be as fast as movapd, and -mtune=barcelona should work well in general, not only in this example. The bigger difference in performance, for longer loops, would come with further batching of sums, favoring loop lengths of multiples of 4 (or 8, with unrolling). That alignment already favors a fairly long loop. As you're using C++, it seems you could have used inner_product() rather than writing out a function. My Core I7 showed matrix multiply 25x25 times 25x100 producing 17Gflops with gfortran in-line code. g++ produces about 80% of that.
Re: The "right way" to handle alignment of pointer targets in the compiler?
Benjamin Redelings I wrote: Thanks for the information! Here are several reasons (there are more) why gcc uses 64-bit loads by default: 1) For a single dot product, the rate of 64-bit data loads roughly balances the latency of adds to the same register. Parallel dot products (using 2 accumulators) would take advantage of faster 128-bit loads. 2) run-time checks to adjust alignment, if possible, don't pay off for loop counts < about 40. 3) several obsolete CPU architectures implemented 128-bit loads by pairs of 64-bit loads. 4) 64-bit loads were generally more efficient than movupd, prior to barcelona. In the case you quote, with parallel dot products, 128-bit loads would be required so as to show much performance gain over x87.
Re: adding -fnoalias ... would a patch be accepted ?
torbenh wrote: can you please explain, why you reject the idea of -fnoalias ? msvc has declspec(noalias) icc has -fnoalias msvc needs it because it doesn't implement restrict and supports violation of typed aliasing rules as a default. ICL needs it for msvc compatibility, but has better alternatives. gcc can't copy the worst features of msvc.
Re: speed of double-precision divide
Steve White wrote: I was under the misconception that each of these SSE operatons was meant to be accomplished in a single clock cycle (although I knew there are various other issues.) Current CPU architectures permit an SSE scalar or parallel multiply and add instruction to be issued on each clock cycle. Completion takes at least 4 cycles for add, significantly more for multiply. The instruction timing tables quote throughput (how many cycles between issue) and latency (number of cycles to complete an individual operation). An even more common misconception than yours is that the extra time taken to complete multiply, compared with the time of add, would disappear with fused multiply-add instructions. SSE divide, as has been explained, is not pipelined. The best way to speed up a loop with divide is with vectorization, barring situations such as the one you brought up where divide may not actually be a necessary part of the algorithm.
Re: Support for export keyword to use with C++ templates ?
On 2/2/10 7:19 PM, Richard Kenner wrote: I see that what I need is an assignment for all future changes. If my employer is not involved with any contributions of mine, the employer disclaimer is not needed, right ? It's safest to have it. The best way to prove that your employer is not involved with any contributions of yours is with such a disclaimer. Some employers have had a formal process for approving assignment of own-time contributions, as well as assignments as part of their business, and lack of either form of assignment indicates the employer has forbidden them. -- Tim Prince
Re: Starting an OpenMP parallel section is extremely slow on a hyper-threaded Nehalem
On 2/11/2010 2:00 AM, Edwin Bennink wrote: Dear gcc list, I noticed that starting an OpenMP parallel section takes a significant amount of time on Nehalem cpu's with hyper-threading enabled. If you think a question might be related to gcc, but don't know which forum to use, gcc-help is more appropriate. As your question is whether there is a way to avoid anomalous behaviors when an old Ubuntu is run on a CPU released after that version of Ubuntu, an Ubuntu forum might be more appropriate. A usual way is to shut off HyperThreading in the BIOS when running on a distro which has trouble with it. I do find your observation interesting. As far as I know, the oldest distro which works well on Core I7 is RHEL5.2 x86_64, which I run, with updated gcc and binutils, and HT disabled, as I never run applications which could benefit from HT. -- Tim Prince
Re: Change x86 default arch for 4.5?
On 2/18/2010 4:54 PM, Joe Buck wrote: But maybe I didn't ask the right question: can any x86 experts comment on recently made x86 CPUs that would not function correctly with code produced by --with-arch=i486? Are there any? All CPUs still in production are at least SSE3 capable, unless someone can come up with one of which I'm not aware. Intel compilers made the switch last year to requiring SSE2 capability for the host, as well as in the default target options, even for 32-bit. All x86_64 or X64 CPUs for which any compiler was produced had SSE2 capability, so it is required for those 64-bit targets. -- Tim Prince
Re: [RFH] A simple way to figure out the number of bits used by a long double
On 2/26/2010 5:44 AM, Ed Smith-Rowland wrote: Huh. I would have *sworn* that sizeof(long double) was 10 not 16 even though we know it was 80 bits. As you indicated before, sizeof gives the amount of memory displaced by the object, including padding. In my experience with gcc, sizeof(long double) is likely to be 12 on 32-bit platforms, and 16 on 64-bit platforms. These choices are made to preserve alignment for 32-bit and 128-bit objects respectively, and to improve performance in the 64-bit case, for hardware which doesn't like to straddle cache lines. It seems the topic would have been more appropriate for gcc-help, if related to gcc, or maybe comp.lang.c, if a question about implementation in accordance with standard C. -- Tim Prince
Re: legitimate parallel make check?
On 3/9/2010 4:28 AM, IainS wrote: It would be nice to allow the apparently independent targets [e.g. gcc-c,fortran,c++ etc.] to be (explicitly) make-checked in parallel. On certain targets, it has been necessary to do this explicitly for a long time, submitting make check-gcc, make check-fortran, make check-g++ separately. Perhaps a script could be made which would detect when the build is complete, then submit the separate make check serial jobs together. -- Tim Prince
Re: GCC vs ICC
On 3/22/2010 7:46 PM, Rayne wrote: Hi all, I'm interested in knowing how GCC differs from Intel's ICC in terms of the optimization levels and catering to specific processor architecture. I'm using GCC 4.1.2 20070626 and ICC v11.1 for Linux. How does ICC's optimization levels (O1 to O3) differ from GCC, if they differ at all? The ICC is able to cater specifically to different architectures (IA-32, intel64 and IA-64). I've read that GCC has the -march compiler option which I think is similar, but I can't find a list of the options to use. I'm using Intel Xeon X5570, which is 64-bit. Are there any other GCC compiler options I could use that would cater my applications for 64-bit Intel CPUs? Some of that seems more topical on the Intel software forum for icc, and the following more topical on either that forum or gcc-help, where you should go for follow-up. If you are using gcc on Xeon 5570, gcc -mtune=barcelona -ffast-math -O3 -msse4.2 might be a comparable level of optimization to icc -xSSE4.2 For gcc 4.1, you would have to set also -ftree-vectorize, but you would be better off with a current version. But, if you are optimizing for early Intel 64-bit Xeon, -mtune=barcelona would not be consistently good, and you could not use -msse4 or -xSSE4.2. For optimization which observes standards and also disables vectorized sum reduction, you would omit -ffast-math for gcc, and set icc -fp-model source. -- Tim Prince
Re: Compiler option for SSE4
On 3/23/2010 11:02 PM, Rayne wrote: I'm using GCC 4.1.2 20070626 on a server with Intel Xeon X5570. How do I turn on the compiler option for SSE4? I've tried -msse4, -msse4.1 and -msse4.2, but they all returned the error message cc1: error: unrecognized command line option "-msse4.1" (for whichever option I tried). You would need a gcc version which supports sse4. As you said yourself, your version is approaching 3 years old. Actually, the more important option for Xeon 55xx, if you are vectorizing, is the -mtune=barcelona, which has been supported for about 2 years. Whether vectorizing or not, on an 8 core CPU, the OpenMP introduced in gcc 4.2 would be useful. This looks like a gcc-help mail list question, which is where you should submit any follow-up. -- Tim Prince
Re: Optimizing floating point *(2^c) and /(2^c)
On 3/29/2010 10:51 AM, Geert Bosch wrote: On Mar 29, 2010, at 13:19, Jeroen Van Der Bossche wrote: 've recently written a program where taking the average of 2 floating point numbers was a real bottleneck. I've looked into the assembly generated by gcc -O3 and apparently gcc treats multiplication and division by a hard-coded 2 like any other multiplication with a constant. I think, however, that *(2^c) and /(2^c) for floating points, where the c is known at compile-time, should be able to be optimized with the following pseudo-code: e = exponent bits of the number if (e> c&& e< (0b111...11)-c) { e += c or e -= c } else { do regular multiplication } Even further optimizations may be possible, such as bitshifting the significand when e=0. However, that would require checking for a lot of special cases and require so many conditional jumps that it's most likely not going to be any faster. I'm not skilled enough with assembly to write this myself and test if this actually performs faster than how it's implemented now. Its performance will most likely also depend on the processor architecture, and I could only test this code on one machine. Therefore I ask to those who are familiar with gcc's optimization routines to give this 2 seconds of thought, as this is probably rather easy to implement and many programs could benefit from this. For any optimization suggestions, you should start with showing some real, compilable, code with a performance problem that you think the compiler could address. Please include details about compilation options, GCC versions and target hardware, as well as observed performance numbers. How do you see that averaging two floating point numbers is a bottleneck? This should only be a single addition and multiplication, and will execute in a nanosecond or so on a moderately modern system. Your particular suggestion is flawed. Floating-point multiplication is very fast on most targets. It is hard to see how on any target with floating-point hardware, manual mucking with the representation can be a win. In particular, your sketch doesn't at all address underflow and overflow. Likely a complete implementation would be many times slower than a floating-point multiply. -Geert gcc used to have the ability to replace division by a power of 2 by an fscale instruction, for appropriate targets (maybe still does). Such targets have nearly disappeared from everyday usage. What remains is the possibility of replacing the division by constant power of 2 by multiplication, but it's generally considered the programmer should have done that in the beginning. icc has such an facility, but it's subject to -fp-model=fast (equivalent to gcc -ffast-math -fno-cx-limited-range), even though it's a totally safe conversion. As Geert indicated, it's almost inconceivable that a correct implementation which takes care of exceptions could match the floating point hardware performance, even for a case which starts with operands in memory (but you mention the case following an addition). -- Tim Prince
Re: GCC primary/secondary platforms?
On 4/7/2010 9:17 AM, Gary Funck wrote: On 04/07/10 11:11:05, Diego Novillo wrote: Additionally, make sure that the branch bootstraps and tests on all primary/secondary platforms with all languages enabled. Diego, thanks for your prompt reply and suggestions. Regarding the primary/secondary platforms. Are those listed here? http://gcc.gnu.org/gcc-4.5/criteria.html Will there be a notification if and when C++ run-time will be ready to test on secondary platforms, or will platforms like cygwin be struck from the secondary list? I'm 26 hours into testsuite for 4.5 RC for cygwin gcc/gfortran, didn't know of any other supported languages worth testing. My ia64 box died a few months ago, but suse-linux surely was at least as popular as unknown-linux in recent years. -- Tim Prince
Re: GCC primary/secondary platforms?
On 4/8/2010 2:40 PM, Dave Korn wrote: On 07/04/2010 19:47, Tim Prince wrote: Will there be a notification if and when C++ run-time will be ready to test on secondary platforms, or will platforms like cygwin be struck from the secondary list? What exactly are you talking about? Libstdc++-v3 builds just fine on Cygwin. Our release criteria for the secondary platforms is: * The compiler bootstraps successfully, and the C++ runtime library builds. * The DejaGNU testsuite has been run, and a substantial majority of the tests pass. We pass both those criteria with flying colours. What are you worrying about? cheers, DaveK No one answered questions about why libstdc++ configure started complaining about mis-match in style of wchar support a month ago. Nor did I see anyone give any changes in configure procedure. Giving it another try at a new download today. -- Tim Prince
Re: GCC primary/secondary platforms?
On 4/8/2010 6:24 PM, Dave Korn wrote: Nor did I see anyone give any changes in configure procedure. Giving it another try at a new download today. Well, nothing has changed, but then again I haven't seen anyone else complaining about this, so there's probably some problem in your build environment; let's see what happens with your fresh build. (I've built the 4.5.0-RC1 candidate without any complications and am running the tests right now.) Built OK this time around, no changes here either, except for cygwin1 update. testsuite results in a couple of days. Thanks. -- Tim Prince
Re: Why not contribute? (to GCC)
On 4/23/2010 1:05 PM, HyperQuantum wrote: On Fri, Apr 23, 2010 at 9:58 PM, HyperQuantum wrote: On Fri, Apr 23, 2010 at 8:39 PM, Manuel López-Ibáñez wrote: What reasons keep you from contributing to GCC? The lack of time, for the most part. I submitted a feature request once. It's now four years old, still open, and the last message it received was two years ago. (PR26061) The average time for acceptance of a PR with a patch submission from an outsider such as ourselves is over 2 years, and by then the patch no longer fits, has to be reworked, and is about to become moot. I still have the FSF paperwork in force, as far as I know, from over a decade ago, prior to my current employment. Does it become valid again upon termination of employment? My current employer has no problem with the FSF paperwork for employees whose primary job is maintenance of gnu software (with committee approval), but this does not extend to those of us for whom it is a secondary role. There once was a survey requesting responses on how our FSF submissions compared before and after current employment began, but no summary of the results. -- Tim Prince
Re: Autovectorizing does not work with classes
Georg Martius wrote: > Dear gcc developers, > > I am new to this list. > I tried to use the auto-vectorization (4.2.1 (SUSE Linux)) but unfortunately > with limited success. > My code is bassically a matrix library in C++. The vectorizer does not like > the member variables. Consider this code compiled with > gcc -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5 > -funsafe-math-optimizations > that gives basically "not vectorized: unhandled data-ref" > > class P{ > public: > P() : m(5),n(3) { > double *d = data; > for (int i=0; i d[i] = i/10.2; > } > void test(const double& sum); > private: > int m; > int n; > double data[15]; > }; > > void P::test(const double& sum) { > double *d = this->data; > for(int i=0; i d[i]+=sum; > } > } > > whereas the more or less equivalent C version works just fine: > > int m=5; > int n=3; > double data[15]; > > void test(const double& sum) { > int mn = m*n; > for(int i=0; i data[i]+=sum; > } > } > > > Is there a fundamental problem in using the vectorizer in C++? > I don't see any C code above. As another reply indicated, the most likely C idiom would be to pass sum by value. Alternatively, you could use a local copy of sum, in cases where that is a problem. The only fundamental vectorization problem I can think of which is specific to C++ is the lack of a standard restrict keyword. In g++, __restrict__ is available. A local copy (or value parameter) of sum avoids a need for the compiler to recognize const or restrict as an assurance of no value modification. The loop has to have known fixed bounds at entry, in order to vectorize. If your C++ style doesn't support that, e.g. by calculating the end value outside the loop, as you show in your latter version, then you do have a problem with vectorization.
Re: question. type long long
Александр Струняшев wrote: > Good afternoon. > I need some help. As from what versions your compiler understand that > "long long" is 64 bits ? > > Best regards, Alexander > > P.S. Sorry for my mistakes, I know English bad. No need to be sorry about English, but the topic is OK for gcc-help, not gcc development. gcc was among the first compilers to support long long (always as 64-bit), the only problem being that it was a gnu extension for g++. In that form, the usage may not have settled down until g++ 4.1. The warnings for attempting long long constants in 32-bit mode, without the LL suffix, have been a subject of discussion: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13358 The warning doesn't mean that long long could be less than 64 bits; it means the constant without the LL suffix is less than 64 bits.
Re: Backward Compatibility of RHEL Advanced Server and GCC
Steven Bosscher wrote: On Wed, Oct 29, 2008 at 6:19 AM, S. Suhasini <[EMAIL PROTECTED]> wrote: We would like to know whether the new version of the software (compiled with the new GCC) can be deployed and run on the older setup with RHEL AS 3 and GCC 2.96. We need not compile again on the older setup. Will there be any run-time libraries dependency? Would be very grateful if we get a response for this query. It seems to me that this kind of question is best asked on a RedHat support list, not on a list where compiler development is discussed. FWIW, there is no "official" GCC 2.96, see http://gcc.gnu.org/gcc-2.96.html. This might be partially topical on the gcc-help list. If dynamic libraries are in use, there will be trouble.
Re: Cygwin support
Brian Dessent wrote: > Cygwin has been a secondary target for a number of years. MinGW has > been a secondary target since 4.3. This generally means that they > should be in fairly good shape, more or less. To quote the docs: > >> Our release criteria for the secondary platforms is: >> >> * The compiler bootstraps successfully, and the C++ runtime library >> builds. >> * The DejaGNU testsuite has been run, and a substantial majority of the >> tests pass. > > > More recently I've seen Danny Smith report that the IRA merge broke > MinGW (and presumably Cygwin, since they share most of the same code) > bootstrap. I haven't tested this myself recently so I don't know if > it's still broken or not. > I've run the bootstrap and testsuite twice in the last month. The bootstrap failures are due to a broken #ifdef specific to cygwin in the headers provided with cygwin, the requirement for a specific version of autoconf (not available in setup), and the need to remove the -werror in libstdc++ build (because of minor discrepancies in cygwin headers). All of those are easy to rectify, but fixes seem unlikely to be considered by the decision makers. However, the C++ testsuite results are unacceptable, with many internal errors. For some time now, gfortran has been broken for practical purposes, even when it passes testsuite, as it seems to have a memory leak. This shows up in the public wiki binaries. So, there are clear points for investigation of cygwin problems, and submission of PRs, should you be interested. > Running the dejagnu testsuite on Cygwin is > excruciatingly slow due to the penalty incurred from emulating fork. It runs over a weekend on a Pentium D which I brought back to life by replacing the CPU cooler system. I have no problem with running this if I am in the office when the snapshot is released, but I think there is little interest in fixing the problems which are specific to g++ on cygwin, yet working gcc and gfortran aren't sufficient for gcc upgrades to be accepted. Support for 64-bit native looks like it will be limited to mingw, so I no longer see a future for gcc on cygwin.
Re: Purpose of GCC Stack Padding?
Andrew Tomazos wrote: I've been studying the x86 compiled form of the following function: void function() { char buffer[X]; } where X = 0, 1, 2 .. 100 Naively, I would expect to see: pushl %ebp movl%esp, %ebp subl$X, %esp leave ret Instead, the stack appears to be padded: For a buffer size of 0the stack size is 0 For a buffer size of 1 to 7 the stack size is 16 For a buffer size of 8 to 12 the stack size is 24 For a buffer size of 13 to 28 the stack size is 40 For a buffer size of 29 to 44 the stack size is 56 For a buffer size of 45 to 60 the stack size is 72 For a buffer size of 61 to 76 the stack size is 88 For a buffer size of 77 to 92 the stack size is 104 For a buffer size of 93 to 100 the stack size is 120 When X >= 8 gcc adds a stack corruption check (__stack_chk_fail), which accounts for an extra 4 bytes of stack space in these cases. This does not explain the rest of the padding. Can anyone explain the purpose of the rest of the padding? This looks like more of a gcc-help question, trying to move the thread there. Unless you over-ride defaults with -mpreferred-stack boundary (or -Os, which probably implies a change in stack boundary), or ask for a change on the basis of making a leaf function, you are generating alignment compatible with the use of SSE parallel instructions. The stack, then, must be 16-byte aligned before entry and at exit, and also a buffer of 16 bytes or more must be 16-byte aligned. I believe there is a move afoot to standardize the treatment for the most common x86 32-bit targets; that was done at the beginning for 64-bit. Don't know if you are using x86 to imply 32-bit, in accordance with Windows terminology.
Re: Upgrade to GCC.4.3.2
Philipp Thomas wrote: > On Sun, 28 Dec 2008 14:24:22 -0500, you wrote: > >> I have SLES9 and Linux-2.6.5-7.97 kernel install on i586 intel 32 bit >> machine. The compiler is gcc-c++3.3.3-43.24. I want to upgrade to >> GCC4.3.2. My question are: Would this upgrade work with >> SLES9? > > This is the wrong list for such questions. You should try a SUSE > specific list like opens...@opensuse.org or > opensuse-programm...@opensuse.org gcc-help is a reasonable choice as well.
Re: gcc binary download
Tobias Burnus wrote: > > Otherwise, you could consider building GCC yourself, cf. > http://gcc.gnu.org/install/. (Furthermore, some gfortran developers > offer regular GCC builds, which are linked at > http://gcc.gnu.org/wiki/GFortranBinaries; those are all unofficial > builds, come without any warrantee/support, and due to, e.g., library > issues they may not work on your system.) > I believe the wiki builds include C and Fortran, but not C++, in view of the additional limitations in supporting a new g++ on a reasonable range of targets. Even so, there may be minimum requirements on glibc and binutils versions.
Re: Binary Autovectorization
Rodrigo Dominguez wrote: > I am looking at binary auto-vectorization or taking a binary and rewriting > it to use SIMD instructions (either statically or dynamically). That's a tall order, considering how much source level dependency information is needed. I don't know whether proprietary binary translation projects currently under way promise to add vectorization, or just to translate SIMD vector code to new ISA.
Re: -mfpmath=sse,387 is experimental ?
Zuxy Meng wrote: > Hi, > > "Timothy Madden" 写入消息 ! >> I am sure having twice the number of registers (sse+387) would make a >> big difference. You're not counting the rename registers, you're talking about 32-bit mode only, and you're discounting the different mode of accessing the registers. >> >> How would I know if my AMD Sempron 2200+ has separate execution units >> for SSE and >> FPU instructions, with independent registers ? > > Most CPU use the same FP unit for both x87 and SIMD operations so it > wouldn't give you double the performance. The only exception I know of > is K6-2/3, whose x87 and 3DNow! units are separate. > -march=pentium-m observed the preference of those CPUs for mixing the types of code. This was due more to the limited issue rate for SSE instructions than to the expanded number of registers in use. You are welcome to test it on your CPU; however, AMD CPUs were designed to perform well with SSE alone, particularly in 64-bit mode.
Re: GCC 4.4.0 Status Report (2009-03-13)
Chris Lattner wrote: > > On Mar 23, 2009, at 8:02 PM, Jeff Law wrote: > >> Chris Lattner wrote: > These companies really don't care about FOSS in the same way GCC developers do. I'd be highly confident that this would still be a serious issue for the majority of the companies I've interacted with through the years. >>> >>> Hi Jeff, >>> >>> Can you please explain the differences you see between how GCC >>> developers and other people think about FOSS? I'm curious about your >>> perception here, and what basis it is grounded on. >>> >> I'd divide customers into two broad camps. Both camps are extremely >> pragmatic, but they're focused on two totally different goals. > > Thanks Jeff, I completely agree with you. Those camps are very common > in my experience as well. Do you consider GCC developers to fall into > one of these two categories, or do you see them as having a third > perspective? I know that many people have their own motivations and > personal agenda (and it is hard to generalize) but I'm curious what you > meant above. > > Thanks! > > -Chris > >> >> >> The first camp sees FOSS toolkits as a means to help them sell more >> widgets, typically processors & embedded development kits. Their >> belief is that a FOSS toolkit helps build a developer eco-system >> around their widget, which in turn spurs development of consumable >> devices which drive processor & embedded kit sales. The key for >> these guys is free, as in beer, widely available tools. The fact that >> the compiler & assorted utilities are open-source is largely irrelevant. >> >> The second broad camp I run into regularly are software developers >> themselves building applications, most often for internal use, but >> occasionally they're building software that is then licensed to their >> customers. They'd probably describe the compiler & associated >> utilities as a set of hammers, screwdrivers and the like -- they're >> just as happy using GCC as any other compiler so long as it works. >> The fact that the GNU tools are open source is completely irrelevant >> to these guys. They want to see standards compliance, abi >> interoperability, and interoperability with other tools (such as >> debuggers, profilers, guis, etc). They're more than willing to swap >> out one set of tools for another if it gives them some advantage. >> Note that an advantage isn't necessarily compile-time or runtime >> performance -- it might be ease of use, which they believe allows >> their junior level engineers to be more effective (this has come up >> consistently over the last few years). >> >> Note that in neither case do they really care about the open-source >> aspects of their toolchain (or for the most part the OS either). >> They may (and often do) like the commoditization of software that FOSS >> tends to drive, but don't mistake that for caring about the open >> source ideals -- it's merely cost-cutting. >> >> Jeff >> >> > Software developers I deal with use gcc because it's a guaranteed included part of the customer platforms they are targeting. They're generally looking for a 20% gain in performance plus support before adopting commercial alternatives. The GUIs they use don't live up to the advertisements about ease of use. This doesn't necessarily put them in either of Jeff's camps. Tim
Re: Minimum GMP/MPFR version bumps for GCC-4.5
Kaveh R. Ghazi wrote: > What versions of GMP/MPFR do you get on > your typical development box and how old are your distros? > OpenSuSE 10.3 (originally released Oct. 07): gmp-devel-4.2.1-58 gmp-devel-32bit-4.2.1-58 mpfr-2.2.1-45
Re: heise.de comment on 4.4.0 release
Tobias Burnus wrote: > Toon Moene wrote: Can somebody with access to SPEC sources confirm / deny and file a bug report, if appropriate? I just started working on SPEC CPU2006 issues this week. > Seemingly yes. To a certain extend this was by accident as "-msse3" was > used, but it is on i586 only effective with -mfpmath=sse (that is not > completely obvious). By the way, my tests using the Polyhedron benchmark > show that for 32bit, x87 and SSE are similarly fast, depending a lot on > the test case thus it does not slow down the benchmark too much. Certain AMD CPUs had shorter latencies for scalar single precision sse, but generally the advantage of sse comes from vectorization. > > If I understood correctly, the 32bit mode was used since the 64bit mode > needs more than the available 2GB memory. Certain commercial compilers make an effort to switch to 32-bit mode automatically on several CPU2006 benchmarks, as they are too small to run as fast in 64-bit mode. > > Similarly, the option -funroll-loops was avoided as they expect that > unrolling badly interacts with the small cache Atom processors have. > (That CPU2006 runs that long, does not make testing different options > that easy.) I'm surprised that spec 2006 is considered relevant to Atom. The entire thing (base only) has been running under 10 hours on a dual quad core system. I've heard several times the sentiment that there ought to be an "official" harness to run a single test, trying various options. > I would have liked that the options were reported. For instance > -ffast-math was not used out of fear that it results in too imprecise > results causing SPEC to abort. (Admittedly, I'm also careful with that > option, though I assume that -ffast-math works for SPEC.) On the other > hand, certain flags implies by -ffast-math are already applied with -O1 > in some commercial compilers. SPEC probably has been the biggest driver for inclusion of insane options at default in commercial compilers. It's certainly not an example of acceptable practice in writing portable code. I have yet to find a compiler which didn't fail at least one SPEC test, and I don't blame the compilers. There are dependencies on unusual C++ extensions, which somehow weren't noticed before, examples of using "f77" as an excuse for hiding one's intentions, and expectations of optimizations which have little relevance for serious applications. > > David Korn wrote: >> They accused us of a too-hasty release. My irony meter exploded! Anyway, a fault in support for a not-so-open benchmark application seems even less relevant in an open source effort than it is to compilers which depend on ranking for sales success.
Re: Bootstrap broken by ppl/cloog config problem: finds non-system/non-standard "/include" dir
Dave Korn wrote: > > Heh, I was just about to post that, only I was looking at $clooginc rather > than $pplinc! The same problem exists for both; I'm pretty sure we should > fall back on $prefix if the --with option is empty. > When I bootstrapped gcc 4.5 on cygwin yesterday, configure recognized the newly installed ppl, but not the cloog. The bootstrap completed successfully, and I'm not looking a gift horse in the mouth.
Re: Bootstrap broken by ppl/cloog config problem: finds non-system/non-standard "/include" dir
Dave Korn wrote: > Tim Prince wrote: >> Dave Korn wrote: >> >>> Heh, I was just about to post that, only I was looking at $clooginc rather >>> than $pplinc! The same problem exists for both; I'm pretty sure we should >>> fall back on $prefix if the --with option is empty. >>> >> When I bootstrapped gcc 4.5 on cygwin yesterday, configure recognized the >> newly installed ppl, but not the cloog. The bootstrap completed >> successfully, and I'm not looking a gift horse in the mouth. > > You don't have a bogus /include dir, but I bet you'll find -I/include in > PPLINC. > > It would be interesting to know why it didn't spot cloog. What's in your > top-level $objdir/config.log? > #include no such file -I/include was set by configure. As you say, there is something bogus here. setup menu shows cloog installed in development category, but I can't find any such include file. Does this mean the cygwin distribution of cloog is broken?
Re: Bootstrap broken by ppl/cloog config problem: finds non-system/non-standard "/include" dir
Dave Korn wrote: Tim Prince wrote: #include no such file -I/include was set by configure. As you say, there is something bogus here. setup menu shows cloog installed in development category, but I can't find any such include file. Does this mean the cygwin distribution of cloog is broken? Did you make sure to get the -devel packages as well as the libs? That's the usual cause of this kind of problem. I highly recommend the new version of setup.exe that has a package-list search box :-) cheers, DaveK OK, I see there is a libcloog-devel in addition to the cloog Dev selection, guess that will fix it for cygwin. I tried to build cloog for IA64 linux as well, gave up on include file parsing errors.
Re: [Fwd: Failure in bootstrapping gfortran-4.5 on Cygwin]
Ian Lance Taylor wrote: Angelo Graziosi writes: The current snapshot 4.5-20090507 fails to bootstrap on Cygwin: It did bootstrap effortlessly for me, once I logged off to clear hung processes, with the usual disabling of strict warnings. I'll let testsuite run over the weekend.
Re: Failure building current 4.5 snapshot on Cygwin
Angelo Graziosi wrote: > I want to flag the following failure I have seen on Cygwin 1.5 trying to > build current 4.5-20090625 gcc snapshot: > checking whether the C compiler works... configure: error: in > `/tmp/build/intl': > configure: error: cannot run C compiled programs. > If you meant to cross compile, use `--host'. > See `config.log' for more details. I met the same failure on Cygwin 1.7 with yesterday's and last week's snapshots. I didn't notice that it refers to intl/config.log, so will go back and look, as you didn't show what happened there. On a slightly related subject, I have shown that the libgfortran.dll.a and libgomp.dll.a are broken on cygwin builds, including those released for cygwin, as shown on the test case I submitted on cygwin list earlier this week. The -enable-shared has never been satisfactory for gfortran cygwin.
Re: Failure building current 4.5 snapshot on Cygwin
Dave Korn wrote: Angelo Graziosi wrote: I want to flag the following failure I have seen on Cygwin 1.5 trying to build current 4.5-20090625 gcc snapshot: So what's in config.log? And what binutils are you using? cheers, DaveK In my case, it says no permission to execute a.exe. However, I can run the intl configure and make from command line. When I do that, and attempt to restart stage 2, it stops in liberty, and again I have to execute steps from command line.
Re: Failure building current 4.5 snapshot on Cygwin
Kai Tietz wrote: 2009/6/26 Seiji Kachi : Angelo Graziosi wrote: Dave Korn ha scritto: Angelo Graziosi wrote: I want to flag the following failure I have seen on Cygwin 1.5 trying to build current 4.5-20090625 gcc snapshot: So what's in config.log? And what binutils are you using? The config logs are attached, while binutils is the current in Cygwin-1.5, i.e. 20080624-2. Cheers, Angelo. I have also seen similar faulure, and the reason on my environment is as follows. (1) In my case, gcc build complete successfully. But a.exe which is compiled from the new compiler fails. Error message is $ ./a.exe bash: ./a.exe: Permission denied Source code of a.exe is quite simple: main() { printf("Hello\n"); } (2) This failuer occurres from gcc trunk r148408. r148407 is OK. (3) r148408 removed "#ifdef DEBUG_PUBTYPES_SECTION". r148407 does not generate debug_pubtypes section, but r148408 and later version generates debug_pubtypes section in object when we set debug option. (4) gcc build sequence usually uses debug option. (5) My cygwin environment seems not to accept debug_pubtypes section, and pop up "Permission denied" error. When I reverted "#ifdef DEBUG_PUBTYPES_SECTION" in dearf2out.c, the failuer disappeared. Does this failure occurr only on cygwin? Regards, Seiji Kachi No, this bug appeared on all windows pe-coff targets. A fix for this is already checked in yesterday on binutils. Could you try it with the current binutils head version? Cheers, Kai Is this supposed to be sufficient information for us to find that binutils? I may be able to find an insider colleague, otherwise I would have no chance.
Re: Failure building current 4.5 snapshot on Cygwin
Kai Tietz wrote: 2009/6/26 Tim Prince : Kai Tietz wrote: 2009/6/26 Seiji Kachi : Angelo Graziosi wrote: Dave Korn ha scritto: Angelo Graziosi wrote: I want to flag the following failure I have seen on Cygwin 1.5 trying to build current 4.5-20090625 gcc snapshot: So what's in config.log? And what binutils are you using? The config logs are attached, while binutils is the current in Cygwin-1.5, i.e. 20080624-2. Cheers, Angelo. I have also seen similar faulure, and the reason on my environment is as follows. (1) In my case, gcc build complete successfully. But a.exe which is compiled from the new compiler fails. Error message is $ ./a.exe bash: ./a.exe: Permission denied Source code of a.exe is quite simple: main() { printf("Hello\n"); } (2) This failuer occurres from gcc trunk r148408. r148407 is OK. (3) r148408 removed "#ifdef DEBUG_PUBTYPES_SECTION". r148407 does not generate debug_pubtypes section, but r148408 and later version generates debug_pubtypes section in object when we set debug option. (4) gcc build sequence usually uses debug option. (5) My cygwin environment seems not to accept debug_pubtypes section, and pop up "Permission denied" error. When I reverted "#ifdef DEBUG_PUBTYPES_SECTION" in dearf2out.c, the failuer disappeared. Does this failure occurr only on cygwin? Regards, Seiji Kachi No, this bug appeared on all windows pe-coff targets. A fix for this is already checked in yesterday on binutils. Could you try it with the current binutils head version? Cheers, Kai Is this supposed to be sufficient information for us to find that binutils? I may be able to find an insider colleague, otherwise I would have no chance. Hello, you can find the binutils project as usual under http://sources.redhat.com/binutils/ . You can find on this page how you are able to get current cvs version of binutils. This project contains the gnu tools, like dlltool, as, objcopy, ld, etc. The issue you are running in is reasoned by a failure in binutils about setting correct section flags for debugging sections. By the last change in gcc - it was the output of the .debug_pubtypes secton - this issue was shown. There is a patch already applied to binutils's repository head, which should solve the issue described here in this thread. We from mingw-w64 were fallen already over this issue and have taken care. Cheers, Kai My colleague suggested building and installing last week's binutils release. I did so, but it didn't affect the requirement to run each stage 2 configure individually from command line. Thanks, Tim
Re: random numbers
ecrosbie wrote: how do I generate random numbers in a f77 program? Ed Crosbie
Re: random numbers
ecrosbie wrote: how do I generate random numbers in a f77 program? Ed Crosbie This subject isn't topical on the gcc development forum. If you wish to use a gnu Fortran random number generator, please consider gfortran, which implements the language standard random number facility. http://gcc.gnu.org/onlinedocs/gcc-4.4.0/gfortran/ questions might be asked on the gfortran list (follow-up set) or comp.lang.fortran In addition, you will find plenty of other advice by using your web browser.
Re: optimizing a DSO
On 5/28/2010 11:14 AM, Ian Lance Taylor wrote: Quentin Neill writes: A little off topic, but by what facility does the compiler know the linker (or assembler for that matter) is gnu? When you run configure, you can specify --with-gnu-as and/or --with-gnu-ld. If you do, the compiler will assume the GNU assembler or linker. If you do not, the compiler will assume that you are not using the GNU assembler or linker. In this case the compiler will normally use the common subset of command line options supported by the native assembler and the GNU assembler. In general that only affects the compiler behaviour on platforms which support multiple assemblers and/or linkers. E.g., on GNU/Linux, we always assume the GNU assembler and linker. There is an exception. If you use --with-ld, the compiler will run the linker with the -v option and grep for GNU in the output. If it finds it, it will assume it is the GNU linker. The reason for this exception is that --with-ld gives a linker which will always be used. The assumption when no specific linker is specified is that you might wind up using any linker available on the system, depending on the value of PATH when running the compiler. Ian Is it reasonable to assume when the configure test reports using GNU linker, it has taken that "exception," even without a --with-ld specification? -- Tim Prince
Re: gcc command line exceeds 8191 when building in XP
On 7/19/2010 4:13 PM, IceColdBeer wrote: Hi, I'm building a project using GNU gcc, but the command line used to build each source file sometimes exceeds 8191 characters, which is the maximum supported command line length under Win XP.Even worst under Win 2000, where the maximum command line length is limited to 2047 characters. Can the GNU gcc read the build options from a file instead ?I have searched, but cannot find an option in the documentation. Thanks in advance, ICB redirecting to gcc-help. The gcc builds for Windows themselves use a scheme for splitting the link into multiple steps in order to deal with command line length limits. I would suggest adapting that. Can't study it myself now while travelling. -- Tim Prince
Re: x86 assembler syntax
On 8/8/2010 10:21 PM, Rick C. Hodgin wrote: All, Is there an Intel-syntax compatible option for GCC or G++? And if not, why not? It's so much cleaner than AT&T's. - Rick C. Hodgin I don't know how you get along without a search engine. What about http://tldp.org/HOWTO/Assembly-HOWTO/gas.html ? -- Tim Prince
Re: food for optimizer developers
On 8/10/2010 9:21 PM, Ralf W. Grosse-Kunstleve wrote: Most of the time is spent in this function... void dlasr( str_cref side, str_cref pivot, str_cref direct, int const& m, int const& n, arr_cref c, arr_cref s, arr_ref a, int const& lda) in this loop: FEM_DOSTEP(j, n - 1, 1, -1) { ctemp = c(j); stemp = s(j); if ((ctemp != one) || (stemp != zero)) { FEM_DO(i, 1, m) { temp = a(i, j + 1); a(i, j + 1) = ctemp * temp - stemp * a(i, j); a(i, j) = stemp * temp + ctemp * a(i, j); } } } a(i, j) is implemented as T* elems_; // member T const& operator()( ssize_t i1, ssize_t i2) const { return elems_[dims_.index_1d(i1, i2)]; } with ssize_t all[Ndims]; // member ssize_t origin[Ndims]; // member size_t index_1d( ssize_t i1, ssize_t i2) const { return (i2 - origin[1]) * all[0] + (i1 - origin[0]); } The array pointer is buried as elems_ member in the arr_ref<> class template. How can I apply __restrict in this case? Do you mean you are adding an additional level of functions and hoping for efficient in-lining? Your programming style is elusive, and your insistence on top posting will make this thread difficult to deal with. The conditional inside the loop likely is even more difficult for C++ to optimize than Fortran. As already discussed, if you don't optimize otherwise, you will need __restrict to overcome aliasing concerns among a,c, and s. If you want efficient C++, you will need a lot of hand optimization, and verification of the effect of each level of obscurity which you add. How is this topic appropriate to gcc mail list? -- Tim Prince
Re: End of GCC 4.6 Stage 1: October 27, 2010
On 9/6/2010 9:21 AM, Richard Guenther wrote: On Mon, Sep 6, 2010 at 6:19 PM, NightStrike wrote: On Mon, Sep 6, 2010 at 5:21 AM, Richard Guenther wrote: On Mon, 6 Sep 2010, Tobias Burnus wrote: Gerald Pfeifer wrote: Do you have a pointer to testresults you'd like us to use for reference? From our release criteria, for secondary platforms we have: • The compiler bootstraps successfully, and the C++ runtime library builds. • The DejaGNU testsuite has been run, and a substantial majority of the tests pass. See for instance: http://gcc.gnu.org/ml/gcc-testresults/2010-09/msg00295.html There are no libstdc++ results in that. Richard. This is true. I always run make check-gcc. What should I be doing instead? make -k check make check-c++ runs both g++ and libstdc++-v3 testsuites. -- Tim Prince
Re: Turn on -funroll-loops at -O3?
On 1/21/2011 10:43 AM, H.J. Lu wrote: Hi, SInce -O3 turns on vectorizer, should it also turn on -funroll-loops? Only if a conservative default value for max-unroll-times is set 2<= value <= 4 -- Tim Prince
Re: Why doesn't vetorizer skips loop peeling/versioning for target supports hardware misaligned access?
On 1/24/2011 5:21 AM, Bingfeng Mei wrote: Hello, Some of our target processors support complete hardware misaligned memory access. I implemented movmisalignm patterns, and found TARGET_SUPPORT_VECTOR_MISALIGNMENT (TARGET_VECTORIZE_SUPPORT_VECTOR_MISALIGNMENT On 4.6) hook is based on checking these patterns. Somehow this hook doesn't seem to be used. vect_enhance_data_refs_alignment is called regardless whether the target has HW misaligned support or not. Shouldn't using HW misaligned memory access be better than generating extra code for loop peeling/versioning? Or at least if for some architectures it is not the case, we should have a compiler hook to choose between them. BTW, I mainly work on 4.5, maybe 4.6 has changed. Thanks, Bingfeng Mei Peeling for alignment still presents a performance advantage on longer loops for the most common current CPUs. Skipping the peeling is likely to be advantageous for short loops. I've noticed that 4.6 can vectorize loops with multiple assignments, presumably taking advantage of misalignment support. There's even a better performing choice of instructions for -march=corei7 misaligned access than is taken by other compilers, but that could be an accident. At this point, I'd like to congratulate the developers for the progress already evident in 4.6. -- Tim Prince
Re: Are 8-byte ints guaranteed?
Thomas Koenig wrote: Hello world, are there any platforms where gcc doesn't support 8-byte ints? Can a front end depend on this? This would make life easier for Fortran, for example, because we could use INTEGER(KIND=8) for a lot of interfaces without having to bother with checks for the presence of KIND=8 integers. No doubt, there are such platforms, although I doubt there is sufficient interest in running gfortran on them. Support for 64-bit integers on common 32-bit platforms is rather inefficient, when it is done by pairs of 32-bit integers.
Re: g77 problem for octave
[EMAIL PROTECTED] wrote: Dear Sir/Madame, I have switched my OS to SuSE Linux 10.1 and for a while trying to install "Octave" to my computer. Unfortunately, the error message below is the only thing that i got. Installing octave-2.1.64-3.i586[Local packages] There are no installable providers of gcc-g77 for octave-2.1.64-3.i586[Local packages] On my computer, the installed version of gcc is 4.1.0-25 and i could not find any compatible version of g77 to install. For the installation of octave, i need exactly gcc-g77 not gcc-fortran. Can you please help me to deal with this problem? If you are so interested in using g77 rather than gfortran, it should be easy enough to grab gcc-3.4.x sources and build g77. One would wonder why you dislike gfortran so much.
Re: Modifying the LABEL for functions emitted by the GCC Compiler
Rohit Arul Raj wrote: The gcc-coldfire compiler spits out the labels as it is in the assembly file (main, printf etc), where as the IDE compiler spits out the labels prefixed with a '_' (_main, _printf etc). Is there any way i can make gcc-coldfire compiler emit the lables prefixed with an underscore (' _ ').Can anyone Help me OUT of this mess!!! How about reconciling the -fleading-underscore options?
Re: BFD Error a regression?
Jerry DeLisle wrote: BFD: BFD 2.16.91.0.6 20060212 internal error, aborting at ../../bfd/elfcode.h line 190 in bfd_elf32_swap_symbol_in BFD: Please report this bug. make[1]: *** [complex16] Error 1 make[1]: *** Waiting for unfinished jobs BFD: BFD 2.16.91.0.6 20060212 internal error, aborting at ../../bfd/elfcode.h line 190 in bfd_elf32_swap_symbol_in BFD: Please report this bug. BFD is acknowledging that it may be buggy. Does this occur with current binutils, e.g. from ftp.kernel.org? Are you able to build g++ and libstdc++ without hitting this or similar bug? Buggy binutils is a chronic problem with RHEL, and is generally not fixed without 6 months effort by an OEM with more influence than my employer. If you hit it with a small test case, surely it will be hit with real applications sooner or later.
Re: Calculating cosinus/sinus
On 05/11/2013 11:25 AM, Robert Dewar wrote: On 5/11/2013 11:20 AM, jacob navia wrote: OK I did a similar thing. I just compiled sin(argc) in main. The results prove that you were right. The single fsin instruction takes longer than several HUNDRED instructions (calls, jumps table lookup what have you) Gone are the times when an fsin would take 30 cycles or so. Intel has destroyed the FPU. That's an unwarrented claim, but indeed the algorithm used within the FPU is inferior to the one in the library. Not so surprising, the one in the chip is old, and we have made good advances in learning how to calculate things accurately. Also, the library is using the fast new 64-bit arithmetic. So none of this is (or should be surprising). In the benchmark code all that code/data is in the L1 cache. In real life code you use the sin routine sometimes, and the probability of it not being in the L1 cache is much higher, I would say almost one if you do not do sin/cos VERY often. But of course you don't really care about performance so much unless you *are* using it very often. I would be surprised if there are any real programs in which using the FPU instruction is faster. Possible, if long double precision is needed, within the range where fsin can deliver it. I take it the use of vector sin library is excluded (not available for long double). And as noted earlier in the thread, the library algorithm is more accurate than the Intel algorithm, which is also not at all surprising. reduction for range well outside basic 4 quadrants should be better in the library (note that fsin gives up for |x| > 2^64) but a double library function can hardly be claimed to be generally more accurate than long double built-in. For the time being I will go on generating the fsin code. I will try to optimize Moshier's SIN function later on. Well I will be surprised if you can find significant optimizations to that very clever routine. Certainly you have to be a floating-point expert to even touch it! Robert Dewar -- Tim Prince
Re: Calculating cosinus/sinus
On 5/12/2013 9:53 AM, Ondřej Bílka wrote: On Sun, May 12, 2013 at 02:14:31PM +0200, David Brown wrote: On 11/05/13 17:20, jacob navia wrote: Le 11/05/13 16:01, Ondřej Bílka a écrit : As 1) only way is measure that. Compile following an we will see who is rigth. cat " #include int main(){ int i; double x=0; double ret=0; double f; for(i=0;i<1000;i++){ ret+=sin(x); x+=0.3; } return ret; } " > sin.c OK I did a similar thing. I just compiled sin(argc) in main. The results prove that you were right. The single fsin instruction takes longer than several HUNDRED instructions (calls, jumps table lookup what have you) Gone are the times when an fsin would take 30 cycles or so. Intel has destroyed the FPU. What makes you so sure that it takes more than 30 cycles to execute hundreds of instructions in the library? Modern cpus often do several instructions per cycle (I am not considering multiple cores here). They can issue several instructions per cycle, and predicted jumps can often be eliminated entirely in the decode stages. To clarify numbers here 30 cycles library call is unrealistic, just latency caused by call and saving/restoring xmm register overhead is often more than 30 cycles. A sin takes around 150 cycles for normal inputs. A fsin is slower for several reasons. One is that performance depends on input. From http://www.agner.org/optimize/instruction_tables.pdf interesting historical reference fsin takes about 20-100 cycles. Those tables show up to 210 cycles for some highly reputed CPU models of various brands. This doesn't count the next issue: Second problem is that xmm->memory->fpu->memory->xmm roundtrip is expensive. There is performance penalty when switching between fpu and xmm instructions. Which would be a reason for fsin appearing in mathinline.h for i386 but no such for x86_64 implementations of glibc. Yes, it's popular to malign gcc developers or Intel even where it is out of their hands. The moral here is that /you/ need to benchmark /your/ code on /your/ processor - don't jump to conclusions, or accept other benchmarks as giving the complete picture. Agreed. -- Tim Prince
Re: RFC: SIMD pragma independent of Cilk Plus / OpenMPv4
On 9/9/2013 9:37 AM, Tobias Burnus wrote: Dear all, sometimes it can be useful to annotate loops for better vectorization, which is rather independent from parallelization. For vectorization, GCC has [0]: a) Cilk Plus's #pragma simd [1] b) OpenMP 4.0's #pragma omp simd [2] Those require -fcilkplus and -fopenmp, respectively, and activate much more. The question is whether it makes sense to provide a means to ask the compiler for SIMD vectorization without enabling all the other things of Cilk Plus/OpenMP. What's your opinion? [If one provides it, the question is whether it is always on or not, which syntax/semantics it uses [e.g. just the one of Cilk or OpenMP] and what to do with conflicting pragmas which can occur in this case.] Side remark: For vectorization, the widely supported #pragma ivdep, vector, novector can be also useful, even if they are less formally defined. "ivdep" seems to be one of the more useful ones, whose semantics one can map to a safelen of infinity in OpenMP's semenatics [i.e. loop->safelen = INT_MAX]. Tobias [0] In the trunk is currently only some initial middle-end support. OpenMP's imp simd is in the gomp-4_0-branch; Cilk Plus's simd has been submitted for the trunk at http://gcc.gnu.org/ml/gcc-patches/2013-08/msg01626.html [1] http://www.cilkplus.org/download#open-specification [2] http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf ifort/icc have a separate option -openmp-simd for the purpose of activating omp simd directives without invoking OpenMP. In the previous release, in order to activate both OpenMP parallel and omp simd, both options were required (-openmp -openmp-simd). In the new "SP1" release last week, -openmp implies -openmp-simd. Last time I checked, turning off the options did not cause the compiler to accept but ignore all omp simd directives, as I personally thought would be desirable. A few cases are active regardless of compile line option, but many will be rejected without matching options. Current Intel implementations of safelen will fail to vectorize and give notice if the value is set unnecessarily large. It's been agreed that increasing the safelen value beyond the optimum level should not turn off vectorization. safelen(32) is optimum for several float/single precision cases in the Intel(r) Xeon Phi(tm) cross compiler; needless to say, safelen(8) is sufficient for 128-bit SSE2. I pulled down an update of gcc gomp-4_0-branch yesterday and see in the not-yet-working additions to gcc testsuite there appears to be a move toward adding more cilkplus clauses to omp simd, such as firstprivate lastprivate (which are accepted but apparently ignored in the Intel omp simd implementation). I'll be discussing in a meeting later today my effort to publish material including discussion of OpenMP 4.0 implementations. -- Tim Prince
Re: Vectorization: Loop peeling with misaligned support.
On 11/15/2013 2:26 PM, Ondřej Bílka wrote: On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote: Also keep in mind that usually costs go up significantly if misalignment causes cache line splits (processor will fetch 2 lines). There are non-linear costs of filling up the store queue in modern out-of-order processors (x86). Bottom line is that it's much better to peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache line boundaries otherwise. The solution is to either actually always peel for alignment, or insert an additional check for cache line boundaries (for high trip count loops). That is quite bold claim do you have a benchmark to support that? Since nehalem there is no overhead of unaligned sse loads except of fetching cache lines. As haswell avx2 loads behave in similar way. Where gcc or gfortran choose to split sse2 or sse4 loads, I found a marked advantage in that choice on my Westmere (which I seldom power on nowadays). You are correct that this finding is in disagreement with Intel documentation, and it has the effect that Intel option -xHost is not the optimum one. I suspect the Westmere was less well performing than Nehalem on unaligned loads. Another poorly documented feature of Nehalem and Westmere was a preference for 32-byte aligned data, more so than Sandy Bridge. Intel documentation encourages use of unaligned AVX-256 loads on Ivy Bridge and Haswell, but Intel compilers don't implement them (except for intrinsics) until AVX2. Still, on my own Haswell tests, the splitting of unaligned loads by use of AVX compile option comes out ahead. Supposedly, the preference of Windows intrinsics programmers for the relative simplicity of unaligned moves was taken into account in the more recent hardware designs, as it was disastrous for Sandy Bridge. I have only remote access to Haswell although I plan to buy a laptop soon. I'm skeptical about whether useful findings on these points may be obtained on a Windows laptop. In case you didn't notice it, Intel compilers introduced #pragma vector unaligned as a means to specify handling of unaligned access without peeling. I guess it is expected to be useful on Ivy Bridge or Haswell for cases where the loop count is moderate but expected to match unrolled AVX-256, or if the case where peeling can improve alignment is rare. In addition, Intel compilers learned from gcc the trick of using AVX-128 for situations where frequent unaligned accesses are expected and peeling is clearly undesirable. The new facility for vectorizing OpenMP parallel loops (e.g. #pragma omp parallel for simd) uses AVX-128, consistent with the fact that OpenMP chunks are more frequently unaligned. In fact, parallel for simd seems to perform nearly the same with gcc-4.9 as with icc. Many decisions on compiler defaults still are based on an unscientific choice of benchmarks, with gcc evidently more responsive to input from the community. -- Tim Prince
Re: How to generate AVX512 instructions now (just to look at them).
On 1/3/2014 11:04 AM, Toon Moene wrote: I am trying to figure out how the top-consuming routines in our weather models will be compiled when using AVX512 instructions (and their 32 512 bit registers). I thought an up-to-date trunk version of gcc, using the command line: <...>/gfortran -Ofast -S -mavx2 -mavx512f would do that. Unfortunately, I do not see any use of the new zmm.. registers, which might mean that AVX512 isn't used yet. This is how the nightly build job builds the trunk gfortran compiler: configure --prefix=/home/toon/compilers/install --with-gnu-as --with-gnu-ld --enable-languages=fortran<,other-language> --disable-multilib --disable-nls --with-arch=core-avx2 --with-tune=core-avx2 gfortran -O3 -funroll-loops --param max-unroll-times=2 -ffast-math -mavx512f -fopenmp -S is giving me extremely limited zmm register usage in my build of gfortran trunk. It appears to be using zmm only to enable use of vpternlogd instructions. Immediately following the first such usage, it is failing to vectorize a dot_product with stride 1 operands. There are still AVX2 scalar instructions and AVX-256 vectorized loops, but none with reduction or fma. For gcc, I have to add -march=native in order for it to accept fma intrinsics (even though that one is expanded to AVX without fma). Sorry, my only AVX2 CPU is a Windows 8.1 installation (!). Target: x86_64-unknown-cygwin Configured with: ../configure --prefix=/usr/local/gcc4.9/ --enable-languages='c c++ fortran' --enable-libgomp --enable-threads=posix --disable-libmudflap --disa ble-__cxa_atexit --with-dwarf2 --without-libiconv-prefix --without-libintl-prefi x --with-system-zlib -- Tim Prince
Re: How to generate AVX512 instructions now (just to look at them).
On 1/3/2014 2:58 PM, Toon Moene wrote: On 01/03/2014 07:04 PM, Jakub Jelinek wrote: On Fri, Jan 03, 2014 at 05:04:55PM +0100, Toon Moene wrote: I am trying to figure out how the top-consuming routines in our weather models will be compiled when using AVX512 instructions (and their 32 512 bit registers). what I'm interested in, is (cat verintlin.f): SUBROUTINE VERINT ( I KLON , KLAT , KLEV , KINT , KHALO I , KLON1 , KLON2 , KLAT1 , KLAT2 I , KP , KQ , KR R , PARG , PRES R , PALFH , PBETH R , PALFA , PBETA , PGAMA ) C C*** C C VERINT - THREE DIMENSIONAL INTERPOLATION C C PURPOSE: C C THREE DIMENSIONAL INTERPOLATION C C INPUT PARAMETERS: C C KLON NUMBER OF GRIDPOINTS IN X-DIRECTION C KLAT NUMBER OF GRIDPOINTS IN Y-DIRECTION C KLEV NUMBER OF VERTICAL LEVELS C KINT TYPE OF INTERPOLATION C= 1 - LINEAR C= 2 - QUADRATIC C= 3 - CUBIC C= 4 - MIXED CUBIC/LINEAR C KLON1 FIRST GRIDPOINT IN X-DIRECTION C KLON2 LAST GRIDPOINT IN X-DIRECTION C KLAT1 FIRST GRIDPOINT IN Y-DIRECTION C KLAT2 LAST GRIDPOINT IN Y-DIRECTION C KPARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS C KQARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS C KRARRAY OF INDEXES FOR VERTICAL DISPLACEMENTS C PARG ARRAY OF ARGUMENTS C PALFH ALFA HAT C PBETH BETA HAT C PALFA ARRAY OF WEIGHTS IN X-DIRECTION C PBETA ARRAY OF WEIGHTS IN Y-DIRECTION C PGAMA ARRAY OF WEIGHTS IN VERTICAL DIRECTION C C OUTPUT PARAMETERS: C C PRES INTERPOLATED FIELD C C HISTORY: C C J.E. HAUGEN 1 1992 C C*** C IMPLICIT NONE C INTEGER KLON , KLAT , KLEV , KINT , KHALO, IKLON1 , KLON2 , KLAT1 , KLAT2 C INTEGER KP(KLON,KLAT), KQ(KLON,KLAT), KR(KLON,KLAT) REAL PARG(2-KHALO:KLON+KHALO-1,2-KHALO:KLAT+KHALO-1,KLEV) , RPRES(KLON,KLAT) , R PALFH(KLON,KLAT) , PBETH(KLON,KLAT) , R PALFA(KLON,KLAT,4) , PBETA(KLON,KLAT,4), R PGAMA(KLON,KLAT,4) C INTEGER JX, JY, IDX, IDY, ILEV REAL Z1MAH, Z1MBH C C LINEAR INTERPOLATION C DO JY = KLAT1,KLAT2 DO JX = KLON1,KLON2 IDX = KP(JX,JY) IDY = KQ(JX,JY) ILEV = KR(JX,JY) C PRES(JX,JY) = PGAMA(JX,JY,1)*( C + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX ,IDY-1,ILEV-1) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY ,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX ,IDY ,ILEV-1) ) ) C+ + + PGAMA(JX,JY,2)*( C+ + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX ,IDY-1,ILEV ) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY ,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX ,IDY ,ILEV ) ) ) ENDDO ENDDO C RETURN END i.e., real Fortran code, not just intrinsics :-) Right out of the AVX512 architect's dream. It appears to need 24 AVX-512 registers in the ifort compilation (/arch:MIC-AVX512) to avoid those spills and repeated memory operands in the gfortran avx2 compilation. How small a ratio of floating point to total instructions can you call "real Fortran?" -- Tim Prince
Re: -O3 and -ftree-vectorize
On 2/6/2014 1:51 PM, Uros Bizjak wrote: Hello! 4.9 does not enable -ftree-vectorize for -O3 (and Ofast) anymore. Is this intentional? $/ssd/uros/gcc-build/gcc/xgcc -B /ssd/uros/gcc-build/gcc -O3 -Q --help=optimizers ... -ftree-vectorize [disabled] ... I'm seeing vectorization but no output from -ftree-vectorizer-verbose, and no dot product vectorization inside omp parallel regions, with gcc g++ or gfortran 4.9. Primary targets are cygwin64 and linux x86_64. I've been unable to use -O3 vectorization with gcc, although it works with gfortran and g++, so use gcc -O2 -ftree-vectorize together with additional optimization flags which don't break. I've made source code changes to take advantage of the new vectorization with merge() and ? operators; while it's useful for -march=core-avx2, it's sometimes a loss for -msse4.1. gcc vectorization with #pragma omp parallel for simd is reasonably effective in my tests only on 12 or more cores. #pragma omp simd reduction(max: ) is giving correct results but poor performance in my tests. You've probably seen my gcc testresults posts. The one major recent improvement is the ability to skip cilkplus tests on targets where it's totally unsupported. Without cilk_for et al. even on "supported" targets cilkplus seems useless. There are still lots of failing stabs tests on targets where those apparently aren't supported. So there are some mysteries about what the developers intend. I suppose this was posted on gcc list on account of such questions being ignored on gcc-help. -- Tim Prince
Re: -O3 and -ftree-vectorize
On 02/07/2014 10:22 AM, Jakub Jelinek wrote: On Thu, Feb 06, 2014 at 05:21:00PM -0500, Tim Prince wrote: I'm seeing vectorization but no output from -ftree-vectorizer-verbose, and no dot product vectorization inside omp parallel regions, with gcc g++ or gfortran 4.9. Primary targets are cygwin64 and linux x86_64. I've been unable to use -O3 vectorization with gcc, although it works with gfortran and g++, so use gcc -O2 -ftree-vectorize together with additional optimization flags which don't break. Can you file a GCC bugzilla PR with minimal testcases for this (or point us at already filed bugreports)? The question of problems with gcc -O3 (called from gfortran) have eluded me as to finding a minimal test case. When I run under debug, it appears that somewhere prior to the crash some gfortran code is over-written with data by the gcc code, overwhelming my debugging skill. I can get full performance with -O2 plus a bunch of intermediate flags. As to non-vectorization of dot product in omp parallel region, -fopt-info (which I didn't know about) is reporting vectorization, but there are no parallel simd instructions in the generated code for the omp_fn. I'll file a PR on that if it's still reproduced in a minimal case. I've made source code changes to take advantage of the new vectorization with merge() and ? operators; while it's useful for -march=core-avx2, it's sometimes a loss for -msse4.1. gcc vectorization with #pragma omp parallel for simd is reasonably effective in my tests only on 12 or more cores. Likewise. Those are cases of 2 levels of loops from netlib "vector" benchmark where only one level is vectorizable and parallelizable. By putting the vectorizable loop on the outside the parallelization scales to a large number of cores. I don't expect it to out-perform single thread optimized avx vectorization until 8 or more cores are in use, but it needs more than expected number of threads even relative to SSE vectorization. #pragma omp simd reduction(max: ) is giving correct results but poor performance in my tests. Likewise. I'll file a PR on this, didn't know if there might be interest. I have an Intel compiler issue "closed, will not be fixed" so the simd reduction(max: ) isn't viable for icc in the near term. Thanks,
Re: -O3 and -ftree-vectorize
On 2/7/2014 11:09 AM, Tim Prince wrote: On 02/07/2014 10:22 AM, Jakub Jelinek wrote: On Thu, Feb 06, 2014 at 05:21:00PM -0500, Tim Prince wrote: I'm seeing vectorization but no output from -ftree-vectorizer-verbose, and no dot product vectorization inside omp parallel regions, with gcc g++ or gfortran 4.9. Primary targets are cygwin64 and linux x86_64. I've been unable to use -O3 vectorization with gcc, although it works with gfortran and g++, so use gcc -O2 -ftree-vectorize together with additional optimization flags which don't break. Can you file a GCC bugzilla PR with minimal testcases for this (or point us at already filed bugreports)? The question of problems with gcc -O3 (called from gfortran) have eluded me as to finding a minimal test case. When I run under debug, it appears that somewhere prior to the crash some gfortran code is over-written with data by the gcc code, overwhelming my debugging skill. I can get full performance with -O2 plus a bunch of intermediate flags. As to non-vectorization of dot product in omp parallel region, -fopt-info (which I didn't know about) is reporting vectorization, but there are no parallel simd instructions in the generated code for the omp_fn. I'll file a PR on that if it's still reproduced in a minimal case. I've made source code changes to take advantage of the new vectorization with merge() and ? operators; while it's useful for -march=core-avx2, it's sometimes a loss for -msse4.1. gcc vectorization with #pragma omp parallel for simd is reasonably effective in my tests only on 12 or more cores. Likewise. Those are cases of 2 levels of loops from netlib "vector" benchmark where only one level is vectorizable and parallelizable. By putting the vectorizable loop on the outside the parallelization scales to a large number of cores. I don't expect it to out-perform single thread optimized avx vectorization until 8 or more cores are in use, but it needs more than expected number of threads even relative to SSE vectorization. #pragma omp simd reduction(max: ) is giving correct results but poor performance in my tests. Likewise. I'll file a PR on this, didn't know if there might be interest. I have an Intel compiler issue "closed, will not be fixed" so the simd reduction(max: ) isn't viable for icc in the near term. Thanks, With further investigation, my case with reverse_copy outside and inner_product inside an omp parallel region is working very well with -O3 -ffast-math for double data type. There seems a possible performance problem with reverse_copy for float data type, so much so that gfortran does better with the loop reversal pushed down into the parallel dot_products. I have seen at least 2 cases where the new gcc vectorization of stride -1 with vpermd is superior to other compilers, even for float data type. For the cases where omp parallel for simd is set in expectation of gaining outer loop parallel simd, gcc is ignoring the simd clause. So it is understandable that a large number of cores is needed to overcome the lack of parallel simd (other than by simd intrinsics coding). I'll choose an example of omp simd reduction(max: ) for a PR. Thanks. -- Tim Prince
Re: Vectorizer Pragmas
On 2/15/2014 3:36 PM, Renato Golin wrote: On 15 February 2014 19:26, Jakub Jelinek wrote: GCC supports #pragma GCC ivdep/#pragma simd/#pragma omp simd, the last one can be used without rest of OpenMP by using -fopenmp-simd switch. Does the simd/omp have control over the tree vectorizer? Or are they just flags for the omp implementation? I don't see why we would need more ways to do the same thing. Me neither! That's what I'm trying to avoid. Do you guys use those pragmas for everything related to the vectorizer? I found that the Intel pragmas (not just simd and omp) are pretty good fit to most of our needed functionality. Does GCC use Intel pragmas to control the vectorizer? Would be good to know how you guys did it, so that we can follow the same pattern. Can GCC vectorize lexical blocks as well? Or just loops? IF those pragmas can't be used in lexical blocks, would it be desired to extend that in GCC? The Intel guys are pretty happy implementing simd, omp, etc. in LLVM, and I think if the lexical block problem is common, they may even be open to extending the semantics? cheers, --renato gcc ignores the Intel pragmas, other than the OpenMP 4.0 ones. I think Jakub may have his hands full trying to implement the OpenMP 4 pragmas, plus GCC ivdep, and gfortran equivalents. It's tough enough distinguishing between Intel's partial implementation of OpenMP 4 and the way it ought to be done. In my experience, the (somewhat complicated) gcc --param options work sufficiently well for specification of unrolling. In the same vein, I haven't seen any cases where gcc 4.9 is excessively aggressive in vectorization, so that a #pragma novector plus scalar unroll is needed, as it is with Intel compilers. I'm assuming that Intel involvement with llvm is aimed toward making it look like Intel's own compilers; before I retired, I heard a comment which indicated a realization that the idea of pushing llvm over gnu had been over-emphasized. My experience with this is limited; my Intel Android phone broke before I got too involved with their llvm Android compiler, which had some bad effects on both gcc and Intel software usage for normal Windows purposes. I've never seen a compiler where pragmas could be used to turn on auto-vectorization when compile options were set to disable it. The closest to that is the Intel(r) Cilk(tm) Plus where CEAN notation implies turning on many aggressive optimizations, such that full performance can be achieved without problematical -O3. If your idea is to obtain selective effective auto-vectorization in source code which is sufficiently broken that -O2 -ftree-vectorize can't be considered or -fno-strict-aliasing has to be set, I'm not about to second such a motion. -- Tim Prince
Re: Vectorizer Pragmas
On 2/16/2014 2:05 PM, Renato Golin wrote: On 16 February 2014 17:23, Tobias Burnus wrote: Compiler vendors (and users) have different ideas whether the SIMD pragmas should give the compiler only a hint or completely override the compiler's heuristics. In case of the Intel compiler, the user rules; in case of GCC, it only influences the heuristics unless one passes explicitly -fsimd-cost-model=unlimited (cf. also -Wopenmp-simd). Yes, Intel's idea for simd directives is to vectorize without applying either cost models or concern about exceptions. I tried -fsimd-cost-model-unlimited on my tests; it made no difference. As a user, I found Intel's pragmas interesting, but at the end regarded OpenMP's SIMD directives/pragmas as sufficient. That was the kind of user experience that I was looking for, thanks! The alignment options for OpenMP 4 are limited, but OpenMP 4 also seems to prevent loop fusion, where alignment assertions may be more critical. In addition, Intel uses the older directives, which some marketer decided should be called Cilk(tm) Plus even when used in Fortran, to control whether streaming stores may be chosen in some situations. I think gcc supports those only by explicit intrinsics. I don't think many people want to use both OpenMP 4 and older Intel directives together. Several of these directives are still in an embryonic stage in both Intel and gnu compilers. -- Tim Prince
Re: Vectorizer Pragmas
On 2/17/2014 4:42 AM, Renato Golin wrote: On 16 February 2014 23:44, Tim Prince wrote: I don't think many people want to use both OpenMP 4 and older Intel directives together. I'm having less and less incentives to use anything other than omp4, cilk and whatever. I think we should be able to map all our internal needs to those pragmas. On the other hand, if you guys have any cross discussion with Intel folks about it, I'd love to hear. Since our support for those directives are a bit behind, would be good not to duplicate the efforts in the long run. I'm continuing discussions with former Intel colleagues. If you are asking for insight into how Intel priorities vary over time, I don't expect much, unless the next beta compiler provides some inferences. They have talked about implementing all of OpenMP 4.0 except user defined reduction this year. That would imply more activity in that area than on cilkplus, although some fixes have come in the latter. On the other hand I had an issue on omp simd reduction(max: ) closed with the decision "will not be fixed." I have an icc problem report in on fixing omp simd safelen so it is more like the standard and less like the obsolete pragma simd vectorlength. Also, I have some problem reports active attempting to get clarification of their omp target implementation. You may have noticed that omp parallel for simd in current Intel compilers can be used for combined thread and simd parallelism, including the case where the outer loop is parallelizable and vectorizable but the inner one is not. -- Tim Prince
Re: Shouldn't unsafe-math-optimizations (re-)enable fp-contract=fast?
On 3/6/2014 1:01 PM, Joseph S. Myers wrote: On Thu, 6 Mar 2014, Ian Bolton wrote: Hi there, I see in common.opt that fp-contract=fast is the default for GCC. But then it gets disabled in c-family/c-opts.c if you are using ISO C (e.g. with -std=c99). But surely if you have also specified -funsafe-math-optimizations then it should flip it back onto fast? That seems reasonable. I do see an improvement in several benchmarks by use of fma when I append -ffp-contract=fast after -std=c99 Thanks. -- Tim Prince
Re: weird optimization in sin+cos, x86 backend
On 2/9/2012 5:55 AM, Richard Guenther wrote: On Thu, Feb 9, 2012 at 11:35 AM, Andrew Haley wrote: On 02/09/2012 10:20 AM, James Courtier-Dutton wrote: From what I can see, on x86_64, the hardware fsin(x) is more accurate than the hardware fsincos(x). As you gradually increase the size of X from 0 to 10e22, fsincos(x) diverges from the correct accurate value quicker than fsin(x) does. So, from this I would say that using fsincos instead of fsin is not a good idea, at least on x86_64 platforms. That's true iff you're using the hardware builtins, which we're not on GNU/Linux unless you're using -ffast-math. If you're using -ffast-math, the fsincos optimization is appropriate anyway because you want fast. If you're not using -ffast-math it's still appropriate, because we're using an accurate libm. The point of course is that glibc happily uses fsin/fsincos (which isn't even fast compared to a decent implementation using SSE math). x87 built-ins should be a fair compromise between speed, code size, and accuracy, for long double, on most CPUs. As Richard says, it's certainly possible to do better in the context of SSE, but gcc doesn't know anything about the quality of math libraries present; it doesn't even take into account whether it's glibc or something else. -- Tim Prince
Re: weird optimization in sin+cos, x86 backend
On 02/14/2012 04:51 AM, Andrew Haley wrote: On 02/13/2012 08:00 PM, Geert Bosch wrote: GNU Linux is quite good, but has issues with the "pow" function for large exponents, even in current versions Really? Even on 64-bit? I know this is a problem for the 32-bit legacy architecture, but I thought the 64-bit pow() was OK. Andrew. No problems seen under elefunt with glibc 2.12 x86_64. -- Tim Prince
Re: weird optimization in sin+cos, x86 backend
On 02/14/2012 08:26 AM, Vincent Lefevre wrote: On 2012-02-14 09:51:28 +, Andrew Haley wrote: On 02/13/2012 08:00 PM, Geert Bosch wrote: GNU Linux is quite good, but has issues with the "pow" function for large exponents, even in current versions Really? Even on 64-bit? I know this is a problem for the 32-bit legacy architecture, but I thought the 64-bit pow() was OK. According to http://sourceware.org/bugzilla/show_bug.cgi?id=706 the 32-bit pow() can be completely wrong, and the 64-bit pow() is just very inaccurate. That bugzilla brings up paranoia, but with gfortran 4.7 on glibc 2.12 I get TESTING X**((X+1)/(X-1)) VS. EXP(2) = 7.3890561 AS X -> 1. ACCURACY SEEMS ADEQUATE. TESTING POWERS Z**Q AT FOUR NEARLY EXTREME VALUES: NO DISCREPANCIES FOUND. . NO FAILURES, DEFECTS NOR FLAWS HAVE BEEN DISCOVERED. ROUNDING APPEARS TO CONFORM TO THE PROPOSED IEEE STANDARD P754 THE ARITHMETIC DIAGNOSED APPEARS TO BE EXCELLENT! Historically, glibc for i386 used the raw x87 built-ins without any of the recommended precautions. Paranoia still shows, as it always did: TESTING X**((X+1)/(X-1)) VS. EXP(2) = 7.3890561 AS X -> 1. DEFECT: Calculated (1-0.11102230E-15)**(-0.18014399E+17) differs from correct value by -0.34413050E-08 This much error may spoil calculations such as compounded interest. .... -- Tim Prince
Re: GCC: OpenMP posix pthread
On 2/19/2012 9:42 AM, erotavlas_tu...@libero.it wrote: I'm starting to use Helgrind a tool of Valgrind. I read on the manual the following statement: Runtime support library for GNU OpenMP (part of GCC), at least for GCC versions 4.2 and 4.3. The GNU OpenMP runtime library (libgomp.so) constructs its own synchronisation primitives using combinations of atomic memory instructions and the futex syscall, which causes total chaos since in Helgrind since it cannot "see" those. In the latest version of GCC, is this still true or now the OpenMP uses the standard POSIX pthread? Do you have a specific OS family in mind? -- Tim Prince
Re: Vectorizer question
On 5/16/2012 4:01 PM, Iyer, Balaji V wrote: Hello Everyone, I have a question regarding the vectorizer. In the following code below... Int func (int x, int y) { If (x==y) Return (x+y); Else Return (x-y); } If we force the x and y to be vectors of vectorlength 4, then will the if-statement get a vector of booleans or does it get 1 boolean that compares 2 very large values? I guess another way to ask is that, will it logically break it up into 4 if-statements or just 1? Any help is greatly appreciated! Thanks, Balaji V. Iyer. PS. Please CC me in response so that I can get to it quickly. Is this about vector extensions to C, or about other languages such as C++ or Fortran? In Fortran, it's definitely an array of logical, in the case where the compiler can't optimize it away. This would more likely be written return x==y ? x+y : x-y; -- Tim Prince
Re: GCC optimization report
On 7/17/2012 7:23 AM, Richard Guenther wrote: On Tue, Jul 17, 2012 at 12:43 PM, wrote: Hi all, I would like to know if GCC provides an option to get a detailed report on the optimization actually performed by the compiler. For example with the Intel C compiler it is possible using the -opt-report. I don't want to look at the assembly file and figure out the optimization. There is only -ftree-vectorizer-verbose=N currently and the various dump-files the individual passes produce (-fdump-tree-all[-subflags] -fdump-rtl-all[-subflags]). -ftree-vectorizer-verbose is analogous to the icc -vec-report option (included in -opt-report). Among the questions not answered by -opt-report are those associated with application (or not) of -complex-limited-range (gcc -fcx-limited-range). -opt-report3 I believe turns on reporting of software prefetch application, which is important but difficult to follow. It's nearly impossible to compare icc and gcc optimization other than by examining assembly and using a profiler which shows paths taken. -- Tim Prince
Re: gfortran error: Statement order error: declaration after DATA
On 9/11/2012 5:46 PM, David N. Bradley wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I am trying to compile the cactuscode package and can not get past the error : Statement order error: declaration after DATA can you point me in the direction of a fix. I included offending file as an attachment. Dave kb9qhd Amateur Radio Service Technician class Licence Grid EN43 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJQT9tfAAoJEIHvsckbl2dBLMIH/0LR4lA3w9W6lhaB3lkyX9WB dQJmYHAM59LsGmi+9fmhODG1KkoVfIMIqI8AaDHAFQiqkN2QCr1BNGTFgifFFcV9 BijJt4OtcZTzS0LwIzLTGOEbBJIT2xP1HQmVm/7gYr90HlWvLMHLoPJgqnNsJyNT mxWMEJojD/xeKaHE6yUIZxRlbnM/pC7UYSIruQ7YjsxC7gKpHfBeOM9Op4AkwJ0k H4IaKRDpYOKBbEHP6LLPZFTdosjQgWaFnTBILvLaHjSqa9mskU4yTDLdLHFNjUz9 i5hC2ihlIJBcQx1QVLwt/AvjSDtqPqLPKo3h2OBH0IJzlcS+kOkfeSQ+AvkWghU= =snlv -END PGP SIGNATURE- Surely someone has pointed out, you should require only to sort the file by placing the dimension statement ahead of the data statement, if you don't wish to adopt more modern syntax. -- Tim Prince
Re: calculation of pi
On 11/3/2012 3:32 AM, Mischa Baars wrote: /usr/include/gnu/stubs.h:7:27: fatal error: gnu/stubs-32.h: No such file or directory which also prevents me from compiling the compiler under Fedora 17. This means that I am both not able to compile programs in 32-bit mode and help you with the compiler. Normally, this means you didn't install the optional (32-bit) glibc-devel i386. -- Tim Prince
Re: RFC: [ARM] Disable peeling
On 12/11/2012 5:14 AM, Richard Earnshaw wrote: On 11/12/12 09:56, Richard Biener wrote: On Tue, Dec 11, 2012 at 10:48 AM, Richard Earnshaw wrote: On 11/12/12 09:45, Richard Biener wrote: On Mon, Dec 10, 2012 at 10:07 PM, Andi Kleen wrote: Jan Hubicka writes: Note that I think Core has similar characteristics - at least for string operations it fares well with unalignes accesses. Nehalem and later has very fast unaligned vector loads. There's still some penalty when they cross cache lines however. iirc the rule of thumb is to do unaligned for 128 bit vectors, but avoid it for 256bit vectors because the cache line cross penalty is larger on Sandy Bridge and more likely with the larger vectors. Yes, I think the rule was that using the unaligned instruction variants carries no penalty when the actual access is aligned but that aligned accesses are still faster than unaligned accesses. Thus peeling for alignment _is_ a win. I also seem to remember that the story for unaligned stores vs. unaligned loads is usually different. Yes, it's generally the case that unaligned loads are slightly more expensive than unaligned stores, since the stores can often merge in a store buffer with little or no penalty. It was the other way around on AMD CPUs AFAIK - unaligned stores forced flushes of the store buffers. Which is why the vectorizer first and foremost tries to align stores. In which case, which to align should be a question that the ME asks the BE. R. I see that this thread is no longer about ARM. Yes, when peeling for alignment, aligned stores should take precedence over aligned loads. "ivy bridge" corei7-3 is supposed to have corrected the situation on "sandy bridge" corei7-2 where unaligned 256-bit load is more expensive than explicitly split (128-bit) loads. There aren't yet any production multi-socket corei7-3 platforms. It seems difficult to make the best decision between 128-bit unaligned access without peeling and 256-bit access with peeling for alignment (unless the loop count is known to be too small for the latter to come up to speed). Facilities afforded by various compilers to allow the programmer to guide this choice are rather strange and probably not to be counted on. In my experience, "westmere" unaligned 128-bit loads are more expensive than explicitly split (64-bit) loads, but the architecture manuals disagree with this finding. gcc already does a good job for corei7[-1] in such situations. -- Tim Prince
Re: not-a-number's
On 1/16/2013 5:00 AM, Andrew Haley wrote: On 01/16/2013 11:54 AM, Mischa Baars wrote: Here's what Standard C, F.8.3 Relational operators, says: x != x → false The statement x != x is true if x is a NaN. x == x → true The statement x == x is false if x is a NaN. And indeed apparently the answer then is '2'. However, I don't think this is correct. If that means that there is an error in the C specification, then there probably is an error in the specification. Right. So we are agreed that GCC does what the specification of the C programming language says it must do. Any argument that you have must, therefore, be with the technical committee of ISO C, not with us. Andrew. There exist compilers which have options to ignore the possibility of NaN and replace x == x by 1 and x != x by 0 at compile time. gcc is undoubtedly correct in not making such replacements as a default in violation of C specification. -- Tim Prince
Re: Floating Point subnormal numbers under C99 with GCC 4.7
On 1/27/2013 6:02 PM, Argentinator Rincón Matemático wrote: Hi, dear friends. I am testing floating-points macros in C language, under the standard C99. My compiler is GCC 4.6.1. (with 4.7.1, I have the same result). I have two computers: My system (1) is Windows XP SP2 32bit, in an "Intel (R) Celeron (R) 420" @ 1.60 GHz. My system (2) is Windows 7 Ultimate SP1 64bit, in an "AMD Turion II X2 dual-core mobile M520 ( 2,3 ghz 1MB L2 Cache )" (The result was the same in both systems.) I am interested in testing subnormal numbers for the types float, double and long double. I've tried the following line: printf(" Float: %x\n Double: %x\n Long Double: %x\n",fpclassify(FLT_MIN / 4.F), fpclassify(DBL_MIN / 4.), fpclassify(LDBL_MIN / 4.L )); I've compiled with the options -std=c99 and -pedantic (also without -pedantic). Compilation goes well, however the program shows me this: Float: 400 Double: 400 Long Double: 4400 (0x400 == FP_NORMAL, 0x4400 == FP_SUBNORMAL) I think that the right result must be 0x4400 in all cases. When I tested the constant sizes, I have obtained they are of the right type. For example, I have obtained: sizeof(float) == 4 sizeof(double) == 8 sizeof(long double) == 12 Also: sizeof(FLT_MIN / 4.F) == 4 sizeof(DBL_MIN / 4.) == 8 sizeof(LDBL_MIN / 4.L) == 12 This means that FLT_MIN / 4.F only can be a float, and so on. Moreover, FLT_MIN / 4.F must be a subnormal float number. However, it seems like the fpclassify() macro behave as if any argument were a long double number. Just in case, I have recompiled the program by putting the constants at hand: printf(" Float: %x\n", fpclassify(0x1p-128F)); The result was the same. Am I missunderstanding the C99 rules? Or the fpclassify() macro has a bug in the GCC compiler? (in the same way, the isnormal() macro "returns" 1 for float and double, but 0 for long double). I quote the C99 standard paragraph that explains the behaviour of fpclassify macro: First, an argument represented in a format wider than its semantic type is converted to its semantic type. Then classification is based on the type of the argument. Thanks. Sincerely, yours. Argentinator This looks more like a topic for gcc-help. Even if you had quoted gcc -v it would not reveal conclusively where your or fpclassify() came from, although it would give some important clues. There are at least 3 different implementations of gcc for Windows (not counting 32- vs. 64-bit), although not all are commonly available for gcc-4.6 or 4.7. The specific version of gcc would make less difference than which implementation it is. I guess, from your finding that sizeof(long double) == 12, you are running a 32-bit compiler even on the 64-bit Windows. The 32-bit gcc I have installed on my 64-bit Windows evaluates expressions in long double unless -mfpmath=sse is set (as one would normally do). This may affect the results returned by fpclassify. 64-bit gcc defaults to -mfpmath=sse. -- Tim Prince