from:"tim prince"

Re: Build failure in dwarf2out

2006-10-29 Thread Tim Prince


Paul Thomas wrote:

I am being hit by this:

rf2out.c -o dwarf2out.o
../../trunk/gcc/dwarf2out.c: In function `file_name_acquire':
../../trunk/gcc/dwarf2out.c:7672: error: `files' undeclared (first use 
in this f

unction)
../../trunk/gcc/dwarf2out.c:7672: error: (Each undeclared identifier is 
reported

only once
../../trunk/gcc/dwarf2out.c:7672: error: for each function it appears in.)
../../trunk/gcc/dwarf2out.c:7672: error: `i' undeclared (first use in 
this funct

ion)


My guess is that the #define activating that region of code is 
erroneously triggered.  I am running the 2-day (on cygwin with a 
substandard BIOS) testsuite now.

Re: Call to arms: testsuite failures on various targets

2007-04-14 Thread Tim Prince


FX Coudert wrote:

Hi all,

I reviewed this afternoon the postings from the gcc-testresults 
mailing-list for the past month, and we have a couple of gfortran 
testsuite failures showing up on various targets. Could people with 
access to said targets (possibly maintainers) please file PRs in 
bugzilla for each testcase, reporting the error message and/or 
backtrace? (I'd be happy to be added to the Cc list of these)


* ia64-suse-linux-gnu: gfortran.dg/vect/vect-4.f90
FAIL: gfortran.dg/vect/vect-4.f90  -O  scan-tree-dump-times Alignment of 
access

forced using peeling 1
FAIL: gfortran.dg/vect/vect-4.f90  -O  scan-tree-dump-times Vectorizing 
an unali

gned access 1

This happens on all reported ia64 targets, including mine.  What is 
expected here?  There is no vectorization on ia64, no reason for 
peeling.  The compilation has no problem, and there is no report 
generated.  As far as I know, the vectorization options are ignored.
Without unrolling, of course, gfortran doesn't optimize the loop at all, 
but I assume that's a different question.

Re: Call to arms: testsuite failures on various targets

2007-04-14 Thread Tim Prince


Dorit Nuzman wrote:

FX Coudert wrote:


Hi all,

I reviewed this afternoon the postings from the gcc-testresults
mailing-list for the past month, and we have a couple of gfortran
testsuite failures showing up on various targets. Could people with
access to said targets (possibly maintainers) please file PRs in
bugzilla for each testcase, reporting the error message and/or
backtrace? (I'd be happy to be added to the Cc list of these)

* ia64-suse-linux-gnu: gfortran.dg/vect/vect-4.f90
  

FAIL: gfortran.dg/vect/vect-4.f90  -O  scan-tree-dump-times Alignment of
access
forced using peeling 1
FAIL: gfortran.dg/vect/vect-4.f90  -O  scan-tree-dump-times Vectorizing
an unali
gned access 1




These tests should xfail on "vect_no_align" targets. On targets that
support misaligned accesses we use peeling to align two datarefs, and
generate a misaligned memory-access for a third dataref. But on targets
that do not support misaligned accesses I expect we just use versioning
with runtime alignment test. Does the following pass for you (I just added
"{ xfail vect_no_align }" to the two failing tests)?


Index: vect-4.f90
===
--- vect-4.f90  (revision 123409)
+++ vect-4.f90  (working copy)
@@ -10,7 +10,7 @@
 END

 ! { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } }
-! { dg-final { scan-tree-dump-times "Alignment of access forced using
peeling" 1 "vect" } }
-! { dg-final { scan-tree-dump-times "Vectorizing an unaligned access" 1
"vect" } }
+! { dg-final { scan-tree-dump-times "Alignment of access forced using
peeling" 1 "vect" { xfail vect_no_align } } }
+! { dg-final { scan-tree-dump-times "Vectorizing an unaligned access" 1
"vect" { xfail vect_no_align } } }
 ! { dg-final { scan-tree-dump-times "accesses have the same alignment." 1
"vect" } }
 ! { dg-final { cleanup-tree-dump "vect" } }

  


This patch does change those reports to XFAIL (testsuite report attached).
I suppose any attempts to optimize for ia64, such as load-pair 
versioning, would be in the dataflow branch, whose location I don't know.
laST_UPDATED: Obtained from SVN: trunk revision 123799

Native configuration is ia64-unknown-linux-gnu

=== gcc tests ===


Running target unix
FAIL: gcc.c-torture/execute/mayalias-2.c compilation,  -O3 -g  (internal 
compiler error)
UNRESOLVED: gcc.c-torture/execute/mayalias-2.c execution,  -O3 -g 
FAIL: gcc.c-torture/execute/mayalias-3.c compilation,  -O3 -g  (internal 
compiler error)
UNRESOLVED: gcc.c-torture/execute/mayalias-3.c execution,  -O3 -g 
FAIL: gcc.c-torture/execute/va-arg-24.c execution,  -O3 -fomit-frame-pointer 
-funroll-loops 
FAIL: gcc.c-torture/execute/va-arg-24.c execution,  -O3 -fomit-frame-pointer 
-funroll-all-loops -finline-functions 
FAIL: gcc.dg/builtin-apply4.c execution test
FAIL: gcc.dg/pr30643.c scan-assembler-not undefined
WARNING: gcc.dg/torture/fp-int-convert-float128-timode.c  -O0  compilation 
failed to produce executable
WARNING: gcc.dg/torture/fp-int-convert-float128-timode.c  -O1  compilation 
failed to produce executable
WARNING: gcc.dg/torture/fp-int-convert-float128-timode.c  -O2  compilation 
failed to produce executable
WARNING: gcc.dg/torture/fp-int-convert-float128-timode.c  -O3 
-fomit-frame-pointer  compilation failed to produce executable
WARNING: gcc.dg/torture/fp-int-convert-float128-timode.c  -O3 -g  compilation 
failed to produce executable
WARNING: gcc.dg/torture/fp-int-convert-float128-timode.c  -Os  compilation 
failed to produce executable
FAIL: gcc.dg/torture/fp-int-convert-float128.c  -O0  (test for excess errors)
WARNING: gcc.dg/torture/fp-int-convert-float128.c  -O0  compilation failed to 
produce executable
FAIL: gcc.dg/torture/fp-int-convert-float128.c  -O1  (test for excess errors)
WARNING: gcc.dg/torture/fp-int-convert-float128.c  -O1  compilation failed to 
produce executable
FAIL: gcc.dg/torture/fp-int-convert-float128.c  -O2  (test for excess errors)
WARNING: gcc.dg/torture/fp-int-convert-float128.c  -O2  compilation failed to 
produce executable
FAIL: gcc.dg/torture/fp-int-convert-float128.c  -O3 -fomit-frame-pointer  (test 
for excess errors)
WARNING: gcc.dg/torture/fp-int-convert-float128.c  -O3 -fomit-frame-pointer  
compilation failed to produce executable
FAIL: gcc.dg/torture/fp-int-convert-float128.c  -O3 -g  (test for excess errors)
WARNING: gcc.dg/torture/fp-int-convert-float128.c  -O3 -g  compilation failed 
to produce executable
FAIL: gcc.dg/torture/fp-int-convert-float128.c  -Os  (test for excess errors)
WARNING: gcc.dg/torture/fp-int-convert-float128.c  -Os  compilation failed to 
produce executable
XPASS: gcc.dg/tree-ssa/loop-1.c scan-assembler-times foo 5
XPASS: gcc.dg/tree-ssa/update-threading.c scan-tree-dump-times Invalid sum 0
FAIL: gcc.dg/vect/pr30771.c scan-tree-dump-times vectorized 1 loops 1
FAIL: gcc.dg/vect/vect-iv-4.c scan-tree-dump-times vectorized 1 loops 1
FAIL: gcc.dg/vect/vect-iv-9.c scan-tree-dump-times vectorize

Re: Where is gstdint.h

2007-04-22 Thread Tim Prince


[EMAIL PROTECTED] wrote:

Where is gstdint.h ? Does it acctually exist ?

libdecnumber seems to use it.

decimal32|64|128.h's include decNumber.h which includes deccontext.h 
which includes gstdint.h


When you configure libdecnumber (e.g. by running top-level gcc 
configure), gstdint.h should be created, by modifying .  Since 
you said nothing about the conditions where you had a problem, you can't 
expect anyone to fix it for you.  If you do want it fixed, you should at 
least file a complete PR.  As it is more likely to happen with a poorly 
supported target, you may have to look into it in more detail than that.
When this happened to me, I simply made a copy of stdint.h to get over 
the hump.

Re: Where is gstdint.h

2007-04-22 Thread Tim Prince


[EMAIL PROTECTED] wrote:

[EMAIL PROTECTED] wrote:

Where is gstdint.h ? Does it acctually exist ?

libdecnumber seems to use it.

decimal32|64|128.h's include decNumber.h which includes deccontext.h 
which includes gstdint.h


When you configure libdecnumber (e.g. by running top-level gcc 
configure), gstdint.h should be created, by modifying .  
Since you said nothing about the conditions where you had a problem, 
you can't expect anyone to fix it for you.  If you do want it fixed, 
you should at least file a complete PR.  As it is more likely to 
happen with a poorly supported target, you may have to look into it in 
more detail than that.
When this happened to me, I simply made a copy of stdint.h to get over 
the hump.


Thanks for prompt reply.

I am doing 386 build. I could not find it in my build directory, but it 
is there after all. Sorry, not used to finding files in Linux.


Aaron

You can't expect people to guess which 386 build you are doing.  Certain 
386 builds clearly are not in the "poorly supported" category, others 
may be.

Re: Where is gstdint.h

2007-04-23 Thread Tim Prince


[EMAIL PROTECTED] wrote:

Tim Prince wrote:

[EMAIL PROTECTED] wrote:

Where is gstdint.h ? Does it acctually exist ?

libdecnumber seems to use it.

decimal32|64|128.h's include decNumber.h which includes deccontext.h 
which includes gstdint.h


When you configure libdecnumber (e.g. by running top-level gcc 
configure), gstdint.h should be created, by modifying .  
Since you said nothing about the conditions where you had a problem, 
you can't expect anyone to fix it for you.  If you do want it fixed, 
you should at least file a complete PR.  As it is more likely to 
happen with a poorly supported target, you may have to look into it in 
more detail than that.
When this happened to me, I simply made a copy of stdint.h to get over 
the hump.


This might happen when you run the top level gcc configure in its own 
directory.  You may want to try to make a new directory elsewhere and 
run configure there.


pwd
.../my-gcc-source-tree
mkdir ../build
cd ../build
../my-gcc-source-tree/configure
make


If you're suggesting trying to build in the top level directory to see 
if the same problem occurs, I would expect other problems to arise.  If 
it would help diagnose the problem, and the problem persists for a few 
weeks, I'd be willing to try it.

Re: Effects of newly introduced -mpcX 80387 precision flag

2007-04-29 Thread Tim Prince


[EMAIL PROTECTED] wrote:



I just (re-)discovered these tables giving maximum known errors in some 
libm functions when extended precision is enabled:


http://people.inf.ethz.ch/gonnet/FPAccuracy/linux/summary.html

and when the precision of the mantissa is set to 53 bits (double 
precision):


http://people.inf.ethz.ch/gonnet/FPAccuracy/linux64/summary.html

This is from 2002, and indeed, some of the errors in double-precision 
results are hundreds or thousands of times bigger when the precision is 
set to 53 bits.


This isn't very helpful.  I can't find an indication of whose libm is 
being tested, it appears to be an unspecified non-standard version of 
gcc, and a lot of digging would be needed to find out what the tests 
are.  It makes no sense at all for sqrt() to break down with change in 
precision mode.
Extended precision typically gives a significant improvement in accuracy 
of complex math functions, as shown in the Celefunt suite from TOMS.
The functions shown, if properly coded for SSE2, should be capable of 
giving good results, independent of x87 precision mode. I understand 
there is continuing academic research.
Arguments have been going on for some time on whether to accept 
approximate SSE2 math libraries.  I personally would not like to see new 
libraries without some requirement for readable C source and testing.
I agree that it would be bad to set 53-bit mode blindly for a library 
which expects 64-bit mode, but it seems a serious weakness if such a 
library doesn't take care of precision mode itself.
The whole precision mode issue seems somewhat moot, now that years have 
passed since the last CPUs were made which do not support SSE2, or the 
equivalent in other CPU families.

Re: Effects of newly introduced -mpcX 80387 precision flag

2007-04-29 Thread Tim Prince


[EMAIL PROTECTED] wrote:


On Apr 29, 2007, at 1:01 PM, Tim Prince wrote:





It makes no sense at all for sqrt() to break down with change in 
precision mode.


If you do an extended-precision (80-bit) sqrt and then round the result 
again to a double (64-bit) then those two roundings will increase the 
error, sometimes to > 1/2 ulp.


To give current results on a machine I have access to, I ran the tests 
there on


vendor_id   : AuthenticAMD
cpu family  : 15
model   : 33
model name  : Dual Core AMD Opteron(tm) Processor 875

using

euler-59% gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --prefix=/pkgs/gcc-4.1.2
Thread model: posix
gcc version 4.1.2

on an up-to-date RHEL 4.0 server (so whatever libm is offered there), 
and, indeed, the only differences that it found were in 1/x, sqrt(x), 
and Pi*x because of double rounding.  In other words, the code that went 
through libm gave identical answers whether running on sse, x87 
(extended precision), or x87 (double precision).


I don't know whether there are still math libraries for which Gonnet's 
2002 results prevail.


Double rounding ought to be avoided by -mfpmath=sse and permitting 
builtin_sqrt to do its thing, or by setting 53-bit precision.  The 
latter disables long double.  The original URL showed total failure of 
sqrt(); double rounding only brings error of .5 ULP, as usually assessed.
I don't think the 64-/53-bit double rounding of sqrt can be detected, 
but of course such double rounding of * can be measured.  With Pi, you 
have various possibilities, according to precision of the Pi value 
(including the possibility of the one supplied by the x87 instruction) 
as well as the 2 choices of arithmetic precision mode.

Re: Successfull Build of gcc on Cygwin WinXp SP2

2007-04-30 Thread Tim Prince


[EMAIL PROTECTED] wrote:



Cygcheck version 1.90
Compiled on Jan 31 2007

How do I get a later version of Cygwin ?



1.90 is the current release version.  It seems unlikely that later trial 
versions have a patch for the stdio.h conflict with C99, or changes 
headers to avoid warnings which by default are fatal.
If you want a newer cygwin.dll, read the cygwin mail list archive for 
hints, but it doesn't appear to be relevant.

Re: Successfull Build of gcc on Cygwin WinXp SP2

2007-05-01 Thread Tim Prince


[EMAIL PROTECTED] wrote:

James,


On 5/1/07, Aaron Gray <[EMAIL PROTECTED]> wrote:

Hi James,

> Successfully built latest gcc on Win XP SP2 with cvs built cygwin.

I was wondering whether you could help to get me to the same point 
please.



You will need to use Dave Korns patch for newlib.

http://sourceware.org/ml/newlib/2007/msg00292.html


I am getting the following :-

$ patch newlib/libc/include/stdio.h fix-gcc-bootstrap-on-cygwin-patch.diff
patching file newlib/libc/include/stdio.h
Hunk #1 succeeded at 475 (offset 78 lines).
Hunk #2 FAILED at 501.
Hunk #3 FAILED at 521.
2 out of 3 hunks FAILED -- saving rejects to file 
newlib/libc/include/stdio.h.rej


I had to apply the relevant changes manually to the cygwin . 
It doesn't appear to match the version for which Dave made the patch.

Re: What happend to bootstrap-lean?

2005-12-17 Thread Tim Prince


Gabriel Dos Reis wrote:

Andrew Pinski <[EMAIL PROTECTED]> writes:

| > 
| > On Fri, 16 Dec 2005, Paolo Bonzini wrote:

| > > Yes.  "make bubblestrap" is now called simply "make".
| > 
| > Okay, how is "make bootstrap-lean" called these days? ;-)
| > 
| > In fact, bootstrap-lean is still documented in install.texi and 
| > makefile.texi, but it no longer seems to be present in the Makefile

| > machinery.  Could we get this back?
| 
| bootstrap-lean is done by doing the following (which I feel is the wrong way):

| Configure with --enable-bootstrap=lean
| and then do a "make bootstrap"

Hmm, does that mean that I would have to reconfigure GCC if I wanted
to do "make bootstrap-lean" after a previous configuration and build?

I think the answer must be "no", but I'm not sure.

-- Gaby

I've not been able to find another way to rebuild (on SuSE 9.2, for 
example) after applying the weekly patch file.  I'm hoping that 
suggestion works.

Re: Fwd: Windows support dropped from gcc trunk

2015-10-14 Thread Tim Prince

On 10/14/2015 11:36 AM, Steve Kargl wrote:
> On Wed, Oct 14, 2015 at 11:32:52AM -0400, Tim Prince wrote:
>> Sorry if someone sees this multiple times; I think it may have been
>> stopped by ISP or text mode filtering:
>>
>> Since Sept. 26, the partial support for Windows 64-bit has been dropped
>> from gcc trunk:
>> winnt.c apparently has problems with seh, which prevent bootstrapping,
>> and prevent the new gcc from building libraries.
>> libgfortran build throws a fatal error on account of lack of support for
>> __float128, even if a working gcc is used.
>> I didn't see any notification about this; maybe it wasn't a consensus
>> decision?
>> There are satisfactory pre-built gfortran 5.2 compilers (including
>> libgomp, although that is off by default and the testsuite wants acc as
>> well as OpenMP) available in cygwin64 (test version) and (apparently)
>> mingw-64.
>>
> The last comment to winnt.c is
>
> 2015-10-02  Kai Tietz  
>
> PR target/51726
> * config/i386/winnt.c (ix86_handle_selectany_attribute): Handle
> selectany within this function without need to keep attribute.
> (i386_pe_encode_section_info): Remove selectany-code.
>
> Perhaps, contact Kai.
>
> I added gcc@gcc.gnu.org as this technically isn't a Fortran issue.
test suite reports hundred of new ICE instances, all referring to this
seh_unwind_emit function:

/cygdrive/c/users/tim/tim/tim/src/gnu/gcc1/gcc/testsuite/gcc.c-torture/compile/2127-1.c:
In function 'foo':^M
/cygdrive/c/users/tim/tim/tim/src/gnu/gcc1/gcc/testsuite/gcc.c-torture/compile/2127-1.c:7:1:
internal compiler error: in i386_pe_seh_unwind_emit, at
config/i386/winnt.c:1137^M
Please submit a full bug report,^M
I will file a bugzila if that is what is wanted, but I wanted to know if
there is a new configure option required.
As far as I know there were always problems with long double for Windows
targets, but the refusal of libgfortran to build on account of it is new.
Thanks,
Tim

Re: question about -ffast-math implementation

2014-06-02 Thread Tim Prince


On 6/2/2014 3:00 AM, Andrew Pinski wrote:

On Sun, Jun 1, 2014 at 11:09 PM, Janne Blomqvist
 wrote:

On Sun, Jun 1, 2014 at 9:52 AM, Mike Izbicki  wrote:

I'm trying to copy gcc's behavior with the -ffast-math compiler flag
into haskell's ghc compiler.  The only documentation I can find about
it is at:

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

I understand how floating point operations work and have come up with
a reasonable list of optimizations to perform.  But I doubt it is
exhaustive.

My question is: where can I find all the gory details about what gcc
will do with this flag?  I'm perfectly willing to look at source code
if that's what it takes.


In addition to the official documentation, a nice overview is at

https://gcc.gnu.org/wiki/FloatingPointMath

Useful, thanks for the pointer


Though for the gory details and authoritative answers I suppose you'd
have to look into the source code.


Also, are there any optimizations that you wish -ffast-math could
perform, but for various architectural reasons they don't fit into
gcc?


There are of course a (nearly endless?) list of optimizations that
could be done but aren't (lack of manpower, impractical, whatnot). I'm
not sure there are any interesting optimizations that would be
dependent on loosening -ffast-math further?
I find it difficult to remember how to reconcile differing treatments by 
gcc and gfortran under -ffast-math; in particular, with respect to 
-fprotect-parens and -freciprocal-math.  The latter appears to comply 
with Fortran standard.


(One thing I wish wouldn't be included in -ffast-math is
-fcx-limited-range; the naive complex division algorithm can easily
lead to comically poor results.)


Which is kinda interesting because the Google folks have been trying
to turn on  -fcx-limited-range for C++ a few times now.

Intel tried to add -complex-limited-range as a default under -fp-model 
fast=1 but that was shown to be unsatisfactory.


Now, with the introduction of omp simd directives and pragmas, we have 
disagreement among various compilers on the relative roles of the 
directives and the fast-math options.
I've submitted PR60117 hoping to get some insight on whether omp simd 
should disable optimizations otherwise performed by -ffast-math.


Intel made the directives over-ride the compiler line fast (or 
"no-fast") settings locally, so that complex-limited-range might be in 
effect inside the scope of the directive (no matter whether you want 
it).  They made changes in the current beta compiler, so it's no longer 
practical to set standard-compliant options but discard them by pragma 
in individual for loops.



--
Tim Prince

Re: Vector permutation only deals with # of vector elements same as mask?

2011-02-11 Thread Tim Prince


On 2/11/2011 7:30 AM, Bingfeng Mei wrote:

Thanks. Another question. Is there any plan to vectorize
the loops like the following ones?

 for (i=127; i>=0; i--) {
x[i] = y[i] + z[i];
 }



When I last tried, the Sun compilers could vectorize such loops 
efficiently (for fairly short loops), with appropriate data definitions. 
 The Sun compilers didn't peel for alignment, to improve performance on 
longer loops, as gcc and others do.  For a case with no data overlaps 
(float * __restrict__ x, ,y,z,  or Fortran), loop reversal can do 
the job. gcc has some loop reversal machinery, but I haven't seen it 
used for vectorization.  In a simple case like this, some might argue 
there's no reason to write a backward loop when it could easily be 
reversed in source code, and compilers have been seen to make mistakes 
in reversal.


--
Tim Prince

Re: numerical results differ after irrelevant code change

2011-05-08 Thread Tim Prince

On 5/8/2011 8:25 AM, Michael D. Berger wrote:

-Original Message-
From: Robert Dewar [mailto:de...@adacore.com]
Sent: Sunday, May 08, 2011 11:13
To: Michael D. Berger
Cc: gcc@gcc.gnu.org
Subject: Re: numerical results differ after irrelevant code change

[...]

This kind of result is quite expected on an x86 using the old
style (default) floating-point (becauae of extra precision in
intermediate results).

How does the extra precision lead to the variable result?
Also, is there a way to prevent it?  It is a pain in regression testing.

If you don't need to support CPUs over 10 years old, consider 
-march=pentium4 -mfpmath=sse or use the 64-bit OS and gcc.
Note the resemblance of your quoted differences to DBL_EPSILON from 
.  That's 1 ULP relative to 1.0.  I have a hard time imagining 
the nature of real applications which don't need to tolerate differences 
of 1 ULP.

--
Tim Prince

Re: Profiling gcc itself

2011-11-20 Thread Tim Prince


 On 11/20/2011 11:10 AM, Basile Starynkevitch wrote:

On Sun, 20 Nov 2011 03:43:20 -0800
Jeff Evarts  wrote:


I posted this question at irc://irc.oftc.net/#gcc and they suggested
that I pose it here instead.

I do some "large-ish" builds (linux, gcc itself, etc) on a too-regular
basis, and I was wondering what could be done to speed things up. A
little printf-style checking hints to me that I might be spending the
majority of my time in CPP rather g++, gasm, ld, etc. Has anyone
(ever, regularly, or recently) built gcc (g++, gcpp) with profiling
turned on? Is it hard? Did you get good results?

I'm not sure the question belongs to gcc@gcc.gnu.org, perhaps 
gcc-h...@gcc.gnu.org might
be a better place.
If you choose to follow such advice, explaining whether other facilities 
already in gcc, e.g.

http://gcc.gnu.org/onlinedocs/gcc/Precompiled-Headers.html
apply to your situation may be useful.


--
Tim Prince

Re: C Compiler benchmark: gcc 4.6.3 vs. Intel v11 and others

2012-01-19 Thread Tim Prince

On 1/19/2012 9:27 AM, willus.com wrote:

On 1/19/2012 2:59 AM, Richard Guenther wrote:

On Thu, Jan 19, 2012 at 7:37 AM, Marc Glisse wrote:

On Wed, 18 Jan 2012, willus.com wrote:

For those who might be interested, I've recently benchmarked gcc 4.6.3
(and 3.4.2) vs. Intel v11 and Microsoft (in Windows 7) here:

http://willus.com/ccomp_benchmark2.shtml

http://en.wikipedia.org/wiki/Microsoft_Windows_SDK#64-bit_development

For the math functions, this is normally more a libc feature, so you
might
get very different results on different OS. Then again, by using
-ffast-math, you allow the math functions to return any random value,
so I
can think of ways to make it even faster ;-)

Also for math functions you can simply substitute the Intel compilers one
(GCC uses the Microsoft ones) by linking against libimf. You can also
make
use of their vectorized variants from GCC by specifying -mveclibabi=svml
and link against libimf (the GCC autovectorizer will then use the
routines
from the Intel compiler math library). That makes a huge difference for
code using functions from math.h.

Richard.

--
Marc Glisse

Thank you both for the tips. Are you certain that with the flags I used
Intel doesn't completely in-line the math2.h functions at the compile
stage? gcc? I take it to use libimf.a (legally) I would have to purchase
the Intel compiler?
In-line math functions, beyond what gcc does automatically (sqrt...) are
possible only with x87 code; those aren't vectorizable nor remarkably
fast, although quality can be made good (with care).

As Richard said, the icc svml library is the one supporting the fast
vector math functions. There is also an arch-consistency version of
svml (different internal function names) which is not as fast but may
give more accurate results or avoid platform-dependent bugs.

Yes, the Intel library license makes restrictions on usage:
http://software.intel.com/en-us/articles/faq-intel-parallel-composer-redistributable-package/?wapkw=%28redistributable+license%29

You might use it for personal purposes under terms of this linux license:
http://software.intel.com/en-us/articles/Non-Commercial-license/?wapkw=%28non-commercial+license%29

It isn't supported in the gcc context. Needless to say, I don't speak
for my employer.

--
Tim Prince

Re: C Compiler benchmark: gcc 4.6.3 vs. Intel v11 and others

2012-01-19 Thread Tim Prince


On 1/19/2012 9:24 PM, willus.com wrote:

On 1/18/2012 10:37 PM, Marc Glisse wrote:

On Wed, 18 Jan 2012, willus.com wrote:


For those who might be interested, I've recently benchmarked gcc
4.6.3 (and 3.4.2) vs. Intel v11 and Microsoft (in Windows 7) here:

http://willus.com/ccomp_benchmark2.shtml


http://en.wikipedia.org/wiki/Microsoft_Windows_SDK#64-bit_development

For the math functions, this is normally more a libc feature, so you
might get very different results on different OS. Then again, by using
-ffast-math, you allow the math functions to return any random value,
so I can think of ways to make it even faster ;-)


I use -ffast-math all the time and have always gotten virtually
identical results to when I turn it off. The speed difference is
important for me.
The default for the Intel compiler is more aggressive than gcc 
-ffast-math -fno-cx-limited-range, as long as you don't use one of the 
old buggy mathinline.h header files.  For a fair comparison, you need 
detailed attention to comparable options.  If you don't set gcc 
-ffast-math, you will want icc -fp-model-source.
It's good to have in mind what you want from the more aggressive 
options, e.g. auto-vectorization of sum reduction.

If you do want gcc -fcx-limited range, icc spells it -complex-limited-range.

--
Tim Prince

Re: weird optimization in sin+cos, x86 backend

2012-02-05 Thread Tim Prince


 On 02/05/2012 11:08 AM, James Courtier-Dutton wrote:

Hi,

I looked at this a bit closer.
sin(1.0e22) is outside the +-2^63 range, so FPREM1 is used to bring it
inside the range.
So, I looked at FPREM1 a bit closer.

#include
#include

int main (void)
{
  long double x, r, m;

  x = 1.0e22;
// x = 5.26300791462049950360708478127784;<- This is what the answer
should be give or take 2PI.
  m =  M_PIl * 2.0;
  r = remainderl(x, m);   // Utilizes FPREM1

  printf ("x = %.17Lf\n", x);
  printf ("m = %.17Lf\n", m);
  printf ("r = %.17Lf\n", r);

  return 1;
}

This outputs:
x = 100.0
m = 6.28318530717958648
r = 2.66065232182161996

But, r should be
5.26300791462049950360708478127784... or
-1.020177392559086973318201985281...
according to wolfram alpha and most arbitrary maths libs I tried.

I need to do a bit more digging, but this might point to a bug in the
cpu instruction FPREM1

Kind Regards

James
As I recall, the remaindering instruction was documented as using a 
66-bit rounded approximation fo PI, in case that is what you refer to.


--
Tim Prince

Re: Failure building current 4.5 snapshot on Cygwin

2009-08-23 Thread Tim Prince


Eric Niebler wrote:

Angelo Graziosi wrote:

Eric Niebler wrote:
I am running into the same problem (cannnot build latest snapshot on 
cygwin). I have built and installed the latest binutils from head 
(see attached config.log for details). But still the build fails. Any 
help?


This is strange! Recent snapshots (4.3, 4.4, 4.5) build OB both on 
Cygwin-1.5 and 1.7. In 1.5 I have build the same binutils of 1.7.






I've attached objdir/intl/config.log.

It says you have triggered cross compilation mode, without complete 
setup.  Also, it says you are building in a directory below your source 
code directory, which I always used to do myself, but stopped on account 
of the number of times I've seen this criticized.
The only new build-blocking problem I've run into in the last month is 
the unsupported autoconf test, which has a #FIXME comment.  I had to 
comment it out.

Re: [4.4] Strange performance regression?

2009-10-14 Thread Tim Prince


Joern Rennecke wrote:

Quoting Mark Tall :


Joern Rennecke wrote:

But at any rate, the subject does not agree with
the content of the original post.  When we talk
about a 'regression' in a particular gcc version,
we generally mean that this version is in some
way worse than a previous version of gcc.


Didn't the original poster indicate that gcc 4.3 was faster than 4.4 ?
 In my book that is a regression.


He also said that it was a different machine, Core 2 Q6600 vs
some kind of Xeon Core 2 system with a total of eight cores.
As different memory subsystems are likely to affect the code, it
is not an established regression till he can reproduce a performance
drop going from an older to a current compiler on the same or
sufficiently similar machines, under comparable load conditions -
which generally means that the machine must be idle apart from the
benchmark.
Ian's judgment in diverting to gcc-help was born out when it developed 
that -funroll-loops was wanted.  This appeared to confirm his suggestion 
that it might have had to do with loop alignments.
As long as everyone is editorializing, I'll venture say this case raises 
the suspicion that gcc might benefit from better default loop 
alignments, at least for that particular CPU.  However, I've played a 
lot of games on Core i7 with varying unrolling etc. I find the behavior 
of current gcc entirely satisfactory, aside from the verbosity of the 
options required.

Re: Whole program optimization and functions-only-called-once.

2009-11-15 Thread Tim Prince


Toon Moene wrote:

Richard Guenther wrote:


On Sun, Nov 15, 2009 at 8:07 AM, Toon Moene  wrote:



Steven Bosscher wrote:



At least CPROP, LCM-PRE, and HOIST (i.e. all passes in gcse.c), and
variable tracking.


Are they covered by a --param ?  At least that way I could teach them 
to go

on indefinitely ...



I think most of them are.  Maybe we should diagnose the cases where
we hit these limits.


That would be a good idea.  One other compiler I work with frequently 
(the Intel Fortran compiler) does just that.  However, either it doesn't 
have or their marketing department doesn't want you to know about knobs 
to tweak these decisions :-)


Both gfortran and ifort have a much longer list of adjustable limits on 
in-lining than most customers are willing to study or test.

Re: On the x86_64, does one have to zero a vector register before filling it completely ?

2009-11-28 Thread Tim Prince


Toon Moene wrote:

H.J. Lu wrote:

On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene  wrote:

L.S.,

Due to the discussion on register allocation, I went back to a hobby of
mine: Studying the assembly output of the compiler.

For this Fortran subroutine (note: unless otherwise told to the Fortran
front end, reals are 32 bit floating point numbers):

 subroutine sum(a, b, c, n)
 integer i, n
 real a(n), b(n), c(n)
 do i = 1, n
c(i) = a(i) + b(i)
 enddo
 end

with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop:

   xorps   %xmm2, %xmm2
   
.L6:
   movaps  %xmm2, %xmm0
   movaps  %xmm2, %xmm1
   movlps  (%r9,%rax), %xmm0
   movlps  (%r8,%rax), %xmm1
   movhps  8(%r9,%rax), %xmm0
   movhps  8(%r8,%rax), %xmm1
   incl%ecx
   addps   %xmm1, %xmm0
   movaps  %xmm0, 0(%rbp,%rax)
   addq$16, %rax
   cmpl%ebx, %ecx
   jb  .L6

I'm not a master of x86_64 assembly, but this strongly looks like 
%xmm{0,1}
have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), 
before

they are completely filled with the mov{l,h}ps instructions ?



I think it is used to avoid partial SSE register stall.


You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for 
%xmm1) instruction (to copy 4*32 bits to the register) ?



If you want those, you must request them with -mtune=barcelona.

Re: On the x86_64, does one have to zero a vector register before filling it completely ?

2009-11-28 Thread Tim Prince


Richard Guenther wrote:

On Sat, Nov 28, 2009 at 4:26 PM, Tim Prince  wrote:

Toon Moene wrote:

H.J. Lu wrote:

On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene  wrote:

L.S.,

Due to the discussion on register allocation, I went back to a hobby of
mine: Studying the assembly output of the compiler.

For this Fortran subroutine (note: unless otherwise told to the Fortran
front end, reals are 32 bit floating point numbers):

subroutine sum(a, b, c, n)
integer i, n
real a(n), b(n), c(n)
do i = 1, n
   c(i) = a(i) + b(i)
enddo
end

with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop:

  xorps   %xmm2, %xmm2
  
.L6:
  movaps  %xmm2, %xmm0
  movaps  %xmm2, %xmm1
  movlps  (%r9,%rax), %xmm0
  movlps  (%r8,%rax), %xmm1
  movhps  8(%r9,%rax), %xmm0
  movhps  8(%r8,%rax), %xmm1
  incl%ecx
  addps   %xmm1, %xmm0
  movaps  %xmm0, 0(%rbp,%rax)
  addq$16, %rax
  cmpl%ebx, %ecx
  jb  .L6

I'm not a master of x86_64 assembly, but this strongly looks like
%xmm{0,1}
have to be zero'd (%xmm2 is set to zero by xor'ing it with itself),
before
they are completely filled with the mov{l,h}ps instructions ?


I think it is used to avoid partial SSE register stall.



You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for
%xmm1) instruction (to copy 4*32 bits to the register) ?


If you want those, you must request them with -mtune=barcelona.


Which would then get you movups (%r9,%rax), %xmm0 (unaligned move).
generic tuning prefers the split moves, AMD Fam10 and above handle
unaligned moves just fine.

Correct, the movaps would have been used if alignment were recognized.
The newer CPUs achieve full performance with movups.
Do you consider Core i7/Nehalem as included in "AMD Fam10 and above?"

Re: On the x86_64, does one have to zero a vector register before filling it completely ?

2009-11-28 Thread Tim Prince


Toon Moene wrote:

Toon Moene wrote:


Tim Prince wrote:

 > If you want those, you must request them with -mtune=barcelona.

OK, so it is an alignment issue (with -mtune=barcelona):

.L6:
movups  0(%rbp,%rax), %xmm0
movups  (%rbx,%rax), %xmm1
incl%ecx
addps   %xmm1, %xmm0
movaps  %xmm0, (%r8,%rax)
addq$16, %rax
cmpl%r10d, %ecx
jb  .L6


Once this problem is solved (well, determined how it could be solved), 
we go on to the next, the extraneous induction variable %ecx.


There are two ways to deal with it:

1. Eliminate it with respect to the other induction variable that
   counts in the same direction (upwards, with steps 16) and remember
   that induction variable's (%rax) limit.

or:

2. Count %ecx down from %r10d to zero (which eliminates %r10d as a loop
   carried register).

g77 avoided this by coding counted do loops with a separate loop counter 
counting down to zero - not so with gfortran (quoting):


/* Translate the simple DO construct.  This is where the loop variable
   has integer type and step +-1.  We can't use this in the general case
   because integer overflow and floating point errors could give
   incorrect results.
   We translate a do loop from:

   DO dovar = from, to, step
  body
   END DO

   to:

   [Evaluate loop bounds and step]
   dovar = from;
   if ((step > 0) ? (dovar <= to) : (dovar => to))
{
  for (;;)
{
  body;
   cycle_label:
  cond = (dovar == to);
  dovar += step;
  if (cond) goto end_label;
}
  }
   end_label:

   This helps the optimizers by avoiding the extra induction variable
   used in the general case.  */

So either we teach the Fortran front end this trick, or we teach the 
loop optimization the trick of flipping the sense of a (n otherwise 
unused) induction variable 


This would have paid off more frequently in i386 mode, where there is a 
possibility of integer register pressure in loops small enough for such 
an optimization to succeed.
This seems to be among the types of optimizations envisioned for 
run-time binary interpretation systems.

Re: Graphite and Loop fusion.

2009-11-30 Thread Tim Prince


Toon Moene wrote:


REAL, ALLOCATABLE ::  A(:,:), B(:,:), C(:,:), D(:,:), E(:,:), F(:,:)

! ... READ IN EXTEND OF ARRAYS ...

READ*,N

! ... ALLOCATE ARRAYS

ALLOCATE(A(N,N),B(N,N),C(N,N),D(N,N),E(N,N),F(N,N))

! ... READ IN ARRAYS

READ*,A,B

C = A + B
D = A * C
E = B * EXP(D)
F = C * LOG(E)

where the four assignments all have the structure of loops like:

DO I = 1, N
   DO J = 1, N
  X(J,I) = OP(A(J,I), B(J,I))
   ENDDO
ENDDO

Obviously, this could benefit from loop fusion, by combining the four 
assignments in one loop.


Provided that it were still possible to vectorize suitable portions, or 
N is known to be so large that cache locality outweighs vectorization. 
This raises the question of progress on vector math functions, as well 
as the one about relative alignments (or ignoring them in view of recent 
CPU designs).

Re: Need an assembler consult!

2009-12-29 Thread Tim Prince


FX wrote:

Hi all,

I have picked up what seems to be a simple patch from PR36399, but I don't know 
enough assembler to tell whether it's fixing it completely or not.

The following function:

#include 
__m128i r(__m128 d1, __m128 d2, __m128 d3, __m128i r, int t, __m128i s) {return 
r+s;}

is compiled by Apple's GCC into:

pushl   %ebp
movl%esp, %ebp
subl$72, %esp
movaps  %xmm0, -24(%ebp)
movaps  %xmm1, -40(%ebp)
movaps  %xmm2, -56(%ebp)
movdqa  %xmm3, -72(%ebp) #
movdqa  24(%ebp), %xmm0  #
paddq   -72(%ebp), %xmm0 #
leave
ret

Instead of lines marked with #, FSF's GCC gives:

movdqa  40(%ebp), %xmm1
movdqa  8(%ebp), %xmm0
paddq   %xmm1, %xmm0


By fixing SSE_REGPARM_MAX in config/i386/i386.h (following Apple's compiler 
value), I get GCC now generates:

movdqa  %xmm3, -72(%ebp)
movdqa  24(%ebp), %xmm0
movdqa  -72(%ebp), %xmm1
paddq   %xmm1, %xmm0

The first two lines are identical to Apple, but the last two don't. They seem 
OK to me, but I don't know enough assembler to be really sure. Could someone 
confirm the two are equivalent?


Apparently the same as far as what is returned in xmm0.

Re: The "right way" to handle alignment of pointer targets in the compiler?

2010-01-01 Thread Tim Prince


Benjamin Redelings I wrote:

Hi,

I have been playing with the GCC vectorizer and examining assembly code 
that is produced for dot products that are not for a fixed number of 
elements.  (This comes up surprisingly often in scientific codes.)  So 
far, the generated code is not faster than non-vectorized code, and I 
think that it is because I can't find a way to tell the compiler that 
the target of a double* is 16-byte aligned.






 From Pr 27827 - http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827 :
 "I just quickly glanced at the code, and I see that it never uses 
"movapd" from memory, which is a key to getting decent performance."


How many people would take advantage of special machinery for some old 
CPU, if that's your goal?


simplifying your example to

double f3(const double* p_, const double* q_,int n)
{
  double sum = 0;
  for(int i=0; iOn CPUs introduced in the last 2 years, movupd should be as fast as 
movapd, and -mtune=barcelona should work well in general, not only in 
this example.
The bigger difference in performance, for longer loops, would come with 
further batching of sums, favoring loop lengths of multiples of 4 (or 8, 
with unrolling).  That alignment already favors a fairly long loop.


As you're using C++, it seems you could have used inner_product() rather 
than writing out a function.


My Core I7 showed matrix multiply 25x25 times 25x100 producing 17Gflops 
with gfortran in-line code.  g++ produces about 80% of that.

Re: The "right way" to handle alignment of pointer targets in the compiler?

2010-01-02 Thread Tim Prince


Benjamin Redelings I wrote:

Thanks for the information!


Here are several reasons (there are more) why gcc uses 64-bit loads by 
default:
1) For a single dot product, the rate of 64-bit data loads roughly 
balances the latency of adds to the same register. Parallel dot products 
(using 2 accumulators) would take advantage of faster 128-bit loads.
2) run-time checks to adjust alignment, if possible, don't pay off for 
loop counts < about 40.
3) several obsolete CPU architectures implemented 128-bit loads by pairs 
of 64-bit loads.
4) 64-bit loads were generally more efficient than movupd, prior to 
barcelona.


In the case you quote, with parallel dot products, 128-bit loads would 
be required so as to show much performance gain over x87.

Re: adding -fnoalias ... would a patch be accepted ?

2010-01-05 Thread Tim Prince


torbenh wrote:

can you please explain, why you reject the idea of -fnoalias ? 
msvc has declspec(noalias) icc has -fnoalias 

msvc needs it because it doesn't implement restrict and supports 
violation of typed aliasing rules as a default.  ICL needs it for msvc 
compatibility, but has better alternatives. gcc can't copy the worst 
features of msvc.

Re: speed of double-precision divide

2010-01-24 Thread Tim Prince


Steve White wrote:

 I was under the misconception that each of these SSE operatons
was meant to be accomplished in a single clock cycle (although I knew there
are various other issues.)



Current CPU architectures permit an SSE scalar or parallel multiply and 
add instruction to be issued on each clock cycle.  Completion takes at 
least 4 cycles for add, significantly more for multiply.
The instruction timing tables quote throughput (how many cycles between 
issue) and latency (number of cycles to complete an individual operation).
An even more common misconception than yours is that the extra time 
taken to complete multiply, compared with the time of add, would 
disappear with fused multiply-add instructions.
SSE divide, as has been explained, is not pipelined.   The best way to 
speed up a loop with divide is with vectorization, barring situations 
such as the one you brought up where divide may not actually be a 
necessary part of the algorithm.

Re: Support for export keyword to use with C++ templates ?

2010-02-02 Thread Tim Prince


On 2/2/10 7:19 PM, Richard Kenner wrote:

I see that what I need is an assignment for all future changes. If my
employer is not involved with any contributions of mine, the employer
disclaimer is not needed, right ?
 

It's safest to have it.  The best way to prove that your employer is
not involved with any contributions of yours is with such a disclaimer.
   
Some employers have had a formal process for approving assignment of 
own-time contributions, as well as assignments as part of their 
business, and lack of either form of assignment indicates the employer 
has forbidden them.


--
Tim Prince

Re: Starting an OpenMP parallel section is extremely slow on a hyper-threaded Nehalem

2010-02-11 Thread Tim Prince


On 2/11/2010 2:00 AM, Edwin Bennink wrote:

Dear gcc list,


I noticed that starting an OpenMP parallel section takes a significant 
amount of time on Nehalem cpu's with hyper-threading enabled.


If you think a question might be related to gcc, but don't know which 
forum to use, gcc-help is more appropriate.  As your question is whether 
there is a way to avoid anomalous behaviors when an old Ubuntu is run on 
a CPU released after that version of Ubuntu, an Ubuntu forum might be 
more appropriate.  A usual way is to shut off HyperThreading in the BIOS 
when running on a distro which has trouble with it.  I do find your 
observation interesting.
As far as I know, the oldest distro which works well on Core I7 is 
RHEL5.2 x86_64, which I run, with updated gcc and binutils, and HT 
disabled, as I never run applications which could benefit from HT.


--
Tim Prince

Re: Change x86 default arch for 4.5?

2010-02-18 Thread Tim Prince


On 2/18/2010 4:54 PM, Joe Buck wrote:


But maybe I didn't ask the right question: can any x86 experts comment on
recently made x86 CPUs that would not function correctly with code
produced by --with-arch=i486?  Are there any?

   
All CPUs still in production are at least SSE3 capable, unless someone 
can come up with one of which I'm not aware.  Intel compilers made the 
switch last year to requiring SSE2 capability for the host, as well as 
in the default target options, even for 32-bit.  All x86_64 or X64 CPUs 
for which any compiler was produced had SSE2 capability, so it is 
required for those 64-bit targets.


--
Tim Prince

Re: [RFH] A simple way to figure out the number of bits used by a long double

2010-02-26 Thread Tim Prince


On 2/26/2010 5:44 AM, Ed Smith-Rowland wrote:
Huh.  I would have *sworn* that sizeof(long double) was 10 not 16 even 
though we know it was 80 bits.


As you indicated before, sizeof gives the amount of memory displaced by 
the object, including padding.  In my experience with gcc, sizeof(long 
double) is likely to be 12 on 32-bit platforms, and 16 on 64-bit 
platforms.  These choices are made to preserve alignment for 32-bit and 
128-bit objects respectively, and to improve performance in the 64-bit 
case, for hardware which doesn't like to straddle cache lines.
It seems the topic would have been more appropriate for gcc-help, if 
related to gcc, or maybe comp.lang.c, if a question about implementation 
in accordance with standard C.


--
Tim Prince

Re: legitimate parallel make check?

2010-03-09 Thread Tim Prince


On 3/9/2010 4:28 AM, IainS wrote:



It would be nice to allow the apparently independent targets [e.g. 
gcc-c,fortran,c++ etc.] to be (explicitly) make-checked in parallel.



On certain targets, it has been necessary to do this explicitly for a 
long time, submitting make check-gcc, make check-fortran, make check-g++ 
separately.  Perhaps a script could be made which would detect when the 
build is complete, then submit the separate make check serial jobs together.



--
Tim Prince

Re: GCC vs ICC

2010-03-22 Thread Tim Prince


On 3/22/2010 7:46 PM, Rayne wrote:

Hi all,

I'm interested in knowing how GCC differs from Intel's ICC in terms of the 
optimization levels and catering to specific processor architecture. I'm using 
GCC 4.1.2 20070626 and ICC v11.1 for Linux.

How does ICC's optimization levels (O1 to O3) differ from GCC, if they differ 
at all?

The ICC is able to cater specifically to different architectures (IA-32, 
intel64 and IA-64). I've read that GCC has the -march compiler option which I 
think is similar, but I can't find a list of the options to use. I'm using 
Intel Xeon X5570, which is 64-bit. Are there any other GCC compiler options I 
could use that would cater my applications for 64-bit Intel CPUs?

   
Some of that seems more topical on the Intel software forum for icc, and 
the following more topical on either that forum or gcc-help, where you 
should go for follow-up.

If you are using gcc on Xeon 5570,
gcc -mtune=barcelona -ffast-math -O3 -msse4.2
might be a comparable level of optimization to
icc -xSSE4.2
For gcc 4.1, you would have to set also -ftree-vectorize, but you would 
be better off with a current version.
But, if you are optimizing for early Intel 64-bit Xeon, -mtune=barcelona 
would not be consistently good, and you could not use -msse4 or -xSSE4.2.
For optimization which observes standards and also disables vectorized 
sum reduction, you would omit -ffast-math for gcc, and set icc -fp-model 
source.


--
Tim Prince

Re: Compiler option for SSE4

2010-03-23 Thread Tim Prince


On 3/23/2010 11:02 PM, Rayne wrote:

I'm using GCC 4.1.2 20070626 on a server with Intel Xeon X5570.

How do I turn on the compiler option for SSE4? I've tried -msse4, -msse4.1 and -msse4.2, 
but they all returned the error message cc1: error: unrecognized command line option 
"-msse4.1" (for whichever option I tried).
   
You would need a gcc version which supports sse4.  As you said yourself, 
your version is approaching 3 years old.  Actually, the more important 
option for Xeon 55xx, if you are vectorizing, is the -mtune=barcelona, 
which has been supported for about 2 years.  Whether vectorizing or not, 
on an 8 core CPU, the OpenMP introduced in gcc 4.2 would be useful.
This looks like a gcc-help mail list question, which is where you should 
submit any follow-up.


--
Tim Prince

Re: Optimizing floating point *(2^c) and /(2^c)

2010-03-29 Thread Tim Prince


On 3/29/2010 10:51 AM, Geert Bosch wrote:

On Mar 29, 2010, at 13:19, Jeroen Van Der Bossche wrote:

   

've recently written a program where taking the average of 2 floating
point numbers was a real bottleneck. I've looked into the assembly
generated by gcc -O3 and apparently gcc treats multiplication and
division by a hard-coded 2 like any other multiplication with a
constant. I think, however, that *(2^c) and /(2^c) for floating
points, where the c is known at compile-time, should be able to be
optimized with the following pseudo-code:

e = exponent bits of the number
if (e>  c&&  e<  (0b111...11)-c) {
e += c or e -= c
} else {
do regular multiplication
}

Even further optimizations may be possible, such as bitshifting the
significand when e=0. However, that would require checking for a lot
of special cases and require so many conditional jumps that it's most
likely not going to be any faster.

I'm not skilled enough with assembly to write this myself and test if
this actually performs faster than how it's implemented now. Its
performance will most likely also depend on the processor
architecture, and I could only test this code on one machine.
Therefore I ask to those who are familiar with gcc's optimization
routines to give this 2 seconds of thought, as this is probably rather
easy to implement and many programs could benefit from this.
 

For any optimization suggestions, you should start with showing some real, 
compilable, code with a performance problem that you think the compiler could 
address. Please include details about compilation options, GCC versions and 
target hardware, as well as observed performance numbers. How do you see that 
averaging two floating point numbers is a bottleneck? This should only be a 
single addition and multiplication, and will execute in a nanosecond or so on a 
moderately modern system.

Your particular suggestion is flawed. Floating-point multiplication is very 
fast on most targets. It is hard to see how on any target with floating-point 
hardware, manual mucking with the representation can be a win. In particular, 
your sketch doesn't at all address underflow and overflow. Likely a complete 
implementation would be many times slower than a floating-point multiply.

   -Geert
   
gcc used to have the ability to replace division by a power of 2 by an 
fscale instruction, for appropriate targets (maybe still does).  Such 
targets have nearly disappeared from everyday usage.  What remains is 
the possibility of replacing the division by constant power of 2 by 
multiplication, but it's generally considered the programmer should have 
done that in the beginning.  icc has such an facility, but it's subject 
to -fp-model=fast (equivalent to gcc -ffast-math -fno-cx-limited-range), 
even though it's a totally safe conversion.
As Geert indicated, it's almost inconceivable that a correct 
implementation which takes care of exceptions could match the floating 
point hardware performance, even for a case which starts with operands 
in memory (but you mention the case following an addition).


--
Tim Prince

Re: GCC primary/secondary platforms?

2010-04-07 Thread Tim Prince


On 4/7/2010 9:17 AM, Gary Funck wrote:

On 04/07/10 11:11:05, Diego Novillo wrote:
   

Additionally, make sure that the branch bootstraps and tests on all
primary/secondary platforms with all languages enabled.
 

Diego, thanks for your prompt reply and suggestions.  Regarding
the primary/secondary platforms.  Are those listed here?
   http://gcc.gnu.org/gcc-4.5/criteria.html

   
Will there be a notification if and when C++ run-time will be ready to 
test on secondary platforms, or will platforms like cygwin be struck 
from the secondary list?  I'm 26 hours into testsuite for 4.5 RC for 
cygwin gcc/gfortran, didn't know of any other supported languages worth 
testing.
My ia64 box died a few months ago, but suse-linux surely was at least as 
popular as unknown-linux in recent years.


--
Tim Prince

Re: GCC primary/secondary platforms?

2010-04-08 Thread Tim Prince


On 4/8/2010 2:40 PM, Dave Korn wrote:

On 07/04/2010 19:47, Tim Prince wrote:

   

Will there be a notification if and when C++ run-time will be ready to
test on secondary platforms, or will platforms like cygwin be struck
from the secondary list?
 

   What exactly are you talking about?  Libstdc++-v3 builds just fine on Cygwin.

   

Our release criteria for the secondary platforms is:

 * The compiler bootstraps successfully, and the C++ runtime library builds.
 * The DejaGNU testsuite has been run, and a substantial majority of the 
tests pass.
 

   We pass both those criteria with flying colours.  What are you worrying 
about?

 cheers,
   DaveK
   
No one answered questions about why libstdc++ configure started 
complaining about mis-match in style of wchar support a month ago.  Nor 
did I see anyone give any changes in configure procedure. Giving it 
another try at a new download today.


--
Tim Prince

Re: GCC primary/secondary platforms?

2010-04-08 Thread Tim Prince


On 4/8/2010 6:24 PM, Dave Korn wrote:



  Nor
did I see anyone give any changes in configure procedure. Giving it
another try at a new download today.
 

   Well, nothing has changed, but then again I haven't seen anyone else
complaining about this, so there's probably some problem in your build
environment; let's see what happens with your fresh build.  (I've built the
4.5.0-RC1 candidate without any complications and am running the tests right 
now.)

   
Built OK this time around, no changes here either, except for cygwin1 
update.  testsuite results in a couple of days.

Thanks.

--
Tim Prince

Re: Why not contribute? (to GCC)

2010-04-23 Thread Tim Prince


On 4/23/2010 1:05 PM, HyperQuantum wrote:

On Fri, Apr 23, 2010 at 9:58 PM, HyperQuantum  wrote:
   

On Fri, Apr 23, 2010 at 8:39 PM, Manuel López-Ibáñez
  wrote:
 
   

What reasons keep you from contributing to GCC?
   

The lack of time, for the most part.
 

I submitted a feature request once. It's now four years old, still
open, and the last message it received was two years ago. (PR26061)
   
The average time for acceptance of a PR with a patch submission from an 
outsider such as ourselves is over 2 years, and by then the patch no 
longer fits, has to be reworked, and is about to become moot.
I still have the FSF paperwork in force, as far as I know, from over a 
decade ago, prior to my current employment.  Does it become valid again 
upon termination of employment?  My current employer has no problem with 
the FSF paperwork for employees whose primary job is maintenance of gnu 
software (with committee approval), but this does not extend to those of 
us for whom it is a secondary role.  There once was a survey requesting 
responses on how our FSF submissions compared before and after current 
employment began, but no summary of the results.


--
Tim Prince

Re: Autovectorizing does not work with classes

2008-10-07 Thread Tim Prince

Georg Martius wrote:
> Dear gcc developers,
> 
> I am new to this list. 
> I tried to use the auto-vectorization (4.2.1 (SUSE Linux)) but unfortunately 
> with limited success.
> My code is bassically a matrix library in C++. The vectorizer does not like 
> the member variables. Consider this code compiled with 
> gcc -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5 
> -funsafe-math-optimizations
> that gives basically  "not vectorized: unhandled data-ref"
> 
> class P{
> public:
>   P() : m(5),n(3) {
> double *d = data;
> for (int i=0; i   d[i] = i/10.2;
>   }
>   void test(const double& sum);
> private:
>   int m;
>   int n;
>   double data[15];
> };
> 
> void P::test(const double& sum) {  
>   double *d = this->data;
>   for(int i=0; i d[i]+=sum;
>   }
> }
> 
> whereas the more or less equivalent C version works just fine:
> 
> int m=5;
> int n=3;
> double data[15];
> 
> void test(const double& sum) {  
>   int mn = m*n;
>   for(int i=0; i data[i]+=sum;
>   }
> }
> 
> 
> Is there a fundamental problem in using the vectorizer in C++?
> 

I don't see any C code above.  As another reply indicated, the most likely
C idiom would be to pass sum by value.  Alternatively, you could use a
local copy of sum, in cases where that is a problem.
The only fundamental vectorization problem I can think of which is
specific to C++ is the lack of a standard restrict keyword.  In g++,
__restrict__ is available.  A local copy (or value parameter) of sum
avoids a need for the compiler to recognize const or restrict as an
assurance of no value modification.
The loop has to have known fixed bounds at entry, in order to vectorize.
If your C++ style doesn't support that, e.g. by calculating the end value
outside the loop, as you show in your latter version, then you do have a
problem with vectorization.

Re: question. type long long

2008-10-12 Thread Tim Prince

Александр Струняшев wrote:
> Good afternoon.
> I need some help. As from what versions your compiler understand that
> "long long" is 64 bits ?
> 
> Best regards, Alexander
> 
> P.S. Sorry for my mistakes, I know English bad.

No need to be sorry about English, but the topic is OK for gcc-help, not
gcc development.
gcc was among the first compilers to support long long (always as 64-bit),
the only problem being that it was a gnu extension for g++. In that form,
the usage may not have settled down until g++ 4.1. The warnings for
attempting long long constants in 32-bit mode, without the LL suffix, have
been a subject of discussion:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13358
The warning doesn't mean that long long could be less than 64 bits; it
means the constant without the LL suffix is less than 64 bits.

Re: Backward Compatibility of RHEL Advanced Server and GCC

2008-10-29 Thread Tim Prince


Steven Bosscher wrote:

On Wed, Oct 29, 2008 at 6:19 AM, S. Suhasini
<[EMAIL PROTECTED]> wrote:
  

We would like to know whether the new version of the software (compiled with
the new GCC) can be deployed and run on the older setup with RHEL AS 3 and
GCC 2.96. We need not compile again on the older setup. Will there be any
run-time libraries dependency? Would be very grateful if we get a response
for this query.



It seems to me that this kind of question is best asked on a RedHat
support list, not on a list where compiler development is discussed.

FWIW, there is no "official" GCC 2.96, see http://gcc.gnu.org/gcc-2.96.html.

  
This might be partially topical on the gcc-help list.  If dynamic 
libraries are in use, there will be trouble.

Re: Cygwin support

2008-11-14 Thread Tim Prince

Brian Dessent wrote:

> Cygwin has been a secondary target for a number of years.  MinGW has
> been a secondary target since 4.3.  This generally means that they
> should be in fairly good shape, more or less.  To quote the docs:
> 
>> Our release criteria for the secondary platforms is:
>>
>> * The compiler bootstraps successfully, and the C++ runtime library 
>> builds.
>> * The DejaGNU testsuite has been run, and a substantial majority of the 
>> tests pass.
> 

> 
> More recently I've seen Danny Smith report that the IRA merge broke
> MinGW (and presumably Cygwin, since they share most of the same code)
> bootstrap.  I haven't tested this myself recently so I don't know if
> it's still broken or not.
> 

I've run the bootstrap and testsuite twice in the last month.  The
bootstrap failures are due to a broken #ifdef specific to cygwin in the
headers provided with cygwin, the requirement for a specific version of
autoconf (not available in setup), and the need to remove the -werror in
libstdc++ build (because of minor discrepancies in cygwin headers).  All
of those are easy to rectify, but fixes seem unlikely to be considered by
the decision makers.  However, the C++ testsuite results are unacceptable,
with many internal errors.
For some time now, gfortran has been broken for practical purposes, even
when it passes testsuite, as it seems to have a memory leak.  This shows
up in the public wiki binaries.
So, there are clear points for investigation of cygwin problems, and
submission of PRs, should you be interested.

>  Running the dejagnu testsuite on Cygwin is
> excruciatingly slow due to the penalty incurred from emulating fork. 

It runs over a weekend on a Pentium D which I brought back to life by
replacing the CPU cooler system.  I have no problem with running this if I
am in the office when the snapshot is released, but I think there is
little interest in fixing the problems which are specific to g++ on
cygwin, yet working gcc and gfortran aren't sufficient for gcc upgrades to
be accepted.  Support for 64-bit native looks like it will be limited to
mingw, so I no longer see a future for gcc on cygwin.

Re: Purpose of GCC Stack Padding?

2008-12-16 Thread Tim Prince


Andrew Tomazos wrote:

I've been studying the x86 compiled form of the following function:

void function()
{
char buffer[X];
}

where X = 0, 1, 2 .. 100

Naively, I would expect to see:

   pushl   %ebp
   movl%esp, %ebp
   subl$X, %esp
   leave
   ret

Instead, the stack appears to be padded:

For a buffer size of  0the stack size is   0
For a buffer size of  1 to   7 the stack size is  16
For a buffer size of  8 to  12 the stack size is  24
For a buffer size of 13 to  28 the stack size is  40
For a buffer size of 29 to  44 the stack size is  56
For a buffer size of 45 to  60 the stack size is  72
For a buffer size of 61 to  76 the stack size is  88
For a buffer size of 77 to  92 the stack size is 104
For a buffer size of 93 to 100 the stack size is 120

When X >= 8 gcc adds a stack corruption check (__stack_chk_fail),
which accounts for an extra 4 bytes of stack space in these cases.

This does not explain the rest of the padding.  Can anyone explain the
purpose of the rest of the padding?

  
This looks like more of a gcc-help question, trying to move the thread 
there.  Unless you over-ride defaults with -mpreferred-stack boundary 
(or -Os, which probably implies a change in stack boundary), or ask for 
a change on the basis of making a leaf function, you are generating 
alignment compatible with the use of SSE parallel instructions.  The 
stack, then, must be 16-byte aligned before entry and at exit, and also 
a buffer of 16 bytes or more must be 16-byte aligned.
I believe there is a move afoot to standardize the treatment for the 
most common x86 32-bit targets; that was done at the beginning for 
64-bit. Don't know if you are using x86 to imply 32-bit, in accordance 
with Windows terminology.

Re: Upgrade to GCC.4.3.2

2008-12-28 Thread Tim Prince

Philipp Thomas wrote:
> On Sun, 28 Dec 2008 14:24:22 -0500, you wrote:
> 
>> I have SLES9 and Linux-2.6.5-7.97 kernel install on i586 intel 32 bit 
>> machine.  The compiler is gcc-c++3.3.3-43.24.  I want to upgrade to 
>> GCC4.3.2.  My question are: Would this upgrade work with 
>> SLES9?
> 
> This is the wrong list for such questions. You should try a SUSE
> specific list like opens...@opensuse.org or
> opensuse-programm...@opensuse.org

gcc-help is a reasonable choice as well.

Re: gcc binary download

2009-01-15 Thread Tim Prince

Tobias Burnus wrote:

> 
> Otherwise, you could consider building GCC yourself, cf.
> http://gcc.gnu.org/install/. (Furthermore, some gfortran developers
> offer regular GCC builds, which are linked at
> http://gcc.gnu.org/wiki/GFortranBinaries; those are all unofficial
> builds, come without any warrantee/support, and due to, e.g., library
> issues they may not work on your system.)
> 
I believe the wiki builds include C and Fortran, but not C++, in view of
the additional limitations in supporting a new g++ on a reasonable range
of targets.  Even so, there may be minimum requirements on glibc and
binutils versions.

Re: Binary Autovectorization

2009-01-29 Thread Tim Prince

Rodrigo Dominguez wrote:

> I am looking at binary auto-vectorization or taking a binary and rewriting
> it to use SIMD instructions (either statically or dynamically).
That's a tall order, considering how much source level dependency
information is needed.  I don't know whether proprietary binary
translation projects currently under way promise to add vectorization, or
just to translate SIMD vector code to new ISA.

Re: -mfpmath=sse,387 is experimental ?

2009-03-16 Thread Tim Prince

Zuxy Meng wrote:
> Hi,
> 
> "Timothy Madden"  写入消息
!
>> I am sure having twice the number of registers (sse+387) would make a
>> big difference.
You're not counting the rename registers, you're talking about 32-bit mode
only, and you're discounting the different mode of accessing the registers.

>>
>> How would I know if my AMD Sempron 2200+ has separate execution units
>> for SSE and
>> FPU instructions, with independent registers ?
> 
> Most CPU use the same FP unit for both x87 and SIMD operations so it
> wouldn't give you double the performance. The only exception I know of
> is K6-2/3, whose x87 and 3DNow! units are separate.
> 
-march=pentium-m observed the preference of those CPUs for mixing the
types of code.  This was due more to the limited issue rate for SSE
instructions than to the expanded number of registers in use.  You are
welcome to test it on your CPU; however, AMD CPUs were designed to perform
well with SSE alone, particularly in 64-bit mode.

Re: GCC 4.4.0 Status Report (2009-03-13)

2009-03-24 Thread Tim Prince

Chris Lattner wrote:
> 
> On Mar 23, 2009, at 8:02 PM, Jeff Law wrote:
> 
>> Chris Lattner wrote:
>
 These companies really don't care about FOSS in the same way GCC
 developers do.   I'd be highly confident that this would still be a
 serious issue for the majority of the companies I've interacted with
 through the years.
>>>
>>> Hi Jeff,
>>>
>>> Can you please explain the differences you see between how GCC
>>> developers and other people think about FOSS?  I'm curious about your
>>> perception here, and what basis it is grounded on.
>>>
>> I'd divide customers into two broad camps.  Both camps are extremely
>> pragmatic, but they're focused on two totally different goals.
> 
> Thanks Jeff, I completely agree with you.  Those camps are very common
> in my experience as well.  Do you consider GCC developers to fall into
> one of these two categories, or do you see them as having a third
> perspective?  I know that many people have their own motivations and
> personal agenda (and it is hard to generalize) but I'm curious what you
> meant above.
> 
> Thanks!
> 
> -Chris
> 
>>
>>
>> The first camp sees FOSS toolkits as a means to help them sell more
>> widgets, typically processors & embedded development kits.  Their
>> belief is that a FOSS toolkit helps build a developer eco-system
>> around their widget, which in turn spurs development of consumable
>> devices which drive processor & embedded kit sales.   The key for
>> these guys is free, as in beer, widely available tools.  The fact that
>> the compiler & assorted utilities are open-source is largely irrelevant.
>>
>> The second broad camp I run into regularly are software developers
>> themselves building applications, most often for internal use, but
>> occasionally they're building software that is then licensed to their
>> customers.  They'd probably describe the compiler & associated
>> utilities as a set of hammers, screwdrivers and the like -- they're
>> just as happy using GCC as any other compiler so long as it works. 
>> The fact that the GNU tools are open source is completely irrelevant
>> to these guys.  They want to see standards compliance, abi
>> interoperability, and interoperability with other tools (such as
>> debuggers, profilers, guis, etc).  They're more than willing to swap
>> out one set of tools for another if it gives them some advantage. 
>> Note that an advantage isn't necessarily compile-time or runtime
>> performance -- it might be ease of use, which they believe allows
>> their junior level engineers to be more effective (this has come up
>> consistently over the last few years).
>>
>> Note that in neither case do they really care about the open-source
>> aspects of their toolchain (or for the most part the OS either).  
>> They may (and often do) like the commoditization of software that FOSS
>> tends to drive, but don't mistake that for caring about the open
>> source ideals -- it's merely cost-cutting.
>>
>> Jeff
>>
>>
> 
Software developers I deal with use gcc because it's a guaranteed included
part of the customer platforms they are targeting.  They're generally
looking for a 20% gain in performance plus support before adopting
commercial alternatives.  The GUIs they use don't live up to the
advertisements about ease of use.  This doesn't necessarily put them in
either of Jeff's camps.

Tim

Re: Minimum GMP/MPFR version bumps for GCC-4.5

2009-03-26 Thread Tim Prince

Kaveh R. Ghazi wrote:

>  What versions of GMP/MPFR do you get on
> your typical development box and how old are your distros?
> 
OpenSuSE 10.3 (originally released Oct. 07):
gmp-devel-4.2.1-58
gmp-devel-32bit-4.2.1-58
mpfr-2.2.1-45

Re: heise.de comment on 4.4.0 release

2009-04-25 Thread Tim Prince

Tobias Burnus wrote:
> Toon Moene wrote:
 Can somebody with access to SPEC sources confirm / deny and file a bug
 report, if appropriate?
I just started working on SPEC CPU2006 issues this week.

> Seemingly yes. To a certain extend this was by accident as "-msse3" was
> used, but it is on i586 only effective with -mfpmath=sse  (that is not
> completely obvious). By the way, my tests using the Polyhedron benchmark
> show that for 32bit, x87 and SSE are similarly fast, depending a lot on
> the test case thus it does not slow down the benchmark too much.
Certain AMD CPUs had shorter latencies for scalar single precision sse,
but generally the advantage of sse comes from vectorization.

> 
> If I understood correctly, the 32bit mode was used since the 64bit mode
> needs more than the available 2GB memory.
Certain commercial compilers make an effort to switch to 32-bit mode
automatically on several CPU2006 benchmarks, as they are too small to run
as fast in 64-bit mode.
> 
> Similarly, the option -funroll-loops was avoided as they expect that
> unrolling badly interacts with the small cache Atom processors have.
> (That CPU2006 runs that long, does not make testing different options
> that easy.)
I'm surprised that spec 2006 is considered relevant to Atom.  The entire
thing (base only) has been running under 10 hours on a dual quad core
system.  I've heard several times the sentiment that there ought to be an
"official" harness to run a single test, trying various options.

> I would have liked that the options were reported. For instance
> -ffast-math was not used out of fear that it results in too imprecise
> results causing SPEC to abort. (Admittedly, I'm also careful with that
> option, though I assume that -ffast-math works for SPEC.) On the other
> hand, certain flags implies by -ffast-math are already applied with -O1
> in some commercial compilers.
SPEC probably has been the biggest driver for inclusion of insane options
at default in commercial compilers. It's certainly not an example of
acceptable practice in writing portable code.  I have yet to find a
compiler which didn't fail at least one SPEC test, and I don't blame the
compilers.  There are dependencies on unusual C++ extensions, which
somehow weren't noticed before, examples of using "f77" as an excuse for
hiding one's intentions, and expectations of optimizations which have
little relevance for serious applications.
> 
> David Korn wrote:
>> They accused us of a too-hasty release.  My irony meter exploded!

Anyway, a fault in support for a not-so-open benchmark application seems
even less relevant in an open source effort than it is to compilers which
depend on ranking for sales success.

Re: Bootstrap broken by ppl/cloog config problem: finds non-system/non-standard "/include" dir

2009-05-06 Thread Tim Prince

Dave Korn wrote:

> 
>   Heh, I was just about to post that, only I was looking at $clooginc rather
> than $pplinc!  The same problem exists for both; I'm pretty sure we should
> fall back on $prefix if the --with option is empty.
> 

When I bootstrapped gcc 4.5 on cygwin yesterday, configure recognized the
newly installed ppl, but not the cloog.  The bootstrap completed
successfully, and I'm not looking a gift horse in the mouth.

Re: Bootstrap broken by ppl/cloog config problem: finds non-system/non-standard "/include" dir

2009-05-06 Thread Tim Prince

Dave Korn wrote:
> Tim Prince wrote:
>> Dave Korn wrote:
>>
>>>   Heh, I was just about to post that, only I was looking at $clooginc rather
>>> than $pplinc!  The same problem exists for both; I'm pretty sure we should
>>> fall back on $prefix if the --with option is empty.
>>>
>> When I bootstrapped gcc 4.5 on cygwin yesterday, configure recognized the
>> newly installed ppl, but not the cloog.  The bootstrap completed
>> successfully, and I'm not looking a gift horse in the mouth.
> 
>   You don't have a bogus /include dir, but I bet you'll find -I/include in 
> PPLINC.
> 
>   It would be interesting to know why it didn't spot cloog.  What's in your
> top-level $objdir/config.log?
> 

#include 

no such file
-I/include was set by configure.  As you say, there is something bogus here.

setup menu shows cloog installed in development category, but I can't find
any such include file.  Does this mean the cygwin distribution of cloog is
broken?

Re: Bootstrap broken by ppl/cloog config problem: finds non-system/non-standard "/include" dir

2009-05-06 Thread Tim Prince


Dave Korn wrote:

Tim Prince wrote:

  

#include 

no such file
-I/include was set by configure.  As you say, there is something bogus here.

setup menu shows cloog installed in development category, but I can't find
any such include file.  Does this mean the cygwin distribution of cloog is
broken?



  Did you make sure to get the -devel packages as well as the libs?  That's
the usual cause of this kind of problem.  I highly recommend the new version
of setup.exe that has a package-list search box :-)

cheers,
  DaveK
  
OK, I see there is a libcloog-devel in addition to the cloog Dev 
selection, guess that will fix it for cygwin.


I tried to build cloog for IA64 linux as well, gave up on include file 
parsing errors.

Re: [Fwd: Failure in bootstrapping gfortran-4.5 on Cygwin]

2009-05-08 Thread Tim Prince


Ian Lance Taylor wrote:

Angelo Graziosi  writes:

  

The current snapshot 4.5-20090507 fails to bootstrap on Cygwin:



  
It did bootstrap effortlessly for me, once I logged off to clear hung 
processes, with the usual disabling of strict warnings. I'll let 
testsuite run over the weekend.

Re: Failure building current 4.5 snapshot on Cygwin

2009-06-26 Thread Tim Prince

Angelo Graziosi wrote:
> I want to flag the following failure I have seen on Cygwin 1.5 trying to
> build current 4.5-20090625 gcc snapshot:

> checking whether the C compiler works... configure: error: in
> `/tmp/build/intl':
> configure: error: cannot run C compiled programs.
> If you meant to cross compile, use `--host'.
> See `config.log' for more details.

I met the same failure on Cygwin 1.7 with yesterday's and last week's
snapshots.  I didn't notice that it refers to intl/config.log, so will go
back and look, as you didn't show what happened there.

On a slightly related subject, I have shown that the libgfortran.dll.a and
libgomp.dll.a are broken on cygwin builds, including those released for
cygwin, as shown on the test case I submitted on cygwin list earlier this
week.  The -enable-shared has never been satisfactory for gfortran cygwin.

Re: Failure building current 4.5 snapshot on Cygwin

2009-06-26 Thread Tim Prince


Dave Korn wrote:

Angelo Graziosi wrote:
  

I want to flag the following failure I have seen on Cygwin 1.5 trying to
build current 4.5-20090625 gcc snapshot:



  So what's in config.log?  And what binutils are you using?

cheers,
  DaveK

  
In my case, it says no permission to execute a.exe.  However, I can run 
the intl configure and make from command line.  When I do that, and 
attempt to restart stage 2, it stops in liberty, and again I have to 
execute steps from command line.

Re: Failure building current 4.5 snapshot on Cygwin

2009-06-26 Thread Tim Prince


Kai Tietz wrote:

2009/6/26 Seiji Kachi :
  

Angelo Graziosi wrote:


Dave Korn ha scritto:
  

Angelo Graziosi wrote:


I want to flag the following failure I have seen on Cygwin 1.5 trying to
build current 4.5-20090625 gcc snapshot:
  

 So what's in config.log?  And what binutils are you using?


The config logs are attached, while binutils is the current in Cygwin-1.5,
i.e. 20080624-2.


Cheers,
Angelo.
  

I have also seen similar faulure, and the reason on my environment is as
follows.

(1) In my case, gcc build complete successfully.  But a.exe which is
compiled from the new compiler fails. Error message is

$ ./a.exe
bash: ./a.exe: Permission denied

Source code of a.exe is quite simple:
main()
{
 printf("Hello\n");
}

(2) This failuer occurres from gcc trunk r148408.  r148407 is OK.

(3) r148408 removed "#ifdef DEBUG_PUBTYPES_SECTION".  r148407 does not
generate debug_pubtypes section, but r148408 and later version generates
 debug_pubtypes section in object when we set debug option.

(4) gcc build sequence usually uses debug option.

(5) My cygwin environment seems not to accept debug_pubtypes section, and
pop up "Permission denied" error.

When I reverted "#ifdef DEBUG_PUBTYPES_SECTION" in dearf2out.c, the failuer
disappeared.

Does this failure occurr only on cygwin?

Regards,
Seiji Kachi




No, this bug appeared on all windows pe-coff targets. A fix for this
is already checked in yesterday on binutils. Could you try it with the
current binutils head version?

Cheers,
Kai

  
Is this supposed to be sufficient information for us to find that 
binutils?  I may be able to find an insider colleague, otherwise I would 
have no chance.

Re: Failure building current 4.5 snapshot on Cygwin

2009-06-26 Thread Tim Prince


Kai Tietz wrote:

2009/6/26 Tim Prince :
  

Kai Tietz wrote:


2009/6/26 Seiji Kachi :

  

Angelo Graziosi wrote:



Dave Korn ha scritto:

  

Angelo Graziosi wrote:



I want to flag the following failure I have seen on Cygwin 1.5 trying
to
build current 4.5-20090625 gcc snapshot:

  

 So what's in config.log?  And what binutils are you using?



The config logs are attached, while binutils is the current in
Cygwin-1.5,
i.e. 20080624-2.


Cheers,
Angelo.

  

I have also seen similar faulure, and the reason on my environment is as
follows.

(1) In my case, gcc build complete successfully.  But a.exe which is
compiled from the new compiler fails. Error message is

$ ./a.exe
bash: ./a.exe: Permission denied

Source code of a.exe is quite simple:
main()
{
 printf("Hello\n");
}

(2) This failuer occurres from gcc trunk r148408.  r148407 is OK.

(3) r148408 removed "#ifdef DEBUG_PUBTYPES_SECTION".  r148407 does not
generate debug_pubtypes section, but r148408 and later version generates
 debug_pubtypes section in object when we set debug option.

(4) gcc build sequence usually uses debug option.

(5) My cygwin environment seems not to accept debug_pubtypes section, and
pop up "Permission denied" error.

When I reverted "#ifdef DEBUG_PUBTYPES_SECTION" in dearf2out.c, the
failuer
disappeared.

Does this failure occurr only on cygwin?

Regards,
Seiji Kachi




No, this bug appeared on all windows pe-coff targets. A fix for this
is already checked in yesterday on binutils. Could you try it with the
current binutils head version?

Cheers,
Kai


  

Is this supposed to be sufficient information for us to find that binutils?
 I may be able to find an insider colleague, otherwise I would have no
chance.




Hello,

you can find the binutils project as usual under
http://sources.redhat.com/binutils/ . You can find on this page how
you are able to get current cvs version of binutils. This project
contains the gnu tools, like dlltool, as, objcopy, ld, etc.
The issue you are running in is reasoned by a failure in binutils
about setting correct section flags for debugging sections. By the
last change in gcc - it was the output of the .debug_pubtypes secton -
this issue was shown.
There is a patch already applied to binutils's repository head, which
should solve the issue described here in this thread. We from
mingw-w64 were fallen already over this issue and have taken care.

Cheers,
Kai

  
My colleague suggested building and installing last week's binutils 
release.   I did so, but it didn't affect the requirement to run each 
stage 2 configure individually from command line.

Thanks,
Tim

Re: random numbers

2009-07-07 Thread Tim Prince


ecrosbie wrote:

how do I  generate random numbers in  a f77  program?

Ed Crosbie

Re: random numbers

2009-07-07 Thread Tim Prince


ecrosbie wrote:

how do I  generate random numbers in  a f77  program?

Ed Crosbie

This subject isn't topical on the gcc development forum.  If you wish to 
use a gnu Fortran random number generator, please consider gfortran, 
which implements the language standard random number facility.

http://gcc.gnu.org/onlinedocs/gcc-4.4.0/gfortran/

questions might be asked on the gfortran list (follow-up set) or 
comp.lang.fortran

In addition, you will find plenty of other advice by using your web browser.

Re: optimizing a DSO

2010-05-28 Thread Tim Prince


On 5/28/2010 11:14 AM, Ian Lance Taylor wrote:

Quentin Neill  writes:

   

A little off topic, but by what facility does the compiler know the
linker (or assembler for that matter) is gnu?
 

When you run configure, you can specify --with-gnu-as and/or
--with-gnu-ld.  If you do, the compiler will assume the GNU assembler
or linker.  If you do not, the compiler will assume that you are not
using the GNU assembler or linker.  In this case the compiler will
normally use the common subset of command line options supported by
the native assembler and the GNU assembler.

In general that only affects the compiler behaviour on platforms which
support multiple assemblers and/or linkers.  E.g., on GNU/Linux, we
always assume the GNU assembler and linker.

There is an exception.  If you use --with-ld, the compiler will run
the linker with the -v option and grep for GNU in the output.  If it
finds it, it will assume it is the GNU linker.  The reason for this
exception is that --with-ld gives a linker which will always be used.
The assumption when no specific linker is specified is that you might
wind up using any linker available on the system, depending on the
value of PATH when running the compiler.

Ian
   
Is it reasonable to assume when the configure test reports using GNU 
linker, it has taken that "exception," even without a --with-ld 
specification?


--
Tim Prince

Re: gcc command line exceeds 8191 when building in XP

2010-07-19 Thread Tim Prince


On 7/19/2010 4:13 PM, IceColdBeer wrote:

Hi,

I'm building a project using GNU gcc, but the command line used to build
each source file sometimes exceeds 8191 characters, which is the maximum
supported command line length under Win XP.Even worst under Win 2000,
where the maximum command line length is limited to 2047 characters.

Can the GNU gcc read the build options from a file instead ?I have
searched, but cannot find an option in the documentation.

Thanks in advance,
ICB


   

redirecting to gcc-help.
The gcc builds for Windows themselves use a scheme for splitting the 
link into multiple steps in order to deal with command line length 
limits.  I would suggest adapting that.  Can't study it myself now while 
travelling.


--
Tim Prince

Re: x86 assembler syntax

2010-08-08 Thread Tim Prince


On 8/8/2010 10:21 PM, Rick C. Hodgin wrote:

All,

Is there an Intel-syntax compatible option for GCC or G++?  And if not,
why not?  It's so much cleaner than AT&T's.

- Rick C. Hodgin


   

I don't know how you get along without a search engine.  What about
http://tldp.org/HOWTO/Assembly-HOWTO/gas.html ?

--
Tim Prince

Re: food for optimizer developers

2010-08-10 Thread Tim Prince


On 8/10/2010 9:21 PM, Ralf W. Grosse-Kunstleve wrote:

Most of the time is spent in this function...

void
dlasr(
   str_cref side,
   str_cref pivot,
   str_cref direct,
   int const&  m,
   int const&  n,
   arr_cref  c,
   arr_cref  s,
   arr_ref  a,
   int const&  lda)

in this loop:

 FEM_DOSTEP(j, n - 1, 1, -1) {
   ctemp = c(j);
   stemp = s(j);
   if ((ctemp != one) || (stemp != zero)) {
 FEM_DO(i, 1, m) {
   temp = a(i, j + 1);
   a(i, j + 1) = ctemp * temp - stemp * a(i, j);
   a(i, j) = stemp * temp + ctemp * a(i, j);
 }
   }
 }

a(i, j) is implemented as

   T* elems_; // member

 T const&
 operator()(
   ssize_t i1,
   ssize_t i2) const
 {
   return elems_[dims_.index_1d(i1, i2)];
 }

with

   ssize_t all[Ndims]; // member
   ssize_t origin[Ndims]; // member

 size_t
 index_1d(
   ssize_t i1,
   ssize_t i2) const
 {
   return
   (i2 - origin[1]) * all[0]
 + (i1 - origin[0]);
 }

The array pointer is buried as elems_ member in the arr_ref<>  class template.
How can I apply __restrict in this case?

   
Do you mean you are adding an additional level of functions and hoping 
for efficient in-lining?   Your programming style is elusive, and your 
insistence on top posting will make this thread difficult to deal with.
The conditional inside the loop likely is even more difficult for C++ to 
optimize than Fortran. As already discussed, if you don't optimize 
otherwise, you will need __restrict to overcome aliasing concerns among 
a,c, and s.  If you want efficient C++, you will need a lot of hand 
optimization, and verification of the effect of each level of obscurity 
which you add.   How is this topic appropriate to gcc mail list?


--
Tim Prince

Re: End of GCC 4.6 Stage 1: October 27, 2010

2010-09-06 Thread Tim Prince


 On 9/6/2010 9:21 AM, Richard Guenther wrote:

On Mon, Sep 6, 2010 at 6:19 PM, NightStrike  wrote:

On Mon, Sep 6, 2010 at 5:21 AM, Richard Guenther  wrote:

On Mon, 6 Sep 2010, Tobias Burnus wrote:


Gerald Pfeifer wrote:

Do you have a pointer to testresults you'd like us to use for reference?

  From our release criteria, for secondary platforms we have:

• The compiler bootstraps successfully, and the C++ runtime library
  builds.
• The DejaGNU testsuite has been run, and a substantial majority of the
  tests pass.

See for instance:
   http://gcc.gnu.org/ml/gcc-testresults/2010-09/msg00295.html

There are no libstdc++ results in that.

Richard.

This is true.  I always run make check-gcc.  What should I be doing instead?

make -k check

make check-c++ runs both g++ and libstdc++-v3 testsuites.

--
Tim Prince

Re: Turn on -funroll-loops at -O3?

2011-01-21 Thread Tim Prince


On 1/21/2011 10:43 AM, H.J. Lu wrote:

Hi,

SInce -O3 turns on vectorizer, should it also turn on
-funroll-loops?

Only if a conservative default value for max-unroll-times is set 2<= 
value <= 4


--
Tim Prince

Re: Why doesn't vetorizer skips loop peeling/versioning for target supports hardware misaligned access?

2011-01-24 Thread Tim Prince


On 1/24/2011 5:21 AM, Bingfeng Mei wrote:

Hello,
Some of our target processors support complete hardware misaligned
memory access. I implemented movmisalignm patterns, and found
TARGET_SUPPORT_VECTOR_MISALIGNMENT (TARGET_VECTORIZE_SUPPORT_VECTOR_MISALIGNMENT
On 4.6) hook is based on checking these patterns. Somehow this
hook doesn't seem to be used. vect_enhance_data_refs_alignment
is called regardless whether the target has HW misaligned support
or not.

Shouldn't using HW misaligned memory access be better than
generating extra code for loop peeling/versioning? Or at least
if for some architectures it is not the case, we should have
a compiler hook to choose between them. BTW, I mainly work
on 4.5, maybe 4.6 has changed.

Thanks,
Bingfeng Mei

Peeling for alignment still presents a performance advantage on longer 
loops for the most common current CPUs.  Skipping the peeling is likely 
to be advantageous for short loops.
I've noticed that 4.6 can vectorize loops with multiple assignments, 
presumably taking advantage of misalignment support.  There's even a 
better performing choice of instructions for -march=corei7 misaligned 
access than is taken by other compilers, but that could be an accident.
At this point, I'd like to congratulate the developers for the progress 
already evident in 4.6.


--
Tim Prince

Re: Are 8-byte ints guaranteed?

2006-07-16 Thread Tim Prince


Thomas Koenig wrote:

Hello world,

are there any platforms where gcc doesn't support 8-byte ints?
Can a front end depend on this?

This would make life easier for Fortran, for example, because we
could use INTEGER(KIND=8) for a lot of interfaces without having
to bother with checks for the presence of KIND=8 integers.

No doubt, there are such platforms, although I doubt there is sufficient 
interest in running gfortran on them.  Support for 64-bit integers on 
common 32-bit platforms is rather inefficient, when it is done by pairs 
of 32-bit integers.

Re: g77 problem for octave

2006-07-16 Thread Tim Prince


[EMAIL PROTECTED] wrote:

Dear Sir/Madame,

I have switched my OS to SuSE Linux 10.1 and for a while trying to install
"Octave" to my computer. Unfortunately, the error message below is the only
thing that i got.


Installing octave-2.1.64-3.i586[Local packages]
There are no installable providers of gcc-g77 for
octave-2.1.64-3.i586[Local packages]


On my computer, the installed version of gcc is 4.1.0-25 and i could not
find any compatible version of g77 to install. For the installation of
octave, i need exactly gcc-g77 not gcc-fortran.

Can you please help me to deal with this problem?

If you are so interested in using g77 rather than gfortran, it should be 
easy enough to grab gcc-3.4.x sources and build g77.  One would wonder 
why you dislike gfortran so much.

Re: Modifying the LABEL for functions emitted by the GCC Compiler

2006-09-01 Thread Tim Prince


Rohit Arul Raj wrote:



The  gcc-coldfire compiler spits out the labels as it is in the
assembly file (main, printf etc), where  as the IDE compiler spits out
the labels prefixed with a  '_' (_main, _printf etc).

Is there any way  i can  make gcc-coldfire compiler emit  the lables
prefixed with an underscore (' _ ').Can anyone Help me OUT of this
mess!!!


How about reconciling the -fleading-underscore options?

Re: BFD Error a regression?

2006-09-16 Thread Tim Prince


Jerry DeLisle wrote:



BFD: BFD 2.16.91.0.6 20060212 internal error, aborting at 
../../bfd/elfcode.h line 190 in bfd_elf32_swap_symbol_in


BFD: Please report this bug.

make[1]: *** [complex16] Error 1
make[1]: *** Waiting for unfinished jobs
BFD: BFD 2.16.91.0.6 20060212 internal error, aborting at 
../../bfd/elfcode.h line 190 in bfd_elf32_swap_symbol_in


BFD: Please report this bug.



BFD is acknowledging that it may be buggy.  Does this occur with current 
binutils, e.g. from ftp.kernel.org?  Are you able to build g++ and 
libstdc++ without hitting this or similar bug?  Buggy binutils is a 
chronic problem with RHEL, and is generally not fixed without 6 months 
effort by an OEM with more influence than my employer.  If you hit it 
with a small test case, surely it will be hit with real applications 
sooner or later.

Re: Calculating cosinus/sinus

2013-05-11 Thread Tim Prince


On 05/11/2013 11:25 AM, Robert Dewar wrote:

On 5/11/2013 11:20 AM, jacob navia wrote:


OK I did a similar thing. I just compiled sin(argc) in main.
The results prove that you were right. The single fsin instruction
takes longer than several HUNDRED instructions (calls, jumps
table lookup what have you)

Gone are the times when an fsin would take 30 cycles or so.
Intel has destroyed the FPU.


That's an unwarrented claim, but indeed the algorithm used
within the FPU is inferior to the one in the library. Not
so surprising, the one in the chip is old, and we have made
good advances in learning how to calculate things accurately.
Also, the library is using the fast new 64-bit arithmetic.
So none of this is (or should be surprising).


In the benchmark code all that code/data is in the L1 cache.
In real life code you use the sin routine sometimes, and
the probability of it not being in the L1 cache is much higher,
I would say almost one if you do not do sin/cos VERY often.


But of course you don't really care about performance so much
unless you *are* using it very often. I would be surprised if
there are any real programs in which using the FPU instruction
is faster.
Possible, if long double precision is needed, within the range where 
fsin can deliver it.
I take it the use of vector sin library is excluded (not available for 
long double).


And as noted earlier in the thread, the library algorithm is
more accurate than the Intel algorithm, which is also not at
all surprising.
reduction for range well outside basic 4 quadrants should be better in 
the library (note that fsin gives up for |x| > 2^64) but a double 
library function can hardly be claimed to be generally more accurate 
than long double built-in.


For the time being I will go on generating the fsin code.
I will try to optimize Moshier's SIN function later on.


Well I will be surprised if you can find significant
optimizations to that very clever routine. Certainly
you have to be a floating-point expert to even touch it!

Robert Dewar





--
Tim Prince

Re: Calculating cosinus/sinus

2013-05-12 Thread Tim Prince


On 5/12/2013 9:53 AM, Ondřej Bílka wrote:

On Sun, May 12, 2013 at 02:14:31PM +0200, David Brown wrote:

On 11/05/13 17:20, jacob navia wrote:

Le 11/05/13 16:01, Ondřej Bílka a écrit :

As 1) only way is measure that. Compile following an we will see who is
rigth.

cat "
#include 

int main(){ int i;
   double x=0;

   double ret=0;
   double f;
   for(i=0;i<1000;i++){
  ret+=sin(x);
 x+=0.3;
   }
   return ret;
}
" > sin.c

OK I did a similar thing. I just compiled sin(argc) in main.
The results prove that you were right. The single fsin instruction
takes longer than several HUNDRED instructions (calls, jumps
table lookup what have you)

Gone are the times when an fsin would take 30 cycles or so.
Intel has destroyed the FPU.


What makes you so sure that it takes more than 30 cycles to execute
hundreds of instructions in the library?  Modern cpus often do
several instructions per cycle (I am not considering multiple cores
here).  They can issue several instructions per cycle, and predicted
jumps can often be eliminated entirely in the decode stages.


To clarify numbers here 30 cycles library call is unrealistic, just
latency caused by call and saving/restoring xmm register overhead
is often more than 30 cycles.
A sin takes around 150 cycles for normal inputs.

A fsin is slower for several reasons. One is that performance depends on
input. From http://www.agner.org/optimize/instruction_tables.pdf

interesting historical reference

fsin takes about 20-100 cycles.
Those tables show  up to 210 cycles for some highly reputed CPU models 
of various brands.  This doesn't count the next issue:


Second problem is that xmm->memory->fpu->memory->xmm roundtrip is expensive.
There is performance penalty when switching between fpu and xmm
instructions.
Which would be a reason for fsin appearing in mathinline.h for i386 but 
no such for x86_64 implementations of glibc.
Yes, it's popular to malign gcc developers or Intel even where it is out 
of their hands.

The moral here is that /you/ need to benchmark /your/ code on /your/
processor - don't jump to conclusions, or accept other benchmarks as
giving the complete picture.


Agreed.



--
Tim Prince

Re: RFC: SIMD pragma independent of Cilk Plus / OpenMPv4

2013-09-09 Thread Tim Prince


On 9/9/2013 9:37 AM, Tobias Burnus wrote:

Dear all,

sometimes it can be useful to annotate loops for better vectorization,
which is rather independent from parallelization.

For vectorization, GCC has [0]:
a) Cilk Plus's  #pragma simd  [1]
b) OpenMP 4.0's #pragma omp simd [2]

Those require -fcilkplus and -fopenmp, respectively, and activate much
more. The question is whether it makes sense to provide a means to ask
the compiler for SIMD vectorization without enabling all the other things
of Cilk Plus/OpenMP. What's your opinion?

[If one provides it, the question is whether it is always on or not,
which syntax/semantics it uses [e.g. just the one of Cilk or OpenMP]
and what to do with conflicting pragmas which can occur in this case.]


Side remark: For vectorization, the widely supported #pragma ivdep,
vector, novector can be also useful, even if they are less formally
defined. "ivdep" seems to be one of the more useful ones, whose
semantics one can map to a safelen of infinity in OpenMP's semenatics
[i.e. loop->safelen = INT_MAX].

Tobias

[0] In the trunk is currently only some initial middle-end support.
OpenMP's imp simd is in the gomp-4_0-branch; Cilk Plus's simd has been
submitted for the trunk at
http://gcc.gnu.org/ml/gcc-patches/2013-08/msg01626.html
[1] http://www.cilkplus.org/download#open-specification
[2] http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf
ifort/icc have a separate option -openmp-simd for the purpose of 
activating omp simd directives without invoking OpenMP.  In the previous 
release, in order to activate both OpenMP parallel and omp simd, both 
options were required (-openmp -openmp-simd).  In the new "SP1" release 
last week, -openmp implies -openmp-simd.  Last time I checked, turning 
off the options did not cause the compiler to accept but ignore all omp 
simd directives, as I personally thought would be desirable.  A few 
cases are active regardless of compile line option, but many will be 
rejected without matching options.


Current Intel implementations of safelen will fail to vectorize and give 
notice if the value is set unnecessarily large.  It's been agreed that 
increasing the safelen value beyond the optimum level should not turn 
off vectorization.  safelen(32) is optimum for several float/single 
precision cases in the Intel(r) Xeon Phi(tm) cross compiler; needless to 
say, safelen(8) is sufficient for 128-bit SSE2.


I pulled down an update of gcc gomp-4_0-branch yesterday and see in the 
not-yet-working additions to gcc testsuite there appears to be a move 
toward adding more cilkplus clauses to omp simd, such as firstprivate 
lastprivate (which are accepted but apparently ignored in the Intel omp 
simd implementation).
I'll be discussing in a meeting later today my effort to publish 
material including discussion of OpenMP 4.0 implementations.


--
Tim Prince

Re: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Tim Prince


On 11/15/2013 2:26 PM, Ondřej Bílka wrote:

On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:

Also keep in mind that usually costs go up significantly if
misalignment causes cache line splits (processor will fetch 2 lines).
There are non-linear costs of filling up the store queue in modern
out-of-order processors (x86). Bottom line is that it's much better to
peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache
line boundaries otherwise. The solution is to either actually always
peel for alignment, or insert an additional check for cache line
boundaries (for high trip count loops).

That is quite bold claim do you have a benchmark to support that?

Since nehalem there is no overhead of unaligned sse loads except of fetching
cache lines. As haswell avx2 loads behave in similar way.
Where gcc or gfortran choose to split sse2 or sse4 loads, I found a 
marked advantage in that choice on my Westmere (which I seldom power on 
nowadays). You are correct that this finding is in disagreement with 
Intel documentation, and it has the effect that Intel option -xHost is 
not the optimum one. I suspect the Westmere was less well performing 
than Nehalem on unaligned loads. Another poorly documented feature of 
Nehalem and Westmere was a preference for 32-byte aligned data, more so 
than Sandy Bridge.
Intel documentation encourages use of unaligned AVX-256 loads on Ivy 
Bridge and Haswell, but Intel compilers don't implement them (except for 
intrinsics) until AVX2. Still, on my own Haswell tests, the splitting of 
unaligned loads by use of AVX compile option comes out ahead. 
Supposedly, the preference of Windows intrinsics programmers for the 
relative simplicity of unaligned moves was taken into account in the 
more recent hardware designs, as it was disastrous for Sandy Bridge.
I have only remote access to Haswell although I plan to buy a laptop 
soon. I'm skeptical about whether useful findings on these points may be 
obtained on a Windows laptop.
In case you didn't notice it, Intel compilers introduced #pragma vector 
unaligned as a means to specify handling of unaligned access without 
peeling. I guess it is expected to be useful on Ivy Bridge or Haswell 
for cases where the loop count is moderate but expected to match 
unrolled AVX-256, or if the case where peeling can improve alignment is 
rare.
In addition, Intel compilers learned from gcc the trick of using AVX-128 
for situations where frequent unaligned accesses are expected and 
peeling is clearly undesirable. The new facility for vectorizing OpenMP 
parallel loops (e.g. #pragma omp parallel for simd) uses AVX-128, 
consistent with the fact that OpenMP chunks are more frequently 
unaligned. In fact, parallel for simd seems to perform nearly the same 
with gcc-4.9 as with icc.
Many decisions on compiler defaults still are based on an unscientific 
choice of benchmarks, with gcc evidently more responsive to input from 
the community.


--
Tim Prince

Re: How to generate AVX512 instructions now (just to look at them).

2014-01-03 Thread Tim Prince



On 1/3/2014 11:04 AM, Toon Moene wrote:
I am trying to figure out how the top-consuming routines in our 
weather models will be compiled when using AVX512 instructions (and 
their 32 512 bit registers).


I thought an up-to-date trunk version of gcc, using the command line:

<...>/gfortran -Ofast -S -mavx2 -mavx512f 

would do that.

Unfortunately, I do not see any use of the new zmm.. registers, which 
might mean that AVX512 isn't used yet.


This is how the nightly build job builds the trunk gfortran compiler:

configure --prefix=/home/toon/compilers/install --with-gnu-as 
--with-gnu-ld --enable-languages=fortran<,other-language> 
--disable-multilib --disable-nls --with-arch=core-avx2 
--with-tune=core-avx2
gfortran -O3  -funroll-loops --param max-unroll-times=2 -ffast-math 
-mavx512f -fopenmp -S
is giving me extremely limited zmm register usage in my build of 
gfortran trunk.  It appears to be using zmm only to enable use of 
vpternlogd instructions.  Immediately following the first such usage, it 
is failing to vectorize a dot_product with stride 1 operands.  There are 
still AVX2 scalar instructions and AVX-256 vectorized loops, but none 
with reduction or fma.
For gcc, I have to add -march=native in order for it to accept fma 
intrinsics (even though that one is expanded to AVX without fma).

Sorry, my only AVX2 CPU is a Windows 8.1 installation (!).

Target: x86_64-unknown-cygwin
Configured with: ../configure --prefix=/usr/local/gcc4.9/ 
--enable-languages='c
c++ fortran' --enable-libgomp --enable-threads=posix 
--disable-libmudflap --disa
ble-__cxa_atexit --with-dwarf2 --without-libiconv-prefix 
--without-libintl-prefi

x --with-system-zlib

--
Tim Prince

Re: How to generate AVX512 instructions now (just to look at them).

2014-01-03 Thread Tim Prince



On 1/3/2014 2:58 PM, Toon Moene wrote:

On 01/03/2014 07:04 PM, Jakub Jelinek wrote:


On Fri, Jan 03, 2014 at 05:04:55PM +0100, Toon Moene wrote:



I am trying to figure out how the top-consuming routines in our
weather models will be compiled when using AVX512 instructions (and
their 32 512 bit registers). 



what I'm interested in, is (cat verintlin.f):


  SUBROUTINE VERINT (
 I   KLON   , KLAT   , KLEV   , KINT  , KHALO
 I , KLON1  , KLON2  , KLAT1  , KLAT2
 I , KP , KQ , KR
 R , PARG   , PRES
 R , PALFH  , PBETH
 R , PALFA  , PBETA  , PGAMA   )
C
C***
C
C  VERINT - THREE DIMENSIONAL INTERPOLATION
C
C  PURPOSE:
C
C  THREE DIMENSIONAL INTERPOLATION
C
C  INPUT PARAMETERS:
C
C  KLON  NUMBER OF GRIDPOINTS IN X-DIRECTION
C  KLAT  NUMBER OF GRIDPOINTS IN Y-DIRECTION
C  KLEV  NUMBER OF VERTICAL LEVELS
C  KINT  TYPE OF INTERPOLATION
C= 1 - LINEAR
C= 2 - QUADRATIC
C= 3 - CUBIC
C= 4 - MIXED CUBIC/LINEAR
C  KLON1 FIRST GRIDPOINT IN X-DIRECTION
C  KLON2 LAST  GRIDPOINT IN X-DIRECTION
C  KLAT1 FIRST GRIDPOINT IN Y-DIRECTION
C  KLAT2 LAST  GRIDPOINT IN Y-DIRECTION
C  KPARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS
C  KQARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS
C  KRARRAY OF INDEXES FOR VERTICAL   DISPLACEMENTS
C  PARG  ARRAY OF ARGUMENTS
C  PALFH ALFA HAT
C  PBETH BETA HAT
C  PALFA ARRAY OF WEIGHTS IN X-DIRECTION
C  PBETA ARRAY OF WEIGHTS IN Y-DIRECTION
C  PGAMA ARRAY OF WEIGHTS IN VERTICAL DIRECTION
C
C  OUTPUT PARAMETERS:
C
C  PRES  INTERPOLATED FIELD
C
C  HISTORY:
C
C  J.E. HAUGEN   1  1992
C
C***
C
  IMPLICIT NONE
C
  INTEGER KLON   , KLAT   , KLEV   , KINT   , KHALO,
 IKLON1  , KLON2  , KLAT1  , KLAT2
C
  INTEGER   KP(KLON,KLAT), KQ(KLON,KLAT), KR(KLON,KLAT)
  REAL PARG(2-KHALO:KLON+KHALO-1,2-KHALO:KLAT+KHALO-1,KLEV)  ,
 RPRES(KLON,KLAT) ,
 R   PALFH(KLON,KLAT) ,  PBETH(KLON,KLAT)  ,
 R   PALFA(KLON,KLAT,4)   ,  PBETA(KLON,KLAT,4),
 R   PGAMA(KLON,KLAT,4)
C
  INTEGER JX, JY, IDX, IDY, ILEV
  REAL Z1MAH, Z1MBH
C
C  LINEAR INTERPOLATION
C
  DO JY = KLAT1,KLAT2
  DO JX = KLON1,KLON2
 IDX  = KP(JX,JY)
 IDY  = KQ(JX,JY)
 ILEV = KR(JX,JY)
C
 PRES(JX,JY) = PGAMA(JX,JY,1)*(
C
 +   PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV-1)
 +  + PALFA(JX,JY,2)*PARG(IDX  ,IDY-1,ILEV-1) )
 + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY  ,ILEV-1)
 +  + PALFA(JX,JY,2)*PARG(IDX  ,IDY  ,ILEV-1) ) )
C+
 +   + PGAMA(JX,JY,2)*(
C+
 +   PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV  )
 +  + PALFA(JX,JY,2)*PARG(IDX  ,IDY-1,ILEV  ) )
 + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY  ,ILEV  )
 +  + PALFA(JX,JY,2)*PARG(IDX  ,IDY  ,ILEV  ) ) )
  ENDDO
  ENDDO
C
  RETURN
  END

i.e., real Fortran code, not just intrinsics :-)

Right out of the AVX512 architect's dream.  It appears to need 24 
AVX-512 registers in the ifort compilation (/arch:MIC-AVX512) to avoid 
those spills and repeated memory operands in the gfortran avx2 compilation.
How small a ratio of floating point to total instructions can you call 
"real Fortran?"


--
Tim Prince

Re: -O3 and -ftree-vectorize

2014-02-06 Thread Tim Prince



On 2/6/2014 1:51 PM, Uros Bizjak wrote:

Hello!

4.9 does not enable -ftree-vectorize for -O3 (and Ofast) anymore. Is
this intentional?

$/ssd/uros/gcc-build/gcc/xgcc -B /ssd/uros/gcc-build/gcc -O3 -Q
--help=optimizers

...
-ftree-vectorize  [disabled]
...


I'm seeing vectorization  but no output from -ftree-vectorizer-verbose, 
and no dot product vectorization inside omp parallel regions, with gcc 
g++ or gfortran 4.9.  Primary targets are cygwin64 and linux x86_64.
I've been unable to use -O3 vectorization with gcc, although it works 
with gfortran and g++, so use gcc -O2 -ftree-vectorize together with 
additional optimization flags which don't break.
I've made source code changes to take advantage of the new vectorization 
with merge() and ? operators; while it's useful for -march=core-avx2, 
it's sometimes a loss for -msse4.1.
gcc vectorization with #pragma omp parallel for simd is reasonably 
effective in my tests only on 12 or more cores.
#pragma omp simd reduction(max: ) is giving correct results but poor 
performance in my tests.


You've probably seen my gcc testresults posts.  The one major recent 
improvement is the ability to skip cilkplus tests on targets where it's 
totally unsupported.  Without cilk_for et al. even on "supported" 
targets cilkplus seems useless.
There are still lots of failing stabs tests on targets where those 
apparently aren't supported.


So there are some mysteries about what the developers intend.  I suppose 
this was posted on gcc list on account of such questions being ignored 
on gcc-help.


--
Tim Prince

Re: -O3 and -ftree-vectorize

2014-02-07 Thread Tim Prince



On 02/07/2014 10:22 AM, Jakub Jelinek wrote:

On Thu, Feb 06, 2014 at 05:21:00PM -0500, Tim Prince wrote:

I'm seeing vectorization  but no output from
-ftree-vectorizer-verbose, and no dot product vectorization inside
omp parallel regions, with gcc g++ or gfortran 4.9.  Primary targets
are cygwin64 and linux x86_64.
I've been unable to use -O3 vectorization with gcc, although it
works with gfortran and g++, so use gcc -O2 -ftree-vectorize
together with additional optimization flags which don't break.

Can you file a GCC bugzilla PR with minimal testcases for this (or point us
at already filed bugreports)?
The question of problems with gcc -O3 (called from gfortran) have eluded 
me as to finding a minimal test case.  When I run under debug, it 
appears that somewhere prior to the crash some gfortran code is 
over-written with data by the gcc code, overwhelming my debugging 
skill.  I can get full performance with -O2 plus a bunch of intermediate 
flags.
As to non-vectorization of dot product in omp parallel region, 
-fopt-info (which I didn't know about) is reporting vectorization, but 
there are no parallel simd instructions in the generated code for the 
omp_fn.  I'll file a PR on that if it's still reproduced in a minimal case.





I've made source code changes to take advantage of the new
vectorization with merge() and ? operators; while it's useful for
-march=core-avx2, it's sometimes a loss for -msse4.1.
gcc vectorization with #pragma omp parallel for simd is reasonably
effective in my tests only on 12 or more cores.

Likewise.
Those are cases of 2 levels of loops from netlib "vector" benchmark 
where only one level is vectorizable and parallelizable. By putting the 
vectorizable loop on the outside the parallelization scales to a large 
number of cores.  I don't expect it to out-perform single thread 
optimized avx vectorization until 8 or more cores are in use, but it 
needs more than expected number of threads even relative to SSE 
vectorization.



#pragma omp simd reduction(max: ) is giving correct results but poor
performance in my tests.

Likewise.
I'll file a PR on this, didn't know if there might be interest.  I have 
an Intel compiler issue "closed, will not be fixed" so the simd 
reduction(max: ) isn't viable for icc in the near term.

Thanks,

Re: -O3 and -ftree-vectorize

2014-02-08 Thread Tim Prince



On 2/7/2014 11:09 AM, Tim Prince wrote:


On 02/07/2014 10:22 AM, Jakub Jelinek wrote:

On Thu, Feb 06, 2014 at 05:21:00PM -0500, Tim Prince wrote:

I'm seeing vectorization  but no output from
-ftree-vectorizer-verbose, and no dot product vectorization inside
omp parallel regions, with gcc g++ or gfortran 4.9.  Primary targets
are cygwin64 and linux x86_64.
I've been unable to use -O3 vectorization with gcc, although it
works with gfortran and g++, so use gcc -O2 -ftree-vectorize
together with additional optimization flags which don't break.
Can you file a GCC bugzilla PR with minimal testcases for this (or 
point us

at already filed bugreports)?
The question of problems with gcc -O3 (called from gfortran) have 
eluded me as to finding a minimal test case.  When I run under debug, 
it appears that somewhere prior to the crash some gfortran code is 
over-written with data by the gcc code, overwhelming my debugging 
skill.  I can get full performance with -O2 plus a bunch of 
intermediate flags.
As to non-vectorization of dot product in omp parallel region, 
-fopt-info (which I didn't know about) is reporting vectorization, but 
there are no parallel simd instructions in the generated code for the 
omp_fn.  I'll file a PR on that if it's still reproduced in a minimal 
case.





I've made source code changes to take advantage of the new
vectorization with merge() and ? operators; while it's useful for
-march=core-avx2, it's sometimes a loss for -msse4.1.
gcc vectorization with #pragma omp parallel for simd is reasonably
effective in my tests only on 12 or more cores.

Likewise.
Those are cases of 2 levels of loops from netlib "vector" benchmark 
where only one level is vectorizable and parallelizable. By putting 
the vectorizable loop on the outside the parallelization scales to a 
large number of cores.  I don't expect it to out-perform single thread 
optimized avx vectorization until 8 or more cores are in use, but it 
needs more than expected number of threads even relative to SSE 
vectorization.



#pragma omp simd reduction(max: ) is giving correct results but poor
performance in my tests.

Likewise.
I'll file a PR on this, didn't know if there might be interest.  I 
have an Intel compiler issue "closed, will not be fixed" so the simd 
reduction(max: ) isn't viable for icc in the near term.

Thanks,

With further investigation, my case with reverse_copy outside and 
inner_product inside an omp parallel region is working very well with 
-O3 -ffast-math for double data type.  There seems a possible 
performance problem with reverse_copy for float data type, so much so 
that gfortran does better with the loop reversal pushed down into the 
parallel dot_products.  I have seen at least 2 cases where the new gcc 
vectorization of stride -1 with vpermd is superior to other compilers, 
even for float data type.
For the cases where omp parallel for simd is set in expectation of 
gaining outer loop parallel simd, gcc is ignoring the simd clause. So it 
is understandable that a large number of cores is needed to overcome the 
lack of parallel simd (other than by simd intrinsics coding).

I'll choose an example of omp simd reduction(max: ) for a PR.
Thanks.

--
Tim Prince

Re: Vectorizer Pragmas

2014-02-15 Thread Tim Prince



On 2/15/2014 3:36 PM, Renato Golin wrote:

On 15 February 2014 19:26, Jakub Jelinek  wrote:

GCC supports #pragma GCC ivdep/#pragma simd/#pragma omp simd, the last one
can be used without rest of OpenMP by using -fopenmp-simd switch.

Does the simd/omp have control over the tree vectorizer? Or are they
just flags for the omp implementation?



I don't see why we would need more ways to do the same thing.

Me neither! That's what I'm trying to avoid.

Do you guys use those pragmas for everything related to the
vectorizer? I found that the Intel pragmas (not just simd and omp) are
pretty good fit to most of our needed functionality.

Does GCC use Intel pragmas to control the vectorizer? Would be good to
know how you guys did it, so that we can follow the same pattern.

Can GCC vectorize lexical blocks as well? Or just loops?

IF those pragmas can't be used in lexical blocks, would it be desired
to extend that in GCC? The Intel guys are pretty happy implementing
simd, omp, etc. in LLVM, and I think if the lexical block problem is
common, they may even be open to extending the semantics?

cheers,
--renato
gcc ignores the Intel pragmas, other than the OpenMP 4.0 ones.  I think 
Jakub may have his hands full trying to implement the OpenMP 4 pragmas, 
plus GCC ivdep, and gfortran equivalents.  It's tough enough 
distinguishing between Intel's partial implementation of OpenMP 4 and 
the way it ought to be done.
In my experience, the (somewhat complicated) gcc --param options work 
sufficiently well for specification of unrolling.  In the same vein, I 
haven't seen any cases where gcc 4.9 is excessively aggressive in 
vectorization, so that a #pragma novector plus scalar unroll  is needed, 
as it is with Intel compilers.
I'm assuming that Intel involvement with llvm is aimed toward making it 
look like Intel's own compilers; before I retired, I heard a comment 
which indicated a realization that the idea of pushing llvm over gnu had 
been over-emphasized.  My experience with this is limited; my Intel 
Android phone broke before I got too involved with their llvm Android 
compiler, which had some bad effects on both gcc and Intel software 
usage for normal Windows purposes.
I've never seen a compiler where pragmas could be used to turn on 
auto-vectorization when compile options were set to disable it. The 
closest to that is the Intel(r) Cilk(tm) Plus where CEAN notation 
implies turning on many aggressive optimizations, such that full 
performance can be achieved without problematical -O3.  If your idea is 
to obtain selective effective auto-vectorization in source code which is 
sufficiently broken that -O2 -ftree-vectorize can't be considered or 
-fno-strict-aliasing has to be set, I'm not about to second such a motion.


--
Tim Prince

Re: Vectorizer Pragmas

2014-02-16 Thread Tim Prince



On 2/16/2014 2:05 PM, Renato Golin wrote:

On 16 February 2014 17:23, Tobias Burnus  wrote:


Compiler vendors (and users) have different ideas whether the SIMD pragmas
should give the compiler only a hint or completely override the compiler's
heuristics. In case of the Intel compiler, the user rules; in case of GCC,
it only influences the heuristics unless one passes explicitly
-fsimd-cost-model=unlimited (cf. also -Wopenmp-simd).
Yes, Intel's idea for simd directives is to vectorize without applying 
either cost models or concern about exceptions.

I tried -fsimd-cost-model-unlimited on my tests; it made no difference.




As a user, I found Intel's pragmas interesting, but at the end regarded
OpenMP's SIMD directives/pragmas as sufficient.

That was the kind of user experience that I was looking for, thanks!


The alignment options for OpenMP 4 are limited, but OpenMP 4 also seems 
to prevent loop fusion, where alignment assertions may be more critical.
In addition, Intel uses the older directives, which some marketer 
decided should be called Cilk(tm) Plus even when used in Fortran, to 
control whether streaming stores may be chosen in some situations. I 
think gcc supports those only by explicit intrinsics.
I don't think many people want to use both OpenMP 4 and older Intel 
directives together.
Several of these directives are still in an embryonic stage in both 
Intel and gnu compilers.


--
Tim Prince

Re: Vectorizer Pragmas

2014-02-17 Thread Tim Prince



On 2/17/2014 4:42 AM, Renato Golin wrote:

On 16 February 2014 23:44, Tim Prince  wrote:

I don't think many people want to use both OpenMP 4 and older Intel
directives together.

I'm having less and less incentives to use anything other than omp4,
cilk and whatever. I think we should be able to map all our internal
needs to those pragmas.

On the other hand, if you guys have any cross discussion with Intel
folks about it, I'd love to hear. Since our support for those
directives are a bit behind, would be good not to duplicate the
efforts in the long run.


I'm continuing discussions with former Intel colleagues.  If you are 
asking for insight into how Intel priorities vary over time, I don't 
expect much, unless the next beta compiler provides some inferences.  
They have talked about implementing all of OpenMP 4.0 except user 
defined reduction this year.  That would imply more activity in that 
area than on cilkplus, although some fixes have come in the latter.  On 
the other hand I had an issue on omp simd reduction(max: ) closed with 
the decision "will not be fixed."
I have an icc problem report in on fixing omp simd safelen so it is more 
like the standard and less like the obsolete pragma simd vectorlength.  
Also, I have some problem reports active attempting to get clarification 
of their omp target implementation.


You may have noticed that omp parallel for simd in current Intel 
compilers can be used for combined thread and simd parallelism, 
including the case where the outer loop is parallelizable and 
vectorizable but the inner one is not.


--
Tim Prince

Re: Shouldn't unsafe-math-optimizations (re-)enable fp-contract=fast?

2014-03-06 Thread Tim Prince



On 3/6/2014 1:01 PM, Joseph S. Myers wrote:

On Thu, 6 Mar 2014, Ian Bolton wrote:


Hi there,

I see in common.opt that fp-contract=fast is the default for GCC.

But then it gets disabled in c-family/c-opts.c if you are using ISO C
(e.g. with -std=c99).

But surely if you have also specified -funsafe-math-optimizations then
it should flip it back onto fast?

That seems reasonable.
I do see an improvement in several benchmarks by use of fma when I 
append -ffp-contract=fast after -std=c99

Thanks.

--
Tim Prince

Re: weird optimization in sin+cos, x86 backend

2012-02-09 Thread Tim Prince


On 2/9/2012 5:55 AM, Richard Guenther wrote:

On Thu, Feb 9, 2012 at 11:35 AM, Andrew Haley  wrote:

On 02/09/2012 10:20 AM, James Courtier-Dutton wrote:

 From what I can see, on x86_64, the hardware fsin(x) is more accurate
than the hardware fsincos(x).
As you gradually increase the size of X from 0 to 10e22, fsincos(x)
diverges from the correct accurate value quicker than fsin(x) does.

So, from this I would say that using fsincos instead of fsin is not a
good idea, at least on x86_64 platforms.


That's true iff you're using the hardware builtins, which we're not on
GNU/Linux unless you're using -ffast-math.  If you're using
-ffast-math, the fsincos optimization is appropriate anyway because
you want fast.  If you're not using -ffast-math it's still
appropriate, because we're using an accurate libm.


The point of course is that glibc happily uses fsin/fsincos (which isn't even
fast compared to a decent implementation using SSE math).



x87 built-ins should be a fair compromise between speed, code size, and 
accuracy, for long double, on most CPUs.  As Richard says, it's 
certainly possible to do better in the context of SSE, but gcc doesn't 
know anything about the quality of math libraries present; it doesn't 
even take into account whether it's glibc or something else.



--
Tim Prince

Re: weird optimization in sin+cos, x86 backend

2012-02-14 Thread Tim Prince


 On 02/14/2012 04:51 AM, Andrew Haley wrote:

On 02/13/2012 08:00 PM, Geert Bosch wrote:

GNU Linux is quite good, but has issues with the "pow" function for
large exponents, even in current versions

Really?  Even on 64-bit?  I know this is a problem for the 32-bit
legacy architecture, but I thought the 64-bit pow() was OK.

Andrew.


No problems seen under elefunt with glibc 2.12 x86_64.

--
Tim Prince

Re: weird optimization in sin+cos, x86 backend

2012-02-14 Thread Tim Prince


 On 02/14/2012 08:26 AM, Vincent Lefevre wrote:

On 2012-02-14 09:51:28 +, Andrew Haley wrote:

On 02/13/2012 08:00 PM, Geert Bosch wrote:

GNU Linux is quite good, but has issues with the "pow" function for
large exponents, even in current versions

Really?  Even on 64-bit?  I know this is a problem for the 32-bit
legacy architecture, but I thought the 64-bit pow() was OK.

According to http://sourceware.org/bugzilla/show_bug.cgi?id=706
the 32-bit pow() can be completely wrong, and the 64-bit pow()
is just very inaccurate.


That bugzilla brings up paranoia, but with gfortran 4.7 on glibc 2.12 I get

TESTING X**((X+1)/(X-1))  VS.  EXP(2) =   7.3890561  AS  X -> 1.
 ACCURACY SEEMS ADEQUATE.
 TESTING POWERS Z**Q  AT FOUR NEARLY EXTREME VALUES:
  NO DISCREPANCIES FOUND.
.
NO FAILURES, DEFECTS NOR FLAWS HAVE BEEN DISCOVERED.
 ROUNDING APPEARS TO CONFORM TO THE PROPOSED IEEE STANDARD  P754
 THE ARITHMETIC DIAGNOSED APPEARS TO BE EXCELLENT!

Historically, glibc for i386 used the raw x87 built-ins without any of 
the recommended precautions.  Paranoia still shows, as it always did:

TESTING X**((X+1)/(X-1))  VS.  EXP(2) =   7.3890561  AS  X -> 1.
 DEFECT: Calculated (1-0.11102230E-15)**(-0.18014399E+17)
 differs from correct value by -0.34413050E-08
 This much error may spoil calculations such as compounded interest.
....

--
Tim Prince

Re: GCC: OpenMP posix pthread

2012-02-19 Thread Tim Prince


On 2/19/2012 9:42 AM, erotavlas_tu...@libero.it wrote:


I'm starting to use Helgrind a tool of Valgrind. I read on the manual the 
following statement:
Runtime support library for GNU OpenMP (part of GCC), at least for GCC versions 4.2 and 
4.3. The GNU OpenMP runtime library (libgomp.so) constructs its own synchronisation 
primitives using combinations of atomic memory instructions and the futex syscall, which 
causes total chaos since in Helgrind since it cannot "see" those.

In the latest version of GCC, is this still true or now the OpenMP uses the 
standard POSIX pthread?



Do you have a specific OS family in mind?


--
Tim Prince

Re: Vectorizer question

2012-05-16 Thread Tim Prince


On 5/16/2012 4:01 PM, Iyer, Balaji V wrote:

Hello Everyone,
I have a question regarding the vectorizer. In the following code 
below...

Int func (int x, int y)
{
If (x==y)
Return (x+y);
Else
Return (x-y);
}

If we force the x and y to be vectors of vectorlength 4, then will the 
if-statement get a vector of booleans or does it get 1 boolean that compares 2 
very large values? I guess another way to ask is that, will it logically break 
it up into 4 if-statements or just 1?

Any help is greatly appreciated!

Thanks,

Balaji V. Iyer.

PS. Please CC me in response  so that I can get to it quickly.
Is this about vector extensions to C, or about other languages such as 
C++ or Fortran?  In Fortran, it's definitely an array of logical, in the 
case where the compiler can't optimize it away.  This would more likely 
be written

  return x==y ? x+y : x-y;

--
Tim Prince

Re: GCC optimization report

2012-07-17 Thread Tim Prince


On 7/17/2012 7:23 AM, Richard Guenther wrote:

On Tue, Jul 17, 2012 at 12:43 PM,  wrote:

Hi all,
I would like to know if GCC provides an option to get a detailed report on the 
optimization actually performed by the compiler. For example with the Intel C 
compiler it is possible using the -opt-report. I don't want to look at the 
assembly file and figure out the optimization.

There is only -ftree-vectorizer-verbose=N currently and the various dump-files
the individual passes produce (-fdump-tree-all[-subflags]
-fdump-rtl-all[-subflags]).


-ftree-vectorizer-verbose is analogous to the icc -vec-report option 
(included in -opt-report).
Among the questions not answered by -opt-report are those associated 
with application (or not) of -complex-limited-range (gcc 
-fcx-limited-range).
-opt-report3 I believe turns on reporting of software prefetch 
application, which is important but difficult to follow.
It's nearly impossible to compare icc and gcc optimization other than by 
examining assembly and using a profiler which shows paths taken.


--
Tim Prince

Re: gfortran error: Statement order error: declaration after DATA

2012-09-12 Thread Tim Prince


On 9/11/2012 5:46 PM, David N. Bradley wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I am trying to compile the cactuscode package and can not get past the
error :
Statement order error: declaration after DATA
can you point me in the direction of a fix.  I included offending file
as an attachment.

Dave

kb9qhd Amateur Radio Service
Technician class Licence
Grid EN43

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJQT9tfAAoJEIHvsckbl2dBLMIH/0LR4lA3w9W6lhaB3lkyX9WB
dQJmYHAM59LsGmi+9fmhODG1KkoVfIMIqI8AaDHAFQiqkN2QCr1BNGTFgifFFcV9
BijJt4OtcZTzS0LwIzLTGOEbBJIT2xP1HQmVm/7gYr90HlWvLMHLoPJgqnNsJyNT
mxWMEJojD/xeKaHE6yUIZxRlbnM/pC7UYSIruQ7YjsxC7gKpHfBeOM9Op4AkwJ0k
H4IaKRDpYOKBbEHP6LLPZFTdosjQgWaFnTBILvLaHjSqa9mskU4yTDLdLHFNjUz9
i5hC2ihlIJBcQx1QVLwt/AvjSDtqPqLPKo3h2OBH0IJzlcS+kOkfeSQ+AvkWghU=
=snlv
-END PGP SIGNATURE-
Surely someone has pointed out, you should require only to sort the file 
by placing the dimension statement ahead of the data statement, if you 
don't wish to adopt more modern syntax.


--
Tim Prince

Re: calculation of pi

2012-11-03 Thread Tim Prince


On 11/3/2012 3:32 AM, Mischa Baars wrote:



/usr/include/gnu/stubs.h:7:27: fatal error: gnu/stubs-32.h: No such 
file or directory


which also prevents me from compiling the compiler under Fedora 17. 
This means that I am both not able to compile programs in 32-bit mode 
and help you with the compiler.


Normally, this means you didn't install the optional (32-bit) 
glibc-devel i386.


--
Tim Prince

Re: RFC: [ARM] Disable peeling

2012-12-11 Thread Tim Prince


On 12/11/2012 5:14 AM, Richard Earnshaw wrote:

On 11/12/12 09:56, Richard Biener wrote:
On Tue, Dec 11, 2012 at 10:48 AM, Richard Earnshaw  
wrote:

On 11/12/12 09:45, Richard Biener wrote:


On Mon, Dec 10, 2012 at 10:07 PM, Andi Kleen  
wrote:


Jan Hubicka  writes:

Note that I think Core has similar characteristics - at least for 
string

operations
it fares well with unalignes accesses.



Nehalem and later has very fast unaligned vector loads. There's still
some
penalty when they cross cache lines however.

iirc the rule of thumb is to do unaligned for 128 bit vectors,
but avoid it for 256bit vectors because the cache line cross
penalty is larger on Sandy Bridge and more likely with the larger
vectors.



Yes, I think the rule was that using the unaligned instruction 
variants

carries
no penalty when the actual access is aligned but that aligned 
accesses are
still faster than unaligned accesses.  Thus peeling for alignment 
_is_ a

win.
I also seem to remember that the story for unaligned stores vs. 
unaligned

loads
is usually different.



Yes, it's generally the case that unaligned loads are slightly more
expensive than unaligned stores, since the stores can often merge in 
a store

buffer with little or no penalty.


It was the other way around on AMD CPUs AFAIK - unaligned stores forced
flushes of the store buffers.  Which is why the vectorizer first and
foremost tries
to align stores.



In which case, which to align should be a question that the ME asks 
the BE.


R.



I see that this thread is no longer about ARM.
Yes, when peeling for alignment, aligned stores should take precedence 
over aligned loads.
"ivy bridge" corei7-3 is supposed to have corrected the situation on 
"sandy bridge" corei7-2 where unaligned 256-bit load is more expensive 
than explicitly split (128-bit) loads.  There aren't yet any production 
multi-socket corei7-3 platforms.
It seems difficult to make the best decision between 128-bit unaligned 
access without peeling and 256-bit access with peeling for alignment 
(unless the loop count is known to be too small for the latter to come 
up to speed).  Facilities afforded by various compilers to allow the 
programmer to guide this choice are rather strange and probably not to 
be counted on.
In my experience, "westmere" unaligned 128-bit loads are more expensive 
than explicitly split (64-bit) loads, but the architecture manuals 
disagree with this finding.  gcc already does a good job for corei7[-1] 
in such situations.


--
Tim Prince

Re: not-a-number's

2013-01-16 Thread Tim Prince


On 1/16/2013 5:00 AM, Andrew Haley wrote:

On 01/16/2013 11:54 AM, Mischa Baars wrote:

Here's what Standard C, F.8.3 Relational operators, says:

   x != x → false   The statement x != x is true if x is a NaN.

   x == x → true  The statement x == x is false if x is a NaN.

And indeed apparently the answer then is '2'. However, I don't think
this is correct. If that means that there is an error in the C
specification, then there probably is an error in the specification.

Right.  So we are agreed that GCC does what the specification of
the C programming language says it must do.  Any argument that you
have must, therefore, be with the technical committee of ISO C, not
with us.

Andrew.

There exist compilers which have options to ignore the possibility of 
NaN and replace x == x  by 1 and x != x by 0 at compile time. gcc is 
undoubtedly correct in not making such replacements as a default in 
violation of C specification.


--
Tim Prince

Re: Floating Point subnormal numbers under C99 with GCC 4.7‏

2013-01-27 Thread Tim Prince


On 1/27/2013 6:02 PM, Argentinator Rincón Matemático wrote:

Hi, dear friends.

I am testing floating-points macros in C language, under the standard C99.
My compiler is GCC 4.6.1. (with 4.7.1, I have the same result).

I have two computers:
My system (1) is Windows XP SP2 32bit, in an "Intel (R) Celeron (R) 420" @ 1.60 
GHz.
My system (2) is Windows 7 Ultimate SP1 64bit, in an "AMD Turion II X2 dual-core 
mobile M520 ( 2,3 ghz 1MB L2 Cache )"
(The result was the same in both systems.)

I am interested in testing subnormal numbers for the types float, double and 
long double.
I've tried the following line:

printf(" Float: %x\n Double: %x\n Long Double: %x\n",fpclassify(FLT_MIN / 4.F), 
fpclassify(DBL_MIN / 4.), fpclassify(LDBL_MIN / 4.L ));

I've compiled with the options -std=c99 and -pedantic (also without -pedantic).
Compilation goes well, however the program shows me this:

  Float: 400
  Double: 400
  Long Double: 4400

(0x400 == FP_NORMAL, 0x4400 == FP_SUBNORMAL)

I think that the right result must be 0x4400 in all cases.

When I tested the constant sizes, I have obtained they are of the right type.
For example, I have obtained:

sizeof(float) == 4
sizeof(double) == 8
sizeof(long double) == 12

Also:

sizeof(FLT_MIN / 4.F) == 4
sizeof(DBL_MIN / 4.) == 8
sizeof(LDBL_MIN / 4.L) == 12

This means that FLT_MIN / 4.F only can be a float, and so on.
Moreover, FLT_MIN / 4.F must be a subnormal float number.
However, it seems like the fpclassify() macro behave as if any argument were a 
long double number.

Just in case, I have recompiled the program by putting the constants at hand:

printf(" Float: %x\n", fpclassify(0x1p-128F));

The result was the same.

Am I missunderstanding the C99 rules? Or the fpclassify() macro has a bug in 
the GCC compiler?
(in the same way, the isnormal() macro "returns" 1 for float and double, but 0 
for long double).
I quote the C99 standard paragraph that explains the behaviour of fpclassify 
macro:

First, an argument represented in a format wider than its semantic type is 
converted to its semantic type.
Then classiﬁcation is based on the type of the argument.

Thanks.
Sincerely, yours.
Argentinator

This looks more like a topic for gcc-help.
Even if you had quoted gcc -v it would not reveal conclusively where 
your  or fpclassify() came from, although it would give some 
important clues.  There are at least 3 different implementations of gcc 
for Windows (not counting 32- vs. 64-bit), although not all are commonly 
available for gcc-4.6 or 4.7.  The specific version of gcc would make 
less difference than which implementation it is.
I guess, from your finding that sizeof(long double) == 12, you are 
running a 32-bit compiler even on the 64-bit Windows.
The 32-bit gcc I have installed on my 64-bit Windows evaluates 
expressions in long double unless -mfpmath=sse is set (as one would 
normally do).  This may affect the results returned by fpclassify. 
64-bit gcc defaults to -mfpmath=sse.


--
Tim Prince

1 2 >

1 - 100 of 155 matches

Mail list logo