--- Comment #25 from whaley at cs dot utsa dot edu 2009-07-24 17:05 ---
Richard,
>GCC does not assume the stack is aligned to 16 bytes if it cannot prove that
>it is.
If this is true now, it is a change from previous behavior. When I reported
this problem, gcc *assumed* 1
--- Comment #19 from whaley at cs dot utsa dot edu 2008-12-15 23:39 ---
>There is the problem, LSB did the incorrect thing of thinking the written
>standard applied to what really was being done when the LSB was doing its
>work. Standards are made to be amended. Witness
--- Comment #17 from whaley at cs dot utsa dot edu 2008-12-15 22:01 ---
>LSB was written years after we had already did this back in gcc 3.0.
>Please check the history before saying gcc followed a written standard
>when none existed when this change was done.
LSB was m
--- Comment #15 from whaley at cs dot utsa dot edu 2008-12-15 21:32 ---
>GCC chose to change the *unwritten* standard for the ABI in use for IA32
>GNU/Linux.
This is not true. Prior to this change, gcc followed the *written* standard
provided by the LSB. You chose to viola
--- Comment #13 from whaley at cs dot utsa dot edu 2008-12-15 14:52 ---
>No; "The nice thing about standards is that there are so many to choose from"
>is a well-known saying.
And also one without application here. I am aware of no other standard for
Linux ABI other
--- Comment #11 from whaley at cs dot utsa dot edu 2008-12-12 01:48 ---
>LSB may be a starting point for plausible hypotheses about the ABIs, but
>you need to evaluate it critically to see whether each statement is
>actually an accurate description of fact.
I.e., you are s
--- Comment #8 from whaley at cs dot utsa dot edu 2008-12-12 00:51 ---
>I suppose that by "32-bit ABI for the x86" you mean a document with
>1990-1996 SCO copyrights.
I was going by the linux standards base, which still links to:
http://www.caldera.com/developers/d
--- Comment #6 from whaley at cs dot utsa dot edu 2008-12-11 23:42 ---
>GCC can and will realign the loop in 4.4 and above if the function needs a
>bigger alignment than the required 4 byte. So again I don't see any issues
>here really.
Is this the response to another
--- Comment #4 from whaley at cs dot utsa dot edu 2008-12-11 23:25 ---
>aligning the stack to 16 bytes is complaint
It might be complaint, but it certainly isn't compliant. The ABI says that you
can assume 4-byte alignment, and not all 4-byte alignments are 16-byte aligned
(o
--- Comment #1 from whaley at cs dot utsa dot edu 2008-12-11 23:01 ---
Created an attachment (id=16893)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=16893&action=view)
tarfile demonstrating problem
tarfile containing Makefile, align.f, and printvp.c that can show this
x8632 ABI
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: fortran
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: whaley at cs dot utsa dot edu
http
--- Comment #8 from whaley at cs dot utsa dot edu 2007-06-28 14:18 ---
I've been doing further testing on the g5 (the only machine where I have local
and root access), and this problem does not occur with stock gcc 4.1.1 either.
Therefore, whatever problem is avoided by throwing
--- Comment #7 from whaley at cs dot utsa dot edu 2007-06-28 05:25 ---
This problem affects the g5/970 as well:
Darwin. uname -a
Darwin etl-g52.cs.utsa.edu 8.10.0 Darwin Kernel Version 8.10.0: Wed May 23
16:50:59 PDT 2007; root:xnu-792.21.3~1/RELEASE_PPC Power Macintosh powerpc
Darwin
--- Comment #2 from whaley at cs dot utsa dot edu 2007-06-28 05:23 ---
Fixed, thanks.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32524
--- Comment #6 from whaley at cs dot utsa dot edu 2007-06-27 19:09 ---
Andrew,
OK, I installed stock gnu gcc 3.4.6:
78n04 TEST/MMBENCH_PPC> ~/local/gcc-3.4.6/bin/gcc -v
Reading specs from
/u/noibm122/local/gcc-3.4.6/lib/gcc/powerpc64-unknown-linux-gnu/3.4.6/specs
Configured w
--- Comment #4 from whaley at cs dot utsa dot edu 2007-06-27 17:00 ---
Andrew,
>PowerPC970FX is not a direct descendent of Power5
Sorry, completely misremembered this. Since Power4 didn't suffer as bad
as Power5 (I think it lost maybe 10% rather than 50), maybe the 970 will
uild 4.2 on OS X G5
Product: gcc
Version: 4.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: whaley at cs dot utsa dot edu
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32524
--- Comment #1 from whaley at cs dot utsa dot edu 2007-06-27 16:21 ---
Created an attachment (id=13794)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=13794&action=view)
Makefile and source demonstrating problem
Creates directory MMBENCH_PPC. Edit the Makefile and set G
you need.
Cheers,
Clint
--
Summary: disastrous scheduling for POWER5
Product: gcc
Version: 4.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
AssignedTo: unassigned at gcc dot gnu dot org
--- Comment #6 from whaley at cs dot utsa dot edu 2007-06-20 23:17 ---
Anybody have enough __asm__ foo to write a inline assembly macro taking a long
double operand and returning one, which I can use to call fsqrt directly in
inline assembly? I'm scoping the docs, but have never
--- Comment #5 from whaley at cs dot utsa dot edu 2007-06-20 22:17 ---
It may be C99, but since it doesn't work on 90% of the machines in the world,
it is a bit of a stretch to call it portable. My point is no standard mandates
you round down a long double (where you don't ro
--- Comment #3 from whaley at cs dot utsa dot edu 2007-06-20 21:52 ---
Turns out the proposed solution of using sqrtl is not portable. In particular,
all code using it fails to link on Windows using cygwin. Any idea how to make
this work portably? I still don't understand why
--- Comment #92 from whaley at cs dot utsa dot edu 2007-03-09 20:22 ---
I'd like to welcome the newest members of the bug 323 community, where all x87
floating point errors in gcc come to die! All floating point errors that use
the x87 are welcome, despite the fact that many of
--- Comment #1 from whaley at cs dot utsa dot edu 2007-01-26 16:21 ---
Created an attachment (id=12963)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=12963&action=view)
Can be compiled to .s as described in report to duplicate error
--
http://gcc.gnu.org/bugzilla/show_
: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: whaley at cs dot utsa dot edu
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30599
--- Comment #10 from whaley at cs dot utsa dot edu 2006-12-19 17:18 ---
Guys,
In the interests of full disclosure, I did some quick timings on the Core2Duo,
and as I kind of suspected, scalar SSE crushed x87 there. I was pretty sure
scalar SSE could achieve 2 flop/cycle, while Intel
--- Comment #9 from whaley at cs dot utsa dot edu 2006-12-19 16:04 ---
Ian,
Thanks for the info. I see I failed to consider the cross-register moves you
mentioned. However, can't those be moved through memory, where something
destined for a 64-bit register is first written fro
--- Comment #7 from whaley at cs dot utsa dot edu 2006-12-19 00:31 ---
>Depends on what you mean by fixable by the programmer because most people don't
know anything about precusion issues.
Most people don't know programming at all, so I guess you are suggesting that
er
--- Comment #5 from whaley at cs dot utsa dot edu 2006-12-18 22:14 ---
I cannot, of course, force you to admit it, but 323 is a bug fixable by the
programmer, and this one is not. The other requires a lot of work in the
compiler, and this does not. So, viewing them as the same can be
--- Comment #3 from whaley at cs dot utsa dot edu 2006-12-18 21:16 ---
BTW, in case it isn't obvious, here's the fix that I typically use for problems
like bug 323 that I cannot when it is gcc itself that is unpredictably spilling
the computation:
void test(double x
--- Comment #2 from whaley at cs dot utsa dot edu 2006-12-18 20:43 ---
Hi,
While it may be decided not to fix this problem, this is not a duplicate of bug
323, and so it should be closed for another reason if you want to ignore it.
323 has a problem because of the function call, where
Priority: P3
Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: whaley at cs dot utsa dot edu
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30255
--- Comment #2 from whaley at cs dot utsa dot edu 2006-08-17 14:17 ---
Richard,
Thanks for confirmation. There's no chance of this happening soon, I guess?
I'm working on a release of ATLAS (fast linear algebra), and I can't enable gcc
vectorization until its nece
--- Comment #67 from whaley at cs dot utsa dot edu 2006-08-11 15:22 ---
Uros,
>Slightly offtopic, but to put some numbers to comment #8 and comment #11,
>equivalent SSE code now reaches only 50% of x87 single performance and 60% of
>x87 double performance on AMD x86_64
FYI,
FIRMED
Severity: normal
Priority: P3
Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: whaley at cs dot utsa dot edu
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28684
--- Comment #62 from whaley at cs dot utsa dot edu 2006-08-10 15:15 ---
Paolo,
>The IEEE standard mandates particular rules for performing operations on
>infinities, NaNs, signed zeros, denormals, ... The C standard, by
>mandating no reassociation, ensures that you don
--- Comment #60 from whaley at cs dot utsa dot edu 2006-08-10 14:08 ---
Paolo,
Thanks for the explanation of what -funsafe is presently doing.
>You are also confusing -funsafe-math-optimizations with -ffast-math.
No, what I'm doing is reading the man page (the closest th
--- Comment #58 from whaley at cs dot utsa dot edu 2006-08-09 23:01 ---
Andrew,
>Except for the fact IEEE compliant fp does not allow for reordering at all
>except
>in some small cases. For an example is (a + b) + (-a) is not the same as (a +
>(-a)) + b,
>so reorderi
--- Comment #56 from whaley at cs dot utsa dot edu 2006-08-09 21:33 ---
Dorit,
>This flag is needed in order to allow vectorization of reduction (summation
>in your case) of floating-point data.
OK, but this is a bd flag to require. From the computational scientist'
--- Comment #54 from whaley at cs dot utsa dot edu 2006-08-09 16:08 ---
Dorit,
OK, I've posted a new tarfile with a safe kernel code where the loop is not
unrolled, so that the vectorizer has a chance. With this kernel, I can make it
vectorize code, but only if I throw the -fu
--- Comment #53 from whaley at cs dot utsa dot edu 2006-08-09 15:52 ---
Created an attachment (id=12047)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=12047&action=view)
benchmark wt vectorizable kernel
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827
--- Comment #52 from whaley at cs dot utsa dot edu 2006-08-09 14:33 ---
Paolo,
>In some sense, this is the peephole I would rather *not* do. But the answer
>is yes. :-)
Ahh, got it :)
>So, do you now agree that the bug would be fixed if the patch that is in GCC
>4.2 wa
--- Comment #50 from whaley at cs dot utsa dot edu 2006-08-08 18:36 ---
Guys,
I've been scoping this a little closer on the Athlon64X2. I have found that
the patched gcc can achieve as much as 93% of theoretical peak (5218Mflop on a
2800Mhz Athlon64X2!) for in-cache matmul whe
--- Comment #49 from whaley at cs dot utsa dot edu 2006-08-08 16:43 ---
Paolo,
>Yes, so far so good and this part has already been committed. But does
>a *single* load-and-execute instruction execute faster than the two
>instructions in a load+execute sequence?
As I said, i
--- Comment #45 from whaley at cs dot utsa dot edu 2006-08-08 02:59 ---
Guys,
OK, with Dorit's -fdump-tree-vect-details, I made a little progress on
vectorization. In order to get vectorization to work, I had to add the flag
'-funsafe-math-optimizations'. I will
--- Comment #44 from whaley at cs dot utsa dot edu 2006-08-07 21:56 ---
Guys,
OK, the mystery of why my hand-patched gcc didn't work is now cleared up. My
first clue was that neither did the SVN-build gcc! Turns out, your peephole
opt is only done if I throw the flag -O3 rather
--- Comment #41 from whaley at cs dot utsa dot edu 2006-08-07 17:19 ---
Paolo,
>Actually, the peephole phase may not change the register usage, but it
>could peruse a scratch register if available. But it would be much more
>controversial (even if backed by your hard numbers
--- Comment #39 from whaley at cs dot utsa dot edu 2006-08-07 16:47 ---
Paolo,
OK, never mind about all the questions on assembly/patches/SVN/gcc3 perf: I
checked out the main branch, and vi'd the patched file, and I see that your
patch is there. I am presently building the SVN g
--- Comment #38 from whaley at cs dot utsa dot edu 2006-08-07 15:32 ---
Paolo,
Thanks for all the help. I'm not sure I understand everything perfectly
though, so there's some questions below . . .
>I don't see how the last fmul[sl] can be removed without increasing
--- Comment #36 from whaley at cs dot utsa dot edu 2006-08-06 15:03 ---
Paola,
Thanks for working on this. We are making progres, but I have some mixed
results. I timed the assemblies you provided directly. I added a target
"asgexe" that builds the same benchmark, assumin
--- Comment #35 from whaley at cs dot utsa dot edu 2006-08-05 18:26 ---
Created an attachment (id=12020)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=12020&action=view)
new Makefile targets
OK, this is same benchmark again, now creating MMBENCHS directory. In addition
--- Comment #33 from whaley at cs dot utsa dot edu 2006-08-05 14:24 ---
Paolo,
Can you post the assembly and the patch as attachments? If necessary, I can
hack the benchmark to call the assembly routines on a couple of platforms.
Also, did you see what I did wrong in applying the
--- Comment #31 from whaley at cs dot utsa dot edu 2006-08-04 16:24 ---
Paolo,
Thanks for the update. I attempted to apply this patch, but apparantly I
failed, as it made absolutely no difference. I mean, not only did it not
change performance, but if you diff the assembly, you get
Status: UNCONFIRMED
Severity: trivial
Priority: P3
Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: whaley at cs dot utsa dot edu
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28519
--- Comment #29 from whaley at cs dot utsa dot edu 2006-07-04 13:15 ---
Guys,
The integer and fp differences do not appear to be strongly related. In
particular, on my P4e, gcc 4's integer code is actually faster than gcc 3's.
Further, if you look at the assemblies of t
--- Comment #28 from whaley at cs dot utsa dot edu 2006-06-29 04:17 ---
Guys,
If you are looking for the reason that the new code might be slower, my feeling
from the benchmark data is that involves hiding the cost of the loads. Notice
that, except for the cases where the double
--- Comment #26 from whaley at cs dot utsa dot edu 2006-06-28 19:57 ---
Created an attachment (id=11773)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=11773&action=view)
raw runs table is generated from
As promised, here is the raw data I built the table out of, includin
--- Comment #24 from whaley at cs dot utsa dot edu 2006-06-27 16:44 ---
Guys,
OK, here is a table summarizing the performance you can see using the
mmbench4s.tar.gz. I believe this covers a strong majority of the x86
architectures in use today (there are some specialty processors such
--- Comment #23 from whaley at cs dot utsa dot edu 2006-06-27 14:20 ---
Uros,
OK, I made the stupid assumption that the P4 would behave like the P4e,
should've known better :)
I got access to a Pentium 4 (family=15, model=2), and indeed I can repeat the
several surprising thing
--- Comment #21 from whaley at cs dot utsa dot edu 2006-06-26 15:03 ---
Uros,
Thanks for the reply; I think some confusion has set in (see below) :)
>And the results are a bit suprising (this is the exact output of your test):
Note that you are running the opposite of my test c
--- Comment #19 from whaley at cs dot utsa dot edu 2006-06-26 00:55 ---
Thanks for the info. I'm sorry to hear that no performance regression tests
are done, but I guess it kind of explains why these problems reoccur :)
As to not unrolling, the fully unrolled case is almost a
--- Comment #17 from whaley at cs dot utsa dot edu 2006-06-25 13:17 ---
OK, thanks for the reply. I will assume gcc 4 won't be fixed in the near
future. My guess is this will make icc an easier compiler for users, which I
kind of hate, which is why I worked as much as I did on
--- Comment #15 from whaley at cs dot utsa dot edu 2006-06-24 18:10 ---
Hi,
Can someone tell me if anyone is looking into this problem with the hopes of
fixing it? I just noticed that despite the posted code demonstrating the
problem, and verification on: Pentium Pro, Pentium III
--- Comment #14 from whaley at cs dot utsa dot edu 2006-06-14 02:40 ---
OK, I got access to some older machines, and it appears that Core is the only
architecture that likes gcc 4's code. More precisely, I have confirmed that
the following architectures run significantly slower
--- Comment #13 from whaley at cs dot utsa dot edu 2006-06-07 22:28 ---
Subject: Re: gcc 4 produces worse x87 code on all platforms than gcc 3
Guys,
Just got access to a CoreDuo machine, and tested things there. I had to
do some hand-translation of assemblies, as I didn't
--- Comment #12 from whaley at cs dot utsa dot edu 2006-06-01 18:43 ---
Subject: Re: gcc 4 produces worse x87 code on all platforms than gcc 3
Uros,
>gcc version 3.4.6
>vs.
>gcc version 4.2.0 20060601 (experimental)
>
>-fomit-frame-pointer -O -msse2 -mfpmath=sse
&g
--- Comment #11 from whaley at cs dot utsa dot edu 2006-06-01 16:26 ---
Subject: Re: gcc 4 produces worse x87 code on all platforms than gcc 3
Uros,
OK, I originally replied a couple of hours ago, but that is not appearing on
bugzilla for some reason, so I'll try again, this
--- Comment #10 from whaley at cs dot utsa dot edu 2006-06-01 16:02 ---
Created an attachment (id=11571)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=11571&action=view)
Same benchmark, but with single precision timing included
Here's the same benchmark, but can ti
--- Comment #8 from whaley at cs dot utsa dot edu 2006-05-31 14:12 ---
Subject: Re: gcc 4 produces worse x87 code on all platforms than gcc 3
Uros,
>IMO the fact that gcc 3.x beats 4.x on this code could be attributed to pure
>luck.
As far as understanding from first prin
--- Comment #6 from whaley at cs dot utsa dot edu 2006-05-31 01:09 ---
Subject: Re: gcc 4 produces worse x87 code on all platforms than gcc 3
Yes, I agree it is an x86/x86_64 issue. I have not yet scoped the performance
of any of the other architectures with gcc 4 vs. 3: since 90% of
70 matches
Mail list logo