Fleaser
Hello, we are a small team and would need your help,just click and you've already helped.We thanks in advance. Look at our website: http://www.fleaser.com Follow us on Twitter http://twitter.com/fleaser Send this message to your friends If you already got mail delete it Thanks for your help Greetings Fleaser Team
Re: Big regression showing up on darwin
Andrew Pinski wrote: > On Fri, Jan 1, 2010 at 7:07 AM, FX wrote: >> I know something is going on with section names, so I thought I'd mention >> that there is a big regression on darwin (most "-flto -fwhopr -O2" tests >> fail) at rev. 155544. An example is: > > Really lto should be disabled when targeting darwin. See PR 41529. Also this particular error is caused by the asterisks-in-DECL_ASSEMBLER_NAME problem (PR42531) that should be fixed at r.15. cheers, DaveK
Re: Please update GNU GCC mirror list
On Wed, 16 Dec 2009, JohnT wrote: > Some of the sites listed on the mirror list > http://gcc.gnu.org/mirrors.html aren't up to date and some aren't > accessible. LaffeyComputer.com doesn't allow access, and used to > require a password for access. This isn't the way a GNU mirror site > ought to operate. There should be free public access. Thanks for the report, John. I regularily run a link checker that also covers our mirror sites. That is not as easy as it may seem at first since some mirrors only provide local access within their geography and removing them based on my testing from one place on the planet would be premature. That said, mirrors.laffeycomputer.com may indeed not be active anymore. Let me include mirrormas...@laffey.biz, the documented contact for that mirror. mirrormas...@laffey.biz, would you mind letting us know about the status of your GCC mirror site? Indeed I'm not able to access this from any machine I am trying, always running into a timeout. Gerald
Re: WTF?
On Wed, 25 Nov 2009, Dave Korn wrote: > But does it, though? From http://gcc.gnu.org/svnwrite.html: >[...] > So, where are whitespace changes to non-comment parts of .c and .h > source files covered? I think that there may be a bit of a common > assumption that "obvious" extends somewhat further than the wording of > the documentation actually implies - not just in the context of this > incident, but the question has occurred to me in other cases too, and > maybe now would be a good time to clear it up. So... On Wed, 25 Nov 2009, Kaveh R. Ghazi wrote: > I agree the wording could be better. ...does one of you have a suggestion on how to improve the wording? The svnwrite.html page never was ment to be "the law", more like documenting best practises and some rules of thumb, but of course improvements will be welcome. Gerald
Re: The "right way" to handle alignment of pointer targets in the compiler?
Thanks for the information! How many people would take advantage of special machinery for some old CPU, if that's your goal? Some, but I suppose the old machinery will be gone eventually. But, yes, I am most interested in current processors. On CPUs introduced in the last 2 years, movupd should be as fast as movapd, OK, I didn't know this. Thanks for the information! and -mtune=barcelona should work well in general, not only in this example. The bigger difference in performance, for longer loops, would come with further batching of sums, favoring loop lengths of multiples of 4 (or 8, with unrolling). That alignment already favors a fairly long loop. As you're using C++, it seems you could have used inner_product() rather than writing out a function. That was a reduced test case. The code that I'm modifying is doing two simultaneous inner products with the same number of iterations: for (int j = 0; j < kStateCount; j++) { sum1 += matrices1w[j] * partials1v[j]; sum2 += matrices2w[j] * partials2v[j]; } I tried using two separate calls to inner_product, and it turns out to be slightly slower. GCC does not fuse the loops. My Core I7 showed matrix multiply 25x25 times 25x100 producing 17Gflops with gfortran in-line code. g++ produces about 80% of that. So, one reason that I incorrectly assumed that movapd is necessary for good performance is because the SSE code is actually being matched in performance by non-SSE code - on a core2 processor and the x86_64 abi. I expected the SSE code to be two times faster, if vectorization was working, since I am using double precision. But perhaps SSE should not be expected to give (much) of a performance advantage here? For a recent gcc 4.5 with CXXFLAGS="-O3 -ffast-math -fno-tree-vectorize -march=native -mno-sse2 -mno-sse3 -mno-sse4" I got this code for the inner loop: be00: dd 04 07fldl (%rdi,%rax,1) be03: dc 0c 01fmull (%rcx,%rax,1) be06: de c1 faddp %st,%st(1) be08: dd 04 06fldl (%rsi,%rax,1) be0b: dc 0c 02fmull (%rdx,%rax,1) be0e: 48 83 c0 08 add$0x8,%rax be12: de c2 faddp %st,%st(2) be14: 4c 39 c0cmp%r8,%rax be17: 75 e7 jnebe00 Using alternative CXXFLAGS="-O3 -march=native -g -ffast-math -mtune=generic" I get: 1f1: 66 0f 57 c9 xorpd %xmm1,%xmm1 1f5: 31 c0 xor%eax,%eax 1f7: 31 d2 xor%edx,%edx 1f9: 66 0f 28 d1 movapd %xmm1,%xmm2 1fd: 0f 1f 00nopl (%rax) 200: f2 42 0f 10 1c 10 movsd (%rax,%r10,1),%xmm3 206: 83 c2 01add$0x1,%edx 209: f2 42 0f 10 24 00 movsd (%rax,%r8,1),%xmm4 20f: 66 41 0f 16 5c 02 08movhpd 0x8(%r10,%rax,1),%xmm3 216: 66 42 0f 16 64 00 08movhpd 0x8(%rax,%r8,1),%xmm4 21d: 66 0f 28 c3 movapd %xmm3,%xmm0 221: f2 41 0f 10 1c 03 movsd (%r11,%rax,1),%xmm3 227: 66 0f 59 c4 mulpd %xmm4,%xmm0 22b: 66 41 0f 16 5c 03 08movhpd 0x8(%r11,%rax,1),%xmm3 232: f2 42 0f 10 24 08 movsd (%rax,%r9,1),%xmm4 238: 66 42 0f 16 64 08 08movhpd 0x8(%rax,%r9,1),%xmm4 23f: 48 83 c0 10 add$0x10,%rax 243: 39 ea cmp%ebp,%edx 245: 66 0f 58 d0 addpd %xmm0,%xmm2 249: 66 0f 28 c3 movapd %xmm3,%xmm0 24d: 66 0f 59 c4 mulpd %xmm4,%xmm0 251: 66 0f 58 c8 addpd %xmm0,%xmm1 255: 72 a9 jb 200 257: 44 39 f3cmp%r14d,%ebx 25a: 66 0f 7c c9 haddpd %xmm1,%xmm1 25e: 44 89 f0mov%r14d,%eax 261: 66 0f 7c d2 haddpd %xmm2,%xmm2 (Note the presence of movsd / movhpd instead of movupd.) So... should I expect the SSE code to be any faster? If not, could you possibly say why not? Are there other operations (besided inner products) where SSE code would actually be expected to be faster? -BenRI
Re: The "right way" to handle alignment of pointer targets in the compiler?
Benjamin Redelings I wrote: Thanks for the information! Here are several reasons (there are more) why gcc uses 64-bit loads by default: 1) For a single dot product, the rate of 64-bit data loads roughly balances the latency of adds to the same register. Parallel dot products (using 2 accumulators) would take advantage of faster 128-bit loads. 2) run-time checks to adjust alignment, if possible, don't pay off for loop counts < about 40. 3) several obsolete CPU architectures implemented 128-bit loads by pairs of 64-bit loads. 4) 64-bit loads were generally more efficient than movupd, prior to barcelona. In the case you quote, with parallel dot products, 128-bit loads would be required so as to show much performance gain over x87.
GCC aliasing rules: more aggressive than C99?
The aliasing policies that GCC implements seem to be more strict than what is in the C99 standard. I am wondering if this is true or whether I am mistaken (I am not an expert on the standard, so the latter is definitely possible). The relevant text is: An object shall have its stored value accessed only by an lvalue expression that has one of the following types: * a type compatible with the effective type of the object, [...] * an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or To me this allows the following: int i; union u { int x; } *pu = (union u*)&i; printf("%d\n", pu->x); In this example, the object "i", which is of type "int", is having its stored value accessed by an lvalue expression of type "union u", which includes the type "int" among its members. I have seen other articles that interpret the standard in this way. See section "Casting through a union (2)" from this article, which claims that casts of this sort are legal and that GCC's warnings against them are false positives: http://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html However, this appears to be contrary to GCC's documentation. From the manpage: Similarly, access by taking the address, casting the resulting pointer and dereferencing the result has undefined behavior, even if the cast uses a union type, e.g.: int f() { double d = 3.0; return ((union a_union *) &d)->i; } I have also been able to experimentally verify that GCC will mis-compile this fragment if we expect the behavior the standard specifies: int g; struct A { int x; }; int foo(struct A *a) { if(g) a->x = 5; return g; } With GCC 4.3.3 -O3 on x86-64 (Ubuntu), g is only loaded once: : 0: 8b 05 00 00 00 00 moveax,DWORD PTR [rip+0x0]# 6 6: 85 c0 test eax,eax 8: 74 06 je 10 a: c7 07 05 00 00 00 movDWORD PTR [rdi],0x5 10: f3 c3 repz ret But this is incorrect if foo() was called as: foo((struct A*)&g); Here is another example: struct A { int x; }; struct B { int x; }; int foo(struct A *a, struct B *b) { if(a->x) b->x = 5; return a->x; } When I compile this, a->x is only loaded once, even though foo() could have been called like this: int i; foo((struct A*)&i, (struct B*)&i); >From this I conclude that GCC diverges from the standard, in that it does not allow casts of this sort. In one sense this is good (because the policy GCC implements is more aggressive, and yet still reasonable) but on the other hand it means (if I am not mistaken) that GCC will incorrectly optimize strictly conforming programs. Clarifications are most welcome! Josh