------- Comment #38 from whaley at cs dot utsa dot edu  2006-08-07 15:32 -------
Paolo,

Thanks for all the help.  I'm not sure I understand everything perfectly
though, so there's some questions below . . .

>I don't see how the last fmul[sl] can be removed without increasing code size.

Since the flags are asking for performance, not size optimization, this should
only be an argument if the fmul[s,l]'s are performance-neutral.  A lot of
performance optimizations increase code size, after all . . .  Obviously, no
fmul[sl] is possible, since gcc 3 achieves it.  However, I can see that the
peephole phase might not be able to change the register usage.

>Can you please try re-running the tests?  It takes skill^W^W

Yes, I found the results confusing as well, which is why I reran them 50 times
before posting.  I also posted the tarfile (wt Makefile and assemblies) that
built them, so that my mistakes could be caught by someone with more skill. 
Just as a check, maybe you can confirm the .s you posted is the right one?  I
can't find the loads of the matrix C anywhere in its assembly, and I can find
them in the double version  . . .  Anyway, I like your suggestion (below) of
getting the compiler so we won't have to worry about assemblies, so that's
probably the way to go.  On this front, is there some reason you cannot post
the patch(es) as attachments, just to rule out copy problems, as I've asked in
last several messages?  Note there's no need if I can grab your stuff from SVN,
as below . . .

>because my tests were run on a similar Prescott (P4e)

You didn't post the gcc 3 performance numbers.  What were those like?  If
you beat/tied gcc 3, then the remaining fmul[l,s] are probably not a big
deal.  If gcc 3 is still winning, on the other hand . . .

>It also would be interesting to re-run your code generator on a compiler built 
>from svn trunk.

Are your changes on a branch I could check out?  If so, give me the commands to
get that branch, as we are scoping assemblies only because of the patching
problem.  Having a full compiler would indeed enable more detailed
investigations, including loosing the full code generator on the improved
compiler.

>Also, I strongly believe that you should implement vectorization,

ATLAS implements vectorization, by writing the entire GEMM kernel in assembly
and directly using SSE.  However, there are cases where generated C code must
be called, and that's where gcc comes in . . .

>or at least find out *why* GCC does not vectorize your code. It may be simply 
>that it does not have any guarantee on the alignment.

I'm all for this.  info gcc says that w/o a guarantee of alignment, loops are
duped, with an if selecting between vector and scalar loops, is this not
accurate?  I spent a day trying to get gcc to vectorize any of the generator's
loops, and did not succeed (can you make it vectorize the provided benchmark
code?).  I also tried various unrollings of the inner loop, particularly no
unrolling and unroll=2 (vector length).  I was unable to truly decipher the
warning messages explaining the lack of vectorization, and I would truly
welcome some help in fixing this.

This is a separate issue from the x87 code, and this tracker item is already
fairly complex :) I'm assuming if I attempted to open a bug tracker of "gcc
will not vectorize atlas's generated code" it would be closed pretty quickly. 
Maybe you can recommend how to approach this, or open another report that we
can exchange info on?  I would truly appreciate the opportunity to get some
feedback from gcc authors to help guide me to solving this problem.

Thanks for all the info,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

Reply via email to