[Bug target/14552] compiled trivial vector intrinsic code is inefficient

michaelni at gmx dot at Fri, 21 Mar 2008 19:40:00 -0700


------- Comment #37 from michaelni at gmx dot at  2008-03-22 02:39 -------
Subject: Re:  compiled trivial vector intrinsic code is
        inefficient


On Fri, Mar 21, 2008 at 10:34:00AM -0000, ubizjak at gmail dot com wrote:
> 
> 
> ------- Comment #36 from ubizjak at gmail dot com  2008-03-21 10:33 -------
> (In reply to comment #35)
> 
> > Also ffmpeg uses almost entirely asm() instead of intrinsics so this alone 
> > is
> > not so much a problem for ffmpeg than it is for others who followed the
> > recommandition of "intrinsics are better than asm".
> > 
> > About trolling, well i made no attempt to reply politely and diplomatic, no.
> > But "solving" a "problem" in some use case by droping support for that use
> > case is kinda extreem.
> > 
> > The way i see it is that
> > * Its non trivial to place emms optimally and automatically
> > * there needs to be a emms between mmx code and fpu code
> > 
> > The solutions to this would be any one of
> > A. let the programmer place emms like it has been in the past
> > B. dont support mmx at all
> > C. dont support x87 fpu at all
> > D. place emms after every bunch of mmx instructions
> > E. solve a quite non trivial problem and place emms optimally
> > 
> > The solution which has been selected apparently is B., why was that choosen?
> > Instead of lets say A.?
> > 
> > If i do write SIMD code then i do know that i need an emms on x86. Its
> > trivial for the programmer to place it optimally.
> 
> I don't know where you get the idea that MMX support was dropped in any way. I

Maybe because the SIMD code in this PR compiled with -mmmx does not use mmx
but very significantly less efficient integer instructions. And you added a
test to gcc which ensures that this case does not use mmx instructions.

This is pretty much the definion of droping mmx support (for this specific
case).


> won't engage in a discussion about autovectorisation, intrinsics, builtins,
> generic vectorisation, etc, etc with you,

And somehow iam glad about that.


> but please look at PR 21395 how
> performance PR should be filled. 

> The MMX code in that PR is _far_ from trivial,

Well that is something i would disagree about.


> but since it is well written using intrinsic instructions, it enables
> jaw-dropping performance increase that is simply not possible when ASM blocks
> are used.
> 
> Now, I'm sure that you have your numbers ready to back up your claims from
> Comment #33 about performance of generated code, and I challenge you to beat
> performance of gcc-4.4 generated code by hand-crafted assembly using the
> example of PR 21395.

done, 
jaw-dropping intrinsics need 
2.034s 

stinky hand written asm needs 
1.312s

But you can read the details in PR 21395.

[...]


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552

[Bug target/14552] compiled trivial vector intrinsic code is inefficient

Reply via email to