Re: AVX generic mode tuning discussion.

Richard Guenther Wed, 02 Nov 2011 13:37:05 -0700

On Wed, Nov 2, 2011 at 5:57 PM, Jagasia, Harsha <harsha.jaga...@amd.com> wrote:
>> >> > > We would like to propose changing AVX generic mode tuning to
>> >> generate
>> >> > 128-bit
>> >> > > AVX instead of 256-bit AVX.
>> >> >
>> >> > You indicate a 3% reduction on bulldozer with avx256.
>> >> > How does avx128 compare to -mno-avx -msse4.2?
>> >>
>> >> We see these % differences going from SSE42 to AVX128 to AVX256 on
>> >> Bulldozer with "-mtune=generic -Ofast".
>> >> (Positive is improvement, negative is degradation)
>> >>
>> >> Bulldozer:
>> >>                       AVX128/SSE42    AVX256/AVX-128
>> >> 410.bwaves            -1.4%                   -1.4%
>> >> 416.gamess            -1.1%                   0.0%
>> >> 433.milc              0.5%                    -2.4%
>> >> 434.zeusmp            9.7%                    -2.1%
>> >> 435.gromacs           5.1%                    0.5%
>> >> 436.cactusADM         8.2%                    -23.8%
>> >> 437.leslie3d          8.1%                    0.4%
>> >> 444.namd              3.6%                    0.0%
>> >> 447.dealII            -1.4%                   -0.4%
>> >> 450.soplex            -0.4%                   -0.4%
>> >> 453.povray            0.0%                    -1.5%
>> >> 454.calculix          15.7%                   -8.3%
>> >> 459.GemsFDTD          4.9%                    1.4%
>> >> 465.tonto             1.3%                    -0.6%
>> >> 470.lbm               0.9%                    0.3%
>> >> 481.wrf               7.3%                    -3.6%
>> >> 482.sphinx3           5.0%                    -9.8%
>> >> SPECFP                3.8%                    -3.2%
>> >>
>> >> > Will the next AMD generation have a useable avx256?
>> >> > I'm not keen on the idea of generic mode being tune
>> >> > for a single processor revision that maybe shouldn't
>> >> > actually be using avx at all.
>> >>
>> >> We see a substantial gain in several SPECFP benchmarks going from
>> SSE42
>> >> to AVX128 on Bulldozer.
>> >> IMHO, accomplishing even a 5% gain in an individual benchmark takes
>> a
>> >> hardware company several man months.
>> >> The loss with AVX256 for Bulldozer is much more significant than the
>> >> gain for SandyBridge.
>> >> While the general trend in the industry is a move toward AVX256, for
>> >> now we would be disadvantaging Bulldozer with this choice.
>> >>
>> >> We have several customers who use -mtune=generic and it is default,
>> >> unless a user explicitly overrides it with -mtune=native. They are
>> the
>> >> ones who want to experiment with latest ISA using gcc, but want to
>> keep
>> >> their ISA selection and tuning agnostic on x86/64. IMHO, it is with
>> >> these customers in mind that generic was introduced in the first
>> place.
>> >
>> > Since stage 1 closure is around the corner, just wanted to ping to
>> see if the maintainers have made up their mind on this one.
>> > AVX-128 is an improvement over SSE42 for Bulldozer and AVX-256 wipes
>> out pretty much all of that gain in generic mode.
>> > Until there is a convergence on AVX-256 for x86/64, we would like to
>> propose having generic generate avx-128 by default and have a user
>> override to avx-256 manually when known to benefit performance.
>>
>> Did somebody spend the time analyzing why CactusADM shows so much of a
>> difference?
>> With the recent improvements in vectorizing for AVX, did
>> you
>> re-do the measurements with a recent trunk?
>>
>> I don't think disabling avx-256 by default is a good idea until we
>> understand why these numbers happen and are convinced we cannot fix
>> this by proper
>> cost modeling.
>
> We have observed cases where AVX-256 bit code is slower than AVX-128 bit code 
> on Bulldozer. This is because internally the front end, data paths etc for 
> Bulldozer are designed for optimal AVX 128-bit. Throwing densely packed 
> 256-bit code at the pipeline can congest the front end causing stalls and 
> hence slowdowns. We expect the behavior of cactus, calculix and sphinx, which 
> are the 3 benchmarks with the biggest avx-256 gaps, to be in the same vein. 
> In general, the hardware design engineers recommend running AVX 128-bit code 
> on Bulldozer. Given the underlying hardware design, software tuning can't 
> really change the results here. Any further analysis of cactus would be a 
> cycle sink at our end and we may not even be able to discuss the details on a 
> public mailing list. x86/64 has not yet converged on avx-256 and generic mode 
> should reflect that.


Well, generic hasn't converged on AVX at all.  Cost modeling can deal
with code density just fine - are there any differences between code
density issues of
say, loads vs. stores vs. arithmetic?  I specifically ask about
analysis because AVX-256 has instruction set issues for certain
patterns the vectorizer generates
and the cost model currently does not reflect these at all.

Richard.

> Posting the re-measurements on trunk for cactus, calculix and sphinx on 
> Bulldozer:
>                AVX128/SSE42    AVX256/AVX-128
> 436.cactusADM   10%                     -30%
> 454.calculix    14.7%                   -6%
> 482.sphinx3         7%                  -9%
>
> All positive % above are improvements, all negative % are degradations.
>
> I will post re-measurements for all of Spec with latest trunk as soon as I 
> have them.
>
> Thoughts?
>
> Thanks,
> Harsha
>
>
>

Re: AVX generic mode tuning discussion.

Reply via email to