> >> > > We would like to propose changing AVX generic mode tuning to
> >> generate
> >> > 128-bit
> >> > > AVX instead of 256-bit AVX.
> >> >
> >> > You indicate a 3% reduction on bulldozer with avx256.
> >> > How does avx128 compare to -mno-avx -msse4.2?
> >>
> >> We see these % differences going from SSE42 to AVX128 to AVX256 on
> >> Bulldozer with "-mtune=generic -Ofast".
> >> (Positive is improvement, negative is degradation)
> >>
> >> Bulldozer:
> >>                       AVX128/SSE42    AVX256/AVX-128
> >> 410.bwaves            -1.4%                   -1.4%
> >> 416.gamess            -1.1%                   0.0%
> >> 433.milc              0.5%                    -2.4%
> >> 434.zeusmp            9.7%                    -2.1%
> >> 435.gromacs           5.1%                    0.5%
> >> 436.cactusADM         8.2%                    -23.8%
> >> 437.leslie3d          8.1%                    0.4%
> >> 444.namd              3.6%                    0.0%
> >> 447.dealII            -1.4%                   -0.4%
> >> 450.soplex            -0.4%                   -0.4%
> >> 453.povray            0.0%                    -1.5%
> >> 454.calculix          15.7%                   -8.3%
> >> 459.GemsFDTD          4.9%                    1.4%
> >> 465.tonto             1.3%                    -0.6%
> >> 470.lbm               0.9%                    0.3%
> >> 481.wrf               7.3%                    -3.6%
> >> 482.sphinx3           5.0%                    -9.8%
> >> SPECFP                3.8%                    -3.2%
> >>
> >> > Will the next AMD generation have a useable avx256?
> >> > I'm not keen on the idea of generic mode being tune
> >> > for a single processor revision that maybe shouldn't
> >> > actually be using avx at all.
> >>
> >> We see a substantial gain in several SPECFP benchmarks going from
> SSE42
> >> to AVX128 on Bulldozer.
> >> IMHO, accomplishing even a 5% gain in an individual benchmark takes
> a
> >> hardware company several man months.
> >> The loss with AVX256 for Bulldozer is much more significant than the
> >> gain for SandyBridge.
> >> While the general trend in the industry is a move toward AVX256, for
> >> now we would be disadvantaging Bulldozer with this choice.
> >>
> >> We have several customers who use -mtune=generic and it is default,
> >> unless a user explicitly overrides it with -mtune=native. They are
> the
> >> ones who want to experiment with latest ISA using gcc, but want to
> keep
> >> their ISA selection and tuning agnostic on x86/64. IMHO, it is with
> >> these customers in mind that generic was introduced in the first
> place.
> >
> > Since stage 1 closure is around the corner, just wanted to ping to
> see if the maintainers have made up their mind on this one.
> > AVX-128 is an improvement over SSE42 for Bulldozer and AVX-256 wipes
> out pretty much all of that gain in generic mode.
> > Until there is a convergence on AVX-256 for x86/64, we would like to
> propose having generic generate avx-128 by default and have a user
> override to avx-256 manually when known to benefit performance.
> 
> Did somebody spend the time analyzing why CactusADM shows so much of a
> difference?  
> With the recent improvements in vectorizing for AVX, did
> you
> re-do the measurements with a recent trunk?
> 
> I don't think disabling avx-256 by default is a good idea until we
> understand why these numbers happen and are convinced we cannot fix
> this by proper
> cost modeling.

We have observed cases where AVX-256 bit code is slower than AVX-128 bit code 
on Bulldozer. This is because internally the front end, data paths etc for 
Bulldozer are designed for optimal AVX 128-bit. Throwing densely packed 256-bit 
code at the pipeline can congest the front end causing stalls and hence 
slowdowns. We expect the behavior of cactus, calculix and sphinx, which are the 
3 benchmarks with the biggest avx-256 gaps, to be in the same vein. In general, 
the hardware design engineers recommend running AVX 128-bit code on Bulldozer. 
Given the underlying hardware design, software tuning can't really change the 
results here. Any further analysis of cactus would be a cycle sink at our end 
and we may not even be able to discuss the details on a public mailing list. 
x86/64 has not yet converged on avx-256 and generic mode should reflect that.

Posting the re-measurements on trunk for cactus, calculix and sphinx on 
Bulldozer:
                AVX128/SSE42    AVX256/AVX-128
436.cactusADM   10%                     -30%
454.calculix    14.7%                   -6%
482.sphinx3         7%                  -9%

All positive % above are improvements, all negative % are degradations.

I will post re-measurements for all of Spec with latest trunk as soon as I have 
them.

Thoughts?

Thanks,
Harsha


Reply via email to