On Wed, Nov 2, 2011 at 5:57 PM, Jagasia, Harsha <harsha.jaga...@amd.com> wrote: >> >> > > We would like to propose changing AVX generic mode tuning to >> >> generate >> >> > 128-bit >> >> > > AVX instead of 256-bit AVX. >> >> > >> >> > You indicate a 3% reduction on bulldozer with avx256. >> >> > How does avx128 compare to -mno-avx -msse4.2? >> >> >> >> We see these % differences going from SSE42 to AVX128 to AVX256 on >> >> Bulldozer with "-mtune=generic -Ofast". >> >> (Positive is improvement, negative is degradation) >> >> >> >> Bulldozer: >> >> AVX128/SSE42 AVX256/AVX-128 >> >> 410.bwaves -1.4% -1.4% >> >> 416.gamess -1.1% 0.0% >> >> 433.milc 0.5% -2.4% >> >> 434.zeusmp 9.7% -2.1% >> >> 435.gromacs 5.1% 0.5% >> >> 436.cactusADM 8.2% -23.8% >> >> 437.leslie3d 8.1% 0.4% >> >> 444.namd 3.6% 0.0% >> >> 447.dealII -1.4% -0.4% >> >> 450.soplex -0.4% -0.4% >> >> 453.povray 0.0% -1.5% >> >> 454.calculix 15.7% -8.3% >> >> 459.GemsFDTD 4.9% 1.4% >> >> 465.tonto 1.3% -0.6% >> >> 470.lbm 0.9% 0.3% >> >> 481.wrf 7.3% -3.6% >> >> 482.sphinx3 5.0% -9.8% >> >> SPECFP 3.8% -3.2% >> >> >> >> > Will the next AMD generation have a useable avx256? >> >> > I'm not keen on the idea of generic mode being tune >> >> > for a single processor revision that maybe shouldn't >> >> > actually be using avx at all. >> >> >> >> We see a substantial gain in several SPECFP benchmarks going from >> SSE42 >> >> to AVX128 on Bulldozer. >> >> IMHO, accomplishing even a 5% gain in an individual benchmark takes >> a >> >> hardware company several man months. >> >> The loss with AVX256 for Bulldozer is much more significant than the >> >> gain for SandyBridge. >> >> While the general trend in the industry is a move toward AVX256, for >> >> now we would be disadvantaging Bulldozer with this choice. >> >> >> >> We have several customers who use -mtune=generic and it is default, >> >> unless a user explicitly overrides it with -mtune=native. They are >> the >> >> ones who want to experiment with latest ISA using gcc, but want to >> keep >> >> their ISA selection and tuning agnostic on x86/64. IMHO, it is with >> >> these customers in mind that generic was introduced in the first >> place. >> > >> > Since stage 1 closure is around the corner, just wanted to ping to >> see if the maintainers have made up their mind on this one. >> > AVX-128 is an improvement over SSE42 for Bulldozer and AVX-256 wipes >> out pretty much all of that gain in generic mode. >> > Until there is a convergence on AVX-256 for x86/64, we would like to >> propose having generic generate avx-128 by default and have a user >> override to avx-256 manually when known to benefit performance. >> >> Did somebody spend the time analyzing why CactusADM shows so much of a >> difference? >> With the recent improvements in vectorizing for AVX, did >> you >> re-do the measurements with a recent trunk? >> >> I don't think disabling avx-256 by default is a good idea until we >> understand why these numbers happen and are convinced we cannot fix >> this by proper >> cost modeling. > > We have observed cases where AVX-256 bit code is slower than AVX-128 bit code > on Bulldozer. This is because internally the front end, data paths etc for > Bulldozer are designed for optimal AVX 128-bit. Throwing densely packed > 256-bit code at the pipeline can congest the front end causing stalls and > hence slowdowns. We expect the behavior of cactus, calculix and sphinx, which > are the 3 benchmarks with the biggest avx-256 gaps, to be in the same vein. > In general, the hardware design engineers recommend running AVX 128-bit code > on Bulldozer. Given the underlying hardware design, software tuning can't > really change the results here. Any further analysis of cactus would be a > cycle sink at our end and we may not even be able to discuss the details on a > public mailing list. x86/64 has not yet converged on avx-256 and generic mode > should reflect that.
Well, generic hasn't converged on AVX at all. Cost modeling can deal with code density just fine - are there any differences between code density issues of say, loads vs. stores vs. arithmetic? I specifically ask about analysis because AVX-256 has instruction set issues for certain patterns the vectorizer generates and the cost model currently does not reflect these at all. Richard. > Posting the re-measurements on trunk for cactus, calculix and sphinx on > Bulldozer: > AVX128/SSE42 AVX256/AVX-128 > 436.cactusADM 10% -30% > 454.calculix 14.7% -6% > 482.sphinx3 7% -9% > > All positive % above are improvements, all negative % are degradations. > > I will post re-measurements for all of Spec with latest trunk as soon as I > have them. > > Thoughts? > > Thanks, > Harsha > > >