We would like to propose changing AVX generic mode tuning to generate 128-bit AVX instead of 256-bit AVX. As per H.J's suggestion, we have reviewed the various tuning choices made for generic mode with respect to AMD's upcoming Bulldozer processor. At this moment, this is the most significant change we have to propose. While we are willing to re-engineer generic mode, this feature needs immediate discussion since the performance impact on Bulldozer is significant.
Here is the relative CPU2006 performance data we have gathered using gcc on AMD Bulldozer (BD) and Intel Sandybridge (SB) machines with "-Ofast -mtune=generic -mavx". %gain/loss avx256 vs avx128 (negative % indicates loss positive % indicates gain) AMD BD Intel SB 410.bwaves -2.34 -1.52 416.gamess -1.11 -0.30 433.milc 0.47 -1.75 434.zeusmp -3.61 0.68 435.gromacs -0.54 -0.38 436.cactusADM -23.56 21.49 437.leslie3d -0.44 1.56 444.namd 0.00 0.00 447.dealII -0.36 -0.23 450.soplex -0.43 -0.29 453.povray 0.50 3.63 454.calculix -8.29 1.38 459.GemsFDTD 2.37 -1.54 465.tonto 0.00 0.00 470.lbm 0.00 0.21 481.wrf -4.80 0.00 482.sphinx3 -10.20 -3.65 SpecINT -3.29 1.01 400.perlbench 0.93 1.47 401.bzip2 0.60 0.00 403.gcc 0.00 0.00 429.mcf 0.00 -0.36 445.gobmk -1.03 0.37 456.hmmer -0.64 0.38 458.sjeng 1.74 0.00 462.libquantum 0.31 0.00 464.h264ref 0.00 0.00 471.omnetpp -1.27 0.00 473.astar 0.00 0.46 483.xalancbmk 0.51 0.00 SpecFP 0.09 0.19 As per the data, the 1% performance gain for Intel Sandybridge on SpecFP is eclipsed by a 3% degradation for AMD Bulldozer. For the data above, generic mode splits both 256-bit misaligned loads and stores, as is currently the case in trunk. Even if we disable 256-bit misaliged load splitting, AVX 256-bit performance improves only by ~1.4% on SpecFP for AMD Bulldozer. On the other hand, AVX 256-bit performance drops by 0.12% on Intel Sandybridge. In this case with AVX 256 load splitting disabled, a cumulative 0.9% performance gain for Intel Sandybridge is reflected versus a 1.9% loss for AMD Bulldozer comparing AVX 256 to AVX 128 and hence AVX 256 is still not a fair choice for generic mode. Please provide thoughts. It would be great if HJ can verify Intel Sandybridge data. Thanks, Harsha