On Mon, Jan 7, 2013 at 7:21 PM, Jagasia, Harsha <harsha.jaga...@amd.com> wrote: >>>> We would like to propose changing AVX generic mode tuning to generate >>>> 128-bit AVX instead of 256-bit AVX. >>> >>> You indicate a 3% reduction on bulldozer with avx256. >>> How does avx128 compare to -mno-avx -msse4.2? >> Will the next AMD generation have a useable avx256? >>> >>> I'm not keen on the idea of generic mode being tune for a single >>> processor revision that maybe shouldn't actually be using avx at all. > >>Btw, it looks like the data is massively skewed by 436.cactusADM. What are >>the overall numbers if you disregard cactus? It's also for sure the case >>that the vectorizer cost model has not been touched for avx256 vs. avx128 vs. >>sse, so a more sensible >approach would be to look at differentiating things >>there to improve the cactus numbers. > >>Harsha, did you investigate why avx256 is such a loss for cactus or why it is >>so much of a win for SB? > > I know this thread did not get closed from our end for a while now, but we > (AMD) would really like to re-open this discussion. So here goes. > > We did investigate why cactus is slower in avx-256 mode than avx-128 mode on > AMD processors. > > Using "-Ofast" flag (with appropriate flags to generate avx-128 code or > avx-256 code) and running with the reference data set, we observe the > following runtimes on Bulldozer. > Runtime %Diff > AVX-256 versus AVX-128 > AVX128 616s 38% > AVX256 with store splitting 853s > > Scheduling and predictive commoning are turned off in the compiler for both > cases, so that the code generated by the compiler for the avx-128 and avx-256 > cases are mostly equivalent i.e only avx-128 instructions on one side are > being replaced by avx-256 instructions on the other side. > > Looking at the cactus source and oprofile reports, the hottest loop nest is a > triple nested loop. The innermost loop of this nest has ~400 lines of Fortran > code and takes up 99% of the run time of the benchmark. > > Gcc vectorizes the innermost loop for both the 128 and 256 bit cases. In > order to vectorize the innermost loop, gcc generates a SIMD scalar prologue > loop to align the relevant vectors, followed by a SIMD packed avx loop, > followed by a SIMD scalar epilogue loop to handle what's left after a whole > multiple of vector factor is taken care of. > > Here are the oprofile samples seen in the AVX-128 and AVX-256 case for the > innermost Fortran loop's 3 components. > Oprofile Samples > AVX 128 > AVX-256-ss Gap in samples Gap > as % of total runtime > Total 153408 > 214448 61040 > 38% > SIMD Vector loop 135653 > 183074 47421 30% > SIMD Scalar Prolog loop 3817 > 10434 6617 4% > SIMD Scalar Epilog loop 3471 > 10072 6601 4% > > The avx-256 code is spending 30% more time in the SIMD vector loop than the > avx-128 code. The code gen appears to be equivalent for this vector loop in > the 128b and 256b cases- i.e only avx-128 instructions on one side are being > replaced by avx-256 instructions on the other side. The instruction mix and > scheduling are same, except for the spilling and loading of one variable. > > We know this gap is because there are fewer physical registers available for > renaming to the avx-256 code, since our processor loses the upper halves of > the FP registers for renaming. > Our entire SIMD pipeline in the processor is 128-bit and we don't have > native true 256-bit, even for foreseeable future generations, unlike > Sandybridge/Ivybridge. > > The avx-256 code is spending 8% more time in the SIMD scalar prologue and > epilogue than the avx-128 code. The code gen is exactly the same for these > scalar loops in the 128b and 256b case - i.e exact same instruction mix and > scheduling. The reason for the gap is actually the number of iterations that > gcc executes in these loops for the 2 cases. > > This is because gcc is following Sandy bridge's recommendation and aligning > avx-256 vectors to a 32-byte boundary instead of a 16-byte boundary, even on > Bulldozer. > The Sandybridge Software Optimization Guide mentions that the optimal memory > alignment of an AVX 256-bit vector, stored in memory, is 32 bytes. > The Bulldozer Software Optimization Guide says "Align all packed > floating-point data on 16-byte boundaries". > > In case of cactus, the relevant double vector has 118 elements that are > stepped through in unit stride and the first element handled in the Fortran > loop is aligned at a boundary akin to 0x8. > In avx-128 mode, gcc generates a scalar prologue loop that processes one > element at location 0x8, then a vector loop that processes the next 116 > elements starting at location 0x10 i.e a 16 byte aligned location, then a > scalar epilogue loop that processes the one last element left. > In avx-256 mode, gcc generates a scalar prologue loop that processes the > first 3 elements at location 0x8, 0x10 and 0x16, then a vector loop that > processes the next 112 elements starting at location 0x20 i.e a 32 byte > aligned location, then a scalar epilogue loop that processes the last three > elements left. > > Since this Fortran loop is nested inside another double nested loop, the > overall impact from doing more work in the scalar loops and less in the > vector loop is a reduction in overall vectorization on Bulldozer. > > Enabling avx-256 and choosing alignments and vector factors that are optimal > for Intel Sandybridge/Ivybridge and sub-optimal for AMD processors as default > is truly against the spirit of generic mode here. > > PS: I am not actively working on gcc right now, but other AMD gcc team > members will pitch in if more is needed.
As of the alignment issue we probably want to have a target hook the vectorizer can use to query the desired runtime alignment for a given vector type. At the moment things are hard-wired to vector-size / -alignment and it's probably more envolved to disentangle things here to support 128-bit dynamic alignment for 256-bit vectors. With masked load/store support we can also implement the whole prologue loop using a single vector iteration .... (if there is no reduction involved). The register pressure issue can be resolved using the new vectorizer cost model infrastructure where the target has the chance to look at the whole vectorized set of instructions. As for generally disabling 256bit vector support in the vectorizer for explicit -mavx and generic tuning you know my opinion (even -mprefer-avx128 is a kludge). Instead of /* If AVX is enabled then try vectorizing with both 256bit and 128bit vectors. */ static unsigned int ix86_autovectorize_vector_sizes (void) { return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0; } it would be better to do what the option suggests - only change the prefered (first tried) vector size to 128bits but do not disallow 256bit vectorization if 128bit vectorization is not possible / profitable (ix86_preferred_simd_mode seems already to be wired that way). Disabling 256bit vectorization should be done with a different option (-mavx128, to only enable the 128bit subset?). Richard. > Thanks > Harsha >