>>> We would like to propose changing AVX generic mode tuning to >>> generate 128-bit AVX instead of 256-bit AVX. >> >> You indicate a 3% reduction on bulldozer with avx256. >> How does avx128 compare to -mno-avx -msse4.2? > Will the next AMD generation have a useable avx256? >> >> I'm not keen on the idea of generic mode being tune for a single >> processor revision that maybe shouldn't actually be using avx at all.
>Btw, it looks like the data is massively skewed by 436.cactusADM. What are >the overall numbers if you disregard cactus? It's also for sure the case that >the vectorizer cost model has not been touched for avx256 vs. avx128 vs. sse, >so a more sensible >approach would be to look at differentiating things there >to improve the cactus numbers. >Harsha, did you investigate why avx256 is such a loss for cactus or why it is >so much of a win for SB? I know this thread did not get closed from our end for a while now, but we (AMD) would really like to re-open this discussion. So here goes. We did investigate why cactus is slower in avx-256 mode than avx-128 mode on AMD processors. Using "-Ofast" flag (with appropriate flags to generate avx-128 code or avx-256 code) and running with the reference data set, we observe the following runtimes on Bulldozer. Runtime %Diff AVX-256 versus AVX-128 AVX128 616s 38% AVX256 with store splitting 853s Scheduling and predictive commoning are turned off in the compiler for both cases, so that the code generated by the compiler for the avx-128 and avx-256 cases are mostly equivalent i.e only avx-128 instructions on one side are being replaced by avx-256 instructions on the other side. Looking at the cactus source and oprofile reports, the hottest loop nest is a triple nested loop. The innermost loop of this nest has ~400 lines of Fortran code and takes up 99% of the run time of the benchmark. Gcc vectorizes the innermost loop for both the 128 and 256 bit cases. In order to vectorize the innermost loop, gcc generates a SIMD scalar prologue loop to align the relevant vectors, followed by a SIMD packed avx loop, followed by a SIMD scalar epilogue loop to handle what's left after a whole multiple of vector factor is taken care of. Here are the oprofile samples seen in the AVX-128 and AVX-256 case for the innermost Fortran loop's 3 components. Oprofile Samples AVX 128 AVX-256-ss Gap in samples Gap as % of total runtime Total 153408 214448 61040 38% SIMD Vector loop 135653 183074 47421 30% SIMD Scalar Prolog loop 3817 10434 6617 4% SIMD Scalar Epilog loop 3471 10072 6601 4% The avx-256 code is spending 30% more time in the SIMD vector loop than the avx-128 code. The code gen appears to be equivalent for this vector loop in the 128b and 256b cases- i.e only avx-128 instructions on one side are being replaced by avx-256 instructions on the other side. The instruction mix and scheduling are same, except for the spilling and loading of one variable. We know this gap is because there are fewer physical registers available for renaming to the avx-256 code, since our processor loses the upper halves of the FP registers for renaming. Our entire SIMD pipeline in the processor is 128-bit and we don't have native true 256-bit, even for foreseeable future generations, unlike Sandybridge/Ivybridge. The avx-256 code is spending 8% more time in the SIMD scalar prologue and epilogue than the avx-128 code. The code gen is exactly the same for these scalar loops in the 128b and 256b case - i.e exact same instruction mix and scheduling. The reason for the gap is actually the number of iterations that gcc executes in these loops for the 2 cases. This is because gcc is following Sandy bridge's recommendation and aligning avx-256 vectors to a 32-byte boundary instead of a 16-byte boundary, even on Bulldozer. The Sandybridge Software Optimization Guide mentions that the optimal memory alignment of an AVX 256-bit vector, stored in memory, is 32 bytes. The Bulldozer Software Optimization Guide says "Align all packed floating-point data on 16-byte boundaries". In case of cactus, the relevant double vector has 118 elements that are stepped through in unit stride and the first element handled in the Fortran loop is aligned at a boundary akin to 0x8. In avx-128 mode, gcc generates a scalar prologue loop that processes one element at location 0x8, then a vector loop that processes the next 116 elements starting at location 0x10 i.e a 16 byte aligned location, then a scalar epilogue loop that processes the one last element left. In avx-256 mode, gcc generates a scalar prologue loop that processes the first 3 elements at location 0x8, 0x10 and 0x16, then a vector loop that processes the next 112 elements starting at location 0x20 i.e a 32 byte aligned location, then a scalar epilogue loop that processes the last three elements left. Since this Fortran loop is nested inside another double nested loop, the overall impact from doing more work in the scalar loops and less in the vector loop is a reduction in overall vectorization on Bulldozer. Enabling avx-256 and choosing alignments and vector factors that are optimal for Intel Sandybridge/Ivybridge and sub-optimal for AMD processors as default is truly against the spirit of generic mode here. PS: I am not actively working on gcc right now, but other AMD gcc team members will pitch in if more is needed. Thanks Harsha