>>> We would like to propose changing AVX generic mode tuning to 
>>> generate 128-bit AVX instead of 256-bit AVX.
>>
>> You indicate a 3% reduction on bulldozer with avx256.
>> How does avx128 compare to -mno-avx -msse4.2?
> Will the next AMD generation have a useable avx256?
>>
>> I'm not keen on the idea of generic mode being tune for a single 
>> processor revision that maybe shouldn't actually be using avx at all.

>Btw, it looks like the data is massively skewed by 436.cactusADM.  What are 
>the overall numbers if you disregard cactus?  It's also for sure the case that 
>the vectorizer cost model has not been touched for avx256 vs. avx128 vs. sse, 
>so a more sensible >approach would be to look at differentiating things there 
>to improve the cactus numbers. 

>Harsha, did you investigate why avx256 is such a loss for cactus or why it is 
>so much of a win for SB?

I know this thread did not get closed from our end for a while now, but we 
(AMD) would really like to re-open this discussion. So here goes.

We did investigate why cactus is slower in avx-256 mode than avx-128 mode on 
AMD processors.

Using "-Ofast" flag (with appropriate flags to generate avx-128 code or avx-256 
code) and running with the reference data set, we observe the following 
runtimes on Bulldozer. 
                                                                Runtime %Diff 
AVX-256 versus AVX-128
AVX128                                                616s              38%
AVX256 with store splitting          853s

Scheduling and predictive commoning are turned off in the compiler for both 
cases, so that the code generated by the compiler for the avx-128 and avx-256 
cases are mostly equivalent i.e only avx-128 instructions on one side are being 
replaced by avx-256 instructions on the other side.

Looking at the cactus source and oprofile reports, the hottest loop nest is a 
triple nested loop. The innermost loop of this nest has ~400 lines of Fortran 
code and takes up 99% of the run time of the benchmark. 

Gcc vectorizes the innermost loop for both the 128 and 256 bit cases. In order 
to vectorize the innermost loop, gcc generates a SIMD scalar prologue loop to 
align the relevant vectors, followed by a SIMD packed avx loop, followed by a 
SIMD scalar epilogue loop to handle what's left after a whole multiple of 
vector factor is taken care of. 

Here are the oprofile samples seen in the AVX-128 and AVX-256 case for the 
innermost Fortran loop's 3 components. 
Oprofile Samples
                                                                AVX 128         
                      AVX-256-ss         Gap in samples                 Gap as 
% of total runtime
Total                                                      153408               
                   214448                  61040                                
    38%
SIMD Vector loop                            135653                              
    183074                  47421                                    30%
SIMD Scalar Prolog loop                3817                                     
  10434                    6617                                       4%
SIMD Scalar Epilog loop                 3471                                    
   10072                    6601                                       4%

The avx-256 code is spending 30% more time in the SIMD vector loop than the 
avx-128 code. The code gen appears to be equivalent for this vector loop in the 
128b and 256b cases- i.e only avx-128 instructions on one side are being 
replaced by avx-256 instructions on the other side. The instruction mix and 
scheduling are same, except for the spilling and loading of one variable.

We know this gap is because there are fewer physical registers available for 
renaming to the avx-256 code, since our processor loses the upper halves of the 
FP registers for renaming.
Our entire SIMD pipeline in the processor  is 128-bit and we don't have native 
true 256-bit, even for foreseeable future generations, unlike 
Sandybridge/Ivybridge.

The avx-256 code is spending 8% more time in the SIMD scalar prologue and 
epilogue than the avx-128 code. The code gen is exactly the same for these 
scalar loops in the 128b and 256b case - i.e exact same instruction mix and 
scheduling. The reason for the gap is actually the number of iterations that 
gcc executes in these loops for the 2 cases.  

This is because gcc is following Sandy bridge's recommendation and aligning 
avx-256 vectors to a 32-byte boundary instead of a 16-byte boundary, even on 
Bulldozer. 
The Sandybridge Software Optimization Guide mentions that the optimal memory 
alignment of an AVX 256-bit vector, stored in memory, is 32 bytes. 
The Bulldozer Software Optimization Guide says "Align all packed floating-point 
data on 16-byte boundaries".

In case of cactus, the relevant double vector has 118 elements that are stepped 
through in unit stride and the first element handled in the Fortran loop is 
aligned at a boundary akin to 0x8. 
In avx-128 mode, gcc generates a scalar prologue loop that processes one 
element at location 0x8, then a vector loop that processes the next 116 
elements starting at location 0x10 i.e a 16 byte aligned location, then a 
scalar epilogue loop that processes the one last element left.
In avx-256 mode, gcc generates a scalar prologue loop that processes the first 
3 elements at location 0x8, 0x10 and 0x16, then a vector loop that processes 
the next 112 elements  starting at location 0x20 i.e a 32 byte aligned 
location, then a scalar epilogue loop that processes the last three elements 
left.

Since this Fortran loop is nested inside another double nested loop, the 
overall impact from doing more work in the scalar loops and less in the vector 
loop is a reduction in overall vectorization on Bulldozer.

Enabling avx-256 and choosing alignments and vector factors that are optimal 
for Intel Sandybridge/Ivybridge and sub-optimal for AMD processors as default 
is truly against the spirit of generic mode here.

PS: I am not actively working on gcc right now, but other AMD gcc team members 
will pitch in if more is needed.

Thanks
Harsha

Reply via email to