(I had sent this mail to gcc-help a week ago. Not sure, all GCC developers are 
subscribed to gcc-help, so re-sending to GCC development mailing list)
Hi,
This looks like a missed vectorization opportunity for one of the 'Fortran' hot 
loops in cactusADM (CPU2006 benchmark) when compiled with "-mcpu=cortex-a57 
-Ofast".
Interestingly, the 'generic' model (compiled with plain "-Ofast or -O3" and 
without -mcpu option) vectorizes this hot loop, hence there is good runtime 
performance improvement noticed on native Aarch64 platform.

I don't have a small reproducible testcase, hence quoting cactusADM benchmark 
here.
The hot loop is present in Bench_StaggeredLeapfrog2() in StaggeredLeapfrog2.F 
file.
For cortex-a57, vectorization report clearly mentions that scalar cost < 
vector_cost/vectorization_factor, hence didn't vectorize.
For generic case, due to un-tuned vector cost model, the scalar cost >  
vector_cost/vectorization_factor  (since scalar_cost = vector_cost), so the 
loop got vectorized
   << Output of  generic vectorized case>>   
StaggeredLeapfrog2.fppized.f.130t.vect:StaggeredLeapfrog2.fppized.f:362:0: 
note: LOOP VECTORIZED
I have also played around with cortexa57_vector_cost table(esp., 
scalar_stmt_cost, vector_stmt_cost, vec_unaligned_cost  etc..,), which 
influences the vectorization decision in this case.
The cortexa57_vector_cost table directly maps to the cost mentioned in 
"Cortex(r)-A57 Software Optimisation Guide".
But, it looks like there is further scope of tuning the cortexa57 vector cost 
to vectorize such cases.
Any comments on this missed opportunity ?
Regards,
Saravanan
PS. I am not pasting the hot loop here, as there could be a license issue of 
using SPEC CPU2006 sources







Reply via email to