Looked into tuning the cost model for ThunderX 1 and I noticed I had too high cost of the unaligned load/store. This reduces the cost there and now the loops in linpack are able to vectorize and perform the best.
Also tested on SPEC CPU 2006 to make sure we don't regress the vectorizer there. Committed as obvious after a bootstrap/test on aarch64-linux-gnu with no regressions. Thanks, Andrew Pinski * config/aarch64/aarch64.c (thunderx_vector_cost): Decrease cost of vec_unalign_load_cost and vec_unalign_store_cost.
Index: config/aarch64/aarch64.c =================================================================== --- config/aarch64/aarch64.c (revision 250592) +++ config/aarch64/aarch64.c (working copy) @@ -363,8 +363,8 @@ static const struct cpu_vector_cost thun 2, /* vec_to_scalar_cost */ 2, /* scalar_to_vec_cost */ 3, /* vec_align_load_cost */ - 10, /* vec_unalign_load_cost */ - 10, /* vec_unalign_store_cost */ + 5, /* vec_unalign_load_cost */ + 5, /* vec_unalign_store_cost */ 1, /* vec_store_cost */ 3, /* cond_taken_branch_cost */ 3 /* cond_not_taken_branch_cost */