On Tue, Dec 15, 2015 at 11:35:45AM +0000, Wilco Dijkstra wrote:
>
> Add support for vector permute cost since various permutes can expand into a
> complex
> sequence of instructions. This fixes major performance regressions due to
> recent changes
> in the SLP vectorizer (which now vectorizes more aggressively and emits many
> complex
> permutes).
>
> Set the cost to > 1 for all microarchitectures so that the number of permutes
> is usually zero
> and regressions disappear. An example of the kind of code that might be
> emitted for
> VEC_PERM_EXPR {0, 3} where registers happen to be in the wrong order:
>
> adrp x4, .LC16
> ldr q5, [x4, #:lo12:.LC16
> eor v1.16b, v1.16b, v0.16b
> eor v0.16b, v1.16b, v0.16b
> eor v1.16b, v1.16b, v0.16b
> tbl v0.16b, {v0.16b - v1.16b}, v5.16b
>
> Regress passes. This fixes regressions that were introduced recently, so OK
> for commit?
>
>
> ChangeLog:
> 2015-12-15 Wilco Dijkstra <[email protected]>
>
> * gcc/config/aarch64/aarch64.c (generic_vector_cost):
> Set vec_permute_cost.
> (cortexa57_vector_cost): Likewise.
> (exynosm1_vector_cost): Likewise.
> (xgene1_vector_cost): Likewise.
> (aarch64_builtin_vectorization_cost): Use vec_permute_cost.
> * gcc/config/aarch64/aarch64-protos.h (cpu_vector_cost):
> Add vec_permute_cost entry.
>
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index
> 10754c88c0973d8ef3c847195b727f02b193bbd8..2584f16d345b3d015d577dd28c08a73ee3e0b0fb
> 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -314,6 +314,7 @@ static const struct cpu_vector_cost generic_vector_cost =
> 1, /* scalar_load_cost */
> 1, /* scalar_store_cost */
> 1, /* vec_stmt_cost */
> + 2, /* vec_permute_cost */
> 1, /* vec_to_scalar_cost */
> 1, /* scalar_to_vec_cost */
> 1, /* vec_align_load_cost */
Is there any reasoning behind making this 2? Do we now miss vectorization
for some of the cheaper permutes? Across the cost models/pipeline
descriptions that have been contributed to GCC I think that this is a
sensible change to the generic costs, but I just want to check there
was some reasoning/experimentation behind the number you picked.
As permutes can have such wildly different costs, this all seems like a good
candidate for some future much more involved hook from the vectorizer to the
back-end specifying the candidate permute operation and requesting a cost
(part of the bigger gimple costs framework?).
Thanks,
James