Jennifer Schmitz <[email protected]> writes:
> [...]
> Looking at the diff of the vect dumps (below is a section of the diff for
> strided_store_2.c), it seemed odd that vec_to_scalar operations cost 0 now,
> instead of the previous cost of 2:
>
> +strided_store_1.c:38:151: note: === vectorizable_operation ===
> +strided_store_1.c:38:151: note: vect_model_simple_cost: inside_cost = 1,
> prologue_cost = 0 .
> +strided_store_1.c:38:151: note: ==> examining statement: *_6 = _7;
> +strided_store_1.c:38:151: note: vect_is_simple_use: operand _3 + 1.0e+0,
> type of def: internal
> +strided_store_1.c:38:151: note: Vectorizing an unaligned access.
> +Applying pattern match.pd:236, generic-match-9.cc:4128
> +Applying pattern match.pd:5285, generic-match-10.cc:4234
> +strided_store_1.c:38:151: note: vect_model_store_cost: inside_cost = 12,
> prologue_cost = 0 .
> *_2 1 times unaligned_load (misalign -1) costs 1 in body
> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue
> _3 + 1.0e+0 1 times vector_stmt costs 1 in body
> -_7 1 times vec_to_scalar costs 2 in body
> +<unknown> 1 times vector_load costs 1 in prologue
> +_7 1 times vec_to_scalar costs 0 in body
> _7 1 times scalar_store costs 1 in body
> -_7 1 times vec_to_scalar costs 2 in body
> +_7 1 times vec_to_scalar costs 0 in body
> _7 1 times scalar_store costs 1 in body
> -_7 1 times vec_to_scalar costs 2 in body
> +_7 1 times vec_to_scalar costs 0 in body
> _7 1 times scalar_store costs 1 in body
> -_7 1 times vec_to_scalar costs 2 in body
> +_7 1 times vec_to_scalar costs 0 in body
> _7 1 times scalar_store costs 1 in body
>
> Although the aarch64_use_new_vector_costs_p flag was used in multiple places
> in aarch64.cc, the location that causes this behavior is this one:
> unsigned
> aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> stmt_vec_info stmt_info, slp_tree,
> tree vectype, int misalign,
> vect_cost_model_location where)
> {
> [...]
> /* Try to get a more accurate cost by looking at STMT_INFO instead
> of just looking at KIND. */
> - if (stmt_info && aarch64_use_new_vector_costs_p ())
> + if (stmt_info)
> {
> /* If we scalarize a strided store, the vectorizer costs one
> vec_to_scalar for each element. However, we can store the first
> element using an FP store without a separate extract step. */
> if (vect_is_store_elt_extraction (kind, stmt_info))
> count -= 1;
>
> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
> stmt_info, stmt_cost);
>
> if (vectype && m_vec_flags)
> stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind,
> stmt_info, vectype,
> where, stmt_cost);
> }
> [...]
> return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ());
> }
>
> Previously, for mtune=generic, this function returned a cost of 2 for a
> vec_to_scalar operation in the vect body. Now "if (stmt_info)" is entered and
> "if (vect_is_store_elt_extraction (kind, stmt_info))" evaluates to true,
> which sets the count to 0 and leads to a return value of 0.
At the time the code was written, a scalarised store would be costed
using one vec_to_scalar call into the backend, with the count parameter
set to the number of elements being stored. The "count -= 1" was
supposed to lop off the leading element extraction, since we can store
lane 0 as a normal FP store.
The target-independent costing was later reworked so that it costs
each operation individually:
for (i = 0; i < nstores; i++)
{
if (costing_p)
{
/* Only need vector extracting when there are more
than one stores. */
if (nstores > 1)
inside_cost
+= record_stmt_cost (cost_vec, 1, vec_to_scalar,
stmt_info, 0, vect_body);
/* Take a single lane vector type store as scalar
store to avoid ICE like 110776. */
if (VECTOR_TYPE_P (ltype)
&& known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
n_adjacent_stores++;
else
inside_cost
+= record_stmt_cost (cost_vec, 1, scalar_store,
stmt_info, 0, vect_body);
continue;
}
Unfortunately, there's no easy way of telling whether a particular call
is part of a group, and if so, which member of the group it is.
I suppose we could give up on the attempt to be (somewhat) accurate
and just disable the optimisation. Or we could restrict it to count > 1,
since it might still be useful for gathers and scatters.
cc:ing Richard in case he has any thoughts.
Thanks,
Richard