Jennifer Schmitz <jschm...@nvidia.com> writes: > [...] > Looking at the diff of the vect dumps (below is a section of the diff for > strided_store_2.c), it seemed odd that vec_to_scalar operations cost 0 now, > instead of the previous cost of 2: > > +strided_store_1.c:38:151: note: === vectorizable_operation === > +strided_store_1.c:38:151: note: vect_model_simple_cost: inside_cost = 1, > prologue_cost = 0 . > +strided_store_1.c:38:151: note: ==> examining statement: *_6 = _7; > +strided_store_1.c:38:151: note: vect_is_simple_use: operand _3 + 1.0e+0, > type of def: internal > +strided_store_1.c:38:151: note: Vectorizing an unaligned access. > +Applying pattern match.pd:236, generic-match-9.cc:4128 > +Applying pattern match.pd:5285, generic-match-10.cc:4234 > +strided_store_1.c:38:151: note: vect_model_store_cost: inside_cost = 12, > prologue_cost = 0 . > *_2 1 times unaligned_load (misalign -1) costs 1 in body > -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue > _3 + 1.0e+0 1 times vector_stmt costs 1 in body > -_7 1 times vec_to_scalar costs 2 in body > +<unknown> 1 times vector_load costs 1 in prologue > +_7 1 times vec_to_scalar costs 0 in body > _7 1 times scalar_store costs 1 in body > -_7 1 times vec_to_scalar costs 2 in body > +_7 1 times vec_to_scalar costs 0 in body > _7 1 times scalar_store costs 1 in body > -_7 1 times vec_to_scalar costs 2 in body > +_7 1 times vec_to_scalar costs 0 in body > _7 1 times scalar_store costs 1 in body > -_7 1 times vec_to_scalar costs 2 in body > +_7 1 times vec_to_scalar costs 0 in body > _7 1 times scalar_store costs 1 in body > > Although the aarch64_use_new_vector_costs_p flag was used in multiple places > in aarch64.cc, the location that causes this behavior is this one: > unsigned > aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > stmt_vec_info stmt_info, slp_tree, > tree vectype, int misalign, > vect_cost_model_location where) > { > [...] > /* Try to get a more accurate cost by looking at STMT_INFO instead > of just looking at KIND. */ > - if (stmt_info && aarch64_use_new_vector_costs_p ()) > + if (stmt_info) > { > /* If we scalarize a strided store, the vectorizer costs one > vec_to_scalar for each element. However, we can store the first > element using an FP store without a separate extract step. */ > if (vect_is_store_elt_extraction (kind, stmt_info)) > count -= 1; > > stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind, > stmt_info, stmt_cost); > > if (vectype && m_vec_flags) > stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, > stmt_info, vectype, > where, stmt_cost); > } > [...] > return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ()); > } > > Previously, for mtune=generic, this function returned a cost of 2 for a > vec_to_scalar operation in the vect body. Now "if (stmt_info)" is entered and > "if (vect_is_store_elt_extraction (kind, stmt_info))" evaluates to true, > which sets the count to 0 and leads to a return value of 0.
At the time the code was written, a scalarised store would be costed using one vec_to_scalar call into the backend, with the count parameter set to the number of elements being stored. The "count -= 1" was supposed to lop off the leading element extraction, since we can store lane 0 as a normal FP store. The target-independent costing was later reworked so that it costs each operation individually: for (i = 0; i < nstores; i++) { if (costing_p) { /* Only need vector extracting when there are more than one stores. */ if (nstores > 1) inside_cost += record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, 0, vect_body); /* Take a single lane vector type store as scalar store to avoid ICE like 110776. */ if (VECTOR_TYPE_P (ltype) && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) n_adjacent_stores++; else inside_cost += record_stmt_cost (cost_vec, 1, scalar_store, stmt_info, 0, vect_body); continue; } Unfortunately, there's no easy way of telling whether a particular call is part of a group, and if so, which member of the group it is. I suppose we could give up on the attempt to be (somewhat) accurate and just disable the optimisation. Or we could restrict it to count > 1, since it might still be useful for gathers and scatters. cc:ing Richard in case he has any thoughts. Thanks, Richard