"yangyang (ET)" <[email protected]> writes:
> Hi,
>
> I have revised the patch based on your suggestions, and the following are
> some points that I think is needed to be mentioned in the mail.
>
>> > @@ -502,17 +504,18 @@ simd_clone_adjust_return_type (struct
>> cgraph_node *node)
>> > veclen = node->simdclone->vecsize_int;
>> > else
>> > veclen = node->simdclone->vecsize_float;
>> > - veclen /= GET_MODE_BITSIZE (SCALAR_TYPE_MODE (t));
>> > - if (veclen > node->simdclone->simdlen)
>> > + veclen = exact_div (veclen, GET_MODE_BITSIZE (SCALAR_TYPE_MODE
>> > + (t))); if (known_gt (veclen, node->simdclone->simdlen))
>> > veclen = node->simdclone->simdlen;
>>
>> Although again it probably doesn't make a difference in practice, the
>> known/maybe situation is similar here. When comparing:
>>
>> - an SVE vector of 2 + 2 * (VQ - 1) doubles and
>> - an Advanced SIMD vector of 2 doubles
>>
>> the Advanced SIMD version is conceptually ordered <= the SVE one, in the
>> sense that the SVE vector always contains a whole number of Advanced SIMD
>> vectors whereas the Advanced SIMD vector might not contain a whole number
>> of SVE vectors.
>>
>> In other words, the number of lanes in the Advanced SIMD vector is known_le
>> the number of lanes in the SVE vector, and the number of lanes in the SVE
>> vector is known_ge and maybe_gt (but not known_gt) the number of lanes in
>> the Advanced SIMD vector. So for these kinds of comparison, known_gt can
>> look a bit unexpected, even if (as here) it's probably fine in practice.
>>
>> There's currently a hard-coded assumption in this code and in the vectoriser
>> that both constant-length software vectors and constant-length hardware
>> vectors are a power of 2 in size. This means that the > above is effectively
>> testing whether veclen contains a whole number of
>> node->simdclone->simdlens, not just whether the veclen is bigger.
>>
>> So when adding the initial autovec support for SVE, we generally rewrote
>> these
>> kinds of test to use multiple_p instead of comparisons. I think that makes
>> the
>> intent more obvious (but others might disagree :-)).
>>
>> That was a long-winded way of saying that I suggest we use:
>>
>> if (multiple_p (veclen, node->simdclone->simdlen))
>> veclen = node->simdclone->simdlen;
>>
>> instead.
>
> As far as I understand, multiple_p (a, b) is equal to known_ge (a, b) due to
> the hard-coded assumption, and !multiple_p (b, a) is equal to known_gt (a, b).
Yes to the first part, but !multiple_p (b, a) is equal to known_ge (a, b)
(== known_le (b, a)) rather than known_gt (a, b). E.g.:
multiple_p (2, 2 + 2X)
is false (for X>0) but:
known_gt (2 + 2X, X)
is also false (for X==0). In both cases, the conditions have an
implicit “for all X”.
> So it seems better to use !multiple_p (node->simdclone->simdlen, veclen)
> here,while using multiple_p (veclen, node->simdclone->simdlen) is OK since
> doing the assignment for the eq situation won't change anything. Anyway, I
> used multiple_p (veclen, node->simdclone->simdlen) in the revised patch.
!multiple_p is fine with me too FWIW, if you prefer that.
>> > }
>> > TREE_TYPE (TREE_TYPE (fndecl)) = t;
>> > if (!node->definition)
>> > @@ -663,8 +669,9 @@ simd_clone_adjust_argument_types (struct
>> cgraph_node *node)
>> > veclen = sc->vecsize_int;
>> > else
>> > veclen = sc->vecsize_float;
>> > - veclen /= GET_MODE_BITSIZE (SCALAR_TYPE_MODE (base_type));
>> > - if (veclen > sc->simdlen)
>> > + veclen = exact_div (veclen,
>> > + GET_MODE_BITSIZE (SCALAR_TYPE_MODE (base_type)));
>> > + if (known_gt (veclen, sc->simdlen))
>> > veclen = sc->simdlen;
>>
>> Similarly here.
>
> Similarly here, use multiple_p (veclen, sc->simdlen) in the revised patch in
> addition.
Here too !multiple_p sounds OK if you prefer that.
>> > @@ -981,8 +992,11 @@ ipa_simd_modify_function_body (struct
>> cgraph_node *node,
>> > iter, NULL_TREE, NULL_TREE);
>> > adjustments->register_replacement
>> > (&(*adjustments->m_adj_params)[j], r);
>> >
>> > - if (simd_clone_subparts (vectype) < node->simdclone->simdlen)
>> > - j += node->simdclone->simdlen / simd_clone_subparts (vectype) - 1;
>> > + if (known_lt (simd_clone_subparts (vectype),
>> node->simdclone->simdlen))
>> > + {
>> > + j += vector_unroll_factor (node->simdclone->simdlen,
>> > + simd_clone_subparts (vectype)) - 1;
>> > + }
>>
>> And similarly here:
>>
>> if (multiple_p (node->simdclone->simdlen, simd_clone_subparts
>> (vectype)))
>>
>
> Similarly here, use multiple_p (node->simdclone->simdlen, simd_clone_subparts
> (vectype)) in the revised patch in addition.
To be clear, I think multiple_p is still natural here though.
>> > /* Wrapper for poly_int arguments to target macros, so that if a target
>> > doesn't need polynomial-sized modes, its header file can continue to
>> > treat the argument as a normal constant. This should go away once
>> > @@ -3902,12 +3902,12 @@ vectorizable_simd_clone_call (vec_info *vinfo,
>> stmt_vec_info stmt_info,
>> > n = n->simdclone->next_clone)
>> > {
>> > unsigned int this_badness = 0;
>> > - if (n->simdclone->simdlen > vf
>> > + if (known_gt (n->simdclone->simdlen, vf)
>>
>> I think !constant_multiple_p rather than known_gt here. Until we add
>> support for vectorising simd clones with partial vectors, a whole number of
>> simdlens must fit within VF. So maybe:
>>
>> unsigned int num_calls;
>> if (!constant_multiple_p (n->simdclone->simdlen, vf, &num_calls)
>>
>> and then…
>>
>> > || n->simdclone->nargs != nargs)
>> > continue;
>> > - if (n->simdclone->simdlen < vf)
>> > - this_badness += (exact_log2 (vf)
>> > - - exact_log2 (n->simdclone->simdlen)) * 1024;
>> > + if (known_lt (n->simdclone->simdlen, vf))
>> > + this_badness += exact_log2
>> > + (vector_unroll_factor (vf, n->simdclone->simdlen)) * 1024;
>>
>> …use num_calls instead of vector_unroll_factor here.
>
> The revised patch wrote these code as:
>
> unsigned int num_calls;
> if (!constant_multiple_p (vf, n->simdclone->simdlen, &num_calls)
> || n->simdclone->nargs != nargs)
> continue;
> if (!multiple_p (n->simdclone->simdlen, vf))
> this_badness += exact_log2 (num_calls) * 1024;
I think this should be:
if (num_calls != 1)
this_badness += exact_log2 (num_calls) * 1024;
(Or we could do the addition unconditionally, both work.)
>> > if (n->simdclone->inbranch)
>> > this_badness += 2048;
>> > int target_badness = targetm.simd_clone.usable (n); @@ -3988,19
>> > +3988,19 @@ vectorizable_simd_clone_call (vec_info *vinfo, stmt_vec_info
>> stmt_info,
>> > arginfo[i].vectype = get_vectype_for_scalar_type (vinfo, arg_type,
>> > slp_node);
>> > if (arginfo[i].vectype == NULL
>> > - || (simd_clone_subparts (arginfo[i].vectype)
>> > - > bestn->simdclone->simdlen))
>> > + || (known_gt (simd_clone_subparts (arginfo[i].vectype),
>> > + bestn->simdclone->simdlen)))
>>
>> Here too I think we want constant_multiple_p:
>>
>> || !constant_multiple_p (bestn->simdclone->simdlen,
>> simd_clone_subparts (arginfo[i].vectype))
>>
>
> Use multiple_p here since the multiple is not needed.
True, but in the case of vectorisation, we need to generate a constant
number of copies at compile time. If we don't enforce a constant
multiple, we might end up trying to use an Advanced SIMD routine
when vectorising for SVE.
The reason we don't have a two-argument version of constant_multiple_p
is that so far nothing has needed it (at least AFAIK). There's no
conceptual problem with adding it though. I'm happy to do that if
it would help.
>> > if (ratype)
>> > new_temp = create_tmp_var (ratype);
>> > else if (useless_type_conversion_p (vectype, rtype)) @@ -4264,12
>> > +4266,13 @@ vectorizable_simd_clone_call (vec_info *vinfo,
>> > stmt_vec_info stmt_info,
>> >
>> > if (vec_dest)
>> > {
>> > - if (simd_clone_subparts (vectype) < nunits)
>> > + if (known_lt (simd_clone_subparts (vectype), nunits))
>>
>> multiple_p (nunits, …) here too. It would also be possible to do:
>>
>> if (constant_multiple_p (nunits, simd_clone_subparts (vectype), &k)
>>
>> and avoid the separate vector_unroll_factor.
>
> Use !multiple_p (simd_clone_subparts (vectype), nunits) here, since it seems
> that the following code should not be executed in the eq situation.
Ah, yes, sorry for missing that.
>>
>> > {
>> > unsigned int k, l;
>> > poly_uint64 prec = GET_MODE_BITSIZE (TYPE_MODE (vectype));
>> > poly_uint64 bytes = GET_MODE_SIZE (TYPE_MODE (vectype));
>> > - k = nunits / simd_clone_subparts (vectype);
>> > + k = vector_unroll_factor (nunits,
>> > + simd_clone_subparts (vectype));
>> > gcc_assert ((k & (k - 1)) == 0);
>> > for (l = 0; l < k; l++)
>> > {
>> > @@ -4295,16 +4298,18 @@ vectorizable_simd_clone_call (vec_info *vinfo,
>> stmt_vec_info stmt_info,
>> > vect_clobber_variable (vinfo, stmt_info, gsi, new_temp);
>> > continue;
>> > }
>> > - else if (simd_clone_subparts (vectype) > nunits)
>> > + else if (known_gt (simd_clone_subparts (vectype), nunits))
>>
>> multiple_p here too.
>>
>
> Similarly, use !multiple_p (nunits, simd_clone_subparts (vectype)) here.
OK.
> @@ -614,8 +618,9 @@ simd_clone_adjust_argument_types (struct cgraph_node
> *node)
> veclen = sc->vecsize_int;
> else
> veclen = sc->vecsize_float;
> - veclen /= GET_MODE_BITSIZE (SCALAR_TYPE_MODE (parm_type));
> - if (veclen > sc->simdlen)
> + veclen = exact_div (veclen,
> + GET_MODE_BITSIZE (SCALAR_TYPE_MODE (parm_type)));
> + if (known_gt (veclen, sc->simdlen))
> veclen = sc->simdlen;
> adj.op = IPA_PARAM_OP_NEW;
> adj.param_prefix_index = IPA_PARAM_PREFIX_SIMD;
I think this should use multiple_p (or !multiple_p) too.
> @@ -981,8 +992,11 @@ ipa_simd_modify_function_body (struct cgraph_node *node,
> iter, NULL_TREE, NULL_TREE);
> adjustments->register_replacement (&(*adjustments->m_adj_params)[j],
> r);
> - if (simd_clone_subparts (vectype) < node->simdclone->simdlen)
> - j += node->simdclone->simdlen / simd_clone_subparts (vectype) - 1;
> + if (multiple_p (node->simdclone->simdlen, simd_clone_subparts
> (vectype)))
> + {
> + j += vector_unroll_factor (node->simdclone->simdlen,
> + simd_clone_subparts (vectype)) - 1;
> + }
> }
>
> tree name;
Minor nit, but: the original formatting without braces is preferred
for single statements, even if the statement spans multiple lines.
Otherwise the patch looks good, thanks.
Richard