Hi, I have revised the patch based on your suggestions, and the following are some points that I think is needed to be mentioned in the mail.
> > @@ -502,17 +504,18 @@ simd_clone_adjust_return_type (struct
> cgraph_node *node)
> > veclen = node->simdclone->vecsize_int;
> > else
> > veclen = node->simdclone->vecsize_float;
> > - veclen /= GET_MODE_BITSIZE (SCALAR_TYPE_MODE (t));
> > - if (veclen > node->simdclone->simdlen)
> > + veclen = exact_div (veclen, GET_MODE_BITSIZE (SCALAR_TYPE_MODE
> > + (t))); if (known_gt (veclen, node->simdclone->simdlen))
> > veclen = node->simdclone->simdlen;
>
> Although again it probably doesn't make a difference in practice, the
> known/maybe situation is similar here. When comparing:
>
> - an SVE vector of 2 + 2 * (VQ - 1) doubles and
> - an Advanced SIMD vector of 2 doubles
>
> the Advanced SIMD version is conceptually ordered <= the SVE one, in the
> sense that the SVE vector always contains a whole number of Advanced SIMD
> vectors whereas the Advanced SIMD vector might not contain a whole number
> of SVE vectors.
>
> In other words, the number of lanes in the Advanced SIMD vector is known_le
> the number of lanes in the SVE vector, and the number of lanes in the SVE
> vector is known_ge and maybe_gt (but not known_gt) the number of lanes in
> the Advanced SIMD vector. So for these kinds of comparison, known_gt can
> look a bit unexpected, even if (as here) it's probably fine in practice.
>
> There's currently a hard-coded assumption in this code and in the vectoriser
> that both constant-length software vectors and constant-length hardware
> vectors are a power of 2 in size. This means that the > above is effectively
> testing whether veclen contains a whole number of
> node->simdclone->simdlens, not just whether the veclen is bigger.
>
> So when adding the initial autovec support for SVE, we generally rewrote these
> kinds of test to use multiple_p instead of comparisons. I think that makes
> the
> intent more obvious (but others might disagree :-)).
>
> That was a long-winded way of saying that I suggest we use:
>
> if (multiple_p (veclen, node->simdclone->simdlen))
> veclen = node->simdclone->simdlen;
>
> instead.
As far as I understand, multiple_p (a, b) is equal to known_ge (a, b) due to
the hard-coded assumption, and !multiple_p (b, a) is equal to known_gt (a, b).
So it seems better to use !multiple_p (node->simdclone->simdlen, veclen)
here,while using multiple_p (veclen, node->simdclone->simdlen) is OK since
doing the assignment for the eq situation won't change anything. Anyway, I used
multiple_p (veclen, node->simdclone->simdlen) in the revised patch.
> > }
> > TREE_TYPE (TREE_TYPE (fndecl)) = t;
> > if (!node->definition)
> > @@ -663,8 +669,9 @@ simd_clone_adjust_argument_types (struct
> cgraph_node *node)
> > veclen = sc->vecsize_int;
> > else
> > veclen = sc->vecsize_float;
> > - veclen /= GET_MODE_BITSIZE (SCALAR_TYPE_MODE (base_type));
> > - if (veclen > sc->simdlen)
> > + veclen = exact_div (veclen,
> > + GET_MODE_BITSIZE (SCALAR_TYPE_MODE (base_type)));
> > + if (known_gt (veclen, sc->simdlen))
> > veclen = sc->simdlen;
>
> Similarly here.
Similarly here, use multiple_p (veclen, sc->simdlen) in the revised patch in
addition.
> > @@ -981,8 +992,11 @@ ipa_simd_modify_function_body (struct
> cgraph_node *node,
> > iter, NULL_TREE, NULL_TREE);
> > adjustments->register_replacement
> > (&(*adjustments->m_adj_params)[j], r);
> >
> > - if (simd_clone_subparts (vectype) < node->simdclone->simdlen)
> > - j += node->simdclone->simdlen / simd_clone_subparts (vectype) - 1;
> > + if (known_lt (simd_clone_subparts (vectype),
> node->simdclone->simdlen))
> > + {
> > + j += vector_unroll_factor (node->simdclone->simdlen,
> > + simd_clone_subparts (vectype)) - 1;
> > + }
>
> And similarly here:
>
> if (multiple_p (node->simdclone->simdlen, simd_clone_subparts
> (vectype)))
>
Similarly here, use multiple_p (node->simdclone->simdlen, simd_clone_subparts
(vectype)) in the revised patch in addition.
> > /* Wrapper for poly_int arguments to target macros, so that if a target
> > doesn't need polynomial-sized modes, its header file can continue to
> > treat the argument as a normal constant. This should go away once
> > @@ -3902,12 +3902,12 @@ vectorizable_simd_clone_call (vec_info *vinfo,
> stmt_vec_info stmt_info,
> > n = n->simdclone->next_clone)
> > {
> > unsigned int this_badness = 0;
> > - if (n->simdclone->simdlen > vf
> > + if (known_gt (n->simdclone->simdlen, vf)
>
> I think !constant_multiple_p rather than known_gt here. Until we add
> support for vectorising simd clones with partial vectors, a whole number of
> simdlens must fit within VF. So maybe:
>
> unsigned int num_calls;
> if (!constant_multiple_p (n->simdclone->simdlen, vf, &num_calls)
>
> and then…
>
> > || n->simdclone->nargs != nargs)
> > continue;
> > - if (n->simdclone->simdlen < vf)
> > - this_badness += (exact_log2 (vf)
> > - - exact_log2 (n->simdclone->simdlen)) * 1024;
> > + if (known_lt (n->simdclone->simdlen, vf))
> > + this_badness += exact_log2
> > + (vector_unroll_factor (vf, n->simdclone->simdlen)) * 1024;
>
> …use num_calls instead of vector_unroll_factor here.
The revised patch wrote these code as:
unsigned int num_calls;
if (!constant_multiple_p (vf, n->simdclone->simdlen, &num_calls)
|| n->simdclone->nargs != nargs)
continue;
if (!multiple_p (n->simdclone->simdlen, vf))
this_badness += exact_log2 (num_calls) * 1024;
>
> > if (n->simdclone->inbranch)
> > this_badness += 2048;
> > int target_badness = targetm.simd_clone.usable (n); @@ -3988,19
> > +3988,19 @@ vectorizable_simd_clone_call (vec_info *vinfo, stmt_vec_info
> stmt_info,
> > arginfo[i].vectype = get_vectype_for_scalar_type (vinfo, arg_type,
> > slp_node);
> > if (arginfo[i].vectype == NULL
> > - || (simd_clone_subparts (arginfo[i].vectype)
> > - > bestn->simdclone->simdlen))
> > + || (known_gt (simd_clone_subparts (arginfo[i].vectype),
> > + bestn->simdclone->simdlen)))
>
> Here too I think we want constant_multiple_p:
>
> || !constant_multiple_p (bestn->simdclone->simdlen,
> simd_clone_subparts (arginfo[i].vectype))
>
Use multiple_p here since the multiple is not needed.
> > if (ratype)
> > new_temp = create_tmp_var (ratype);
> > else if (useless_type_conversion_p (vectype, rtype)) @@ -4264,12
> > +4266,13 @@ vectorizable_simd_clone_call (vec_info *vinfo,
> > stmt_vec_info stmt_info,
> >
> > if (vec_dest)
> > {
> > - if (simd_clone_subparts (vectype) < nunits)
> > + if (known_lt (simd_clone_subparts (vectype), nunits))
>
> multiple_p (nunits, …) here too. It would also be possible to do:
>
> if (constant_multiple_p (nunits, simd_clone_subparts (vectype), &k)
>
> and avoid the separate vector_unroll_factor.
Use !multiple_p (simd_clone_subparts (vectype), nunits) here, since it seems
that the following code should not be executed in the eq situation.
>
> > {
> > unsigned int k, l;
> > poly_uint64 prec = GET_MODE_BITSIZE (TYPE_MODE (vectype));
> > poly_uint64 bytes = GET_MODE_SIZE (TYPE_MODE (vectype));
> > - k = nunits / simd_clone_subparts (vectype);
> > + k = vector_unroll_factor (nunits,
> > + simd_clone_subparts (vectype));
> > gcc_assert ((k & (k - 1)) == 0);
> > for (l = 0; l < k; l++)
> > {
> > @@ -4295,16 +4298,18 @@ vectorizable_simd_clone_call (vec_info *vinfo,
> stmt_vec_info stmt_info,
> > vect_clobber_variable (vinfo, stmt_info, gsi, new_temp);
> > continue;
> > }
> > - else if (simd_clone_subparts (vectype) > nunits)
> > + else if (known_gt (simd_clone_subparts (vectype), nunits))
>
> multiple_p here too.
>
Similarly, use !multiple_p (nunits, simd_clone_subparts (vectype)) here.
Bootstrap and tested on both aarch64 and x86 Linux platform, no new regression
witnessed.
Any suggestions?
Thanks,
Yang Yang
PR96342-part1-v2.patch
Description: PR96342-part1-v2.patch
