Hi, I have revised the patch based on your suggestions, and the following are some points that I think is needed to be mentioned in the mail.
> > @@ -502,17 +504,18 @@ simd_clone_adjust_return_type (struct > cgraph_node *node) > > veclen = node->simdclone->vecsize_int; > > else > > veclen = node->simdclone->vecsize_float; > > - veclen /= GET_MODE_BITSIZE (SCALAR_TYPE_MODE (t)); > > - if (veclen > node->simdclone->simdlen) > > + veclen = exact_div (veclen, GET_MODE_BITSIZE (SCALAR_TYPE_MODE > > + (t))); if (known_gt (veclen, node->simdclone->simdlen)) > > veclen = node->simdclone->simdlen; > > Although again it probably doesn't make a difference in practice, the > known/maybe situation is similar here. When comparing: > > - an SVE vector of 2 + 2 * (VQ - 1) doubles and > - an Advanced SIMD vector of 2 doubles > > the Advanced SIMD version is conceptually ordered <= the SVE one, in the > sense that the SVE vector always contains a whole number of Advanced SIMD > vectors whereas the Advanced SIMD vector might not contain a whole number > of SVE vectors. > > In other words, the number of lanes in the Advanced SIMD vector is known_le > the number of lanes in the SVE vector, and the number of lanes in the SVE > vector is known_ge and maybe_gt (but not known_gt) the number of lanes in > the Advanced SIMD vector. So for these kinds of comparison, known_gt can > look a bit unexpected, even if (as here) it's probably fine in practice. > > There's currently a hard-coded assumption in this code and in the vectoriser > that both constant-length software vectors and constant-length hardware > vectors are a power of 2 in size. This means that the > above is effectively > testing whether veclen contains a whole number of > node->simdclone->simdlens, not just whether the veclen is bigger. > > So when adding the initial autovec support for SVE, we generally rewrote these > kinds of test to use multiple_p instead of comparisons. I think that makes > the > intent more obvious (but others might disagree :-)). > > That was a long-winded way of saying that I suggest we use: > > if (multiple_p (veclen, node->simdclone->simdlen)) > veclen = node->simdclone->simdlen; > > instead. As far as I understand, multiple_p (a, b) is equal to known_ge (a, b) due to the hard-coded assumption, and !multiple_p (b, a) is equal to known_gt (a, b). So it seems better to use !multiple_p (node->simdclone->simdlen, veclen) here,while using multiple_p (veclen, node->simdclone->simdlen) is OK since doing the assignment for the eq situation won't change anything. Anyway, I used multiple_p (veclen, node->simdclone->simdlen) in the revised patch. > > } > > TREE_TYPE (TREE_TYPE (fndecl)) = t; > > if (!node->definition) > > @@ -663,8 +669,9 @@ simd_clone_adjust_argument_types (struct > cgraph_node *node) > > veclen = sc->vecsize_int; > > else > > veclen = sc->vecsize_float; > > - veclen /= GET_MODE_BITSIZE (SCALAR_TYPE_MODE (base_type)); > > - if (veclen > sc->simdlen) > > + veclen = exact_div (veclen, > > + GET_MODE_BITSIZE (SCALAR_TYPE_MODE (base_type))); > > + if (known_gt (veclen, sc->simdlen)) > > veclen = sc->simdlen; > > Similarly here. Similarly here, use multiple_p (veclen, sc->simdlen) in the revised patch in addition. > > @@ -981,8 +992,11 @@ ipa_simd_modify_function_body (struct > cgraph_node *node, > > iter, NULL_TREE, NULL_TREE); > > adjustments->register_replacement > > (&(*adjustments->m_adj_params)[j], r); > > > > - if (simd_clone_subparts (vectype) < node->simdclone->simdlen) > > - j += node->simdclone->simdlen / simd_clone_subparts (vectype) - 1; > > + if (known_lt (simd_clone_subparts (vectype), > node->simdclone->simdlen)) > > + { > > + j += vector_unroll_factor (node->simdclone->simdlen, > > + simd_clone_subparts (vectype)) - 1; > > + } > > And similarly here: > > if (multiple_p (node->simdclone->simdlen, simd_clone_subparts > (vectype))) > Similarly here, use multiple_p (node->simdclone->simdlen, simd_clone_subparts (vectype)) in the revised patch in addition. > > /* Wrapper for poly_int arguments to target macros, so that if a target > > doesn't need polynomial-sized modes, its header file can continue to > > treat the argument as a normal constant. This should go away once > > @@ -3902,12 +3902,12 @@ vectorizable_simd_clone_call (vec_info *vinfo, > stmt_vec_info stmt_info, > > n = n->simdclone->next_clone) > > { > > unsigned int this_badness = 0; > > - if (n->simdclone->simdlen > vf > > + if (known_gt (n->simdclone->simdlen, vf) > > I think !constant_multiple_p rather than known_gt here. Until we add > support for vectorising simd clones with partial vectors, a whole number of > simdlens must fit within VF. So maybe: > > unsigned int num_calls; > if (!constant_multiple_p (n->simdclone->simdlen, vf, &num_calls) > > and then… > > > || n->simdclone->nargs != nargs) > > continue; > > - if (n->simdclone->simdlen < vf) > > - this_badness += (exact_log2 (vf) > > - - exact_log2 (n->simdclone->simdlen)) * 1024; > > + if (known_lt (n->simdclone->simdlen, vf)) > > + this_badness += exact_log2 > > + (vector_unroll_factor (vf, n->simdclone->simdlen)) * 1024; > > …use num_calls instead of vector_unroll_factor here. The revised patch wrote these code as: unsigned int num_calls; if (!constant_multiple_p (vf, n->simdclone->simdlen, &num_calls) || n->simdclone->nargs != nargs) continue; if (!multiple_p (n->simdclone->simdlen, vf)) this_badness += exact_log2 (num_calls) * 1024; > > > if (n->simdclone->inbranch) > > this_badness += 2048; > > int target_badness = targetm.simd_clone.usable (n); @@ -3988,19 > > +3988,19 @@ vectorizable_simd_clone_call (vec_info *vinfo, stmt_vec_info > stmt_info, > > arginfo[i].vectype = get_vectype_for_scalar_type (vinfo, arg_type, > > slp_node); > > if (arginfo[i].vectype == NULL > > - || (simd_clone_subparts (arginfo[i].vectype) > > - > bestn->simdclone->simdlen)) > > + || (known_gt (simd_clone_subparts (arginfo[i].vectype), > > + bestn->simdclone->simdlen))) > > Here too I think we want constant_multiple_p: > > || !constant_multiple_p (bestn->simdclone->simdlen, > simd_clone_subparts (arginfo[i].vectype)) > Use multiple_p here since the multiple is not needed. > > if (ratype) > > new_temp = create_tmp_var (ratype); > > else if (useless_type_conversion_p (vectype, rtype)) @@ -4264,12 > > +4266,13 @@ vectorizable_simd_clone_call (vec_info *vinfo, > > stmt_vec_info stmt_info, > > > > if (vec_dest) > > { > > - if (simd_clone_subparts (vectype) < nunits) > > + if (known_lt (simd_clone_subparts (vectype), nunits)) > > multiple_p (nunits, …) here too. It would also be possible to do: > > if (constant_multiple_p (nunits, simd_clone_subparts (vectype), &k) > > and avoid the separate vector_unroll_factor. Use !multiple_p (simd_clone_subparts (vectype), nunits) here, since it seems that the following code should not be executed in the eq situation. > > > { > > unsigned int k, l; > > poly_uint64 prec = GET_MODE_BITSIZE (TYPE_MODE (vectype)); > > poly_uint64 bytes = GET_MODE_SIZE (TYPE_MODE (vectype)); > > - k = nunits / simd_clone_subparts (vectype); > > + k = vector_unroll_factor (nunits, > > + simd_clone_subparts (vectype)); > > gcc_assert ((k & (k - 1)) == 0); > > for (l = 0; l < k; l++) > > { > > @@ -4295,16 +4298,18 @@ vectorizable_simd_clone_call (vec_info *vinfo, > stmt_vec_info stmt_info, > > vect_clobber_variable (vinfo, stmt_info, gsi, new_temp); > > continue; > > } > > - else if (simd_clone_subparts (vectype) > nunits) > > + else if (known_gt (simd_clone_subparts (vectype), nunits)) > > multiple_p here too. > Similarly, use !multiple_p (nunits, simd_clone_subparts (vectype)) here. Bootstrap and tested on both aarch64 and x86 Linux platform, no new regression witnessed. Any suggestions? Thanks, Yang Yang
PR96342-part1-v2.patch
Description: PR96342-part1-v2.patch