Hi!

On Thu, Oct 31, 2013 at 10:04:45PM -0500, Aldy Hernandez wrote:
> Hello gentlemen.  I'm CCing all of you, because each of you can
> provide valuable feedback to various parts of the compiler which I
> touch.  I have sprinkled love notes with your names throughout the
> post :).

Thanks for working on this.

>       * Makefile.in (omp-low.o): Depend on PRETTY_PRINT_H and IPA_PROP_H.

You aren't changing Makefile.in anymore ;).

> +/* Given a NODE, return a compatible SIMD clone returning `vectype'.
> +   If none found, NULL is returned.  */
> +
> +struct cgraph_node *
> +get_simd_clone (struct cgraph_node *node, tree vectype)
> +{
> +  if (!node->has_simd_clones)
> +    return NULL;
> +
> +  /* FIXME: What to do with linear/uniform arguments.  */
> +
> +  /* FIXME: Nasty kludge until we figure out where to put the clone
> +     list-- perhaps, next_sibling_clone/prev_sibling_clone in
> +     cgraph_node ??.  */
> +  struct cgraph_node *t;
> +  FOR_EACH_FUNCTION (t)
> +    if (t->simdclone_of == node
> +     /* No inbranch vectorization for now.  */
> +     && !t->simdclone->inbranch
> +     && types_compatible_p (TREE_TYPE (TREE_TYPE (t->symbol.decl)),
> +                            vectype))
> +      break;
> +  return t;
> +}

You definitely need some quick way to find the simd clones, and you really
can't do this here anyway, because you have to check all arguments, return
type might be missing etc., so it needs to be done by vectorizable_call
itself.

> +  /* If this is a SIMD clone, this points to the SIMD specific
> +     information for it.  */
> +  struct simd_clone *simdclone;
> +
> +  /* If this is a SIMD clone, this points to the original scalar
> +     function.  */
> +  struct cgraph_node *simdclone_of;

Can't you put this into the simd_clone structure, in order not to waste
memory for functions which don't have simd clones?  So, you'd use
t->simdclone && t->simdclone->clone_of == node or similar (if you need it at
all, I guess better is to add a struct cgraph_node *simd_clones;
and put the prev/next pointers in struct simd_clone).

Let me start with two testcases:

test1.c:
int array[1000];

#pragma omp declare simd simdlen(4) notinbranch
#pragma omp declare simd simdlen(4) notinbranch uniform(b)
#pragma omp declare simd simdlen(8) notinbranch
#pragma omp declare simd simdlen(8) notinbranch uniform(b)
__attribute__((noinline)) int
foo (int a, int b)
{
  if (a == b)
    return 5;
  else
    return 6;
}

void
bar ()
{
  int i;
  for (i = 0; i < 1000; ++i)
    array[i] = foo (i, 123);
  for (i = 0; i < 1000; ++i)
    array[i] = foo (i, array[i]);
}

test2.c:
int array[1000];

#pragma omp declare simd simdlen(4) notinbranch aligned(a:16) uniform(a) 
linear(b)
#pragma omp declare simd simdlen(4) notinbranch aligned(a:32) uniform(a) 
linear(b)
#pragma omp declare simd simdlen(8) notinbranch aligned(a:16) uniform(a) 
linear(b)
#pragma omp declare simd simdlen(8) notinbranch aligned(a:32) uniform(a) 
linear(b)
__attribute__((noinline)) void
foo (int *a, int b, int c)
{
  a[b] = c;
}

void
bar ()
{
  int i;
  for (i = 0; i < 1000; ++i)
    foo (array, i, i * array[i]);
}

On test1.c -O3 -fopenmp {,-mavx,-mavx2}, you can see:
test1.c: In function ‘foo.simdclone.0’:
test1.c:8:1: note: The ABI for passing parameters with 32-byte alignment has 
changed in GCC 4.6
 foo (int a, int b)
 ^
test1.c:8:1: warning: AVX vector argument without AVX enabled changes the ABI 
[enabled by default]
and the manglings are without -mavx{,2}
_ZGVxN8vu_foo
_ZGVxN8vv_foo
_ZGVxN4vu_foo
_ZGVxN4vv_foo
while with it _ZGVy* (surprisingly not Y).  As discussed earlier, we don't
want to decide which clones to create based on compiler options, we probably
want to create (unless told by Cilk+ processor clauses otherwise) entry
points for all the ABIs, just try to create the ones not matching compiler
options as small as possible, and use target attribute for those too
and say for _ZGVxN8v?_foo we need to pass the vector arguments in two
vector(4) int parameters rather than one vector(8) as it is done now (that
is why the above warnings and notes are printed).  But you know this
already... ;).

The second testcase currently ICEs I guess during simd cloning, just wanted
to make it clear that while simd clones without any arguments probably don't
make any sense (other than const, but those really should be hoisted out of
the loop much earlier), simd clones with no return value make sense.

> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -1688,6 +1688,16 @@ tree
>  vectorizable_function (gimple call, tree vectype_out, tree vectype_in)
>  {
>    tree fndecl = gimple_call_fndecl (call);
> +  struct cgraph_node *node = cgraph_get_node (fndecl);
> +
> +  if (node->has_simd_clones)
> +    {
> +      struct cgraph_node *clone = get_simd_clone (node, vectype_out);
> +      if (clone)
> +     return clone->symbol.decl;
> +      /* Fall through in case we ever add support for
> +      non-built-ins.  */
> +    }

I think it is a bad idea to do this in vectorizable_function, as I said
earlier keying this on the result type won't work for functions returning
void, and more importantly, you really need access to detailed info about
all the arguments for finding out if you have a suitable clone, and
as test1.c shows, also for selection of the best of the clones if more than
one is suitable.  In test1.c, in the first loop the uniform variants are
better over the ones without uniform second argument, though if the uniform
ones would be missing, then you could use even the ones with vv arguments,
because you can just pass a vector constant (or broadcast scalar element
into the vector).  Similarly, in test2.c, you want to check
get_pointer_alignment of the pointer, and if it is >= 32, you can use
the clones with aligned(:32) (and with (:16), but (:32) is supposedly
better), if it is >= 16, you can only use the clones with (:16), if it is <
16, you can't use anything.

> @@ -1758,10 +1768,12 @@ vectorizable_call (gimple stmt, gimple_stmt_iterator 
> *gsi, gimple *vec_stmt,
>    vectype_in = NULL_TREE;
>    nargs = gimple_call_num_args (stmt);
>  
> -  /* Bail out if the function has more than three arguments, we do not have
> -     interesting builtin functions to vectorize with more than two arguments
> -     except for fma.  No arguments is also not good.  */
> -  if (nargs == 0 || nargs > 3)
> +  /* Bail out if the function has more than three arguments.  We do
> +     not have interesting builtin functions to vectorize with more
> +     than two arguments except for fma (unless we have SIMD clones).
> +     No arguments is also not good.  */
> +  struct cgraph_node *node = cgraph_get_node (gimple_call_fndecl (stmt));
> +  if (nargs == 0 || (!node->has_simd_clones && nargs > 3))
>      return false;

In the end, I think for the vectorization of elemental function calls
it might be best to write a new function, vectorizable_simd_clone_call
or similar, because if we add all the support for into vectorizable_call,
it might be unmaintainable.  Normal vectorizable_calls rely on all the input
arguments being of the same type, the return type must be not be void,
but the return type doesn't have to be the same as argument types (which
means NARROW, NONE or WIDEN kind of expansion).
The simd clones can have arbitrary argument types, so the concept of
narrowing/widening etc. doesn't work well in that case.  So I'd probably go
for starting with copy of vectorizable_call, call it at the same spots as
vectorizable_call (after it), then remove the same argument restrictions and
start updating it to do the analysis of suitable clones and some priority
mechanism on which simd clones are best (give bonus points for uniform
arguments if the argument is uniform, linear if it is linear, for highest
alignment, notinbranch/inbranch, simdlen, etc.).

So, e.g.
      /* We can only handle calls with arguments of the same type.  */
      if (rhs_type
          && !types_compatible_p (rhs_type, TREE_TYPE (op)))
        {
          if (dump_enabled_p ())
            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                             "argument types differ.\n");
          return false;
        }
shouldn't be done for the simd clones, vectype_in should go, vectype_out
should be renamed to vectype and set to first argument's? type if result is
void or unused.

      if (!vect_is_simple_use_1 (op, stmt, loop_vinfo, bb_vinfo,
                                 &def_stmt, &def, &dt[i], &opvectype))
        {
          if (dump_enabled_p ())
            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                             "use not simple.\n");
          return false;
        }

(note, right now in the patch this is buffer overflow for nargs > 3, for
simd clones you want to probably dynamically allocate the dt array,
supposedly also remember opvectype for each argument).
If opvectype above is NULL, then that argument can be passed to
uniform arguments (or of course broadcasted into vector to vector
arguments).  To find out if for opvectype != NULL you can pass it to
linear argument, supposedly you could call simple_iv, and compare the
iv.step computed by it if it returned true with the linear step.
To check alignment, supposedly if it is uniform (opvectype == NULL),
you'd just call get_pointer_alignment if it is a pointer, otherwise
maybe give up for now?  I mean, if it is e.g. linear, it would be harder,
you'd need to know if peeling for alignment will be needed and only if
known not to be needed you could simple_iv it and see if base has right
get_pointer_alignment and step multiplied by vectorization factor
keeps the alignment right.  Though, if you need to insert more than one
call of the simdclone, you'd need to verify that it is right for all the
calls.  Right now we have no way to express conditional calls in the
ifconverted IL, so right now we'll punt on all conditional calls, but
at least the vectorizer can fall back (with much lower priority) to
inbranch elementals if notinbranch doesn't exist (just pass all ones mask).

        Jakub

Reply via email to