I've been spending this week playing around with various representations
of the v{ld,st}{1,2,3,4}{,_lane} operations.  I agree with Ira that the
best representation would be to use built-in functions.

One concern in the original discussion was that the optimisers might
move the original MEM_REFs away from the call.  I don't think that's
a problem though.  For loads, we can simply treat the whole of the
accessed memory as an array, and pass the array by value.  If we do that,
then the call would just look like:

   __builtin_load_lanes (MEM_REF[(elem[N] *)ADDR])

(where, despite the C notation, the MEM_REF accesses the whole of elem[N]).
It is of course possible in principle for the tree optimisers to replace
this MEM_REF with another, equivalent, one, but that's OK semantically.
It isn't possible for the optimisers to replace it with something like
an SSA name, because arrays can't be stored in gimple registers.

__builtin_load_lanes would then be used like this:

   combined_vectors = __builtin_load_lanes (...);
   vector1 = ...extract first vector from combined_vectors...
   vector2 = ...extract second vector from combined_vectors...
   ....

So combined_vectors only exists for load and extract operations.
The question then is: what type should it have?  (At this point I'm
just talking about types, not modes.)  The main possibilities seemed to be:

1. an integer type

     Pros
       * Gimple registers can store integers.

     Cons
       * As Julian points out, GCC doesn't really support integer types
         that are wider than 2 HOST_WIDE_INTs.  It would be good to
         remove that restriction, but it might be a lot of work, and it
         isn't something we'd want to take on as part of this project.

       * We're not really using the type as an integer.

       * The combination of the integer type and the __builtin_load_lanes
         array argument wouldn't be enough to determine the correct
         load operation.  __builtin_load_lanes would need something
         like a vector count (N => vldN) argument as well.

2. a combined vector type

     Pros
       * Gimple registers can store vectors.

     Cons
       * For vld3, this would mean creating vector types with non-power-
         of-two vectors.  GCC doesn't support those yet, and you get
         ICEs as soon as you try to use them.  (Remember that this is
         all about types, not modes.)

         It _might_ be interesting to implement this support, but as
         above, it would be a lot of work.  It also raises some semantic
         questions, such as: what is the alignment of the new vectors?
         Which leads to...

       * The alignment of the type would be strange.  E.g. suppose
         we're loading N*2 uint32_ts into N vectors of 2 elements each.
         The types and alignments would be:

           N=2 uint32x4_t, alignment 16
           N=3 uint32x6_t, alignment 8 (if we follow the convention for modes)
           N=4 uint32x8_t, alignment 32

         We don't need alignments greater than 8 in our intended use;
         16 and 32 are overkill.

       * We're not really using the type as a single vector,
         but as a collection of vectors.

       * The combination of the vector type and the __builtin_load_lanes
         array argument wouldn't be enough to determine the correct
         load operation.  __builtin_load_lanes would need something
         like a vector count (N => vldN) argument as well.

3. an array of vectors type

     Pros
       * No support for new GCC features (large integers or non-power-of-two
         vectors) is needed.

       * The alignment of the type would be taken from the alignment of the
         individual vectors, which is correct.

       * It accurately reflects how the loaded value is going to be used.

       * The type uniquely identifies the correct load operation,
         without need for additional arguments.  (This is minor.)

     Cons
       * Gimple registers can't store array values.

So I think the only disadvantage of using an array of vectors is that the
result can never be a gimple register.  But that isn't much of a disadvantage
really; the things we care about are the individual vectors, which can
of course be treated as gimple registers.  I think our tracking of memory
values is good enough for combined_vectors to be treated as such
(even though, with the back-end changes we talked about earlier,
they will actually be stored in RTL registers).

So how about the following functions?  (Forgive the pascally syntax.)

    __builtin_load_lanes (REF : array N*M of X)
      returns array N of vector M of X
      maps to vldN
      in practice, the result would be used in assignments of the form:
        vectorX = ARRAY_REF <result, X>

    __builtin_store_lanes (VECTORS : array N of vector M of X)
      returns array N*M of X
      maps to vstN
      in practice, the argument would be populated by assignments of the form:
        vectorX = ARRAY_REF <result, X>

    __builtin_load_lane (REF : array N of X,
                         VECTORS : array N of vector M of X,
                         LANE : integer)
      returns array N of vector M of X
      maps to vldN_lane

    __builtin_store_lane (VECTORS : array N of vector M of X,
                          LANE : integer)
      returns array N of X
      maps to vstN_lane

Note that each operation can be expanded independently.  The expansion
doesn't rely on preceding or following statements.

I've hacked up the prototype below as a proof of concept.  It includes
changes to the C parser to allow these functions to be created in the
original source code.  This is throw-away code though; it would never
be submitted.

I've also included a simple test case and the output I get from it.
The output looks pretty good; there's not even the stray VMOV that
I saw with the intrinsics earlier in the week.

(Note that if you'd like to try this yourself, you'll need the patch
I posted on Monday as well.)

What do you think?  Obviously this discussion needs to move to gcc@ at
some point, but I wanted to make sure this was vaguely sane first.

Richard

Index: gcc/gcc/builtins.c
===================================================================
--- gcc.orig/gcc/builtins.c
+++ gcc/gcc/builtins.c
@@ -5775,6 +5775,113 @@ expand_builtin_lock_release (enum machin
   emit_move_insn (mem, val);
 }
 
+#define LANE_CHECK(MODE, DIR, SUFFIX, ARGS)				\
+  if (vmode == MODE && nelems == 4)					\
+    emit_insn (gen_neon_v##DIR##4##SUFFIX ARGS);			\
+  else if (vmode == MODE && nelems == 3)				\
+    emit_insn (gen_neon_v##DIR##3##SUFFIX ARGS);			\
+  else if (vmode == MODE && nelems == 2)				\
+    emit_insn (gen_neon_v##DIR##2##SUFFIX ARGS)
+
+#define HACK_LANES(DIR, SUFFIX, TYPE, ARGS)				\
+  do									\
+    {									\
+      tree type;							\
+      enum machine_mode vmode;						\
+      unsigned HOST_WIDE_INT nelems;					\
+									\
+      type = (TYPE);							\
+      gcc_assert (TREE_CODE (type) == ARRAY_TYPE);			\
+      vmode = TYPE_MODE (TREE_TYPE (type));				\
+      nelems = int_size_in_bytes (type) / GET_MODE_SIZE (vmode);	\
+      LANE_CHECK (V4HImode, DIR, SUFFIX##4hi, ARGS);			\
+      else LANE_CHECK (V8HImode, DIR, SUFFIX##8hi, ARGS);		\
+      else LANE_CHECK (V2SImode, DIR, SUFFIX##2si, ARGS);		\
+      else LANE_CHECK (V4SImode, DIR, SUFFIX##4si, ARGS);		\
+      else gcc_unreachable ();						\
+    }									\
+  while (0)
+
+static rtx
+expand_builtin_load_lanes (tree exp, rtx target)
+{
+  tree exp0;
+  rtx mem, addr;
+
+  exp0 = CALL_EXPR_ARG (exp, 0);
+  mem = expand_normal (exp0);
+  gcc_assert (MEM_P (mem));
+  addr = force_reg (Pmode, XEXP (mem, 0));
+  if (target == 0 || !REG_P (target))
+    target = gen_reg_rtx (TYPE_MODE (TREE_TYPE (exp)));
+  HACK_LANES (ld, v, TREE_TYPE (exp), (target, addr));
+  return target;
+}
+
+static rtx
+expand_builtin_load_lane (tree exp, rtx target)
+{
+  tree exp0, exp1, exp2;
+  rtx mem, addr, curval, lane;
+
+  exp0 = CALL_EXPR_ARG (exp, 0);
+  mem = expand_normal (exp0);
+  gcc_assert (MEM_P (mem));
+  addr = force_reg (Pmode, XEXP (mem, 0));
+
+  exp1 = CALL_EXPR_ARG (exp, 1);
+  curval = expand_normal (exp1);
+  curval = force_reg (TYPE_MODE (TREE_TYPE (exp1)), curval);
+
+  exp2 = CALL_EXPR_ARG (exp, 2);
+  lane = GEN_INT (tree_low_cst (exp2, 1));
+
+  if (target == 0 || !REG_P (target))
+    target = gen_reg_rtx (TYPE_MODE (TREE_TYPE (exp)));
+
+  HACK_LANES (ld, _lanev, TREE_TYPE (exp), (target, addr, curval, lane));
+  return target;
+}
+
+static rtx
+expand_builtin_store_lanes (tree exp, rtx target)
+{
+  tree exp0;
+  rtx addr, rhs;
+
+  exp0 = CALL_EXPR_ARG (exp, 0);
+  rhs = expand_normal (exp0);
+  rhs = force_reg (GET_MODE (rhs), rhs);
+
+  if (target == 0 || !MEM_P (target))
+    gcc_unreachable ();
+  addr = force_reg (Pmode, XEXP (target, 0));
+
+  HACK_LANES (st, v, TREE_TYPE (exp0), (addr, rhs));
+  return target;
+}
+
+static rtx
+expand_builtin_store_lane (tree exp, rtx target)
+{
+  tree exp0, exp1;
+  rtx addr, rhs, lane;
+
+  exp0 = CALL_EXPR_ARG (exp, 0);
+  rhs = expand_normal (exp0);
+  rhs = force_reg (GET_MODE (rhs), rhs);
+
+  exp1 = CALL_EXPR_ARG (exp, 1);
+  lane = GEN_INT (tree_low_cst (exp1, 1));
+
+  if (target == 0 || !MEM_P (target))
+    gcc_unreachable ();
+  addr = force_reg (Pmode, XEXP (target, 0));
+
+  HACK_LANES (st, _lanev, TREE_TYPE (exp0), (addr, rhs, lane));
+  return target;
+}
+
 /* Expand an expression EXP that calls a built-in function,
    with result going to TARGET if that's convenient
    (and in mode MODE if that's convenient).
@@ -6583,6 +6690,18 @@ expand_builtin (tree exp, rtx target, rt
       maybe_emit_free_warning (exp);
       break;
 
+    case BUILT_IN_LOAD_LANES:
+      return expand_builtin_load_lanes (exp, target);
+
+    case BUILT_IN_LOAD_LANE:
+      return expand_builtin_load_lane (exp, target);
+
+    case BUILT_IN_STORE_LANES:
+      return expand_builtin_store_lanes (exp, target);
+
+    case BUILT_IN_STORE_LANE:
+      return expand_builtin_store_lane (exp, target);
+
     default:	/* just do library call, if unknown builtin */
       break;
     }
Index: gcc/gcc/builtins.def
===================================================================
--- gcc.orig/gcc/builtins.def
+++ gcc/gcc/builtins.def
@@ -767,6 +767,11 @@ DEF_BUILTIN_STUB (BUILT_IN_EH_POINTER, "
 DEF_BUILTIN_STUB (BUILT_IN_EH_FILTER, "__builtin_eh_filter")
 DEF_BUILTIN_STUB (BUILT_IN_EH_COPY_VALUES, "__builtin_eh_copy_values")
 
+DEF_BUILTIN_STUB (BUILT_IN_LOAD_LANES, "__builtin_load_lanes")
+DEF_BUILTIN_STUB (BUILT_IN_STORE_LANES, "__builtin_store_lanes")
+DEF_BUILTIN_STUB (BUILT_IN_LOAD_LANE, "__builtin_load_lane")
+DEF_BUILTIN_STUB (BUILT_IN_STORE_LANE, "__builtin_store_lane")
+
 /* Synchronization Primitives.  */
 #include "sync-builtins.def"
 
Index: gcc/gcc/c-family/c-common.c
===================================================================
--- gcc.orig/gcc/c-family/c-common.c
+++ gcc/gcc/c-family/c-common.c
@@ -431,6 +431,11 @@ const struct c_common_resword c_common_r
   { "__decltype",       RID_DECLTYPE,   D_CXXONLY },
   { "__extension__",	RID_EXTENSION,	0 },
   { "__func__",		RID_C99_FUNCTION_NAME, 0 },
+  { "__load_lanes",	RID_LOAD_LANES,	0 },
+  { "__store_lanes",	RID_STORE_LANES, 0 },
+  { "__load_lane",	RID_LOAD_LANE,	0 },
+  { "__store_lane",	RID_STORE_LANE,	0 },
+  { "__array_ref",	RID_ARRAY_REF,	0 },
   { "__has_nothrow_assign", RID_HAS_NOTHROW_ASSIGN, D_CXXONLY },
   { "__has_nothrow_constructor", RID_HAS_NOTHROW_CONSTRUCTOR, D_CXXONLY },
   { "__has_nothrow_copy", RID_HAS_NOTHROW_COPY, D_CXXONLY },
Index: gcc/gcc/c-family/c-common.h
===================================================================
--- gcc.orig/gcc/c-family/c-common.h
+++ gcc/gcc/c-family/c-common.h
@@ -105,6 +105,8 @@ enum rid
   RID_TYPES_COMPATIBLE_P,
   RID_DFLOAT32, RID_DFLOAT64, RID_DFLOAT128,
   RID_FRACT, RID_ACCUM,
+  RID_LOAD_LANES, RID_STORE_LANES, RID_LOAD_LANE, RID_STORE_LANE,
+  RID_ARRAY_REF,
 
   /* This means to warn that this is a C++ keyword, and then treat it
      as a normal identifier.  */
Index: gcc/gcc/c-parser.c
===================================================================
--- gcc.orig/gcc/c-parser.c
+++ gcc/gcc/c-parser.c
@@ -5955,6 +5955,33 @@ c_parser_alignof_expression (c_parser *p
     }
 }
 
+struct hack_call {
+  struct hack_call *next;
+  tree decl;
+};
+static struct hack_call *load_lanes, *store_lanes;
+
+static tree
+get_lane_function (struct hack_call **ptr, tree type, int code,
+		   const char *name)
+{
+  struct hack_call *call;
+
+  while ((call = *ptr))
+    {
+      if (comptypes (TREE_TYPE (call->decl), type) == 1)
+	return call->decl;
+      ptr = &call->next;
+    }
+  call = XNEW (struct hack_call);
+  call->decl = build_decl (BUILTINS_LOCATION, FUNCTION_DECL,
+			   get_identifier (name), type);
+  DECL_BUILT_IN_CLASS (call->decl) = BUILT_IN_NORMAL;
+  DECL_FUNCTION_CODE (call->decl) = (enum built_in_function) code;
+  TREE_READONLY (call->decl) = 1;
+  return call->decl;
+}
+
 /* Parse a postfix expression (C90 6.3.1-6.3.2, C99 6.5.1-6.5.2).
 
    postfix-expression:
@@ -6014,6 +6041,7 @@ c_parser_postfix_expression (c_parser *p
 {
   struct c_expr expr, e1, e2, e3;
   struct c_type_name *t1, *t2;
+  int x;
   location_t loc = c_parser_peek_token (parser)->location;;
   expr.original_code = ERROR_MARK;
   expr.original_type = NULL;
@@ -6435,6 +6463,204 @@ c_parser_postfix_expression (c_parser *p
 	    expr.value = objc_build_encode_expr (type);
 	  }
 	  break;
+	case RID_LOAD_LANES:
+	  x = 0;
+	  goto lanes;
+	case RID_STORE_LANES:
+	  x = 1;
+	lanes:
+	  c_parser_consume_token (parser);
+	  if (!c_parser_require (parser, CPP_OPEN_PAREN, "expected %<(%>"))
+	    {
+	      expr.value = error_mark_node;
+	      break;
+	    }
+	  e1 = c_parser_expr_no_commas (parser, NULL);
+	  mark_exp_read (e1.value);
+	  e1.value = c_fully_fold (e1.value, false, NULL);
+	  if (!c_parser_require (parser, CPP_COMMA, "expected %<,%>"))
+	    {
+	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+	      expr.value = error_mark_node;
+	      break;
+	    }
+	  loc = c_parser_peek_token (parser)->location;
+	  t1 = c_parser_type_name (parser);
+	  if (t1 == NULL)
+	    {
+	      expr.value = error_mark_node;
+	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+	      break;
+	    }
+	  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN,
+				     "expected %<)%>");
+	  {
+	    tree decl, rtype, ftype, ctype;
+
+	    rtype = groktypename (t1, NULL, NULL);
+	    ftype = build_function_type_list (rtype, TREE_TYPE (e1.value),
+					      NULL_TREE);
+	    if (x == 0)
+	      decl = get_lane_function (&load_lanes, ftype,
+					BUILT_IN_LOAD_LANES,
+					"__builtin_load_lanes");
+	    else
+	      decl = get_lane_function (&store_lanes, ftype,
+					BUILT_IN_STORE_LANES,
+					"__builtin_store_lanes");
+	    ctype = build_pointer_type (TREE_TYPE (decl));
+	    expr.value = build1 (ADDR_EXPR, ctype, decl);
+	    expr.value = build_call_nary (rtype, expr.value, 1, e1.value);
+	    break;
+	  }
+	case RID_LOAD_LANE:
+	  c_parser_consume_token (parser);
+	  if (!c_parser_require (parser, CPP_OPEN_PAREN, "expected %<(%>"))
+	    {
+	      expr.value = error_mark_node;
+	      break;
+	    }
+	  e1 = c_parser_expr_no_commas (parser, NULL);
+	  mark_exp_read (e1.value);
+	  e1.value = c_fully_fold (e1.value, false, NULL);
+	  if (!c_parser_require (parser, CPP_COMMA, "expected %<,%>"))
+	    {
+	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+	      expr.value = error_mark_node;
+	      break;
+	    }
+	  e2 = c_parser_expr_no_commas (parser, NULL);
+	  mark_exp_read (e2.value);
+	  e2.value = c_fully_fold (e2.value, false, NULL);
+	  if (!c_parser_require (parser, CPP_COMMA, "expected %<,%>"))
+	    {
+	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+	      expr.value = error_mark_node;
+	      break;
+	    }
+	  e3 = c_parser_expr_no_commas (parser, NULL);
+	  mark_exp_read (e3.value);
+	  e3.value = c_fully_fold (e3.value, false, NULL);
+	  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN,
+				     "expected %<)%>");
+	  {
+	    tree decl, rtype, ftype, ctype;
+
+	    rtype = TREE_TYPE (e2.value);
+	    ftype = build_function_type_list (rtype, TREE_TYPE (e1.value),
+					      TREE_TYPE (e2.value),
+					      TREE_TYPE (e3.value),
+					      NULL_TREE);
+	    decl = get_lane_function (&load_lanes, ftype,
+				      BUILT_IN_LOAD_LANE,
+				      "__builtin_load_lane");
+	    ctype = build_pointer_type (TREE_TYPE (decl));
+	    expr.value = build1 (ADDR_EXPR, ctype, decl);
+	    expr.value = build_call_nary (rtype, expr.value, 3, e1.value,
+					  e2.value, e3.value);
+	    break;
+	  }
+	case RID_STORE_LANE:
+	  c_parser_consume_token (parser);
+	  if (!c_parser_require (parser, CPP_OPEN_PAREN, "expected %<(%>"))
+	    {
+	      expr.value = error_mark_node;
+	      break;
+	    }
+	  e1 = c_parser_expr_no_commas (parser, NULL);
+	  mark_exp_read (e1.value);
+	  e1.value = c_fully_fold (e1.value, false, NULL);
+	  if (!c_parser_require (parser, CPP_COMMA, "expected %<,%>"))
+	    {
+	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+	      expr.value = error_mark_node;
+	      break;
+	    }
+	  e2 = c_parser_expr_no_commas (parser, NULL);
+	  mark_exp_read (e2.value);
+	  e2.value = c_fully_fold (e2.value, false, NULL);
+	  if (!c_parser_require (parser, CPP_COMMA, "expected %<,%>"))
+	    {
+	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+	      expr.value = error_mark_node;
+	      break;
+	    }
+	  loc = c_parser_peek_token (parser)->location;
+	  t1 = c_parser_type_name (parser);
+	  if (t1 == NULL)
+	    {
+	      expr.value = error_mark_node;
+	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+	      break;
+	    }
+	  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN,
+				     "expected %<)%>");
+	  {
+	    tree decl, rtype, ftype, ctype;
+
+	    rtype = groktypename (t1, NULL, NULL);
+	    ftype = build_function_type_list (rtype, TREE_TYPE (e1.value),
+					      TREE_TYPE (e2.value),
+					      NULL_TREE);
+	    decl = get_lane_function (&load_lanes, ftype,
+				      BUILT_IN_STORE_LANE,
+				      "__builtin_store_lane");
+	    ctype = build_pointer_type (TREE_TYPE (decl));
+	    expr.value = build1 (ADDR_EXPR, ctype, decl);
+	    expr.value = build_call_nary (rtype, expr.value, 2,
+					  e1.value, e2.value);
+	    break;
+	  }
+	case RID_ARRAY_REF:
+	  c_parser_consume_token (parser);
+	  if (!c_parser_require (parser, CPP_OPEN_PAREN, "expected %<(%>"))
+	    {
+	      expr.value = error_mark_node;
+	      break;
+	    }
+	  e1 = c_parser_expr_no_commas (parser, NULL);
+	  mark_exp_read (e1.value);
+	  e1.value = c_fully_fold (e1.value, false, NULL);
+	  if (!c_parser_require (parser, CPP_COMMA, "expected %<,%>"))
+	    {
+	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+	      expr.value = error_mark_node;
+	      break;
+	    }
+	  e2 = c_parser_expr_no_commas (parser, NULL);
+	  mark_exp_read (e2.value);
+	  e2.value = c_fully_fold (e2.value, false, NULL);
+	  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN,
+				     "expected %<)%>");
+	  {
+	    tree ltype, upper;
+	    unsigned int nelems;
+
+	    ltype = TREE_TYPE (e1.value);
+	    if (TREE_CODE (ltype) != POINTER_TYPE)
+	      {
+		error ("first argument to %<__array_ref%> must"
+		       " be a pointer");
+		expr.value = error_mark_node;
+		break;
+	      }
+
+	    if (!host_integerp (e2.value, 1))
+	      {
+		error ("second arguments to %<__array_ref%> must"
+		       " be a constant integer");
+		expr.value = error_mark_node;
+		break;
+	      }
+
+	    nelems = tree_low_cst (e2.value, 1);
+	    upper = build_int_cst (size_type_node, nelems - 1);
+	    ltype = build_array_type (TREE_TYPE (ltype),
+				      build_index_type (upper));
+	    expr.value = convert (build_pointer_type (ltype), e1.value);
+	    expr.value = build1 (INDIRECT_REF, ltype, expr.value);
+	    break;
+	  }
 	default:
 	  c_parser_error (parser, "expected expression");
 	  expr.value = error_mark_node;
#include "arm_neon.h"

#define LOAD(DEST, TYPE, X, Y, SRC) \
  (DEST).val = __load_lanes (__array_ref (SRC, X * Y), TYPE##x##Y##_t[X])

#define STORE(DEST, TYPE, X, Y, SRC) \
  __array_ref (DEST, X * Y) = __store_lanes ((SRC).val, TYPE##_t [X * Y])

#define LOAD_LANE(DEST, TYPE, X, Y, SRC, LANE) \
  (DEST).val = __load_lane (__array_ref (SRC, X * Y), (DEST).val, LANE)

#define STORE_LANE(DEST, TYPE, X, Y, SRC, LANE) \
  __array_ref (DEST, X) = __store_lane ((SRC).val, LANE, TYPE##_t[X])

void
foo (uint32_t *a)
{
  uint32x4x2_t x, y;

  LOAD (x, uint32, 2, 4, a);
  LOAD (y, uint32, 2, 4, a + 12);
  x.val[0] = vaddq_u32 (x.val[0], y.val[0]);
  x.val[1] = vaddq_u32 (x.val[1], y.val[1]);
  STORE (a, uint32, 2, 4, x);
}

void
bar (uint32_t *a)
{
  uint32x4x2_t x, y;

  LOAD (x, uint32, 2, 4, a);
  LOAD (y, uint32, 2, 4, a);
  x.val[0] = vaddq_u32 (x.val[0], y.val[0]);
  x.val[1] = vaddq_u32 (x.val[1], y.val[1]);
  STORE (a, uint32, 2, 4, x);
}

void
frob (uint32_t *a)
{
  uint32x4x2_t x, y;

  LOAD (x, uint32, 2, 4, a);
  LOAD (y, uint32, 2, 4, a + 12);
  LOAD_LANE (x, uint32, 2, 4, a + 32, 1);
  LOAD_LANE (y, uint32, 2, 4, a + 36, 2);
  x.val[0] = vaddq_u32 (x.val[0], y.val[0]);
  x.val[1] = vaddq_u32 (x.val[1], y.val[1]);
  STORE_LANE (a, uint32, 2, 4, x, 3);
  STORE_LANE (a + 4, uint32, 2, 4, x, 1);
  STORE_LANE (a + 8, uint32, 2, 4, x, 0);
  STORE_LANE (a + 12, uint32, 2, 4, x, 2);
}
        .cpu arm10tdmi
        .eabi_attribute 27, 3
        .fpu neon
        .eabi_attribute 20, 1
        .eabi_attribute 21, 1
        .eabi_attribute 23, 3
        .eabi_attribute 24, 1
        .eabi_attribute 25, 1
        .eabi_attribute 26, 2
        .eabi_attribute 30, 2
        .eabi_attribute 18, 4
        .file   "test.c"
        .text
        .align  2
        .global foo
        .type   foo, %function
foo:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        add     r3, r0, #48
        vld2.32 {d24-d27}, [r0]
        vld2.32 {d16-d19}, [r3]
        vadd.i32        q12, q12, q8
        vadd.i32        q13, q13, q9
        vst2.32 {d24-d27}, [r0]
        bx      lr
        .size   foo, .-foo
        .align  2
        .global bar
        .type   bar, %function
bar:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        vld2.32 {d16-d19}, [r0]
        vadd.i32        q10, q8, q8
        vadd.i32        q11, q9, q9
        vst2.32 {d20-d23}, [r0]
        bx      lr
        .size   bar, .-bar
        .align  2
        .global frob
        .type   frob, %function
frob:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        vld2.32 {d20-d23}, [r0]
        add     r3, r0, #128
        vld2.32 {d20[1], d22[1]}, [r3]
        add     r3, r0, #48
        vld2.32 {d24-d27}, [r3]
        add     r2, r0, #144
        vld2.32 {d25[0], d27[0]}, [r2]
        vadd.i32        q10, q10, q12
        vadd.i32        q11, q11, q13
        add     r1, r0, #16
        add     r2, r0, #32
        vst2.32 {d21[1], d23[1]}, [r0]
        vst2.32 {d20[1], d22[1]}, [r1]
        vst2.32 {d20[0], d22[0]}, [r2]
        vst2.32 {d21[0], d23[0]}, [r3]
        bx      lr
        .size   frob, .-frob
        .ident  "GCC: (GNU) 4.6.0 20110210 (experimental)"
        .section        .note.GNU-stack,"",%progbits
_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain

Reply via email to