On Thu, 2011-04-21 at 10:50 +0100, Richard Sandiford wrote:
> To get back to this...
> 
> Richard Sandiford <richard.sandif...@linaro.org> writes:
> > Richard Guenther <richard.guent...@gmail.com> writes:
> >> On Thu, Mar 31, 2011 at 3:32 PM, Richard Sandiford
> >> <richard.sandif...@linaro.org> wrote:
> >>> This patch adds an array_mode_supported_p hook, which says whether
> >>> MAX_FIXED_MODE_SIZE should be ignored for a given type of array.
> >>> It follows on from the discussion here:
> >>>
> >>>    http://gcc.gnu.org/ml/gcc/2011-03/msg00342.html
> >>>
> >>> The intended use of the hook is to allow small arrays of vectors
> >>> to have a non-BLK mode, and hence to be stored in rtl registers.
> >>> These arrays are used both in the ARM arm_neon.h API and in the
> >>> optabs proposed in:
> >>>
> >>>    http://gcc.gnu.org/ml/gcc/2011-03/msg00322.html
> >>>
> >>> The tail end of the thread was about the definition of TYPE_MODE:
> >>>
> >>> #define TYPE_MODE(NODE) \
> >>>  (TREE_CODE (TYPE_CHECK (NODE)) == VECTOR_TYPE \
> >>>   ? vector_type_mode (NODE) : (NODE)->type.mode)
> >>>
> >>> with this outcome:
> >>>
> >>>    http://gcc.gnu.org/ml/gcc/2011-03/msg00470.html
> >>>
> >>> To summarise my take on it:
> >>>
> >>> - The current definition of TYPE_MODE isn't sufficient even for vector
> >>>  modes and vector_mode_supported_p, because non-vector types can have
> >>>  vector modes.
> >>>
> >>> - We should no longer treat types as having one mode everywhere.
> >>>  We should instead replace TYPE_MODE with a function that takes
> >>>  a context.  Tests of things like vector_mode_supported_p would
> >>>  move from layout_type to this new function.
> >>>
> >>> I think this patch fits within that scheme.  array_mode_supported_p
> >>> would be treated in the same way as vector_mode_supported_p.
> >>>
> >>> I realise the ideal would be to get rid of TYPE_MODE first.
> >>> But that's going to be a longer-term thing.  Now that there's
> >>> at least a plan, I'd like to press ahead with the array stuff
> >>> on the basis that
> >>>
> >>> (a) although the new hook won't work with the "target" attribute,
> >>>    our current mode handling doesn't work in just the same way.
> >>>
> >>> (b) the new hook doesn't interfere with the plan.
> >>>
> >>> (c) getting good code from the intrinsics (and support for these
> >>>    instructions in the vectoriser) is going to be much more important
> >>>    to most ARM users than the ability to turn Neon on and off for
> >>>    individual functions in a TU.
> >>>
> >>> To give an example of the difference, the Neon code posted here:
> >>>
> >>>    http://hilbert-space.de/?p=22
> >>>
> >>> produces this inner loop before the patch (but with
> >>> http://gcc.gnu.org/ml/gcc-patches/2011-03/msg01996.html applied):
> >>>
> >>> .L3:
> >>>        vld3.8  {d16-d18}, [r1]!
> >>>        vstmia  ip, {d16-d18}
> >>>        fldd    d19, [sp, #24]
> >>>        adr     r5, .L6
> >>>        ldmia   r5, {r4-r5}
> >>>        fldd    d16, [sp, #32]
> >>>        vmov    d18, r4, r5  @ v8qi
> >>>        vmull.u8        q9, d19, d18
> >>>        adr     r5, .L6+8
> >>>        ldmia   r5, {r4-r5}
> >>>        vmov    d17, r4, r5  @ v8qi
> >>>        vstmia  sp, {d18-d19}
> >>>        vmlal.u8        q9, d16, d17
> >>>        fldd    d16, [sp, #40]
> >>>        adr     r5, .L6+16
> >>>        ldmia   r5, {r4-r5}
> >>>        vmov    d17, r4, r5  @ v8qi
> >>>        vmlal.u8        q9, d16, d17
> >>>        add     r3, r3, #1
> >>>        vshrn.i16       d16, q9, #8
> >>>        cmp     r3, r2
> >>>        vst1.8  {d16}, [r0]!
> >>>        bne     .L3
> >>>
> >>> With both patches applied, the inner loop is:
> >>>
> >>> .L3:
> >>>        vld3.8  {d18-d20}, [r1]!
> >>>        vmull.u8        q8, d18, d21
> >>>        vmlal.u8        q8, d19, d22
> >>>        vmlal.u8        q8, d20, d23
> >>>        add     r3, r3, #1
> >>>        vshrn.i16       d16, q8, #8
> >>>        cmp     r3, r2
> >>>        vst1.8  {d16}, [r0]!
> >>>        bne     .L3
> >>>
> >>> Tested on arm-linux-gnueabi.  OK to install?
> >>
> >> It looks reasonable given the past discussion, but - can you move forward
> >> with the Neon stuff a bit to see if it really fits?  Or is this all
> >> that is needed
> >> for the load/store lane support as well (apart from vectorizer changes of
> >> course).
> >
> > Yeah, I have a prototype that hacks up some C support for generating the
> > (otherwise internal-only) load/store built-in functions that the vectoriser
> > is suppsoed to generate.  This patch is all that seems to be needed for the
> > types and optabs generation to work in the natural way.
> >
> > I'm happy to leave it until the vectoriser stuff is in a more
> > submittable state though.
> 
> The vectorisation stuff has now been approved and uses this hook to
> detect whether interleaved loads & stores are supported.  Also...
> 
> > Especially given:
> >
> >> Can you check the code generated by for example
> >>
> >> float foo(char *p)
> >> {
> >>   float a[2];
> >>   int i;
> >>   ((char *)a)[0] = p[0];
> >>   ((char *)a)[1] = p[1];
> >>   ((char *)a)[2] = p[2];
> >>   ((char *)a)[3] = p[3];
> >>   ((char *)a)[4] = p[4];
> >>   ((char *)a)[5] = p[5];
> >>   ((char *)a)[6] = p[6];
> >>   ((char *)a)[7] = p[7];
> >>   return a[0] + a[1];
> >> }
> >>
> >> for an array a that would get such a larger mode?  Thus, check what
> >> happens with partial defs of different types (just to avoid ICEs like the
> >> ones Jakub was fixing yesterday).
> >
> > OK, I tried:
> >
> > #include "arm_neon.h"
> >
> > uint32x2_t foo(char *p)
> > {
> >   uint32x2_t a[2];
> >   int i;
> >   ((char *)a)[0] = p[0];
> >   ((char *)a)[1] = p[1];
> >   ((char *)a)[2] = p[2];
> >   ((char *)a)[3] = p[3];
> >   ((char *)a)[4] = p[4];
> >   ((char *)a)[5] = p[5];
> >   ((char *)a)[6] = p[6];
> >   ((char *)a)[7] = p[7];
> >   ((char *)a)[8] = p[8];
> >   ((char *)a)[9] = p[9];
> >   ((char *)a)[10] = p[10];
> >   ((char *)a)[11] = p[11];
> >   ((char *)a)[12] = p[12];
> >   ((char *)a)[13] = p[13];
> >   ((char *)a)[14] = p[14];
> >   ((char *)a)[15] = p[15];
> >   return vadd_u32 (a[0], a[1]);
> > }
> >
> > uint32x4_t bar(char *p, uint32x4_t *b)
> > {
> >   uint32x4_t a[2];
> >   int i;
> >   ((char *)a)[0] = p[0];
> >   ((char *)a)[1] = p[1];
> >   ((char *)a)[2] = p[2];
> >   ((char *)a)[3] = p[3];
> >   ((char *)a)[4] = p[4];
> >   ((char *)a)[5] = p[5];
> >   ((char *)a)[6] = p[6];
> >   ((char *)a)[7] = p[7];
> >   ((char *)a)[8] = p[8];
> >   ((char *)a)[9] = p[9];
> >   ((char *)a)[10] = p[10];
> >   ((char *)a)[11] = p[11];
> >   ((char *)a)[12] = p[12];
> >   ((char *)a)[13] = p[13];
> >   ((char *)a)[14] = p[14];
> >   ((char *)a)[15] = p[15];
> >   ((char *)a)[16 + 0] = p[16 + 0];
> >   ((char *)a)[16 + 1] = p[16 + 1];
> >   ((char *)a)[16 + 2] = p[16 + 2];
> >   ((char *)a)[16 + 3] = p[16 + 3];
> >   ((char *)a)[16 + 4] = p[16 + 4];
> >   ((char *)a)[16 + 5] = p[16 + 5];
> >   ((char *)a)[16 + 6] = p[16 + 6];
> >   ((char *)a)[16 + 7] = p[16 + 7];
> >   ((char *)a)[16 + 8] = p[16 + 8];
> >   ((char *)a)[16 + 9] = p[16 + 9];
> >   ((char *)a)[16 + 10] = p[16 + 10];
> >   ((char *)a)[16 + 11] = p[16 + 11];
> >   ((char *)a)[16 + 12] = p[16 + 12];
> >   ((char *)a)[16 + 13] = p[16 + 13];
> >   ((char *)a)[16 + 14] = p[16 + 14];
> >   ((char *)a)[16 + 15] = p[16 + 15];
> >   return vaddq_u32 (a[0], a[1]);
> > }
> >
> > It seemed to avoid the problem Jakub was seeing, but the second function
> > hit the known const_int reload failure for these modes:
> >
> >     http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46329
> 
> ...I've just committed the fix for this PR.  Thanks to everyone for
> all the reviews.
> 
> Tested on x86_64-linux-gnu and arm-linux-gnueabi.  Do the
> target-independent bits look OK?  How about the ARM bits?
> 
> Thanks,
> Richard
> 
> 
> gcc/
>       * hooks.h (hook_bool_mode_uhwi_false): Declare.
>       * hooks.c (hook_bool_mode_uhwi_false): New function.
>       * target.def (array_mode_supported_p): New hook.
>       * doc/tm.texi.in (TARGET_ARRAY_MODE_SUPPORTED_P): Add @hook.
>       * doc/tm.texi: Regenerate.
>       * stor-layout.c (mode_for_array): New function.
>       (layout_type): Use it.
>       * config/arm/arm.c (arm_array_mode_supported_p): New function.
>       (TARGET_ARRAY_MODE_SUPPORTED_P): Define.

> Index: gcc/config/arm/arm.c
> ===================================================================
> --- gcc/config/arm/arm.c      2011-04-21 10:47:30.000000000 +0100
> +++ gcc/config/arm/arm.c      2011-04-21 10:47:48.000000000 +0100
> @@ -243,6 +243,8 @@ static rtx arm_pic_static_addr (rtx orig
>  static bool cortex_a9_sched_adjust_cost (rtx, rtx, rtx, int *);
>  static bool xscale_sched_adjust_cost (rtx, rtx, rtx, int *);
>  static bool fa726te_sched_adjust_cost (rtx, rtx, rtx, int *);
> +static bool arm_array_mode_supported_p (enum machine_mode,
> +                                     unsigned HOST_WIDE_INT);
>  static enum machine_mode arm_preferred_simd_mode (enum machine_mode);
>  static bool arm_class_likely_spilled_p (reg_class_t);
>  static bool arm_vector_alignment_reachable (const_tree type, bool is_packed);
> @@ -399,6 +401,8 @@ #define TARGET_ADDRESS_COST arm_address_
>  #define TARGET_SHIFT_TRUNCATION_MASK arm_shift_truncation_mask
>  #undef TARGET_VECTOR_MODE_SUPPORTED_P
>  #define TARGET_VECTOR_MODE_SUPPORTED_P arm_vector_mode_supported_p
> +#undef TARGET_ARRAY_MODE_SUPPORTED_P
> +#define TARGET_ARRAY_MODE_SUPPORTED_P arm_array_mode_supported_p
>  #undef TARGET_VECTORIZE_PREFERRED_SIMD_MODE
>  #define TARGET_VECTORIZE_PREFERRED_SIMD_MODE arm_preferred_simd_mode
>  #undef TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES
> @@ -22514,6 +22518,20 @@ arm_vector_mode_supported_p (enum machin
>    return false;
>  }
>  
> +/* Implements target hook array_mode_supported_p.  */
> +
> +static bool
> +arm_array_mode_supported_p (enum machine_mode mode,
> +                         unsigned HOST_WIDE_INT nelems)
> +{
> +  if (TARGET_NEON
> +      && (VALID_NEON_DREG_MODE (mode) || VALID_NEON_QREG_MODE (mode))
> +      && (nelems >= 2 && nelems <= 4))
> +    return true;
> +
> +  return false;
> +}

I'm not sure I understand why this is limited to 4 or fewer elements. A
Q reg of chars would surely be 16 elements.

R.


Reply via email to