On Thu, 2011-04-21 at 10:50 +0100, Richard Sandiford wrote:
> To get back to this...
>
> Richard Sandiford <[email protected]> writes:
> > Richard Guenther <[email protected]> writes:
> >> On Thu, Mar 31, 2011 at 3:32 PM, Richard Sandiford
> >> <[email protected]> wrote:
> >>> This patch adds an array_mode_supported_p hook, which says whether
> >>> MAX_FIXED_MODE_SIZE should be ignored for a given type of array.
> >>> It follows on from the discussion here:
> >>>
> >>> http://gcc.gnu.org/ml/gcc/2011-03/msg00342.html
> >>>
> >>> The intended use of the hook is to allow small arrays of vectors
> >>> to have a non-BLK mode, and hence to be stored in rtl registers.
> >>> These arrays are used both in the ARM arm_neon.h API and in the
> >>> optabs proposed in:
> >>>
> >>> http://gcc.gnu.org/ml/gcc/2011-03/msg00322.html
> >>>
> >>> The tail end of the thread was about the definition of TYPE_MODE:
> >>>
> >>> #define TYPE_MODE(NODE) \
> >>> (TREE_CODE (TYPE_CHECK (NODE)) == VECTOR_TYPE \
> >>> ? vector_type_mode (NODE) : (NODE)->type.mode)
> >>>
> >>> with this outcome:
> >>>
> >>> http://gcc.gnu.org/ml/gcc/2011-03/msg00470.html
> >>>
> >>> To summarise my take on it:
> >>>
> >>> - The current definition of TYPE_MODE isn't sufficient even for vector
> >>> modes and vector_mode_supported_p, because non-vector types can have
> >>> vector modes.
> >>>
> >>> - We should no longer treat types as having one mode everywhere.
> >>> We should instead replace TYPE_MODE with a function that takes
> >>> a context. Tests of things like vector_mode_supported_p would
> >>> move from layout_type to this new function.
> >>>
> >>> I think this patch fits within that scheme. array_mode_supported_p
> >>> would be treated in the same way as vector_mode_supported_p.
> >>>
> >>> I realise the ideal would be to get rid of TYPE_MODE first.
> >>> But that's going to be a longer-term thing. Now that there's
> >>> at least a plan, I'd like to press ahead with the array stuff
> >>> on the basis that
> >>>
> >>> (a) although the new hook won't work with the "target" attribute,
> >>> our current mode handling doesn't work in just the same way.
> >>>
> >>> (b) the new hook doesn't interfere with the plan.
> >>>
> >>> (c) getting good code from the intrinsics (and support for these
> >>> instructions in the vectoriser) is going to be much more important
> >>> to most ARM users than the ability to turn Neon on and off for
> >>> individual functions in a TU.
> >>>
> >>> To give an example of the difference, the Neon code posted here:
> >>>
> >>> http://hilbert-space.de/?p=22
> >>>
> >>> produces this inner loop before the patch (but with
> >>> http://gcc.gnu.org/ml/gcc-patches/2011-03/msg01996.html applied):
> >>>
> >>> .L3:
> >>> vld3.8 {d16-d18}, [r1]!
> >>> vstmia ip, {d16-d18}
> >>> fldd d19, [sp, #24]
> >>> adr r5, .L6
> >>> ldmia r5, {r4-r5}
> >>> fldd d16, [sp, #32]
> >>> vmov d18, r4, r5 @ v8qi
> >>> vmull.u8 q9, d19, d18
> >>> adr r5, .L6+8
> >>> ldmia r5, {r4-r5}
> >>> vmov d17, r4, r5 @ v8qi
> >>> vstmia sp, {d18-d19}
> >>> vmlal.u8 q9, d16, d17
> >>> fldd d16, [sp, #40]
> >>> adr r5, .L6+16
> >>> ldmia r5, {r4-r5}
> >>> vmov d17, r4, r5 @ v8qi
> >>> vmlal.u8 q9, d16, d17
> >>> add r3, r3, #1
> >>> vshrn.i16 d16, q9, #8
> >>> cmp r3, r2
> >>> vst1.8 {d16}, [r0]!
> >>> bne .L3
> >>>
> >>> With both patches applied, the inner loop is:
> >>>
> >>> .L3:
> >>> vld3.8 {d18-d20}, [r1]!
> >>> vmull.u8 q8, d18, d21
> >>> vmlal.u8 q8, d19, d22
> >>> vmlal.u8 q8, d20, d23
> >>> add r3, r3, #1
> >>> vshrn.i16 d16, q8, #8
> >>> cmp r3, r2
> >>> vst1.8 {d16}, [r0]!
> >>> bne .L3
> >>>
> >>> Tested on arm-linux-gnueabi. OK to install?
> >>
> >> It looks reasonable given the past discussion, but - can you move forward
> >> with the Neon stuff a bit to see if it really fits? Or is this all
> >> that is needed
> >> for the load/store lane support as well (apart from vectorizer changes of
> >> course).
> >
> > Yeah, I have a prototype that hacks up some C support for generating the
> > (otherwise internal-only) load/store built-in functions that the vectoriser
> > is suppsoed to generate. This patch is all that seems to be needed for the
> > types and optabs generation to work in the natural way.
> >
> > I'm happy to leave it until the vectoriser stuff is in a more
> > submittable state though.
>
> The vectorisation stuff has now been approved and uses this hook to
> detect whether interleaved loads & stores are supported. Also...
>
> > Especially given:
> >
> >> Can you check the code generated by for example
> >>
> >> float foo(char *p)
> >> {
> >> float a[2];
> >> int i;
> >> ((char *)a)[0] = p[0];
> >> ((char *)a)[1] = p[1];
> >> ((char *)a)[2] = p[2];
> >> ((char *)a)[3] = p[3];
> >> ((char *)a)[4] = p[4];
> >> ((char *)a)[5] = p[5];
> >> ((char *)a)[6] = p[6];
> >> ((char *)a)[7] = p[7];
> >> return a[0] + a[1];
> >> }
> >>
> >> for an array a that would get such a larger mode? Thus, check what
> >> happens with partial defs of different types (just to avoid ICEs like the
> >> ones Jakub was fixing yesterday).
> >
> > OK, I tried:
> >
> > #include "arm_neon.h"
> >
> > uint32x2_t foo(char *p)
> > {
> > uint32x2_t a[2];
> > int i;
> > ((char *)a)[0] = p[0];
> > ((char *)a)[1] = p[1];
> > ((char *)a)[2] = p[2];
> > ((char *)a)[3] = p[3];
> > ((char *)a)[4] = p[4];
> > ((char *)a)[5] = p[5];
> > ((char *)a)[6] = p[6];
> > ((char *)a)[7] = p[7];
> > ((char *)a)[8] = p[8];
> > ((char *)a)[9] = p[9];
> > ((char *)a)[10] = p[10];
> > ((char *)a)[11] = p[11];
> > ((char *)a)[12] = p[12];
> > ((char *)a)[13] = p[13];
> > ((char *)a)[14] = p[14];
> > ((char *)a)[15] = p[15];
> > return vadd_u32 (a[0], a[1]);
> > }
> >
> > uint32x4_t bar(char *p, uint32x4_t *b)
> > {
> > uint32x4_t a[2];
> > int i;
> > ((char *)a)[0] = p[0];
> > ((char *)a)[1] = p[1];
> > ((char *)a)[2] = p[2];
> > ((char *)a)[3] = p[3];
> > ((char *)a)[4] = p[4];
> > ((char *)a)[5] = p[5];
> > ((char *)a)[6] = p[6];
> > ((char *)a)[7] = p[7];
> > ((char *)a)[8] = p[8];
> > ((char *)a)[9] = p[9];
> > ((char *)a)[10] = p[10];
> > ((char *)a)[11] = p[11];
> > ((char *)a)[12] = p[12];
> > ((char *)a)[13] = p[13];
> > ((char *)a)[14] = p[14];
> > ((char *)a)[15] = p[15];
> > ((char *)a)[16 + 0] = p[16 + 0];
> > ((char *)a)[16 + 1] = p[16 + 1];
> > ((char *)a)[16 + 2] = p[16 + 2];
> > ((char *)a)[16 + 3] = p[16 + 3];
> > ((char *)a)[16 + 4] = p[16 + 4];
> > ((char *)a)[16 + 5] = p[16 + 5];
> > ((char *)a)[16 + 6] = p[16 + 6];
> > ((char *)a)[16 + 7] = p[16 + 7];
> > ((char *)a)[16 + 8] = p[16 + 8];
> > ((char *)a)[16 + 9] = p[16 + 9];
> > ((char *)a)[16 + 10] = p[16 + 10];
> > ((char *)a)[16 + 11] = p[16 + 11];
> > ((char *)a)[16 + 12] = p[16 + 12];
> > ((char *)a)[16 + 13] = p[16 + 13];
> > ((char *)a)[16 + 14] = p[16 + 14];
> > ((char *)a)[16 + 15] = p[16 + 15];
> > return vaddq_u32 (a[0], a[1]);
> > }
> >
> > It seemed to avoid the problem Jakub was seeing, but the second function
> > hit the known const_int reload failure for these modes:
> >
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46329
>
> ...I've just committed the fix for this PR. Thanks to everyone for
> all the reviews.
>
> Tested on x86_64-linux-gnu and arm-linux-gnueabi. Do the
> target-independent bits look OK? How about the ARM bits?
>
> Thanks,
> Richard
>
>
> gcc/
> * hooks.h (hook_bool_mode_uhwi_false): Declare.
> * hooks.c (hook_bool_mode_uhwi_false): New function.
> * target.def (array_mode_supported_p): New hook.
> * doc/tm.texi.in (TARGET_ARRAY_MODE_SUPPORTED_P): Add @hook.
> * doc/tm.texi: Regenerate.
> * stor-layout.c (mode_for_array): New function.
> (layout_type): Use it.
> * config/arm/arm.c (arm_array_mode_supported_p): New function.
> (TARGET_ARRAY_MODE_SUPPORTED_P): Define.
> Index: gcc/config/arm/arm.c
> ===================================================================
> --- gcc/config/arm/arm.c 2011-04-21 10:47:30.000000000 +0100
> +++ gcc/config/arm/arm.c 2011-04-21 10:47:48.000000000 +0100
> @@ -243,6 +243,8 @@ static rtx arm_pic_static_addr (rtx orig
> static bool cortex_a9_sched_adjust_cost (rtx, rtx, rtx, int *);
> static bool xscale_sched_adjust_cost (rtx, rtx, rtx, int *);
> static bool fa726te_sched_adjust_cost (rtx, rtx, rtx, int *);
> +static bool arm_array_mode_supported_p (enum machine_mode,
> + unsigned HOST_WIDE_INT);
> static enum machine_mode arm_preferred_simd_mode (enum machine_mode);
> static bool arm_class_likely_spilled_p (reg_class_t);
> static bool arm_vector_alignment_reachable (const_tree type, bool is_packed);
> @@ -399,6 +401,8 @@ #define TARGET_ADDRESS_COST arm_address_
> #define TARGET_SHIFT_TRUNCATION_MASK arm_shift_truncation_mask
> #undef TARGET_VECTOR_MODE_SUPPORTED_P
> #define TARGET_VECTOR_MODE_SUPPORTED_P arm_vector_mode_supported_p
> +#undef TARGET_ARRAY_MODE_SUPPORTED_P
> +#define TARGET_ARRAY_MODE_SUPPORTED_P arm_array_mode_supported_p
> #undef TARGET_VECTORIZE_PREFERRED_SIMD_MODE
> #define TARGET_VECTORIZE_PREFERRED_SIMD_MODE arm_preferred_simd_mode
> #undef TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES
> @@ -22514,6 +22518,20 @@ arm_vector_mode_supported_p (enum machin
> return false;
> }
>
> +/* Implements target hook array_mode_supported_p. */
> +
> +static bool
> +arm_array_mode_supported_p (enum machine_mode mode,
> + unsigned HOST_WIDE_INT nelems)
> +{
> + if (TARGET_NEON
> + && (VALID_NEON_DREG_MODE (mode) || VALID_NEON_QREG_MODE (mode))
> + && (nelems >= 2 && nelems <= 4))
> + return true;
> +
> + return false;
> +}
I'm not sure I understand why this is limited to 4 or fewer elements. A
Q reg of chars would surely be 16 elements.
R.