Improving the code generated for vld and vst intrinsics

Richard Sandiford Mon, 21 Feb 2011 09:21:09 -0800

One of the vectorisation discussions from last year was about the poor
code GCC generates for vld{2,3,4}_*() and vst{2,3,4}_*().  It forces the
result of the loads onto the stack, then loads the individual pieces from
there.  It does the same thing in reverse for stores.


I think there are two major problems here:

1. The result of the vld*() is a record type such as:

    typedef struct int16x4x3_t
    {
      int16x4_t val[3];
    } int16x4x3_t;

   Ideally, we'd like one of these structures to be stored in a pseudo
   register.  However, the ARM port currently limits in-register
   record types to 64 bits, so something this big is always given
   BLKmode and stored on the stack.

   A simple "fix" for this is to increase MAX_FIXED_MODE_SIZE.
   That would do the right thing for the structures in arm_neon.h,
   but wouldn't be safe in general.

2. The vld*() returns values as a single integer (such as EI mode),
   while uses of the value will typically be in a vector mode such
   as V4SI.  CANNOT_CHANGE_MODE_CLASS doesn't allow direct
   "mode-punning" between the two in VFP_REGS, so this again
   forces the punning to be done on the stack.

   The code in question is:

    /* FPA registers can't do subreg as all values are reformatted to internal
       precision.  VFP registers may only be accessed in the mode they
       were set.  */
    #define CANNOT_CHANGE_MODE_CLASS(FROM, TO, CLASS)   \
      (GET_MODE_SIZE (FROM) != GET_MODE_SIZE (TO)               \
       ? reg_classes_intersect_p (FPA_REGS, (CLASS))    \
         || reg_classes_intersect_p (VFP_REGS, (CLASS)) \

   However, the VFP restriction appears to be specific to VFPv1 --
   thanks to Peter for the archaeology -- and isn't a problem for v6+.
   In that case, removing this restriction is an important optimisation.

I tried the patch below on the following simple testcase:

    #include "arm_neon.h"

    void
    foo (uint16_t *a)
    {
      uint16x4x3_t x, y;

      x = vld3_u16 (a);
      y = vld3_u16 (a + 12);
      x.val[0] = vadd_u16 (x.val[0], y.val[0]);
      x.val[1] = vadd_u16 (x.val[1], y.val[1]);
      x.val[2] = vadd_u16 (x.val[2], y.val[2]);
      vst3_u16 (a, x);
    }

(not necessarily sensible!).  Before the patch, -O2 produced:

        sub     sp, sp, #48
        add     r3, r0, #24
        vld3.16 {d16-d18}, [r3]
        vld3.16 {d20-d22}, [r0]
        add     r3, sp, #24
        vstmia  sp, {d20-d22}
        vstmia  r3, {d16-d18}
        fldd    d19, [sp, #8]
        fldd    d16, [sp, #0]
        fldd    d17, [sp, #24]
        fldd    d20, [sp, #32]
        vadd.i16        d18, d16, d17
        vadd.i16        d17, d19, d20
        fldd    d19, [sp, #16]
        fldd    d20, [sp, #40]
        vadd.i16        d16, d19, d20
        fstd    d18, [sp, #0]
        fstd    d17, [sp, #8]
        fstd    d16, [sp, #16]
        vldmia  sp, {d16-d18}
        vst3.16 {d16-d18}, [r0]
        add     sp, sp, #48
        bx      lr

After the patch we get:

        vld3.16 {d24-d26}, [r0]
        add     r3, r0, #24
        vld3.16 {d20-d22}, [r3]
        vmov    q8, q12  @ ti
        vadd.i16        d17, d17, d21
        vadd.i16        d16, d24, d20
        vadd.i16        d18, d26, d22
        vst3.16 {d16-d18}, [r0]
        bx      lr

The VMOV is a bit disappointing, and needs further investigation.

The first hunk fixes (2), and I think is correct.  The second hunk
hacks (1), and isn't suitable in itself.  I'll next try to make
arm_neon.h use built-in record types that are explicitly EImode,
which should remove the need to change MAX_FIXED_MODE_SIZE.

Richard


Index: gcc/gcc/config/arm/arm.h
===================================================================
--- gcc.orig/gcc/config/arm/arm.h
+++ gcc/gcc/config/arm/arm.h
@@ -1171,10 +1171,12 @@ enum reg_class
 /* FPA registers can't do subreg as all values are reformatted to internal
    precision.  VFP registers may only be accessed in the mode they
    were set.  */
-#define CANNOT_CHANGE_MODE_CLASS(FROM, TO, CLASS)      \
-  (GET_MODE_SIZE (FROM) != GET_MODE_SIZE (TO)          \
-   ? reg_classes_intersect_p (FPA_REGS, (CLASS))       \
-     || reg_classes_intersect_p (VFP_REGS, (CLASS))    \
2+#define CANNOT_CHANGE_MODE_CLASS(FROM, TO, CLASS)             \
+  (GET_MODE_SIZE (FROM) != GET_MODE_SIZE (TO)                  \
+   ? (reg_classes_intersect_p (FPA_REGS, (CLASS))              \
+      || (TARGET_VFP                                           \
+         && reg_classes_intersect_p (VFP_REGS, (CLASS))        \
+         && arm_fpu_desc->rev == 1))                           \
    : 0)
 
 /* The class value for index registers, and the one for base regs.  */
@@ -2458,4 +2460,6 @@ enum arm_builtins
    instruction.  */
 #define MAX_LDM_STM_OPS 4
 
+#define MAX_FIXED_MODE_SIZE GET_MODE_BITSIZE (XImode)
+
 #endif /* ! GCC_ARM_H */

_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain

Improving the code generated for vld and vst intrinsics

Reply via email to