Re: [PATCH, rs6000] Expand vec_ld and vec_st during parsing to improve performance

Richard Biener Wed, 20 Apr 2016 02:06:16 -0700

On Tue, Apr 19, 2016 at 10:27 PM, Bill Schmidt
<[email protected]> wrote:
> On Tue, 2016-04-19 at 10:09 +0200, Richard Biener wrote:
>> On Tue, Apr 19, 2016 at 12:05 AM, Bill Schmidt
>> <[email protected]> wrote:
>> > Hi,
>> >
>> > Expanding built-ins in the usual way (leaving them as calls until
>> > expanding into RTL) restricts the amount of optimization that can be
>> > performed on the code represented by the built-ins.  This has been
>> > observed to be particularly bad for the vec_ld and vec_st built-ins on
>> > PowerPC, which represent the lvx and stvx instructions.  Currently these
>> > are expanded into UNSPECs that are left untouched by the optimizers, so
>> > no redundant load or store elimination can take place.  For certain
>> > idiomatic usages, this leads to very bad performance.
>> >
>> > Initially I planned to just change the UNSPEC representation to RTL that
>> > directly expresses the address masking implicit in lvx and stvx.  This
>> > turns out to be only partially successful in improving performance.
>> > Among other things, by the time we reach RTL we have lost track of the
>> > __restrict__ attribute, leading to more appearances of may-alias
>> > relationships than should really be present.  Instead, this patch
>> > expands the built-ins during parsing so that they are exposed to all
>> > GIMPLE optimizations as well.
>> >
>> > This works well for vec_ld and vec_st.  It is also possible for
>> > programmers to instead use __builtin_altivec_lvx_<mode> and
>> > __builtin_altivec_stvx_<mode>.  These are not so easy to catch during
>> > parsing, since they are not processed by the overloaded built-in
>> > function table.  For these, I am currently falling back to expansion
>> > during RTL while still exposing the address-masking semantics, which
>> > seems ok for these somewhat obscure built-ins.  At some future time we
>> > may decide to handle them similarly to vec_ld and vec_st.
>> >
>> > For POWER8 little-endian only, the loads and stores during expand time
>> > require some special handling, since the POWER8 expanders want to
>> > convert these to lxvd2x/xxswapd and xxswapd/stxvd2x.  To deal with this,
>> > I've added an extra pre-pass to the swap optimization phase that
>> > recognizes the lvx and stvx patterns and canonicalizes them so they'll
>> > be properly recognized.  This isn't an issue for earlier or later
>> > processors, or for big-endian POWER8, so doing this as part of swap
>> > optimization is appropriate.
>> >
>> > We have a lot of existing test cases for this code, which proved very
>> > useful in discovering bugs, so I haven't seen a reason to add any new
>> > tests.
>> >
>> > The patch is fairly large, but it isn't feasible to break it up into
>> > smaller units without leaving something in a broken state.  So I will
>> > have to just apologize for the size and leave it at that.  Sorry! :)
>> >
>> > Bootstrapped and tested successfully on powerpc64le-unknown-linux-gnu,
>> > and on powerpc64-unknown-linux-gnu (-m32 and -m64) with no regressions.
>> > Is this ok for trunk after GCC 6 releases?
>>
>> Just took a very quick look but it seems you are using integer arithmetic
>> for the pointer adjustment and bit-and.  You could use POINTER_PLUS_EXPR
>> for the addition and BIT_AND_EXPR is also valid on pointer types.  Which
>> means you don't need conversions to/from sizetype.
>
> I just verified that I run into trouble with both these changes.  The
> build_binary_op interface doesn't accept POINTER_PLUS_EXPR as a valid
> code (we hit a gcc_unreachable in the main switch statement), but does
> produce pointer additions from a PLUS_EXPR.  Also, apparently
> BIT_AND_EXPR is not valid on at least these pointer types:
>
> ld.c: In function 'test':
> ld.c:68:9: error: invalid operands to binary & (have '__vector(16) unsigned 
> char *' and '__vector(16) unsigned char *')
>    vuc = vec_ld (0, (vector unsigned char *)svuc);
>          ^
>
> That's what happens if I try:
>
>           tree aligned = build_binary_op (loc, BIT_AND_EXPR, addr,
>                                           build_int_cst (TREE_TYPE (arg1),
>                                                          -16), 0);
>
> If I try with building the -16 as a sizetype, I get the same error
> message except that the second argument listed is 'sizetype'.  Is there
> something else I should be trying instead?


Ah, it might be that the FE interfaces (build_binary_op and friends) do not
accept this.  If you'd simply used fold_build2 it should work.  For the
BIT_AND_EXPR the constant has to be of the same type as 'addr'.

Richard.

> Thanks,
> Bill
>
>
>>
>> x86 nowadays has intrinsics implemented as inlines - they come from
>> header files.  It seems for ppc the intrinsics are somehow magically
>> there, w/o a header file?
>>
>> Richard.
>>
>> > Thanks,
>> > Bill
>> >
>> >
>> > 2016-04-18  Bill Schmidt  <[email protected]>
>> >
>> >         * config/rs6000/altivec.md (altivec_lvx_<mode>): Remove.
>> >         (altivec_lvx_<mode>_internal): Document.
>> >         (altivec_lvx_<mode>_2op): New define_insn.
>> >         (altivec_lvx_<mode>_1op): Likewise.
>> >         (altivec_lvx_<mode>_2op_si): Likewise.
>> >         (altivec_lvx_<mode>_1op_si): Likewise.
>> >         (altivec_stvx_<mode>): Remove.
>> >         (altivec_stvx_<mode>_internal): Document.
>> >         (altivec_stvx_<mode>_2op): New define_insn.
>> >         (altivec_stvx_<mode>_1op): Likewise.
>> >         (altivec_stvx_<mode>_2op_si): Likewise.
>> >         (altivec_stvx_<mode>_1op_si): Likewise.
>> >         * config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
>> >         Expand vec_ld and vec_st during parsing.
>> >         * config/rs6000/rs6000.c (altivec_expand_lvx_be): Commentary
>> >         changes.
>> >         (altivec_expand_stvx_be): Likewise.
>> >         (altivec_expand_lv_builtin): Expand lvx built-ins to expose the
>> >         address-masking behavior in RTL.
>> >         (altivec_expand_stv_builtin): Expand stvx built-ins to expose the
>> >         address-masking behavior in RTL.
>> >         (altivec_expand_builtin): Change builtin code arguments for calls
>> >         to altivec_expand_stv_builtin and altivec_expand_lv_builtin.
>> >         (insn_is_swappable_p): Avoid incorrect swap optimization in the
>> >         presence of lvx/stvx patterns.
>> >         (alignment_with_canonical_addr): New function.
>> >         (alignment_mask): Likewise.
>> >         (find_alignment_op): Likewise.
>> >         (combine_lvx_pattern): Likewise.
>> >         (combine_stvx_pattern): Likewise.
>> >         (combine_lvx_stvx_patterns): Likewise.
>> >         (rs6000_analyze_swaps): Perform a pre-pass to recognize lvx and
>> >         stvx patterns from expand.
>> >         * config/rs6000/vector.md (vector_altivec_load_<mode>): Use new
>> >         expansions.
>> >         (vector_altivec_store_<mode>): Likewise.
>> >
>> >
>> > Index: gcc/config/rs6000/altivec.md
>> > ===================================================================
>> > --- gcc/config/rs6000/altivec.md        (revision 235090)
>> > +++ gcc/config/rs6000/altivec.md        (working copy)
>> > @@ -2514,20 +2514,9 @@
>> >    "lvxl %0,%y1"
>> >    [(set_attr "type" "vecload")])
>> >
>> > -(define_expand "altivec_lvx_<mode>"
>> > -  [(parallel
>> > -    [(set (match_operand:VM2 0 "register_operand" "=v")
>> > -         (match_operand:VM2 1 "memory_operand" "Z"))
>> > -     (unspec [(const_int 0)] UNSPEC_LVX)])]
>> > -  "TARGET_ALTIVEC"
>> > -{
>> > -  if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
>> > -    {
>> > -      altivec_expand_lvx_be (operands[0], operands[1], <MODE>mode, 
>> > UNSPEC_LVX);
>> > -      DONE;
>> > -    }
>> > -})
>> > -
>> > +; This version of lvx is used only in cases where we need to force an lvx
>> > +; over any other load, and we don't care about losing CSE opportunities.
>> > +; Its primary use is for prologue register saves.
>> >  (define_insn "altivec_lvx_<mode>_internal"
>> >    [(parallel
>> >      [(set (match_operand:VM2 0 "register_operand" "=v")
>> > @@ -2537,20 +2526,45 @@
>> >    "lvx %0,%y1"
>> >    [(set_attr "type" "vecload")])
>> >
>> > -(define_expand "altivec_stvx_<mode>"
>> > -  [(parallel
>> > -    [(set (match_operand:VM2 0 "memory_operand" "=Z")
>> > -         (match_operand:VM2 1 "register_operand" "v"))
>> > -     (unspec [(const_int 0)] UNSPEC_STVX)])]
>> > -  "TARGET_ALTIVEC"
>> > -{
>> > -  if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
>> > -    {
>> > -      altivec_expand_stvx_be (operands[0], operands[1], <MODE>mode, 
>> > UNSPEC_STVX);
>> > -      DONE;
>> > -    }
>> > -})
>> > +; The next two patterns embody what lvx should usually look like.
>> > +(define_insn "altivec_lvx_<mode>_2op"
>> > +  [(set (match_operand:VM2 0 "register_operand" "=v")
>> > +        (mem:VM2 (and:DI (plus:DI (match_operand:DI 1 "register_operand" 
>> > "b")
>> > +                                  (match_operand:DI 2 "register_operand" 
>> > "r"))
>> > +                        (const_int -16))))]
>> > +  "TARGET_ALTIVEC && TARGET_64BIT"
>> > +  "lvx %0,%1,%2"
>> > +  [(set_attr "type" "vecload")])
>> >
>> > +(define_insn "altivec_lvx_<mode>_1op"
>> > +  [(set (match_operand:VM2 0 "register_operand" "=v")
>> > +        (mem:VM2 (and:DI (match_operand:DI 1 "register_operand" "r")
>> > +                        (const_int -16))))]
>> > +  "TARGET_ALTIVEC && TARGET_64BIT"
>> > +  "lvx %0,0,%1"
>> > +  [(set_attr "type" "vecload")])
>> > +
>> > +; 32-bit versions of the above.
>> > +(define_insn "altivec_lvx_<mode>_2op_si"
>> > +  [(set (match_operand:VM2 0 "register_operand" "=v")
>> > +        (mem:VM2 (and:SI (plus:SI (match_operand:SI 1 "register_operand" 
>> > "b")
>> > +                                  (match_operand:SI 2 "register_operand" 
>> > "r"))
>> > +                        (const_int -16))))]
>> > +  "TARGET_ALTIVEC && TARGET_32BIT"
>> > +  "lvx %0,%1,%2"
>> > +  [(set_attr "type" "vecload")])
>> > +
>> > +(define_insn "altivec_lvx_<mode>_1op_si"
>> > +  [(set (match_operand:VM2 0 "register_operand" "=v")
>> > +        (mem:VM2 (and:SI (match_operand:SI 1 "register_operand" "r")
>> > +                        (const_int -16))))]
>> > +  "TARGET_ALTIVEC && TARGET_32BIT"
>> > +  "lvx %0,0,%1"
>> > +  [(set_attr "type" "vecload")])
>> > +
>> > +; This version of stvx is used only in cases where we need to force an 
>> > stvx
>> > +; over any other store, and we don't care about losing CSE opportunities.
>> > +; Its primary use is for epilogue register restores.
>> >  (define_insn "altivec_stvx_<mode>_internal"
>> >    [(parallel
>> >      [(set (match_operand:VM2 0 "memory_operand" "=Z")
>> > @@ -2560,6 +2574,42 @@
>> >    "stvx %1,%y0"
>> >    [(set_attr "type" "vecstore")])
>> >
>> > +; The next two patterns embody what stvx should usually look like.
>> > +(define_insn "altivec_stvx_<mode>_2op"
>> > +  [(set (mem:VM2 (and:DI (plus:DI (match_operand:DI 1 "register_operand" 
>> > "b")
>> > +                                 (match_operand:DI 2 "register_operand" 
>> > "r"))
>> > +                        (const_int -16)))
>> > +        (match_operand:VM2 0 "register_operand" "v"))]
>> > +  "TARGET_ALTIVEC && TARGET_64BIT"
>> > +  "stvx %0,%1,%2"
>> > +  [(set_attr "type" "vecstore")])
>> > +
>> > +(define_insn "altivec_stvx_<mode>_1op"
>> > +  [(set (mem:VM2 (and:DI (match_operand:DI 1 "register_operand" "r")
>> > +                        (const_int -16)))
>> > +        (match_operand:VM2 0 "register_operand" "v"))]
>> > +  "TARGET_ALTIVEC && TARGET_64BIT"
>> > +  "stvx %0,0,%1"
>> > +  [(set_attr "type" "vecstore")])
>> > +
>> > +; 32-bit versions of the above.
>> > +(define_insn "altivec_stvx_<mode>_2op_si"
>> > +  [(set (mem:VM2 (and:SI (plus:SI (match_operand:SI 1 "register_operand" 
>> > "b")
>> > +                                 (match_operand:SI 2 "register_operand" 
>> > "r"))
>> > +                        (const_int -16)))
>> > +        (match_operand:VM2 0 "register_operand" "v"))]
>> > +  "TARGET_ALTIVEC && TARGET_32BIT"
>> > +  "stvx %0,%1,%2"
>> > +  [(set_attr "type" "vecstore")])
>> > +
>> > +(define_insn "altivec_stvx_<mode>_1op_si"
>> > +  [(set (mem:VM2 (and:SI (match_operand:SI 1 "register_operand" "r")
>> > +                        (const_int -16)))
>> > +        (match_operand:VM2 0 "register_operand" "v"))]
>> > +  "TARGET_ALTIVEC && TARGET_32BIT"
>> > +  "stvx %0,0,%1"
>> > +  [(set_attr "type" "vecstore")])
>> > +
>> >  (define_expand "altivec_stvxl_<mode>"
>> >    [(parallel
>> >      [(set (match_operand:VM2 0 "memory_operand" "=Z")
>> > Index: gcc/config/rs6000/rs6000-c.c
>> > ===================================================================
>> > --- gcc/config/rs6000/rs6000-c.c        (revision 235090)
>> > +++ gcc/config/rs6000/rs6000-c.c        (working copy)
>> > @@ -4800,6 +4800,164 @@ assignment for unaligned loads and stores");
>> >        return stmt;
>> >      }
>> >
>> > +  /* Expand vec_ld into an expression that masks the address and
>> > +     performs the load.  We need to expand this early to allow
>> > +     the best aliasing, as by the time we get into RTL we no longer
>> > +     are able to honor __restrict__, for example.  We may want to
>> > +     consider this for all memory access built-ins.
>> > +
>> > +     When -maltivec=be is specified, simply punt to existing
>> > +     built-in processing.  */
>> > +  if (fcode == ALTIVEC_BUILTIN_VEC_LD
>> > +      && (BYTES_BIG_ENDIAN || !VECTOR_ELT_ORDER_BIG))
>> > +    {
>> > +      tree arg0 = (*arglist)[0];
>> > +      tree arg1 = (*arglist)[1];
>> > +
>> > +      /* Strip qualifiers like "const" from the pointer arg.  */
>> > +      tree arg1_type = TREE_TYPE (arg1);
>> > +      tree inner_type = TREE_TYPE (arg1_type);
>> > +      if (TYPE_QUALS (TREE_TYPE (arg1_type)) != 0)
>> > +       {
>> > +         arg1_type = build_pointer_type (build_qualified_type (inner_type,
>> > +                                                               0));
>> > +         arg1 = fold_convert (arg1_type, arg1);
>> > +       }
>> > +
>> > +      /* Construct the masked address.  We have to jump through some hoops
>> > +        here.  If the first argument to a PLUS_EXPR is a pointer,
>> > +        build_binary_op will multiply the offset by the size of the
>> > +        inner type of the pointer (C semantics).  With vec_ld and vec_st,
>> > +        the offset must be left alone.  However, if we convert to a
>> > +        sizetype to do the arithmetic, we get a PLUS_EXPR instead of a
>> > +        POINTER_PLUS_EXPR, which interferes with aliasing (causing us,
>> > +        for example, to lose "restrict" information).  Thus where legal,
>> > +        we pre-adjust the offset knowing that a multiply by size is
>> > +        coming.  When the offset isn't a multiple of the size, we are
>> > +        forced to do the arithmetic in size_type for correctness, at the
>> > +        cost of losing aliasing information.  This, however, should be
>> > +        quite rare with these operations.  */
>> > +      arg0 = fold (arg0);
>> > +
>> > +      /* Let existing error handling take over if we don't have a constant
>> > +        offset.  */
>> > +      if (TREE_CODE (arg0) == INTEGER_CST)
>> > +       {
>> > +         HOST_WIDE_INT off = TREE_INT_CST_LOW (arg0);
>> > +         HOST_WIDE_INT size = int_size_in_bytes (inner_type);
>> > +         tree addr;
>> > +
>> > +         if (off % size == 0)
>> > +           {
>> > +             tree adjoff = build_int_cst (TREE_TYPE (arg0), off / size);
>> > +             addr = build_binary_op (loc, PLUS_EXPR, arg1, adjoff, 0);
>> > +             addr = build1 (NOP_EXPR, sizetype, addr);
>> > +           }
>> > +         else
>> > +           {
>> > +             tree hack_arg1 = build1 (NOP_EXPR, sizetype, arg1);
>> > +             addr = build_binary_op (loc, PLUS_EXPR, hack_arg1, arg0, 0);
>> > +           }
>> > +         tree aligned = build_binary_op (loc, BIT_AND_EXPR, addr,
>> > +                                         build_int_cst (sizetype, -16), 
>> > 0);
>> > +
>> > +         /* Find the built-in to get the return type so we can convert
>> > +            the result properly (or fall back to default handling if the
>> > +            arguments aren't compatible).  */
>> > +         for (desc = altivec_overloaded_builtins;
>> > +              desc->code && desc->code != fcode; desc++)
>> > +           continue;
>> > +
>> > +         for (; desc->code == fcode; desc++)
>> > +           if (rs6000_builtin_type_compatible (TREE_TYPE (arg0), 
>> > desc->op1)
>> > +               && (rs6000_builtin_type_compatible (TREE_TYPE (arg1),
>> > +                                                   desc->op2)))
>> > +             {
>> > +               tree ret_type = rs6000_builtin_type (desc->ret_type);
>> > +               if (TYPE_MODE (ret_type) == V2DImode)
>> > +                 /* Type-based aliasing analysis thinks vector long
>> > +                    and vector long long are different and will put them
>> > +                    in distinct alias classes.  Force our return type
>> > +                    to be a may-alias type to avoid this.  */
>> > +                 ret_type
>> > +                   = build_pointer_type_for_mode (ret_type, Pmode,
>> > +                                                  true/*can_alias_all*/);
>> > +               else
>> > +                 ret_type = build_pointer_type (ret_type);
>> > +               aligned = build1 (NOP_EXPR, ret_type, aligned);
>> > +               tree ret_val = build_indirect_ref (loc, aligned, RO_NULL);
>> > +               return ret_val;
>> > +             }
>> > +       }
>> > +    }
>> > +
>> > +  /* Similarly for stvx.  */
>> > +  if (fcode == ALTIVEC_BUILTIN_VEC_ST
>> > +      && (BYTES_BIG_ENDIAN || !VECTOR_ELT_ORDER_BIG))
>> > +    {
>> > +      tree arg0 = (*arglist)[0];
>> > +      tree arg1 = (*arglist)[1];
>> > +      tree arg2 = (*arglist)[2];
>> > +
>> > +      /* Construct the masked address.  See handling for 
>> > ALTIVEC_BUILTIN_VEC_LD
>> > +        for an explanation of address arithmetic concerns.  */
>> > +      arg1 = fold (arg1);
>> > +
>> > +      /* Let existing error handling take over if we don't have a constant
>> > +        offset.  */
>> > +      if (TREE_CODE (arg1) == INTEGER_CST)
>> > +       {
>> > +         HOST_WIDE_INT off = TREE_INT_CST_LOW (arg1);
>> > +         tree inner_type = TREE_TYPE (TREE_TYPE (arg2));
>> > +         HOST_WIDE_INT size = int_size_in_bytes (inner_type);
>> > +         tree addr;
>> > +
>> > +         if (off % size == 0)
>> > +           {
>> > +             tree adjoff = build_int_cst (TREE_TYPE (arg1), off / size);
>> > +             addr = build_binary_op (loc, PLUS_EXPR, arg2, adjoff, 0);
>> > +             addr = build1 (NOP_EXPR, sizetype, addr);
>> > +           }
>> > +         else
>> > +           {
>> > +             tree hack_arg2 = build1 (NOP_EXPR, sizetype, arg2);
>> > +             addr = build_binary_op (loc, PLUS_EXPR, hack_arg2, arg1, 0);
>> > +           }
>> > +         tree aligned = build_binary_op (loc, BIT_AND_EXPR, addr,
>> > +                                         build_int_cst (sizetype, -16), 
>> > 0);
>> > +
>> > +         /* Find the built-in to make sure a compatible one exists; if not
>> > +            we fall back to default handling to get the error message.  */
>> > +         for (desc = altivec_overloaded_builtins;
>> > +              desc->code && desc->code != fcode; desc++)
>> > +           continue;
>> > +
>> > +         for (; desc->code == fcode; desc++)
>> > +           if (rs6000_builtin_type_compatible (TREE_TYPE (arg0), 
>> > desc->op1)
>> > +               && rs6000_builtin_type_compatible (TREE_TYPE (arg1), 
>> > desc->op2)
>> > +               && rs6000_builtin_type_compatible (TREE_TYPE (arg2),
>> > +                                                  desc->op3))
>> > +             {
>> > +               tree arg0_type = TREE_TYPE (arg0);
>> > +               if (TYPE_MODE (arg0_type) == V2DImode)
>> > +                 /* Type-based aliasing analysis thinks vector long
>> > +                    and vector long long are different and will put them
>> > +                    in distinct alias classes.  Force our address type
>> > +                    to be a may-alias type to avoid this.  */
>> > +                 arg0_type
>> > +                   = build_pointer_type_for_mode (arg0_type, Pmode,
>> > +                                                  true/*can_alias_all*/);
>> > +               else
>> > +                 arg0_type = build_pointer_type (arg0_type);
>> > +               aligned = build1 (NOP_EXPR, arg0_type, aligned);
>> > +               tree stg = build_indirect_ref (loc, aligned, RO_NULL);
>> > +               tree retval = build2 (MODIFY_EXPR, TREE_TYPE (stg), stg,
>> > +                                     convert (TREE_TYPE (stg), arg0));
>> > +               return retval;
>> > +             }
>> > +       }
>> > +    }
>> > +
>> >    for (n = 0;
>> >         !VOID_TYPE_P (TREE_VALUE (fnargs)) && n < nargs;
>> >         fnargs = TREE_CHAIN (fnargs), n++)
>> > Index: gcc/config/rs6000/rs6000.c
>> > ===================================================================
>> > --- gcc/config/rs6000/rs6000.c  (revision 235090)
>> > +++ gcc/config/rs6000/rs6000.c  (working copy)
>> > @@ -13025,9 +13025,9 @@ swap_selector_for_mode (machine_mode mode)
>> >    return force_reg (V16QImode, gen_rtx_CONST_VECTOR (V16QImode, 
>> > gen_rtvec_v (16, perm)));
>> >  }
>> >
>> > -/* Generate code for an "lvx", "lvxl", or "lve*x" built-in for a little 
>> > endian target
>> > -   with -maltivec=be specified.  Issue the load followed by an 
>> > element-reversing
>> > -   permute.  */
>> > +/* Generate code for an "lvxl", or "lve*x" built-in for a little endian 
>> > target
>> > +   with -maltivec=be specified.  Issue the load followed by an element-
>> > +   reversing permute.  */
>> >  void
>> >  altivec_expand_lvx_be (rtx op0, rtx op1, machine_mode mode, unsigned 
>> > unspec)
>> >  {
>> > @@ -13043,8 +13043,8 @@ altivec_expand_lvx_be (rtx op0, rtx op1, machine_m
>> >    emit_insn (gen_rtx_SET (op0, vperm));
>> >  }
>> >
>> > -/* Generate code for a "stvx" or "stvxl" built-in for a little endian 
>> > target
>> > -   with -maltivec=be specified.  Issue the store preceded by an 
>> > element-reversing
>> > +/* Generate code for a "stvxl" built-in for a little endian target with
>> > +   -maltivec=be specified.  Issue the store preceded by an 
>> > element-reversing
>> >     permute.  */
>> >  void
>> >  altivec_expand_stvx_be (rtx op0, rtx op1, machine_mode mode, unsigned 
>> > unspec)
>> > @@ -13106,22 +13106,65 @@ altivec_expand_lv_builtin (enum insn_code icode, 
>> > t
>> >
>> >    op1 = copy_to_mode_reg (mode1, op1);
>> >
>> > -  if (op0 == const0_rtx)
>> > +  /* For LVX, express the RTL accurately by ANDing the address with -16.
>> > +     LVXL and LVE*X expand to use UNSPECs to hide their special behavior,
>> > +     so the raw address is fine.  */
>> > +  switch (icode)
>> >      {
>> > -      addr = gen_rtx_MEM (blk ? BLKmode : tmode, op1);
>> > -    }
>> > -  else
>> > -    {
>> > -      op0 = copy_to_mode_reg (mode0, op0);
>> > -      addr = gen_rtx_MEM (blk ? BLKmode : tmode, gen_rtx_PLUS (Pmode, 
>> > op0, op1));
>> > -    }
>> > +    case CODE_FOR_altivec_lvx_v2df_2op:
>> > +    case CODE_FOR_altivec_lvx_v2di_2op:
>> > +    case CODE_FOR_altivec_lvx_v4sf_2op:
>> > +    case CODE_FOR_altivec_lvx_v4si_2op:
>> > +    case CODE_FOR_altivec_lvx_v8hi_2op:
>> > +    case CODE_FOR_altivec_lvx_v16qi_2op:
>> > +      {
>> > +       rtx rawaddr;
>> > +       if (op0 == const0_rtx)
>> > +         rawaddr = op1;
>> > +       else
>> > +         {
>> > +           op0 = copy_to_mode_reg (mode0, op0);
>> > +           rawaddr = gen_rtx_PLUS (Pmode, op1, op0);
>> > +         }
>> > +       addr = gen_rtx_AND (Pmode, rawaddr, gen_rtx_CONST_INT (Pmode, 
>> > -16));
>> > +       addr = gen_rtx_MEM (blk ? BLKmode : tmode, addr);
>> >
>> > -  pat = GEN_FCN (icode) (target, addr);
>> > +       /* For -maltivec=be, emit the load and follow it up with a
>> > +          permute to swap the elements.  */
>> > +       if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
>> > +         {
>> > +           rtx temp = gen_reg_rtx (tmode);
>> > +           emit_insn (gen_rtx_SET (temp, addr));
>> >
>> > -  if (! pat)
>> > -    return 0;
>> > -  emit_insn (pat);
>> > +           rtx sel = swap_selector_for_mode (tmode);
>> > +           rtx vperm = gen_rtx_UNSPEC (tmode, gen_rtvec (3, temp, temp, 
>> > sel),
>> > +                                       UNSPEC_VPERM);
>> > +           emit_insn (gen_rtx_SET (target, vperm));
>> > +         }
>> > +       else
>> > +         emit_insn (gen_rtx_SET (target, addr));
>> >
>> > +       break;
>> > +      }
>> > +
>> > +    default:
>> > +      if (op0 == const0_rtx)
>> > +       addr = gen_rtx_MEM (blk ? BLKmode : tmode, op1);
>> > +      else
>> > +       {
>> > +         op0 = copy_to_mode_reg (mode0, op0);
>> > +         addr = gen_rtx_MEM (blk ? BLKmode : tmode,
>> > +                             gen_rtx_PLUS (Pmode, op1, op0));
>> > +       }
>> > +
>> > +      pat = GEN_FCN (icode) (target, addr);
>> > +      if (! pat)
>> > +       return 0;
>> > +      emit_insn (pat);
>> > +
>> > +      break;
>> > +    }
>> > +
>> >    return target;
>> >  }
>> >
>> > @@ -13208,7 +13251,7 @@ altivec_expand_stv_builtin (enum insn_code icode,
>> >    rtx op0 = expand_normal (arg0);
>> >    rtx op1 = expand_normal (arg1);
>> >    rtx op2 = expand_normal (arg2);
>> > -  rtx pat, addr;
>> > +  rtx pat, addr, rawaddr;
>> >    machine_mode tmode = insn_data[icode].operand[0].mode;
>> >    machine_mode smode = insn_data[icode].operand[1].mode;
>> >    machine_mode mode1 = Pmode;
>> > @@ -13220,24 +13263,69 @@ altivec_expand_stv_builtin (enum insn_code icode,
>> >        || arg2 == error_mark_node)
>> >      return const0_rtx;
>> >
>> > -  if (! (*insn_data[icode].operand[1].predicate) (op0, smode))
>> > -    op0 = copy_to_mode_reg (smode, op0);
>> > -
>> >    op2 = copy_to_mode_reg (mode2, op2);
>> >
>> > -  if (op1 == const0_rtx)
>> > +  /* For STVX, express the RTL accurately by ANDing the address with -16.
>> > +     STVXL and STVE*X expand to use UNSPECs to hide their special 
>> > behavior,
>> > +     so the raw address is fine.  */
>> > +  switch (icode)
>> >      {
>> > -      addr = gen_rtx_MEM (tmode, op2);
>> > -    }
>> > -  else
>> > -    {
>> > -      op1 = copy_to_mode_reg (mode1, op1);
>> > -      addr = gen_rtx_MEM (tmode, gen_rtx_PLUS (Pmode, op1, op2));
>> > -    }
>> > +    case CODE_FOR_altivec_stvx_v2df_2op:
>> > +    case CODE_FOR_altivec_stvx_v2di_2op:
>> > +    case CODE_FOR_altivec_stvx_v4sf_2op:
>> > +    case CODE_FOR_altivec_stvx_v4si_2op:
>> > +    case CODE_FOR_altivec_stvx_v8hi_2op:
>> > +    case CODE_FOR_altivec_stvx_v16qi_2op:
>> > +      {
>> > +       if (op1 == const0_rtx)
>> > +         rawaddr = op2;
>> > +       else
>> > +         {
>> > +           op1 = copy_to_mode_reg (mode1, op1);
>> > +           rawaddr = gen_rtx_PLUS (Pmode, op2, op1);
>> > +         }
>> >
>> > -  pat = GEN_FCN (icode) (addr, op0);
>> > -  if (pat)
>> > -    emit_insn (pat);
>> > +       addr = gen_rtx_AND (Pmode, rawaddr, gen_rtx_CONST_INT (Pmode, 
>> > -16));
>> > +       addr = gen_rtx_MEM (tmode, addr);
>> > +
>> > +       op0 = copy_to_mode_reg (tmode, op0);
>> > +
>> > +       /* For -maltivec=be, emit a permute to swap the elements, followed
>> > +          by the store.  */
>> > +       if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
>> > +         {
>> > +           rtx temp = gen_reg_rtx (tmode);
>> > +           rtx sel = swap_selector_for_mode (tmode);
>> > +           rtx vperm = gen_rtx_UNSPEC (tmode, gen_rtvec (3, op0, op0, 
>> > sel),
>> > +                                       UNSPEC_VPERM);
>> > +           emit_insn (gen_rtx_SET (temp, vperm));
>> > +           emit_insn (gen_rtx_SET (addr, temp));
>> > +         }
>> > +       else
>> > +         emit_insn (gen_rtx_SET (addr, op0));
>> > +
>> > +       break;
>> > +      }
>> > +
>> > +    default:
>> > +      {
>> > +       if (! (*insn_data[icode].operand[1].predicate) (op0, smode))
>> > +         op0 = copy_to_mode_reg (smode, op0);
>> > +
>> > +       if (op1 == const0_rtx)
>> > +         addr = gen_rtx_MEM (tmode, op2);
>> > +       else
>> > +         {
>> > +           op1 = copy_to_mode_reg (mode1, op1);
>> > +           addr = gen_rtx_MEM (tmode, gen_rtx_PLUS (Pmode, op2, op1));
>> > +         }
>> > +
>> > +       pat = GEN_FCN (icode) (addr, op0);
>> > +       if (pat)
>> > +         emit_insn (pat);
>> > +      }
>> > +    }
>> > +
>> >    return NULL_RTX;
>> >  }
>> >
>> > @@ -14073,18 +14161,18 @@ altivec_expand_builtin (tree exp, rtx target, 
>> > bool
>> >    switch (fcode)
>> >      {
>> >      case ALTIVEC_BUILTIN_STVX_V2DF:
>> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2df, exp);
>> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2df_2op, 
>> > exp);
>> >      case ALTIVEC_BUILTIN_STVX_V2DI:
>> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2di, exp);
>> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2di_2op, 
>> > exp);
>> >      case ALTIVEC_BUILTIN_STVX_V4SF:
>> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4sf, exp);
>> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4sf_2op, 
>> > exp);
>> >      case ALTIVEC_BUILTIN_STVX:
>> >      case ALTIVEC_BUILTIN_STVX_V4SI:
>> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4si, exp);
>> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4si_2op, 
>> > exp);
>> >      case ALTIVEC_BUILTIN_STVX_V8HI:
>> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v8hi, exp);
>> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v8hi_2op, 
>> > exp);
>> >      case ALTIVEC_BUILTIN_STVX_V16QI:
>> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v16qi, 
>> > exp);
>> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v16qi_2op, 
>> > exp);
>> >      case ALTIVEC_BUILTIN_STVEBX:
>> >        return altivec_expand_stv_builtin (CODE_FOR_altivec_stvebx, exp);
>> >      case ALTIVEC_BUILTIN_STVEHX:
>> > @@ -14272,23 +14360,23 @@ altivec_expand_builtin (tree exp, rtx target, 
>> > bool
>> >        return altivec_expand_lv_builtin (CODE_FOR_altivec_lvxl_v16qi,
>> >                                         exp, target, false);
>> >      case ALTIVEC_BUILTIN_LVX_V2DF:
>> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2df,
>> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2df_2op,
>> >                                         exp, target, false);
>> >      case ALTIVEC_BUILTIN_LVX_V2DI:
>> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2di,
>> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2di_2op,
>> >                                         exp, target, false);
>> >      case ALTIVEC_BUILTIN_LVX_V4SF:
>> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4sf,
>> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4sf_2op,
>> >                                         exp, target, false);
>> >      case ALTIVEC_BUILTIN_LVX:
>> >      case ALTIVEC_BUILTIN_LVX_V4SI:
>> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4si,
>> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4si_2op,
>> >                                         exp, target, false);
>> >      case ALTIVEC_BUILTIN_LVX_V8HI:
>> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v8hi,
>> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v8hi_2op,
>> >                                         exp, target, false);
>> >      case ALTIVEC_BUILTIN_LVX_V16QI:
>> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v16qi,
>> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v16qi_2op,
>> >                                         exp, target, false);
>> >      case ALTIVEC_BUILTIN_LVLX:
>> >        return altivec_expand_lv_builtin (CODE_FOR_altivec_lvlx,
>> > @@ -37139,7 +37227,9 @@ insn_is_swappable_p (swap_web_entry *insn_entry, r
>> >       fix them up by converting them to permuting ones.  Exceptions:
>> >       UNSPEC_LVE, UNSPEC_LVX, and UNSPEC_STVX, which have a PARALLEL
>> >       body instead of a SET; and UNSPEC_STVE, which has an UNSPEC
>> > -     for the SET source.  */
>> > +     for the SET source.  Also we must now make an exception for lvx
>> > +     and stvx when they are not in the UNSPEC_LVX/STVX form (with the
>> > +     explicit "& -16") since this leads to unrecognizable insns.  */
>> >    rtx body = PATTERN (insn);
>> >    int i = INSN_UID (insn);
>> >
>> > @@ -37147,6 +37237,11 @@ insn_is_swappable_p (swap_web_entry *insn_entry, r
>> >      {
>> >        if (GET_CODE (body) == SET)
>> >         {
>> > +         rtx rhs = SET_SRC (body);
>> > +         gcc_assert (GET_CODE (rhs) == MEM);
>> > +         if (GET_CODE (XEXP (rhs, 0)) == AND)
>> > +           return 0;
>> > +
>> >           *special = SH_NOSWAP_LD;
>> >           return 1;
>> >         }
>> > @@ -37156,8 +37251,14 @@ insn_is_swappable_p (swap_web_entry *insn_entry, r
>> >
>> >    if (insn_entry[i].is_store)
>> >      {
>> > -      if (GET_CODE (body) == SET && GET_CODE (SET_SRC (body)) != UNSPEC)
>> > +      if (GET_CODE (body) == SET
>> > +         && GET_CODE (SET_SRC (body)) != UNSPEC)
>> >         {
>> > +         rtx lhs = SET_DEST (body);
>> > +         gcc_assert (GET_CODE (lhs) == MEM);
>> > +         if (GET_CODE (XEXP (lhs, 0)) == AND)
>> > +           return 0;
>> > +
>> >           *special = SH_NOSWAP_ST;
>> >           return 1;
>> >         }
>> > @@ -37827,6 +37928,267 @@ dump_swap_insn_table (swap_web_entry *insn_entry)
>> >    fputs ("\n", dump_file);
>> >  }
>> >
>> > +/* Return RTX with its address canonicalized to (reg) or (+ reg reg).
>> > +   Here RTX is an (& addr (const_int -16)).  Always return a new copy
>> > +   to avoid problems with combine.  */
>> > +static rtx
>> > +alignment_with_canonical_addr (rtx align)
>> > +{
>> > +  rtx canon;
>> > +  rtx addr = XEXP (align, 0);
>> > +
>> > +  if (REG_P (addr))
>> > +    canon = addr;
>> > +
>> > +  else if (GET_CODE (addr) == PLUS)
>> > +    {
>> > +      rtx addrop0 = XEXP (addr, 0);
>> > +      rtx addrop1 = XEXP (addr, 1);
>> > +
>> > +      if (!REG_P (addrop0))
>> > +       addrop0 = force_reg (GET_MODE (addrop0), addrop0);
>> > +
>> > +      if (!REG_P (addrop1))
>> > +       addrop1 = force_reg (GET_MODE (addrop1), addrop1);
>> > +
>> > +      canon = gen_rtx_PLUS (GET_MODE (addr), addrop0, addrop1);
>> > +    }
>> > +
>> > +  else
>> > +    canon = force_reg (GET_MODE (addr), addr);
>> > +
>> > +  return gen_rtx_AND (GET_MODE (align), canon, GEN_INT (-16));
>> > +}
>> > +
>> > +/* Check whether an rtx is an alignment mask, and if so, return
>> > +   a fully-expanded rtx for the masking operation.  */
>> > +static rtx
>> > +alignment_mask (rtx_insn *insn)
>> > +{
>> > +  rtx body = PATTERN (insn);
>> > +
>> > +  if (GET_CODE (body) != SET
>> > +      || GET_CODE (SET_SRC (body)) != AND
>> > +      || !REG_P (XEXP (SET_SRC (body), 0)))
>> > +    return 0;
>> > +
>> > +  rtx mask = XEXP (SET_SRC (body), 1);
>> > +
>> > +  if (GET_CODE (mask) == CONST_INT)
>> > +    {
>> > +      if (INTVAL (mask) == -16)
>> > +       return alignment_with_canonical_addr (SET_SRC (body));
>> > +      else
>> > +       return 0;
>> > +    }
>> > +
>> > +  if (!REG_P (mask))
>> > +    return 0;
>> > +
>> > +  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
>> > +  df_ref use;
>> > +  rtx real_mask = 0;
>> > +
>> > +  FOR_EACH_INSN_INFO_USE (use, insn_info)
>> > +    {
>> > +      if (!rtx_equal_p (DF_REF_REG (use), mask))
>> > +       continue;
>> > +
>> > +      struct df_link *def_link = DF_REF_CHAIN (use);
>> > +      if (!def_link || def_link->next)
>> > +       return 0;
>> > +
>> > +      rtx_insn *const_insn = DF_REF_INSN (def_link->ref);
>> > +      rtx const_body = PATTERN (const_insn);
>> > +      if (GET_CODE (const_body) != SET)
>> > +       return 0;
>> > +
>> > +      real_mask = SET_SRC (const_body);
>> > +
>> > +      if (GET_CODE (real_mask) != CONST_INT
>> > +         || INTVAL (real_mask) != -16)
>> > +       return 0;
>> > +    }
>> > +
>> > +  if (real_mask == 0)
>> > +    return 0;
>> > +
>> > +  return alignment_with_canonical_addr (SET_SRC (body));
>> > +}
>> > +
>> > +/* Given INSN that's a load or store based at BASE_REG, look for a
>> > +   feeding computation that aligns its address on a 16-byte boundary.  */
>> > +static rtx
>> > +find_alignment_op (rtx_insn *insn, rtx base_reg)
>> > +{
>> > +  df_ref base_use;
>> > +  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
>> > +  rtx and_operation = 0;
>> > +
>> > +  FOR_EACH_INSN_INFO_USE (base_use, insn_info)
>> > +    {
>> > +      if (!rtx_equal_p (DF_REF_REG (base_use), base_reg))
>> > +       continue;
>> > +
>> > +      struct df_link *base_def_link = DF_REF_CHAIN (base_use);
>> > +      if (!base_def_link || base_def_link->next)
>> > +       break;
>> > +
>> > +      rtx_insn *and_insn = DF_REF_INSN (base_def_link->ref);
>> > +      and_operation = alignment_mask (and_insn);
>> > +      if (and_operation != 0)
>> > +       break;
>> > +    }
>> > +
>> > +  return and_operation;
>> > +}
>> > +
>> > +struct del_info { bool replace; rtx_insn *replace_insn; };
>> > +
>> > +/* If INSN is the load for an lvx pattern, put it in canonical form.  */
>> > +static void
>> > +combine_lvx_pattern (rtx_insn *insn, del_info *to_delete)
>> > +{
>> > +  rtx body = PATTERN (insn);
>> > +  gcc_assert (GET_CODE (body) == SET
>> > +             && GET_CODE (SET_SRC (body)) == VEC_SELECT
>> > +             && GET_CODE (XEXP (SET_SRC (body), 0)) == MEM);
>> > +
>> > +  rtx mem = XEXP (SET_SRC (body), 0);
>> > +  rtx base_reg = XEXP (mem, 0);
>> > +
>> > +  rtx and_operation = find_alignment_op (insn, base_reg);
>> > +
>> > +  if (and_operation != 0)
>> > +    {
>> > +      df_ref def;
>> > +      struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
>> > +      FOR_EACH_INSN_INFO_DEF (def, insn_info)
>> > +       {
>> > +         struct df_link *link = DF_REF_CHAIN (def);
>> > +         if (!link || link->next)
>> > +           break;
>> > +
>> > +         rtx_insn *swap_insn = DF_REF_INSN (link->ref);
>> > +         if (!insn_is_swap_p (swap_insn)
>> > +             || insn_is_load_p (swap_insn)
>> > +             || insn_is_store_p (swap_insn))
>> > +           break;
>> > +
>> > +         /* Expected lvx pattern found.  Change the swap to
>> > +            a copy, and propagate the AND operation into the
>> > +            load.  */
>> > +         to_delete[INSN_UID (swap_insn)].replace = true;
>> > +         to_delete[INSN_UID (swap_insn)].replace_insn = swap_insn;
>> > +
>> > +         XEXP (mem, 0) = and_operation;
>> > +         SET_SRC (body) = mem;
>> > +         INSN_CODE (insn) = -1; /* Force re-recognition.  */
>> > +         df_insn_rescan (insn);
>> > +
>> > +         if (dump_file)
>> > +           fprintf (dump_file, "lvx opportunity found at %d\n",
>> > +                    INSN_UID (insn));
>> > +       }
>> > +    }
>> > +}
>> > +
>> > +/* If INSN is the store for an stvx pattern, put it in canonical form.  */
>> > +static void
>> > +combine_stvx_pattern (rtx_insn *insn, del_info *to_delete)
>> > +{
>> > +  rtx body = PATTERN (insn);
>> > +  gcc_assert (GET_CODE (body) == SET
>> > +             && GET_CODE (SET_DEST (body)) == MEM
>> > +             && GET_CODE (SET_SRC (body)) == VEC_SELECT);
>> > +  rtx mem = SET_DEST (body);
>> > +  rtx base_reg = XEXP (mem, 0);
>> > +
>> > +  rtx and_operation = find_alignment_op (insn, base_reg);
>> > +
>> > +  if (and_operation != 0)
>> > +    {
>> > +      rtx src_reg = XEXP (SET_SRC (body), 0);
>> > +      df_ref src_use;
>> > +      struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
>> > +      FOR_EACH_INSN_INFO_USE (src_use, insn_info)
>> > +       {
>> > +         if (!rtx_equal_p (DF_REF_REG (src_use), src_reg))
>> > +           continue;
>> > +
>> > +         struct df_link *link = DF_REF_CHAIN (src_use);
>> > +         if (!link || link->next)
>> > +           break;
>> > +
>> > +         rtx_insn *swap_insn = DF_REF_INSN (link->ref);
>> > +         if (!insn_is_swap_p (swap_insn)
>> > +             || insn_is_load_p (swap_insn)
>> > +             || insn_is_store_p (swap_insn))
>> > +           break;
>> > +
>> > +         /* Expected stvx pattern found.  Change the swap to
>> > +            a copy, and propagate the AND operation into the
>> > +            store.  */
>> > +         to_delete[INSN_UID (swap_insn)].replace = true;
>> > +         to_delete[INSN_UID (swap_insn)].replace_insn = swap_insn;
>> > +
>> > +         XEXP (mem, 0) = and_operation;
>> > +         SET_SRC (body) = src_reg;
>> > +         INSN_CODE (insn) = -1; /* Force re-recognition.  */
>> > +         df_insn_rescan (insn);
>> > +
>> > +         if (dump_file)
>> > +           fprintf (dump_file, "stvx opportunity found at %d\n",
>> > +                    INSN_UID (insn));
>> > +       }
>> > +    }
>> > +}
>> > +
>> > +/* Look for patterns created from builtin lvx and stvx calls, and
>> > +   canonicalize them to be properly recognized as such.  */
>> > +static void
>> > +combine_lvx_stvx_patterns (function *fun)
>> > +{
>> > +  int i;
>> > +  basic_block bb;
>> > +  rtx_insn *insn;
>> > +
>> > +  int num_insns = get_max_uid ();
>> > +  del_info *to_delete = XCNEWVEC (del_info, num_insns);
>> > +
>> > +  FOR_ALL_BB_FN (bb, fun)
>> > +    FOR_BB_INSNS (bb, insn)
>> > +    {
>> > +      if (!NONDEBUG_INSN_P (insn))
>> > +       continue;
>> > +
>> > +      if (insn_is_load_p (insn) && insn_is_swap_p (insn))
>> > +       combine_lvx_pattern (insn, to_delete);
>> > +      else if (insn_is_store_p (insn) && insn_is_swap_p (insn))
>> > +       combine_stvx_pattern (insn, to_delete);
>> > +    }
>> > +
>> > +  /* Turning swaps into copies is delayed until now, to avoid problems
>> > +     with deleting instructions during the insn walk.  */
>> > +  for (i = 0; i < num_insns; i++)
>> > +    if (to_delete[i].replace)
>> > +      {
>> > +       rtx swap_body = PATTERN (to_delete[i].replace_insn);
>> > +       rtx src_reg = XEXP (SET_SRC (swap_body), 0);
>> > +       rtx copy = gen_rtx_SET (SET_DEST (swap_body), src_reg);
>> > +       rtx_insn *new_insn = emit_insn_before (copy,
>> > +                                              to_delete[i].replace_insn);
>> > +       set_block_for_insn (new_insn,
>> > +                           BLOCK_FOR_INSN (to_delete[i].replace_insn));
>> > +       df_insn_rescan (new_insn);
>> > +       df_insn_delete (to_delete[i].replace_insn);
>> > +       remove_insn (to_delete[i].replace_insn);
>> > +       to_delete[i].replace_insn->set_deleted ();
>> > +      }
>> > +
>> > +  free (to_delete);
>> > +}
>> > +
>> >  /* Main entry point for this pass.  */
>> >  unsigned int
>> >  rs6000_analyze_swaps (function *fun)
>> > @@ -37833,7 +38195,7 @@ rs6000_analyze_swaps (function *fun)
>> >  {
>> >    swap_web_entry *insn_entry;
>> >    basic_block bb;
>> > -  rtx_insn *insn;
>> > +  rtx_insn *insn, *curr_insn = 0;
>> >
>> >    /* Dataflow analysis for use-def chains.  */
>> >    df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
>> > @@ -37841,12 +38203,15 @@ rs6000_analyze_swaps (function *fun)
>> >    df_analyze ();
>> >    df_set_flags (DF_DEFER_INSN_RESCAN);
>> >
>> > +  /* Pre-pass to combine lvx and stvx patterns so we don't lose info.  */
>> > +  combine_lvx_stvx_patterns (fun);
>> > +
>> >    /* Allocate structure to represent webs of insns.  */
>> >    insn_entry = XCNEWVEC (swap_web_entry, get_max_uid ());
>> >
>> >    /* Walk the insns to gather basic data.  */
>> >    FOR_ALL_BB_FN (bb, fun)
>> > -    FOR_BB_INSNS (bb, insn)
>> > +    FOR_BB_INSNS_SAFE (bb, insn, curr_insn)
>> >      {
>> >        unsigned int uid = INSN_UID (insn);
>> >        if (NONDEBUG_INSN_P (insn))
>> > Index: gcc/config/rs6000/vector.md
>> > ===================================================================
>> > --- gcc/config/rs6000/vector.md (revision 235090)
>> > +++ gcc/config/rs6000/vector.md (working copy)
>> > @@ -167,7 +167,14 @@
>> >    if (VECTOR_MEM_VSX_P (<MODE>mode))
>> >      {
>> >        operands[1] = rs6000_address_for_altivec (operands[1]);
>> > -      emit_insn (gen_altivec_lvx_<mode> (operands[0], operands[1]));
>> > +      rtx and_op = XEXP (operands[1], 0);
>> > +      gcc_assert (GET_CODE (and_op) == AND);
>> > +      rtx addr = XEXP (and_op, 0);
>> > +      if (GET_CODE (addr) == PLUS)
>> > +        emit_insn (gen_altivec_lvx_<mode>_2op (operands[0], XEXP (addr, 
>> > 0),
>> > +                                              XEXP (addr, 1)));
>> > +      else
>> > +        emit_insn (gen_altivec_lvx_<mode>_1op (operands[0], operands[1]));
>> >        DONE;
>> >      }
>> >  }")
>> > @@ -183,7 +190,14 @@
>> >    if (VECTOR_MEM_VSX_P (<MODE>mode))
>> >      {
>> >        operands[0] = rs6000_address_for_altivec (operands[0]);
>> > -      emit_insn (gen_altivec_stvx_<mode> (operands[0], operands[1]));
>> > +      rtx and_op = XEXP (operands[0], 0);
>> > +      gcc_assert (GET_CODE (and_op) == AND);
>> > +      rtx addr = XEXP (and_op, 0);
>> > +      if (GET_CODE (addr) == PLUS)
>> > +        emit_insn (gen_altivec_stvx_<mode>_2op (operands[1], XEXP (addr, 
>> > 0),
>> > +                                               XEXP (addr, 1)));
>> > +      else
>> > +        emit_insn (gen_altivec_stvx_<mode>_1op (operands[1], 
>> > operands[0]));
>> >        DONE;
>> >      }
>> >  }")
>> >
>> >
>>
>>
>
>

Re: [PATCH, rs6000] Expand vec_ld and vec_st during parsing to improve performance

Reply via email to