Re: [PATCH][rs6000] inline expansion of memcmp using vsx

Richard Biener Thu, 15 Nov 2018 02:03:07 -0800

On Wed, Nov 14, 2018 at 5:43 PM Aaron Sawdey <acsaw...@linux.ibm.com> wrote:
>
> This patch generalizes some the functions added earlier to do vsx expansion 
> of strncmp
> so that the can also generate the code needed for memcmp. I reorganized
> expand_block_compare() a little to be able to make use of this there. The vsx 
> code is more
> compact so I've changed the default block compare inline limit to 63 bytes. 
> The vsx
> code is only used if there is at least 16 bytes to compare as this means we 
> don't have to
> do complex code to compare less than one chunk. If vsx is not available the 
> limit is cut
> in half. The performance is good, vsx memcmp is considerably faster than the 
> gpr inline code
> if the strings are equal and is comparable if the strings have a 10% chance 
> of being
> equal (spread across the string).


How is performance affected if there are close earlier char-size
stores to one of the string/memory?
Can power still do store forwarding in this case?

> Currently regtesting, ok for trunk if tests pass?
>
> Thanks!
>    Aaron
>
> 2018-11-14  Aaron Sawdey  <acsaw...@linux.ibm.com>
>
>         * config/rs6000/rs6000-string.c (emit_vsx_zero_reg): New function.
>         (expand_cmp_vec_sequence): Rename and modify
>         expand_strncmp_vec_sequence.
>         (emit_final_compare_vec): Rename and modify 
> emit_final_str_compare_vec.
>         (generate_6432_conversion): New function.
>         (expand_block_compare): Add support for vsx.
>         (expand_block_compare_gpr): New function.
>         * config/rs6000/rs6000.opt (rs6000_block_compare_inline_limit): 
> Increase
>         default limit to 63 because of more compact vsx code.
>
>
>
>
> Index: gcc/config/rs6000/rs6000-string.c
> ===================================================================
> --- gcc/config/rs6000/rs6000-string.c   (revision 266034)
> +++ gcc/config/rs6000/rs6000-string.c   (working copy)
> @@ -615,6 +615,283 @@
>      }
>  }
>
> +static rtx
> +emit_vsx_zero_reg()
> +{
> +  unsigned int i;
> +  rtx zr[16];
> +  for (i = 0; i < 16; i++)
> +    zr[i] = GEN_INT (0);
> +  rtvec zv = gen_rtvec_v (16, zr);
> +  rtx zero_reg = gen_reg_rtx (V16QImode);
> +  rs6000_expand_vector_init (zero_reg, gen_rtx_PARALLEL (V16QImode, zv));
> +  return zero_reg;
> +}
> +
> +/* Generate the sequence of compares for strcmp/strncmp using vec/vsx
> +   instructions.
> +
> +   BYTES_TO_COMPARE is the number of bytes to be compared.
> +   ORIG_SRC1 is the unmodified rtx for the first string.
> +   ORIG_SRC2 is the unmodified rtx for the second string.
> +   S1ADDR is the register to use for the base address of the first string.
> +   S2ADDR is the register to use for the base address of the second string.
> +   OFF_REG is the register to use for the string offset for loads.
> +   S1DATA is the register for loading the first string.
> +   S2DATA is the register for loading the second string.
> +   VEC_RESULT is the rtx for the vector result indicating the byte 
> difference.
> +   EQUALITY_COMPARE_REST is a flag to indicate we need to make a cleanup call
> +   to strcmp/strncmp if we have equality at the end of the inline comparison.
> +   P_CLEANUP_LABEL is a pointer to rtx for a label we generate if we need 
> code
> +   to clean up and generate the final comparison result.
> +   FINAL_MOVE_LABEL is rtx for a label we can branch to when we can just
> +   set the final result.
> +   CHECKZERO indicates whether the sequence should check for zero bytes
> +   for use doing strncmp, or not (for use doing memcmp).  */
> +static void
> +expand_cmp_vec_sequence (unsigned HOST_WIDE_INT bytes_to_compare,
> +                        rtx orig_src1, rtx orig_src2,
> +                        rtx s1addr, rtx s2addr, rtx off_reg,
> +                        rtx s1data, rtx s2data, rtx vec_result,
> +                        bool equality_compare_rest, rtx *p_cleanup_label,
> +                        rtx final_move_label, bool checkzero)
> +{
> +  machine_mode load_mode;
> +  unsigned int load_mode_size;
> +  unsigned HOST_WIDE_INT cmp_bytes = 0;
> +  unsigned HOST_WIDE_INT offset = 0;
> +  rtx zero_reg = NULL;
> +
> +  gcc_assert (p_cleanup_label != NULL);
> +  rtx cleanup_label = *p_cleanup_label;
> +
> +  emit_move_insn (s1addr, force_reg (Pmode, XEXP (orig_src1, 0)));
> +  emit_move_insn (s2addr, force_reg (Pmode, XEXP (orig_src2, 0)));
> +
> +  if (checkzero && !TARGET_P9_VECTOR)
> +    zero_reg = emit_vsx_zero_reg();
> +
> +  while (bytes_to_compare > 0)
> +    {
> +      /* VEC/VSX compare sequence for P8:
> +        check each 16B with:
> +        lxvd2x 32,28,8
> +        lxvd2x 33,29,8
> +        vcmpequb 2,0,1  # compare strings
> +        vcmpequb 4,0,3  # compare w/ 0
> +        xxlorc 37,36,34       # first FF byte is either mismatch or end of 
> string
> +        vcmpequb. 7,5,3  # reg 7 contains 0
> +        bnl 6,.Lmismatch
> +
> +        For the P8 LE case, we use lxvd2x and compare full 16 bytes
> +        but then use use vgbbd and a shift to get two bytes with the
> +        information we need in the correct order.
> +
> +        VEC/VSX compare sequence if TARGET_P9_VECTOR:
> +        lxvb16x/lxvb16x     # load 16B of each string
> +        vcmpnezb.           # produces difference location or zero byte 
> location
> +        bne 6,.Lmismatch
> +
> +        Use the overlapping compare trick for the last block if it is
> +        less than 16 bytes.
> +      */
> +
> +      load_mode = V16QImode;
> +      load_mode_size = GET_MODE_SIZE (load_mode);
> +
> +      if (bytes_to_compare >= load_mode_size)
> +       cmp_bytes = load_mode_size;
> +      else
> +       {
> +         /* Move this load back so it doesn't go past the end.  P8/P9
> +            can do this efficiently.  This is never called with less
> +            than 16 bytes so we should always be able to do this.  */
> +         unsigned int extra_bytes = load_mode_size - bytes_to_compare;
> +         cmp_bytes = bytes_to_compare;
> +         gcc_assert (offset > extra_bytes);
> +         offset -= extra_bytes;
> +         cmp_bytes = load_mode_size;
> +         bytes_to_compare = cmp_bytes;
> +       }
> +
> +      /* The offset currently used is always kept in off_reg so that the
> +        cleanup code on P8 can use it to extract the differing byte.  */
> +      emit_move_insn (off_reg, GEN_INT (offset));
> +
> +      rtx addr1 = gen_rtx_PLUS (Pmode, s1addr, off_reg);
> +      do_load_for_compare_from_addr (load_mode, s1data, addr1, orig_src1);
> +      rtx addr2 = gen_rtx_PLUS (Pmode, s2addr, off_reg);
> +      do_load_for_compare_from_addr (load_mode, s2data, addr2, orig_src2);
> +
> +      /* Cases to handle.  A and B are chunks of the two strings.
> +        1: Not end of comparison:
> +        A != B: branch to cleanup code to compute result.
> +        A == B: next block
> +        2: End of the inline comparison:
> +        A != B: branch to cleanup code to compute result.
> +        A == B: call strcmp/strncmp
> +        3: compared requested N bytes:
> +        A == B: branch to result 0.
> +        A != B: cleanup code to compute result.  */
> +
> +      unsigned HOST_WIDE_INT remain = bytes_to_compare - cmp_bytes;
> +
> +      if (checkzero)
> +       {
> +         if (TARGET_P9_VECTOR)
> +           emit_insn (gen_vcmpnezb_p (vec_result, s1data, s2data));
> +         else
> +           {
> +             /* Emit instructions to do comparison and zero check.  */
> +             rtx cmp_res = gen_reg_rtx (load_mode);
> +             rtx cmp_zero = gen_reg_rtx (load_mode);
> +             rtx cmp_combined = gen_reg_rtx (load_mode);
> +             emit_insn (gen_altivec_eqv16qi (cmp_res, s1data, s2data));
> +             emit_insn (gen_altivec_eqv16qi (cmp_zero, s1data, zero_reg));
> +             emit_insn (gen_orcv16qi3 (vec_result, cmp_zero, cmp_res));
> +             emit_insn (gen_altivec_vcmpequb_p (cmp_combined, vec_result, 
> zero_reg));
> +           }
> +       }
> +      else
> +       emit_insn (gen_altivec_vcmpequb_p (vec_result, s1data, s2data));
> +
> +      bool branch_to_cleanup = (remain > 0 || equality_compare_rest);
> +      rtx cr6 = gen_rtx_REG (CCmode, CR6_REGNO);
> +      rtx dst_label;
> +      rtx cmp_rtx;
> +      if (branch_to_cleanup)
> +       {
> +         /* Branch to cleanup code, otherwise fall through to do more
> +            compares.  P8 and P9 use different CR bits because on P8
> +            we are looking at the result of a comparsion vs a
> +            register of zeroes so the all-true condition means no
> +            difference or zero was found.  On P9, vcmpnezb sets a byte
> +            to 0xff if there is a mismatch or zero, so the all-false
> +            condition indicates we found no difference or zero.  */
> +         if (!cleanup_label)
> +           cleanup_label = gen_label_rtx ();
> +         dst_label = cleanup_label;
> +         if (TARGET_P9_VECTOR && checkzero)
> +           cmp_rtx = gen_rtx_NE (VOIDmode, cr6, const0_rtx);
> +         else
> +           cmp_rtx = gen_rtx_GE (VOIDmode, cr6, const0_rtx);
> +       }
> +      else
> +       {
> +         /* Branch to final return or fall through to cleanup,
> +            result is already set to 0.  */
> +         dst_label = final_move_label;
> +         if (TARGET_P9_VECTOR && checkzero)
> +           cmp_rtx = gen_rtx_EQ (VOIDmode, cr6, const0_rtx);
> +         else
> +           cmp_rtx = gen_rtx_LT (VOIDmode, cr6, const0_rtx);
> +       }
> +
> +      rtx lab_ref = gen_rtx_LABEL_REF (VOIDmode, dst_label);
> +      rtx ifelse = gen_rtx_IF_THEN_ELSE (VOIDmode, cmp_rtx,
> +                                        lab_ref, pc_rtx);
> +      rtx j2 = emit_jump_insn (gen_rtx_SET (pc_rtx, ifelse));
> +      JUMP_LABEL (j2) = dst_label;
> +      LABEL_NUSES (dst_label) += 1;
> +
> +      offset += cmp_bytes;
> +      bytes_to_compare -= cmp_bytes;
> +    }
> +  *p_cleanup_label = cleanup_label;
> +  return;
> +}
> +
> +/* Generate the final sequence that identifies the differing
> +   byte and generates the final result, taking into account
> +   zero bytes:
> +
> +   P8:
> +        vgbbd 0,0
> +        vsldoi 0,0,0,9
> +        mfvsrd 9,32
> +        addi 10,9,-1    # count trailing zero bits
> +        andc 9,10,9
> +        popcntd 9,9
> +        lbzx 10,28,9    # use that offset to load differing byte
> +        lbzx 3,29,9
> +        subf 3,3,10     # subtract for final result
> +
> +   P9:
> +        vclzlsbb            # counts trailing bytes with lsb=0
> +        vextublx            # extract differing byte
> +
> +   STR1 is the reg rtx for data from string 1.
> +   STR2 is the reg rtx for data from string 2.
> +   RESULT is the reg rtx for the comparison result.
> +   S1ADDR is the register to use for the base address of the first string.
> +   S2ADDR is the register to use for the base address of the second string.
> +   ORIG_SRC1 is the unmodified rtx for the first string.
> +   ORIG_SRC2 is the unmodified rtx for the second string.
> +   OFF_REG is the register to use for the string offset for loads.
> +   VEC_RESULT is the rtx for the vector result indicating the byte 
> difference.  */
> +
> +static void
> +emit_final_compare_vec (rtx str1, rtx str2, rtx result,
> +                       rtx s1addr, rtx s2addr,
> +                       rtx orig_src1, rtx orig_src2,
> +                       rtx off_reg, rtx vec_result)
> +{
> +
> +  if (TARGET_P9_VECTOR)
> +    {
> +      rtx diffix = gen_reg_rtx (SImode);
> +      rtx chr1 = gen_reg_rtx (SImode);
> +      rtx chr2 = gen_reg_rtx (SImode);
> +      rtx chr1_di = simplify_gen_subreg (DImode, chr1, SImode, 0);
> +      rtx chr2_di = simplify_gen_subreg (DImode, chr2, SImode, 0);
> +      emit_insn (gen_vclzlsbb_v16qi (diffix, vec_result));
> +      emit_insn (gen_vextublx (chr1, diffix, str1));
> +      emit_insn (gen_vextublx (chr2, diffix, str2));
> +      do_sub3 (result, chr1_di, chr2_di);
> +    }
> +  else
> +    {
> +      gcc_assert (TARGET_P8_VECTOR);
> +      rtx diffix = gen_reg_rtx (DImode);
> +      rtx result_gbbd = gen_reg_rtx (V16QImode);
> +      /* Since each byte of the input is either 00 or FF, the bytes in
> +        dw0 and dw1 after vgbbd are all identical to each other.  */
> +      emit_insn (gen_p8v_vgbbd (result_gbbd, vec_result));
> +      /* For LE, we shift by 9 and get BA in the low two bytes then CTZ.
> +        For BE, we shift by 7 and get AB in the high two bytes then CLZ.  */
> +      rtx result_shifted = gen_reg_rtx (V16QImode);
> +      int shift_amt = (BYTES_BIG_ENDIAN) ? 7 : 9;
> +      emit_insn (gen_altivec_vsldoi_v16qi 
> (result_shifted,result_gbbd,result_gbbd, GEN_INT (shift_amt)));
> +
> +      rtx diffix_df = simplify_gen_subreg (DFmode, diffix, DImode, 0);
> +      emit_insn (gen_p8_mfvsrd_3_v16qi (diffix_df, result_shifted));
> +      rtx count = gen_reg_rtx (DImode);
> +
> +      if (BYTES_BIG_ENDIAN)
> +       emit_insn (gen_clzdi2 (count, diffix));
> +      else
> +       emit_insn (gen_ctzdi2 (count, diffix));
> +
> +      /* P8 doesn't have a good solution for extracting one byte from
> +        a vsx reg like vextublx on P9 so we just compute the offset
> +        of the differing byte and load it from each string.  */
> +      do_add3 (off_reg, off_reg, count);
> +
> +      rtx chr1 = gen_reg_rtx (QImode);
> +      rtx chr2 = gen_reg_rtx (QImode);
> +      rtx addr1 = gen_rtx_PLUS (Pmode, s1addr, off_reg);
> +      do_load_for_compare_from_addr (QImode, chr1, addr1, orig_src1);
> +      rtx addr2 = gen_rtx_PLUS (Pmode, s2addr, off_reg);
> +      do_load_for_compare_from_addr (QImode, chr2, addr2, orig_src2);
> +      machine_mode rmode = GET_MODE (result);
> +      rtx chr1_rm = simplify_gen_subreg (rmode, chr1, QImode, 0);
> +      rtx chr2_rm = simplify_gen_subreg (rmode, chr2, QImode, 0);
> +      do_sub3 (result, chr1_rm, chr2_rm);
> +    }
> +
> +  return;
> +}
> +
>  /* Expand a block compare operation using loop code, and return true
>     if successful.  Return false if we should let the compiler generate
>     normal code, probably a memcmp call.
> @@ -1343,106 +1620,80 @@
>    return true;
>  }
>
> -/* Expand a block compare operation, and return true if successful.
> -   Return false if we should let the compiler generate normal code,
> -   probably a memcmp call.
> +/* Generate code to convert a DImode-plus-carry subtract result into
> +   a SImode result that has the same <0 / ==0 / >0 properties to
> +   produce the final result from memcmp.
>
> -   OPERANDS[0] is the target (result).
> -   OPERANDS[1] is the first source.
> -   OPERANDS[2] is the second source.
> -   OPERANDS[3] is the length.
> -   OPERANDS[4] is the alignment.  */
> -bool
> -expand_block_compare (rtx operands[])
> +   TARGET is the rtx for the register to receive the memcmp result.
> +   SUB_RESULT is the rtx for the register contining the subtract result.  */
> +
> +void
> +generate_6432_conversion(rtx target, rtx sub_result)
>  {
> -  rtx target = operands[0];
> -  rtx orig_src1 = operands[1];
> -  rtx orig_src2 = operands[2];
> -  rtx bytes_rtx = operands[3];
> -  rtx align_rtx = operands[4];
> -  HOST_WIDE_INT cmp_bytes = 0;
> -  rtx src1 = orig_src1;
> -  rtx src2 = orig_src2;
> +  /* We need to produce DI result from sub, then convert to target SI
> +     while maintaining <0 / ==0 / >0 properties.  This sequence works:
> +     subfc L,A,B
> +     subfe H,H,H
> +     popcntd L,L
> +     rldimi L,H,6,0
>
> -  /* This case is complicated to handle because the subtract
> -     with carry instructions do not generate the 64-bit
> -     carry and so we must emit code to calculate it ourselves.
> -     We choose not to implement this yet.  */
> -  if (TARGET_32BIT && TARGET_POWERPC64)
> -    return false;
> +     This is an alternate one Segher cooked up if somebody
> +     wants to expand this for something that doesn't have popcntd:
> +     subfc L,a,b
> +     subfe H,x,x
> +     addic t,L,-1
> +     subfe v,t,L
> +     or z,v,H
>
> -  bool isP7 = (rs6000_tune == PROCESSOR_POWER7);
> +     And finally, p9 can just do this:
> +     cmpld A,B
> +     setb r */
>
> -  /* Allow this param to shut off all expansion.  */
> -  if (rs6000_block_compare_inline_limit == 0)
> -    return false;
> -
> -  /* targetm.slow_unaligned_access -- don't do unaligned stuff.
> -     However slow_unaligned_access returns true on P7 even though the
> -     performance of this code is good there.  */
> -  if (!isP7
> -      && (targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src1))
> -         || targetm.slow_unaligned_access (word_mode, MEM_ALIGN 
> (orig_src2))))
> -    return false;
> -
> -  /* Unaligned l*brx traps on P7 so don't do this.  However this should
> -     not affect much because LE isn't really supported on P7 anyway.  */
> -  if (isP7 && !BYTES_BIG_ENDIAN)
> -    return false;
> -
> -  /* If this is not a fixed size compare, try generating loop code and
> -     if that fails just call memcmp.  */
> -  if (!CONST_INT_P (bytes_rtx))
> -    return expand_compare_loop (operands);
> -
> -  /* This must be a fixed size alignment.  */
> -  if (!CONST_INT_P (align_rtx))
> -    return false;
> -
> -  unsigned int base_align = UINTVAL (align_rtx) / BITS_PER_UNIT;
> -
> -  gcc_assert (GET_MODE (target) == SImode);
> -
> -  /* Anything to move?  */
> -  unsigned HOST_WIDE_INT bytes = UINTVAL (bytes_rtx);
> -  if (bytes == 0)
> -    return true;
> -
> -  rtx tmp_reg_src1 = gen_reg_rtx (word_mode);
> -  rtx tmp_reg_src2 = gen_reg_rtx (word_mode);
> -  /* P7/P8 code uses cond for subfc. but P9 uses
> -     it for cmpld which needs CCUNSmode.  */
> -  rtx cond;
> -  if (TARGET_P9_MISC)
> -    cond = gen_reg_rtx (CCUNSmode);
> +  if (TARGET_64BIT)
> +    {
> +      rtx tmp_reg_ca = gen_reg_rtx (DImode);
> +      emit_insn (gen_subfdi3_carry_in_xx (tmp_reg_ca));
> +      rtx popcnt = gen_reg_rtx (DImode);
> +      emit_insn (gen_popcntddi2 (popcnt, sub_result));
> +      rtx tmp2 = gen_reg_rtx (DImode);
> +      emit_insn (gen_iordi3 (tmp2, popcnt, tmp_reg_ca));
> +      emit_insn (gen_movsi (target, gen_lowpart (SImode, tmp2)));
> +    }
>    else
> -    cond = gen_reg_rtx (CCmode);
> +    {
> +      rtx tmp_reg_ca = gen_reg_rtx (SImode);
> +      emit_insn (gen_subfsi3_carry_in_xx (tmp_reg_ca));
> +      rtx popcnt = gen_reg_rtx (SImode);
> +      emit_insn (gen_popcntdsi2 (popcnt, sub_result));
> +      emit_insn (gen_iorsi3 (target, popcnt, tmp_reg_ca));
> +    }
> +}
>
> -  /* Strategy phase.  How many ops will this take and should we expand it?  
> */
> +/* Generate memcmp expansion using in-line non-loop GPR instructions.
> +   The bool return indicates whether code for a 64->32 conversion
> +   should be generated.
>
> -  unsigned HOST_WIDE_INT offset = 0;
> -  machine_mode load_mode =
> -    select_block_compare_mode (offset, bytes, base_align);
> -  unsigned int load_mode_size = GET_MODE_SIZE (load_mode);
> +   BYTES is the number of bytes to be compared.
> +   BASE_ALIGN is the minimum alignment for both blocks to compare.
> +   ORIG_SRC1 is the original pointer to the first block to compare.
> +   ORIG_SRC2 is the original pointer to the second block to compare.
> +   SUB_RESULT is the reg rtx for the result from the final subtract.
> +   COND is rtx for a condition register that will be used for the final
> +   compare on power9 or better.
> +   FINAL_RESULT is the reg rtx for the final memcmp result.
> +   P_CONVERT_LABEL is a pointer to rtx that will be used to store the
> +   label generated for a branch to the 64->32 code, if such a branch
> +   is needed.
> +   P_FINAL_LABEL is a pointer to rtx that will be used to store the label
> +   for the end of the memcmp if a branch there is needed.
> +*/
>
> -  /* We don't want to generate too much code.  The loop code can take
> -     over for lengths greater than 31 bytes.  */
> -  unsigned HOST_WIDE_INT max_bytes = rs6000_block_compare_inline_limit;
> -  if (!IN_RANGE (bytes, 1, max_bytes))
> -    return expand_compare_loop (operands);
> -
> -  /* The code generated for p7 and older is not faster than glibc
> -     memcmp if alignment is small and length is not short, so bail
> -     out to avoid those conditions.  */
> -  if (!TARGET_EFFICIENT_OVERLAPPING_UNALIGNED
> -      && ((base_align == 1 && bytes > 16)
> -         || (base_align == 2 && bytes > 32)))
> -    return false;
> -
> -  bool generate_6432_conversion = false;
> -  rtx convert_label = NULL;
> -  rtx final_label = NULL;
> -
> +bool
> +expand_block_compare_gpr(unsigned HOST_WIDE_INT bytes, unsigned int 
> base_align,
> +                        rtx orig_src1, rtx orig_src2,
> +                        rtx sub_result, rtx cond, rtx final_result,
> +                        rtx *p_convert_label, rtx *p_final_label)
> +{
>    /* Example of generated code for 18 bytes aligned 1 byte.
>       Compiled with -fno-reorder-blocks for clarity.
>               ldbrx 10,31,8
> @@ -1473,6 +1724,18 @@
>       if the difference is found there, then a final block of HImode that 
> skips
>       the DI->SI conversion.  */
>
> +  unsigned HOST_WIDE_INT offset = 0;
> +  unsigned int load_mode_size;
> +  HOST_WIDE_INT cmp_bytes = 0;
> +  rtx src1 = orig_src1;
> +  rtx src2 = orig_src2;
> +  rtx tmp_reg_src1 = gen_reg_rtx (word_mode);
> +  rtx tmp_reg_src2 = gen_reg_rtx (word_mode);
> +  bool need_6432_conv = false;
> +  rtx convert_label = NULL;
> +  rtx final_label = NULL;
> +  machine_mode load_mode;
> +
>    while (bytes > 0)
>      {
>        unsigned int align = compute_current_alignment (base_align, offset);
> @@ -1536,15 +1799,15 @@
>         }
>
>        int remain = bytes - cmp_bytes;
> -      if (GET_MODE_SIZE (GET_MODE (target)) > GET_MODE_SIZE (load_mode))
> +      if (GET_MODE_SIZE (GET_MODE (final_result)) > GET_MODE_SIZE 
> (load_mode))
>         {
> -         /* Target is larger than load size so we don't need to
> +         /* Final_result is larger than load size so we don't need to
>              reduce result size.  */
>
>           /* We previously did a block that need 64->32 conversion but
>              the current block does not, so a label is needed to jump
>              to the end.  */
> -         if (generate_6432_conversion && !final_label)
> +         if (need_6432_conv && !final_label)
>             final_label = gen_label_rtx ();
>
>           if (remain > 0)
> @@ -1557,7 +1820,7 @@
>               rtx tmp = gen_rtx_MINUS (word_mode, tmp_reg_src1, tmp_reg_src2);
>               rtx cr = gen_reg_rtx (CCmode);
>               rs6000_emit_dot_insn (tmp_reg_src2, tmp, 2, cr);
> -             emit_insn (gen_movsi (target,
> +             emit_insn (gen_movsi (final_result,
>                                     gen_lowpart (SImode, tmp_reg_src2)));
>               rtx ne_rtx = gen_rtx_NE (VOIDmode, cr, const0_rtx);
>               rtx ifelse = gen_rtx_IF_THEN_ELSE (VOIDmode, ne_rtx,
> @@ -1572,11 +1835,11 @@
>                 {
>                   emit_insn (gen_subdi3 (tmp_reg_src2, tmp_reg_src1,
>                                          tmp_reg_src2));
> -                 emit_insn (gen_movsi (target,
> +                 emit_insn (gen_movsi (final_result,
>                                         gen_lowpart (SImode, tmp_reg_src2)));
>                 }
>               else
> -               emit_insn (gen_subsi3 (target, tmp_reg_src1, tmp_reg_src2));
> +               emit_insn (gen_subsi3 (final_result, tmp_reg_src1, 
> tmp_reg_src2));
>
>               if (final_label)
>                 {
> @@ -1591,9 +1854,9 @@
>        else
>         {
>           /* Do we need a 64->32 conversion block? We need the 64->32
> -            conversion even if target size == load_mode size because
> +            conversion even if final_result size == load_mode size because
>              the subtract generates one extra bit.  */
> -         generate_6432_conversion = true;
> +         need_6432_conv = true;
>
>           if (remain > 0)
>             {
> @@ -1604,20 +1867,27 @@
>               rtx cvt_ref = gen_rtx_LABEL_REF (VOIDmode, convert_label);
>               if (TARGET_P9_MISC)
>                 {
> -               /* Generate a compare, and convert with a setb later.  */
> +               /* Generate a compare, and convert with a setb later.
> +                  Use cond that is passed in because the caller needs
> +                  to use it for the 64->32 conversion later.  */
>                   rtx cmp = gen_rtx_COMPARE (CCUNSmode, tmp_reg_src1,
>                                              tmp_reg_src2);
>                   emit_insn (gen_rtx_SET (cond, cmp));
>                 }
>               else
> -               /* Generate a subfc. and use the longer
> -                  sequence for conversion.  */
> -               if (TARGET_64BIT)
> -                 emit_insn (gen_subfdi3_carry_dot2 (tmp_reg_src2, 
> tmp_reg_src2,
> -                                                    tmp_reg_src1, cond));
> -               else
> -                 emit_insn (gen_subfsi3_carry_dot2 (tmp_reg_src2, 
> tmp_reg_src2,
> -                                                    tmp_reg_src1, cond));
> +               {
> +                 /* Generate a subfc. and use the longer sequence for
> +                    conversion.  Cond is not used outside this
> +                    function in this case.  */
> +                 cond = gen_reg_rtx (CCmode);
> +                 if (TARGET_64BIT)
> +                   emit_insn (gen_subfdi3_carry_dot2 (sub_result, 
> tmp_reg_src2,
> +                                                      tmp_reg_src1, cond));
> +                 else
> +                   emit_insn (gen_subfsi3_carry_dot2 (sub_result, 
> tmp_reg_src2,
> +                                                      tmp_reg_src1, cond));
> +               }
> +
>               rtx ne_rtx = gen_rtx_NE (VOIDmode, cond, const0_rtx);
>               rtx ifelse = gen_rtx_IF_THEN_ELSE (VOIDmode, ne_rtx,
>                                                  cvt_ref, pc_rtx);
> @@ -1637,10 +1907,10 @@
>                 }
>               else
>                 if (TARGET_64BIT)
> -                 emit_insn (gen_subfdi3_carry (tmp_reg_src2, tmp_reg_src2,
> +                 emit_insn (gen_subfdi3_carry (sub_result, tmp_reg_src2,
>                                                 tmp_reg_src1));
>                 else
> -                 emit_insn (gen_subfsi3_carry (tmp_reg_src2, tmp_reg_src2,
> +                 emit_insn (gen_subfsi3_carry (sub_result, tmp_reg_src2,
>                                                 tmp_reg_src1));
>             }
>         }
> @@ -1649,51 +1919,162 @@
>        bytes -= cmp_bytes;
>      }
>
> -  if (generate_6432_conversion)
> +  if (convert_label)
> +    *p_convert_label = convert_label;
> +  if (final_label)
> +    *p_final_label = final_label;
> +  return need_6432_conv;
> +}
> +
> +/* Expand a block compare operation, and return true if successful.
> +   Return false if we should let the compiler generate normal code,
> +   probably a memcmp call.
> +
> +   OPERANDS[0] is the target (result).
> +   OPERANDS[1] is the first source.
> +   OPERANDS[2] is the second source.
> +   OPERANDS[3] is the length.
> +   OPERANDS[4] is the alignment.  */
> +bool
> +expand_block_compare (rtx operands[])
> +{
> +  rtx target = operands[0];
> +  rtx orig_src1 = operands[1];
> +  rtx orig_src2 = operands[2];
> +  rtx bytes_rtx = operands[3];
> +  rtx align_rtx = operands[4];
> +
> +  /* This case is complicated to handle because the subtract
> +     with carry instructions do not generate the 64-bit
> +     carry and so we must emit code to calculate it ourselves.
> +     We choose not to implement this yet.  */
> +  if (TARGET_32BIT && TARGET_POWERPC64)
> +    return false;
> +
> +  bool isP7 = (rs6000_tune == PROCESSOR_POWER7);
> +
> +  /* Allow this param to shut off all expansion.  */
> +  if (rs6000_block_compare_inline_limit == 0)
> +    return false;
> +
> +  /* targetm.slow_unaligned_access -- don't do unaligned stuff.
> +     However slow_unaligned_access returns true on P7 even though the
> +     performance of this code is good there.  */
> +  if (!isP7
> +      && (targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src1))
> +         || targetm.slow_unaligned_access (word_mode, MEM_ALIGN 
> (orig_src2))))
> +    return false;
> +
> +  /* Unaligned l*brx traps on P7 so don't do this.  However this should
> +     not affect much because LE isn't really supported on P7 anyway.  */
> +  if (isP7 && !BYTES_BIG_ENDIAN)
> +    return false;
> +
> +  /* If this is not a fixed size compare, try generating loop code and
> +     if that fails just call memcmp.  */
> +  if (!CONST_INT_P (bytes_rtx))
> +    return expand_compare_loop (operands);
> +
> +  /* This must be a fixed size alignment.  */
> +  if (!CONST_INT_P (align_rtx))
> +    return false;
> +
> +  unsigned int base_align = UINTVAL (align_rtx) / BITS_PER_UNIT;
> +
> +  gcc_assert (GET_MODE (target) == SImode);
> +
> +  /* Anything to move?  */
> +  unsigned HOST_WIDE_INT bytes = UINTVAL (bytes_rtx);
> +  if (bytes == 0)
> +    return true;
> +
> +  /* P7/P8 code uses cond for subfc. but P9 uses
> +     it for cmpld which needs CCUNSmode.  */
> +  rtx cond = NULL;
> +  if (TARGET_P9_MISC)
> +    cond = gen_reg_rtx (CCUNSmode);
> +
> +  /* Is it OK to use vec/vsx for this.  TARGET_VSX means we have at
> +     least POWER7 but we use TARGET_EFFICIENT_UNALIGNED_VSX which is
> +     at least POWER8.  That way we can rely on overlapping compares to
> +     do the final comparison of less than 16 bytes.  Also I do not
> +     want to deal with making this work for 32 bits.  In addition, we
> +     have to make sure that we have at least P8_VECTOR (we don't allow
> +     P9_VECTOR without P8_VECTOR).  */
> +  int use_vec = (bytes >= 16 && !TARGET_32BIT
> +                && TARGET_EFFICIENT_UNALIGNED_VSX && TARGET_P8_VECTOR);
> +
> +  /* We don't want to generate too much code.  The loop code can take
> +     over for lengths greater than 31 bytes.  */
> +  unsigned HOST_WIDE_INT max_bytes = rs6000_block_compare_inline_limit;
> +
> +  /* Don't generate too much code if vsx was disabled.  */
> +  if (!use_vec && max_bytes > 1)
> +    max_bytes = ((max_bytes + 1) / 2) - 1;
> +
> +  if (!IN_RANGE (bytes, 1, max_bytes))
> +    return expand_compare_loop (operands);
> +
> +  /* The code generated for p7 and older is not faster than glibc
> +     memcmp if alignment is small and length is not short, so bail
> +     out to avoid those conditions.  */
> +  if (!TARGET_EFFICIENT_OVERLAPPING_UNALIGNED
> +      && ((base_align == 1 && bytes > 16)
> +         || (base_align == 2 && bytes > 32)))
> +    return false;
> +
> +  rtx final_label = NULL;
> +
> +  if (use_vec)
>      {
> -      if (convert_label)
> -       emit_label (convert_label);
> +      rtx final_move_label = gen_label_rtx ();
> +      rtx s1addr = gen_reg_rtx (Pmode);
> +      rtx s2addr = gen_reg_rtx (Pmode);
> +      rtx off_reg = gen_reg_rtx (Pmode);
> +      rtx cleanup_label = NULL;
> +      rtx vec_result = gen_reg_rtx (V16QImode);
> +      rtx s1data = gen_reg_rtx (V16QImode);
> +      rtx s2data = gen_reg_rtx (V16QImode);
> +      rtx result_reg = gen_reg_rtx (word_mode);
> +      emit_move_insn (result_reg, GEN_INT (0));
>
> -      /* We need to produce DI result from sub, then convert to target SI
> -        while maintaining <0 / ==0 / >0 properties.  This sequence works:
> -        subfc L,A,B
> -        subfe H,H,H
> -        popcntd L,L
> -        rldimi L,H,6,0
> +      expand_cmp_vec_sequence (bytes, orig_src1, orig_src2,
> +                              s1addr, s2addr, off_reg, s1data, s2data,
> +                              vec_result, false,
> +                              &cleanup_label, final_move_label, false);
>
> -        This is an alternate one Segher cooked up if somebody
> -        wants to expand this for something that doesn't have popcntd:
> -        subfc L,a,b
> -        subfe H,x,x
> -        addic t,L,-1
> -        subfe v,t,L
> -        or z,v,H
> +      if (cleanup_label)
> +       emit_label (cleanup_label);
>
> -        And finally, p9 can just do this:
> -        cmpld A,B
> -        setb r */
> +      emit_insn (gen_one_cmplv16qi2 (vec_result, vec_result));
>
> -      if (TARGET_P9_MISC)
> +      emit_final_compare_vec (s1data, s2data, result_reg,
> +                             s1addr, s2addr, orig_src1, orig_src2,
> +                             off_reg, vec_result);
> +
> +      emit_label (final_move_label);
> +      emit_insn (gen_movsi (target,
> +                           gen_lowpart (SImode, result_reg)));
> +    }
> +  else
> +    { /* generate GPR code */
> +
> +      rtx convert_label = NULL;
> +      rtx sub_result = gen_reg_rtx (word_mode);
> +      bool need_6432_conversion =
> +       expand_block_compare_gpr(bytes, base_align,
> +                                orig_src1, orig_src2,
> +                                sub_result, cond, target,
> +                                &convert_label, &final_label);
> +
> +      if (need_6432_conversion)
>         {
> -         emit_insn (gen_setb_unsigned (target, cond));
> -       }
> -      else
> -       {
> -         if (TARGET_64BIT)
> -           {
> -             rtx tmp_reg_ca = gen_reg_rtx (DImode);
> -             emit_insn (gen_subfdi3_carry_in_xx (tmp_reg_ca));
> -             emit_insn (gen_popcntddi2 (tmp_reg_src2, tmp_reg_src2));
> -             emit_insn (gen_iordi3 (tmp_reg_src2, tmp_reg_src2, tmp_reg_ca));
> -             emit_insn (gen_movsi (target, gen_lowpart (SImode, 
> tmp_reg_src2)));
> -           }
> +         if (convert_label)
> +           emit_label (convert_label);
> +         if (TARGET_P9_MISC)
> +           emit_insn (gen_setb_unsigned (target, cond));
>           else
> -           {
> -             rtx tmp_reg_ca = gen_reg_rtx (SImode);
> -             emit_insn (gen_subfsi3_carry_in_xx (tmp_reg_ca));
> -             emit_insn (gen_popcntdsi2 (tmp_reg_src2, tmp_reg_src2));
> -             emit_insn (gen_iorsi3 (target, tmp_reg_src2, tmp_reg_ca));
> -           }
> +           generate_6432_conversion(target, sub_result);
>         }
>      }
>
> @@ -1700,7 +2081,6 @@
>    if (final_label)
>      emit_label (final_label);
>
> -  gcc_assert (bytes == 0);
>    return true;
>  }
>
> @@ -1808,7 +2188,7 @@
>         }
>        rtx addr1 = gen_rtx_PLUS (Pmode, src1_addr, offset_rtx);
>        rtx addr2 = gen_rtx_PLUS (Pmode, src2_addr, offset_rtx);
> -
> +
>        do_load_for_compare_from_addr (load_mode, tmp_reg_src1, addr1, 
> orig_src1);
>        do_load_for_compare_from_addr (load_mode, tmp_reg_src2, addr2, 
> orig_src2);
>
> @@ -1966,176 +2346,6 @@
>    return;
>  }
>
> -/* Generate the sequence of compares for strcmp/strncmp using vec/vsx
> -   instructions.
> -
> -   BYTES_TO_COMPARE is the number of bytes to be compared.
> -   ORIG_SRC1 is the unmodified rtx for the first string.
> -   ORIG_SRC2 is the unmodified rtx for the second string.
> -   S1ADDR is the register to use for the base address of the first string.
> -   S2ADDR is the register to use for the base address of the second string.
> -   OFF_REG is the register to use for the string offset for loads.
> -   S1DATA is the register for loading the first string.
> -   S2DATA is the register for loading the second string.
> -   VEC_RESULT is the rtx for the vector result indicating the byte 
> difference.
> -   EQUALITY_COMPARE_REST is a flag to indicate we need to make a cleanup call
> -   to strcmp/strncmp if we have equality at the end of the inline comparison.
> -   P_CLEANUP_LABEL is a pointer to rtx for a label we generate if we need 
> code to clean up
> -   and generate the final comparison result.
> -   FINAL_MOVE_LABEL is rtx for a label we can branch to when we can just
> -   set the final result.  */
> -static void
> -expand_strncmp_vec_sequence (unsigned HOST_WIDE_INT bytes_to_compare,
> -                            rtx orig_src1, rtx orig_src2,
> -                            rtx s1addr, rtx s2addr, rtx off_reg,
> -                            rtx s1data, rtx s2data,
> -                            rtx vec_result, bool equality_compare_rest,
> -                            rtx *p_cleanup_label, rtx final_move_label)
> -{
> -  machine_mode load_mode;
> -  unsigned int load_mode_size;
> -  unsigned HOST_WIDE_INT cmp_bytes = 0;
> -  unsigned HOST_WIDE_INT offset = 0;
> -
> -  gcc_assert (p_cleanup_label != NULL);
> -  rtx cleanup_label = *p_cleanup_label;
> -
> -  emit_move_insn (s1addr, force_reg (Pmode, XEXP (orig_src1, 0)));
> -  emit_move_insn (s2addr, force_reg (Pmode, XEXP (orig_src2, 0)));
> -
> -  unsigned int i;
> -  rtx zr[16];
> -  for (i = 0; i < 16; i++)
> -    zr[i] = GEN_INT (0);
> -  rtvec zv = gen_rtvec_v (16, zr);
> -  rtx zero_reg = gen_reg_rtx (V16QImode);
> -  rs6000_expand_vector_init (zero_reg, gen_rtx_PARALLEL (V16QImode, zv));
> -
> -  while (bytes_to_compare > 0)
> -    {
> -      /* VEC/VSX compare sequence for P8:
> -        check each 16B with:
> -        lxvd2x 32,28,8
> -        lxvd2x 33,29,8
> -        vcmpequb 2,0,1  # compare strings
> -        vcmpequb 4,0,3  # compare w/ 0
> -        xxlorc 37,36,34       # first FF byte is either mismatch or end of 
> string
> -        vcmpequb. 7,5,3  # reg 7 contains 0
> -        bnl 6,.Lmismatch
> -
> -        For the P8 LE case, we use lxvd2x and compare full 16 bytes
> -        but then use use vgbbd and a shift to get two bytes with the
> -        information we need in the correct order.
> -
> -        VEC/VSX compare sequence if TARGET_P9_VECTOR:
> -        lxvb16x/lxvb16x     # load 16B of each string
> -        vcmpnezb.           # produces difference location or zero byte 
> location
> -        bne 6,.Lmismatch
> -
> -        Use the overlapping compare trick for the last block if it is
> -        less than 16 bytes.
> -      */
> -
> -      load_mode = V16QImode;
> -      load_mode_size = GET_MODE_SIZE (load_mode);
> -
> -      if (bytes_to_compare >= load_mode_size)
> -       cmp_bytes = load_mode_size;
> -      else
> -       {
> -         /* Move this load back so it doesn't go past the end.  P8/P9
> -            can do this efficiently.  This is never called with less
> -            than 16 bytes so we should always be able to do this.  */
> -         unsigned int extra_bytes = load_mode_size - bytes_to_compare;
> -         cmp_bytes = bytes_to_compare;
> -         gcc_assert (offset > extra_bytes);
> -         offset -= extra_bytes;
> -         cmp_bytes = load_mode_size;
> -         bytes_to_compare = cmp_bytes;
> -       }
> -
> -      /* The offset currently used is always kept in off_reg so that the
> -        cleanup code on P8 can use it to extract the differing byte.  */
> -      emit_move_insn (off_reg, GEN_INT (offset));
> -
> -      rtx addr1 = gen_rtx_PLUS (Pmode, s1addr, off_reg);
> -      do_load_for_compare_from_addr (load_mode, s1data, addr1, orig_src1);
> -      rtx addr2 = gen_rtx_PLUS (Pmode, s2addr, off_reg);
> -      do_load_for_compare_from_addr (load_mode, s2data, addr2, orig_src2);
> -
> -      /* Cases to handle.  A and B are chunks of the two strings.
> -        1: Not end of comparison:
> -        A != B: branch to cleanup code to compute result.
> -        A == B: next block
> -        2: End of the inline comparison:
> -        A != B: branch to cleanup code to compute result.
> -        A == B: call strcmp/strncmp
> -        3: compared requested N bytes:
> -        A == B: branch to result 0.
> -        A != B: cleanup code to compute result.  */
> -
> -      unsigned HOST_WIDE_INT remain = bytes_to_compare - cmp_bytes;
> -
> -      if (TARGET_P9_VECTOR)
> -       emit_insn (gen_vcmpnezb_p (vec_result, s1data, s2data));
> -      else
> -       {
> -         /* Emit instructions to do comparison and zero check.  */
> -         rtx cmp_res = gen_reg_rtx (load_mode);
> -         rtx cmp_zero = gen_reg_rtx (load_mode);
> -         rtx cmp_combined = gen_reg_rtx (load_mode);
> -         emit_insn (gen_altivec_eqv16qi (cmp_res, s1data, s2data));
> -         emit_insn (gen_altivec_eqv16qi (cmp_zero, s1data, zero_reg));
> -         emit_insn (gen_orcv16qi3 (vec_result, cmp_zero, cmp_res));
> -         emit_insn (gen_altivec_vcmpequb_p (cmp_combined, vec_result, 
> zero_reg));
> -       }
> -
> -      bool branch_to_cleanup = (remain > 0 || equality_compare_rest);
> -      rtx cr6 = gen_rtx_REG (CCmode, CR6_REGNO);
> -      rtx dst_label;
> -      rtx cmp_rtx;
> -      if (branch_to_cleanup)
> -       {
> -         /* Branch to cleanup code, otherwise fall through to do more
> -            compares.  P8 and P9 use different CR bits because on P8
> -            we are looking at the result of a comparsion vs a
> -            register of zeroes so the all-true condition means no
> -            difference or zero was found.  On P9, vcmpnezb sets a byte
> -            to 0xff if there is a mismatch or zero, so the all-false
> -            condition indicates we found no difference or zero.  */
> -         if (!cleanup_label)
> -           cleanup_label = gen_label_rtx ();
> -         dst_label = cleanup_label;
> -         if (TARGET_P9_VECTOR)
> -           cmp_rtx = gen_rtx_NE (VOIDmode, cr6, const0_rtx);
> -         else
> -           cmp_rtx = gen_rtx_GE (VOIDmode, cr6, const0_rtx);
> -       }
> -      else
> -       {
> -         /* Branch to final return or fall through to cleanup,
> -            result is already set to 0.  */
> -         dst_label = final_move_label;
> -         if (TARGET_P9_VECTOR)
> -           cmp_rtx = gen_rtx_EQ (VOIDmode, cr6, const0_rtx);
> -         else
> -           cmp_rtx = gen_rtx_LT (VOIDmode, cr6, const0_rtx);
> -       }
> -
> -      rtx lab_ref = gen_rtx_LABEL_REF (VOIDmode, dst_label);
> -      rtx ifelse = gen_rtx_IF_THEN_ELSE (VOIDmode, cmp_rtx,
> -                                        lab_ref, pc_rtx);
> -      rtx j2 = emit_jump_insn (gen_rtx_SET (pc_rtx, ifelse));
> -      JUMP_LABEL (j2) = dst_label;
> -      LABEL_NUSES (dst_label) += 1;
> -
> -      offset += cmp_bytes;
> -      bytes_to_compare -= cmp_bytes;
> -    }
> -  *p_cleanup_label = cleanup_label;
> -  return;
> -}
> -
>  /* Generate the final sequence that identifies the differing
>     byte and generates the final result, taking into account
>     zero bytes:
> @@ -2190,97 +2400,6 @@
>    return;
>  }
>
> -/* Generate the final sequence that identifies the differing
> -   byte and generates the final result, taking into account
> -   zero bytes:
> -
> -   P8:
> -        vgbbd 0,0
> -        vsldoi 0,0,0,9
> -        mfvsrd 9,32
> -        addi 10,9,-1    # count trailing zero bits
> -        andc 9,10,9
> -        popcntd 9,9
> -        lbzx 10,28,9    # use that offset to load differing byte
> -        lbzx 3,29,9
> -        subf 3,3,10     # subtract for final result
> -
> -   P9:
> -        vclzlsbb            # counts trailing bytes with lsb=0
> -        vextublx            # extract differing byte
> -
> -   STR1 is the reg rtx for data from string 1.
> -   STR2 is the reg rtx for data from string 2.
> -   RESULT is the reg rtx for the comparison result.
> -   S1ADDR is the register to use for the base address of the first string.
> -   S2ADDR is the register to use for the base address of the second string.
> -   ORIG_SRC1 is the unmodified rtx for the first string.
> -   ORIG_SRC2 is the unmodified rtx for the second string.
> -   OFF_REG is the register to use for the string offset for loads.
> -   VEC_RESULT is the rtx for the vector result indicating the byte 
> difference.
> -  */
> -
> -static void
> -emit_final_str_compare_vec (rtx str1, rtx str2, rtx result,
> -                           rtx s1addr, rtx s2addr,
> -                           rtx orig_src1, rtx orig_src2,
> -                           rtx off_reg, rtx vec_result)
> -{
> -  if (TARGET_P9_VECTOR)
> -    {
> -      rtx diffix = gen_reg_rtx (SImode);
> -      rtx chr1 = gen_reg_rtx (SImode);
> -      rtx chr2 = gen_reg_rtx (SImode);
> -      rtx chr1_di = simplify_gen_subreg (DImode, chr1, SImode, 0);
> -      rtx chr2_di = simplify_gen_subreg (DImode, chr2, SImode, 0);
> -      emit_insn (gen_vclzlsbb_v16qi (diffix, vec_result));
> -      emit_insn (gen_vextublx (chr1, diffix, str1));
> -      emit_insn (gen_vextublx (chr2, diffix, str2));
> -      do_sub3 (result, chr1_di, chr2_di);
> -    }
> -  else
> -    {
> -      gcc_assert (TARGET_P8_VECTOR);
> -      rtx diffix = gen_reg_rtx (DImode);
> -      rtx result_gbbd = gen_reg_rtx (V16QImode);
> -      /* Since each byte of the input is either 00 or FF, the bytes in
> -        dw0 and dw1 after vgbbd are all identical to each other.  */
> -      emit_insn (gen_p8v_vgbbd (result_gbbd, vec_result));
> -      /* For LE, we shift by 9 and get BA in the low two bytes then CTZ.
> -        For BE, we shift by 7 and get AB in the high two bytes then CLZ.  */
> -      rtx result_shifted = gen_reg_rtx (V16QImode);
> -      int shift_amt = (BYTES_BIG_ENDIAN) ? 7 : 9;
> -      emit_insn (gen_altivec_vsldoi_v16qi 
> (result_shifted,result_gbbd,result_gbbd, GEN_INT (shift_amt)));
> -
> -      rtx diffix_df = simplify_gen_subreg (DFmode, diffix, DImode, 0);
> -      emit_insn (gen_p8_mfvsrd_3_v16qi (diffix_df, result_shifted));
> -      rtx count = gen_reg_rtx (DImode);
> -
> -      if (BYTES_BIG_ENDIAN)
> -       emit_insn (gen_clzdi2 (count, diffix));
> -      else
> -       emit_insn (gen_ctzdi2 (count, diffix));
> -
> -      /* P8 doesn't have a good solution for extracting one byte from
> -        a vsx reg like vextublx on P9 so we just compute the offset
> -        of the differing byte and load it from each string.  */
> -      do_add3 (off_reg, off_reg, count);
> -
> -      rtx chr1 = gen_reg_rtx (QImode);
> -      rtx chr2 = gen_reg_rtx (QImode);
> -      rtx addr1 = gen_rtx_PLUS (Pmode, s1addr, off_reg);
> -      do_load_for_compare_from_addr (QImode, chr1, addr1, orig_src1);
> -      rtx addr2 = gen_rtx_PLUS (Pmode, s2addr, off_reg);
> -      do_load_for_compare_from_addr (QImode, chr2, addr2, orig_src2);
> -      machine_mode rmode = GET_MODE (result);
> -      rtx chr1_rm = simplify_gen_subreg (rmode, chr1, QImode, 0);
> -      rtx chr2_rm = simplify_gen_subreg (rmode, chr2, QImode, 0);
> -      do_sub3 (result, chr1_rm, chr2_rm);
> -    }
> -
> -  return;
> -}
> -
>  /* Expand a string compare operation with length, and return
>     true if successful.  Return false if we should let the
>     compiler generate normal code, probably a strncmp call.
> @@ -2490,13 +2609,13 @@
>        off_reg = gen_reg_rtx (Pmode);
>        vec_result = gen_reg_rtx (load_mode);
>        emit_move_insn (result_reg, GEN_INT (0));
> -      expand_strncmp_vec_sequence (compare_length,
> -                                  orig_src1, orig_src2,
> -                                  s1addr, s2addr, off_reg,
> -                                  tmp_reg_src1, tmp_reg_src2,
> -                                  vec_result,
> -                                  equality_compare_rest,
> -                                  &cleanup_label, final_move_label);
> +      expand_cmp_vec_sequence (compare_length,
> +                              orig_src1, orig_src2,
> +                              s1addr, s2addr, off_reg,
> +                              tmp_reg_src1, tmp_reg_src2,
> +                              vec_result,
> +                              equality_compare_rest,
> +                              &cleanup_label, final_move_label, true);
>      }
>    else
>      expand_strncmp_gpr_sequence (compare_length, base_align,
> @@ -2545,9 +2664,9 @@
>      emit_label (cleanup_label);
>
>    if (use_vec)
> -    emit_final_str_compare_vec (tmp_reg_src1, tmp_reg_src2, result_reg,
> -                               s1addr, s2addr, orig_src1, orig_src2,
> -                               off_reg, vec_result);
> +    emit_final_compare_vec (tmp_reg_src1, tmp_reg_src2, result_reg,
> +                           s1addr, s2addr, orig_src1, orig_src2,
> +                           off_reg, vec_result);
>    else
>      emit_final_str_compare_gpr (tmp_reg_src1, tmp_reg_src2, result_reg);
>
> Index: gcc/config/rs6000/rs6000.opt
> ===================================================================
> --- gcc/config/rs6000/rs6000.opt        (revision 266034)
> +++ gcc/config/rs6000/rs6000.opt        (working copy)
> @@ -326,7 +326,7 @@
>  Max number of bytes to move inline.
>
>  mblock-compare-inline-limit=
> -Target Report Var(rs6000_block_compare_inline_limit) Init(31) RejectNegative 
> Joined UInteger Save
> +Target Report Var(rs6000_block_compare_inline_limit) Init(63) RejectNegative 
> Joined UInteger Save
>  Max number of bytes to compare without loops.
>
>  mblock-compare-inline-loop-limit=
>
>
> --
> Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
> 050-2/C113  (507) 253-7520 home: 507/263-0782
> IBM Linux Technology Center - PPC Toolchain
>

Re: [PATCH][rs6000] inline expansion of memcmp using vsx

Reply via email to