I found another bug in current implementation. A patch for it doesn't
cure i686-linux- bootstrap, but fixes fails on some tests (see
attached).

The problem was that we tried to add runtime tests for alignment even
if both SRC and DST had unknown alignment - in this case it could be
impossible to make them both aligned simultaneously, so I think it's
easier to even not try to use aligned SSE-moves at all. Generation of
prologues with runtime tests could be used only if at least one
alignment is known - otherwise it's incorrect. Probably, generation of
such prologues could be removed from MEMMOV at all for now.

Though, even with this fix i686-bootstrap still fails. Configure for
the bootstrap-fail reproducing:
CC="gcc -m32" CXX="g++ -m32" ../configure --with-arch=core2
--with-cpu=atom --prefix=`pwd` i686-linux --with-fpmath=sse
--enable-languages=c,c++,fortran

On 18 November 2011 06:23, Jan Hubicka <hubi...@ucw.cz> wrote:
>> > >
>> > > The current x86 memset/memcpy expansion is broken. It miscompiles
>> > > many programs, including GCC itself.  Should it be reverted for now?
>> >
>> > There was problem in the new code doing loopy epilogues.
>> > I am currently testing the following patch that shold fix the problem.
>> > We could either revert now and I will apply combined patch or I hope to 
>> > fix that
>> > tonight.
>>
>> To expand little bit. I was looking into the code for most of the day today 
>> and
>> the patch combines several fixes
>>    1) the new loopy epilogue code was quite broken. It did not work for 
>> memset at all because
>>       the promoted value was not always initialized that I fixed in the 
>> version of patch
>>       that is in mainline now. It however also miss bound check in some 
>> cases.  This is fixed
>>       by the expand_set_or_movmem_via_loop_with_iter change.
>>    2) I misupdated atom description so 32bit memset was not expanded inline, 
>> this is fixed
>>       by memset changes
>>    3) decide_alg was broken in two ways - first it gives complex algorithms 
>> for -O0
>>       and it chose wrong variant when sse_loop is used.
>>    4) the epilogue loop was output even in the case it is not needed - i.e. 
>> when unrolled
>>       loops handled 16 bytes at once, and block size is 39. This is the 
>> ix86_movmem
>>       and ix86_setmem change
>>    5) The implementation of ix86_movmem/ix86_setmem diverged for no reason 
>> so I got it back
>>       to sync. For some reason SSE code in movmem is not output for 64bit 
>> unaligned memcpy
>>       that is fixed too.
>>    6) it seems that both bdver and core is good enough on handling 
>> misaligned blocks that
>>       the alignmnet prologues can be ommited. This greatly improves and 
>> reduces size of the
>>       inline sequence. I will however break this out into independent patch.
>>
>> Life would be easier if the changes was made in multiple incremental steps, 
>> stringops expansion
>> is relatively tricky busyness and realively easy to get wrong in some cases 
>> since there are so
>> many of them depending on knowledge of size/alignmnet and target 
>> architecture.
>
> Hi,
> this is the patch I comitted after bootstrapping/regstesting x86_64-linux and
> --with-arch=core2 --with-cpu=atom 
> gfortran.fortran-torture/execute/arrayarg.f90
> failure stays. As I've explained in the PR log, I believe it is previously
> latent problem elsewhere that is now triggered by inline memset expansion that
> is later unrolled.  I would welcome help from someone who understand the
> testcase on whether it is aliasing safe or not.
>
> Honza
>
>        PR bootstrap/51134
>        * i386.c (atom_cost): Fix 32bit memset description.
>        (expand_set_or_movmem_via_loop_with_iter): Output proper bounds check 
> for epilogue loops.
>        (expand_movmem_epilogue): Handle epilogues up to size 15 w/o producing 
> byte loop.
>        (decide_alg): sse_loop is not useable wthen SSE2 is disabled; when not 
> optimizing always
>        use rep movsb or lincall; do not produce word sized loops when 
> optimizing memset for
>        size (to avoid need for large constants).
>        (ix86_expand_movmem): Get into sync with ix86_expand_setmem; choose 
> unroll factors
>        better; always do 128bit moves when producing SSE loops; do not 
> produce loopy epilogue
>        when size is too small.
>        (promote_duplicated_reg_to_size): Do not look into desired alignments 
> when
>        doing vector expansion.
>        (ix86_expand_setmem): Track better when promoted value is available; 
> choose unroll factors
>        more sanely.; output loopy epilogue only when needed.
> Index: config/i386/i386.c
> ===================================================================
> *** config/i386/i386.c  (revision 181407)
> --- config/i386/i386.c  (working copy)
> *************** struct processor_costs atom_cost = {
> *** 1785,1791 ****
>       if that fails.  */
>    {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* 
> Known alignment.  */
>      {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}},
> !    {{libcall, {{-1, libcall}}},                              /* Unknown 
> alignment.  */
>      {libcall, {{2048, sse_loop}, {2048, unrolled_loop},
>               {-1, libcall}}}}},
>
> --- 1785,1791 ----
>       if that fails.  */
>    {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* 
> Known alignment.  */
>      {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}},
> !    {{libcall, {{2048, sse_loop}, {2048, unrolled_loop}, {-1, libcall}}}, /* 
> Unknown alignment.  */
>      {libcall, {{2048, sse_loop}, {2048, unrolled_loop},
>               {-1, libcall}}}}},
>
> *************** expand_set_or_movmem_via_loop_with_iter
> *** 21149,21168 ****
>
>    top_label = gen_label_rtx ();
>    out_label = gen_label_rtx ();
> -   if (!reuse_iter)
> -     iter = gen_reg_rtx (iter_mode);
> -
>    size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
> !                             NULL, 1, OPTAB_DIRECT);
> !   /* Those two should combine.  */
> !   if (piece_size == const1_rtx)
>      {
> !       emit_cmp_and_jump_insns (size, const0_rtx, EQ, NULL_RTX, iter_mode,
>                               true, out_label);
> -       predict_jump (REG_BR_PROB_BASE * 10 / 100);
>      }
> -   if (!reuse_iter)
> -     emit_move_insn (iter, const0_rtx);
>
>    emit_label (top_label);
>
> --- 21149,21173 ----
>
>    top_label = gen_label_rtx ();
>    out_label = gen_label_rtx ();
>    size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
> !                              NULL, 1, OPTAB_DIRECT);
> !   if (!reuse_iter)
>      {
> !       iter = gen_reg_rtx (iter_mode);
> !       /* Those two should combine.  */
> !       if (piece_size == const1_rtx)
> !       {
> !         emit_cmp_and_jump_insns (size, const0_rtx, EQ, NULL_RTX, iter_mode,
> !                                  true, out_label);
> !         predict_jump (REG_BR_PROB_BASE * 10 / 100);
> !       }
> !       emit_move_insn (iter, const0_rtx);
> !     }
> !   else
> !     {
> !       emit_cmp_and_jump_insns (iter, size, GE, NULL_RTX, iter_mode,
>                               true, out_label);
>      }
>
>    emit_label (top_label);
>
> *************** expand_movmem_epilogue (rtx destmem, rtx
> *** 21460,21466 ****
>        gcc_assert (remainder_size == 0);
>        return;
>      }
> !   if (max_size > 8)
>      {
>        count = expand_simple_binop (GET_MODE (count), AND, count, GEN_INT 
> (max_size - 1),
>                                    count, 1, OPTAB_DIRECT);
> --- 21465,21471 ----
>        gcc_assert (remainder_size == 0);
>        return;
>      }
> !   if (max_size > 16)
>      {
>        count = expand_simple_binop (GET_MODE (count), AND, count, GEN_INT 
> (max_size - 1),
>                                    count, 1, OPTAB_DIRECT);
> *************** expand_movmem_epilogue (rtx destmem, rtx
> *** 21475,21480 ****
> --- 21480,21504 ----
>     */
>    if (TARGET_SINGLE_STRINGOP)
>      {
> +       if (max_size > 8)
> +       {
> +         rtx label = ix86_expand_aligntest (count, 8, true);
> +         if (TARGET_64BIT)
> +           {
> +             src = change_address (srcmem, DImode, srcptr);
> +             dest = change_address (destmem, DImode, destptr);
> +             emit_insn (gen_strmov (destptr, dest, srcptr, src));
> +           }
> +         else
> +           {
> +             src = change_address (srcmem, SImode, srcptr);
> +             dest = change_address (destmem, SImode, destptr);
> +             emit_insn (gen_strmov (destptr, dest, srcptr, src));
> +             emit_insn (gen_strmov (destptr, dest, srcptr, src));
> +           }
> +         emit_label (label);
> +         LABEL_NUSES (label) = 1;
> +       }
>        if (max_size > 4)
>        {
>          rtx label = ix86_expand_aligntest (count, 4, true);
> *************** expand_movmem_epilogue (rtx destmem, rtx
> *** 21508,21513 ****
> --- 21532,21566 ----
>        rtx offset = force_reg (Pmode, const0_rtx);
>        rtx tmp;
>
> +       if (max_size > 8)
> +       {
> +         rtx label = ix86_expand_aligntest (count, 8, true);
> +         if (TARGET_64BIT)
> +           {
> +             src = change_address (srcmem, DImode, srcptr);
> +             dest = change_address (destmem, DImode, destptr);
> +             emit_move_insn (dest, src);
> +             tmp = expand_simple_binop (Pmode, PLUS, offset, GEN_INT (8), 
> NULL,
> +                                        true, OPTAB_LIB_WIDEN);
> +           }
> +         else
> +           {
> +             src = change_address (srcmem, SImode, srcptr);
> +             dest = change_address (destmem, SImode, destptr);
> +             emit_move_insn (dest, src);
> +             tmp = expand_simple_binop (Pmode, PLUS, offset, GEN_INT (4), 
> NULL,
> +                                        true, OPTAB_LIB_WIDEN);
> +             if (tmp != offset)
> +                emit_move_insn (offset, tmp);
> +             tmp = expand_simple_binop (Pmode, PLUS, offset, GEN_INT (4), 
> NULL,
> +                                        true, OPTAB_LIB_WIDEN);
> +             emit_move_insn (dest, src);
> +           }
> +         if (tmp != offset)
> +           emit_move_insn (offset, tmp);
> +         emit_label (label);
> +         LABEL_NUSES (label) = 1;
> +       }
>        if (max_size > 4)
>        {
>          rtx label = ix86_expand_aligntest (count, 4, true);
> *************** expand_setmem_epilogue (rtx destmem, rtx
> *** 21588,21604 ****
>         Remaining part we'll move using Pmode and narrower modes.  */
>
>        if (promoted_to_vector_value)
> !       while (remainder_size >= 16)
> !         {
> !           if (GET_MODE (destmem) != move_mode)
> !             destmem = adjust_automodify_address_nv (destmem, move_mode,
> !                                                     destptr, offset);
> !           emit_strset (destmem, promoted_to_vector_value, destptr,
> !                        move_mode, offset);
>
> !           offset += 16;
> !           remainder_size -= 16;
> !         }
>
>        /* Move the remaining part of epilogue - its size might be
>         a size of the widest mode.  */
> --- 21641,21668 ----
>         Remaining part we'll move using Pmode and narrower modes.  */
>
>        if (promoted_to_vector_value)
> !       {
> !         if (promoted_to_vector_value)
> !           {
> !             if (max_size >= GET_MODE_SIZE (V4SImode))
> !               move_mode = V4SImode;
> !             else if (max_size >= GET_MODE_SIZE (DImode))
> !               move_mode = DImode;
> !           }
> !         while (remainder_size >= GET_MODE_SIZE (move_mode))
> !           {
> !             if (GET_MODE (destmem) != move_mode)
> !               destmem = adjust_automodify_address_nv (destmem, move_mode,
> !                                                       destptr, offset);
> !             emit_strset (destmem,
> !                          promoted_to_vector_value,
> !                          destptr,
> !                          move_mode, offset);
>
> !             offset += GET_MODE_SIZE (move_mode);
> !             remainder_size -= GET_MODE_SIZE (move_mode);
> !           }
> !       }
>
>        /* Move the remaining part of epilogue - its size might be
>         a size of the widest mode.  */
> *************** decide_alg (HOST_WIDE_INT count, HOST_WI
> *** 22022,22031 ****
>                             || (memset
>                                 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
>
> ! #define ALG_USABLE_P(alg) (rep_prefix_usable                  \
> !                          || (alg != rep_prefix_1_byte         \
> !                              && alg != rep_prefix_4_byte      \
> !                              && alg != rep_prefix_8_byte))
>    const struct processor_costs *cost;
>
>    /* Even if the string operation call is cold, we still might spend a lot
> --- 22086,22096 ----
>                             || (memset
>                                 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
>
> ! #define ALG_USABLE_P(alg) ((rep_prefix_usable                 \
> !                           || (alg != rep_prefix_1_byte        \
> !                               && alg != rep_prefix_4_byte      \
> !                               && alg != rep_prefix_8_byte))    \
> !                          && (TARGET_SSE2 || alg != sse_loop))
>    const struct processor_costs *cost;
>
>    /* Even if the string operation call is cold, we still might spend a lot
> *************** decide_alg (HOST_WIDE_INT count, HOST_WI
> *** 22037,22042 ****
> --- 22102,22110 ----
>    else
>      optimize_for_speed = true;
>
> +   if (!optimize)
> +     return (rep_prefix_usable ? rep_prefix_1_byte : libcall);
> +
>    cost = optimize_for_speed ? ix86_cost : &ix86_size_cost;
>
>    *dynamic_check = -1;
> *************** decide_alg (HOST_WIDE_INT count, HOST_WI
> *** 22049,22058 ****
>    /* rep; movq or rep; movl is the smallest variant.  */
>    else if (!optimize_for_speed)
>      {
> !       if (!count || (count & 3))
> !       return rep_prefix_usable ? rep_prefix_1_byte : loop_1_byte;
>        else
> !       return rep_prefix_usable ? rep_prefix_4_byte : loop;
>      }
>    /* Very tiny blocks are best handled via the loop, REP is expensive to 
> setup.
>     */
> --- 22117,22126 ----
>    /* rep; movq or rep; movl is the smallest variant.  */
>    else if (!optimize_for_speed)
>      {
> !       if (!count || (count & 3) || memset)
> !       return rep_prefix_usable ? rep_prefix_1_byte : libcall;
>        else
> !       return rep_prefix_usable ? rep_prefix_4_byte : libcall;
>      }
>    /* Very tiny blocks are best handled via the loop, REP is expensive to 
> setup.
>     */
> *************** decide_alg (HOST_WIDE_INT count, HOST_WI
> *** 22106,22118 ****
>        int max = -1;
>        enum stringop_alg alg;
>        int i;
> -       bool any_alg_usable_p = true;
>        bool only_libcall_fits = true;
>
>        for (i = 0; i < MAX_STRINGOP_ALGS; i++)
>        {
>          enum stringop_alg candidate = algs->size[i].alg;
> -         any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
>
>          if (candidate != libcall && candidate
>              && ALG_USABLE_P (candidate))
> --- 22174,22184 ----
> *************** decide_alg (HOST_WIDE_INT count, HOST_WI
> *** 22124,22130 ****
>        /* If there aren't any usable algorithms, then recursing on
>         smaller sizes isn't going to find anything.  Just return the
>         simple byte-at-a-time copy loop.  */
> !       if (!any_alg_usable_p || only_libcall_fits)
>        {
>          /* Pick something reasonable.  */
>          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
> --- 22190,22196 ----
>        /* If there aren't any usable algorithms, then recursing on
>         smaller sizes isn't going to find anything.  Just return the
>         simple byte-at-a-time copy loop.  */
> !       if (only_libcall_fits)
>        {
>          /* Pick something reasonable.  */
>          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
> *************** ix86_expand_movmem (rtx dst, rtx src, rt
> *** 22253,22259 ****
>    int dynamic_check;
>    bool need_zero_guard = false;
>    bool align_unknown;
> !   int unroll_factor;
>    enum machine_mode move_mode;
>    rtx loop_iter = NULL_RTX;
>    int dst_offset, src_offset;
> --- 22319,22325 ----
>    int dynamic_check;
>    bool need_zero_guard = false;
>    bool align_unknown;
> !   unsigned int unroll_factor;
>    enum machine_mode move_mode;
>    rtx loop_iter = NULL_RTX;
>    int dst_offset, src_offset;
> *************** ix86_expand_movmem (rtx dst, rtx src, rt
> *** 22316,22329 ****
>      case unrolled_loop:
>        need_zero_guard = true;
>        move_mode = Pmode;
> !       unroll_factor = TARGET_64BIT ? 4 : 2;
>        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>        break;
>      case sse_loop:
>        need_zero_guard = true;
>        /* Use SSE instructions, if possible.  */
> !       move_mode = align_unknown ? DImode : V4SImode;
> !       unroll_factor = TARGET_64BIT ? 4 : 2;
>        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>        break;
>      case rep_prefix_8_byte:
> --- 22382,22408 ----
>      case unrolled_loop:
>        need_zero_guard = true;
>        move_mode = Pmode;
> !       unroll_factor = 1;
> !       /* Select maximal available 1,2 or 4 unroll factor.
> !        In 32bit we can not afford to use 4 registers inside the loop.  */
> !       if (!count)
> !       unroll_factor = TARGET_64BIT ? 4 : 2;
> !       else
> !       while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
> !              && unroll_factor < (TARGET_64BIT ? 4 :2))
> !         unroll_factor *= 2;
>        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>        break;
>      case sse_loop:
>        need_zero_guard = true;
>        /* Use SSE instructions, if possible.  */
> !       move_mode = V4SImode;
> !       /* Select maximal available 1,2 or 4 unroll factor.  */
> !       if (!count)
> !       unroll_factor = 4;
> !       else
> !       while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
> !              && unroll_factor < 4)
>        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>        break;
>      case rep_prefix_8_byte:
> *************** ix86_expand_movmem (rtx dst, rtx src, rt
> *** 22568,22574 ****
>    if (alg == sse_loop || alg == unrolled_loop)
>      {
>        rtx tmp;
> !       if (align_unknown && unroll_factor > 1)
>        {
>          /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
>             do this, we can have very big epilogue - when alignment is 
> statically
> --- 22647,22659 ----
>    if (alg == sse_loop || alg == unrolled_loop)
>      {
>        rtx tmp;
> !       int remainder_size = epilogue_size_needed;
> !
> !       /* We may not need the epilgoue loop at all when the count is known
> !        and alignment is not adjusted.  */
> !       if (count && desired_align <= align)
> !       remainder_size = count % epilogue_size_needed;
> !       if (remainder_size > 31)
>        {
>          /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
>             do this, we can have very big epilogue - when alignment is 
> statically
> *************** promote_duplicated_reg_to_size (rtx val,
> *** 22710,22716 ****
>  {
>    rtx promoted_val = NULL_RTX;
>
> !   if (size_needed > 8 || (desired_align > align && desired_align > 8))
>      {
>        /* We want to promote to vector register, so we expect that at least 
> SSE
>         is available.  */
> --- 22795,22801 ----
>  {
>    rtx promoted_val = NULL_RTX;
>
> !   if (size_needed > 8)
>      {
>        /* We want to promote to vector register, so we expect that at least 
> SSE
>         is available.  */
> *************** promote_duplicated_reg_to_size (rtx val,
> *** 22724,22730 ****
>        else
>        promoted_val = promote_duplicated_reg (V4SImode, val);
>      }
> !   else if (size_needed > 4 || (desired_align > align && desired_align > 4))
>      {
>        gcc_assert (TARGET_64BIT);
>        promoted_val = promote_duplicated_reg (DImode, val);
> --- 22809,22815 ----
>        else
>        promoted_val = promote_duplicated_reg (V4SImode, val);
>      }
> !   else if (size_needed > 4)
>      {
>        gcc_assert (TARGET_64BIT);
>        promoted_val = promote_duplicated_reg (DImode, val);
> *************** ix86_expand_setmem (rtx dst, rtx count_e
> *** 22764,22769 ****
> --- 22849,22855 ----
>    unsigned int unroll_factor;
>    enum machine_mode move_mode;
>    rtx loop_iter = NULL_RTX;
> +   bool early_jump = false;
>
>    if (CONST_INT_P (align_exp))
>      align = INTVAL (align_exp);
> *************** ix86_expand_setmem (rtx dst, rtx count_e
> *** 22783,22789 ****
>    /* Step 0: Decide on preferred algorithm, desired alignment and
>       size of chunks to be copied by main loop.  */
>
> !   align_unknown = CONST_INT_P (align_exp) && INTVAL (align_exp) > 0;
>    alg = decide_alg (count, expected_size, true, &dynamic_check, 
> align_unknown);
>    desired_align = decide_alignment (align, alg, expected_size);
>    unroll_factor = 1;
> --- 22869,22875 ----
>    /* Step 0: Decide on preferred algorithm, desired alignment and
>       size of chunks to be copied by main loop.  */
>
> !   align_unknown = !(CONST_INT_P (align_exp) && INTVAL (align_exp) > 0);
>    alg = decide_alg (count, expected_size, true, &dynamic_check, 
> align_unknown);
>    desired_align = decide_alignment (align, alg, expected_size);
>    unroll_factor = 1;
> *************** ix86_expand_setmem (rtx dst, rtx count_e
> *** 22813,22821 ****
>        move_mode = Pmode;
>        unroll_factor = 1;
>        /* Select maximal available 1,2 or 4 unroll factor.  */
> !       while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
> !            && unroll_factor < 4)
> !       unroll_factor *= 2;
>        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>        break;
>      case sse_loop:
> --- 22899,22910 ----
>        move_mode = Pmode;
>        unroll_factor = 1;
>        /* Select maximal available 1,2 or 4 unroll factor.  */
> !       if (!count)
> !       unroll_factor = 4;
> !       else
> !       while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
> !              && unroll_factor < 4)
> !         unroll_factor *= 2;
>        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>        break;
>      case sse_loop:
> *************** ix86_expand_setmem (rtx dst, rtx count_e
> *** 22823,22831 ****
>        move_mode = TARGET_64BIT ? V2DImode : V4SImode;
>        unroll_factor = 1;
>        /* Select maximal available 1,2 or 4 unroll factor.  */
> !       while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
> !            && unroll_factor < 4)
> !       unroll_factor *= 2;
>        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>        break;
>      case rep_prefix_8_byte:
> --- 22912,22923 ----
>        move_mode = TARGET_64BIT ? V2DImode : V4SImode;
>        unroll_factor = 1;
>        /* Select maximal available 1,2 or 4 unroll factor.  */
> !       if (!count)
> !       unroll_factor = 4;
> !       else
> !       while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
> !              && unroll_factor < 4)
> !         unroll_factor *= 2;
>        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>        break;
>      case rep_prefix_8_byte:
> *************** ix86_expand_setmem (rtx dst, rtx count_e
> *** 22904,22909 ****
> --- 22996,23002 ----
>                emit_move_insn (loop_iter, const0_rtx);
>            }
>          label = gen_label_rtx ();
> +         early_jump = true;
>          emit_cmp_and_jump_insns (count_exp,
>                                   GEN_INT (epilogue_size_needed),
>                                   LTU, 0, counter_mode (count_exp), 1, label);
> *************** ix86_expand_setmem (rtx dst, rtx count_e
> *** 23016,23022 ****
>        vec_promoted_val =
>        promote_duplicated_reg_to_size (gpr_promoted_val,
>                                        GET_MODE_SIZE (move_mode),
> !                                       desired_align, align);
>        loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, 
> destreg,
>                                     NULL, vec_promoted_val, count_exp,
>                                     loop_iter, move_mode, unroll_factor,
> --- 23109,23115 ----
>        vec_promoted_val =
>        promote_duplicated_reg_to_size (gpr_promoted_val,
>                                        GET_MODE_SIZE (move_mode),
> !                                       GET_MODE_SIZE (move_mode), align);
>        loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, 
> destreg,
>                                     NULL, vec_promoted_val, count_exp,
>                                     loop_iter, move_mode, unroll_factor,
> *************** ix86_expand_setmem (rtx dst, rtx count_e
> *** 23065,23085 ****
>        LABEL_NUSES (label) = 1;
>        /* We can not rely on fact that promoved value is known.  */
>        vec_promoted_val = 0;
> !       gpr_promoted_val = 0;
>      }
>   epilogue:
>    if (alg == unrolled_loop || alg == sse_loop)
>      {
>        rtx tmp;
> !       if (align_unknown && unroll_factor > 1
> !         && epilogue_size_needed >= GET_MODE_SIZE (move_mode)
> !         && vec_promoted_val)
>        {
>          /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
>             do this, we can have very big epilogue - when alignment is 
> statically
>             unknown we'll have the epilogue byte by byte which may be very 
> slow.  */
>          loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, 
> destreg,
> !             NULL, vec_promoted_val, count_exp,
>              loop_iter, move_mode, 1,
>              expected_size, false);
>          dst = change_address (dst, BLKmode, destreg);
> --- 23158,23183 ----
>        LABEL_NUSES (label) = 1;
>        /* We can not rely on fact that promoved value is known.  */
>        vec_promoted_val = 0;
> !       if (early_jump)
> !         gpr_promoted_val = 0;
>      }
>   epilogue:
>    if (alg == unrolled_loop || alg == sse_loop)
>      {
>        rtx tmp;
> !       int remainder_size = epilogue_size_needed;
> !       if (count && desired_align <= align)
> !       remainder_size = count % epilogue_size_needed;
> !       /* We may not need the epilgoue loop at all when the count is known
> !        and alignment is not adjusted.  */
> !       if (remainder_size > 31
> !         && (alg == sse_loop ? vec_promoted_val : gpr_promoted_val))
>        {
>          /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
>             do this, we can have very big epilogue - when alignment is 
> statically
>             unknown we'll have the epilogue byte by byte which may be very 
> slow.  */
>          loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, 
> destreg,
> !             NULL, (alg == sse_loop ? vec_promoted_val : gpr_promoted_val), 
> count_exp,
>              loop_iter, move_mode, 1,
>              expected_size, false);
>          dst = change_address (dst, BLKmode, destreg);
> *************** ix86_expand_setmem (rtx dst, rtx count_e
> *** 23090,23106 ****
>        if (tmp != destreg)
>        emit_move_insn (destreg, tmp);
>      }
> !   if (count_exp == const0_rtx)
>      ;
> !   else if (!gpr_promoted_val && epilogue_size_needed > 1)
>      expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
>                                     epilogue_size_needed);
>    else
> !     {
> !       if (epilogue_size_needed > 1)
> !       expand_setmem_epilogue (dst, destreg, vec_promoted_val, 
> gpr_promoted_val,
> !                               val_exp, count_exp, epilogue_size_needed);
> !     }
>    if (jump_around_label)
>      emit_label (jump_around_label);
>    return true;
> --- 23188,23201 ----
>        if (tmp != destreg)
>        emit_move_insn (destreg, tmp);
>      }
> !   if (count_exp == const0_rtx || epilogue_size_needed <= 1)
>      ;
> !   else if (!gpr_promoted_val)
>      expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
>                                     epilogue_size_needed);
>    else
> !     expand_setmem_epilogue (dst, destreg, vec_promoted_val, 
> gpr_promoted_val,
> !                           val_exp, count_exp, epilogue_size_needed);
>    if (jump_around_label)
>      emit_label (jump_around_label);
>    return true;
>

-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

Attachment: memcpy_unknown_alignment.patch
Description: Binary data

Reply via email to