Re: Use of vector instructions in memmov/memset expanding

Michael Zolotukhin Fri, 28 Oct 2011 05:41:30 -0700

Hi Jan!
Thanks for the review, you could find my answers to some of your
remarks below. I'll send a corrected patch soon with answers to the
rest of your remarks.


> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> +   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
>
> I am bit concerned about explossion of variants, but adding aligned variants 
> probably makes
> sense.  I guess we should specify what alignment needs to be known. I.e. is 
> alignment of 2 enough
> or shall the alignment be matching the size of loads/stores produced?

Yes, alignment should match the size of loads/stores as well as offset
from alignment boundary should be known. In other case, strategies for
unknown alignment would be chosen.


> This hunk seems dangerous in a way that by emitting the explicit loadq/storeq 
> pairs (and similar) will
> prevent use of integer registers for 64bit/128bit arithmetic.
>
> I guess we could play such tricks for memory-memory moves & constant stores. 
> With gimple optimizations
> we already know pretty well that the moves will stay as they are.  That might 
> be enough for you?

Yes, theoretically it could harm 64/128-bit arithmetic, but actually
what could we do if we have DImode, mem-to-mem move and our mode is
32-bit? Ideally, RA should be able to make desicions on how to perform
such moves, but currently it doesn't generate SSE-moves - when it'll
be able to do so, I think we could remove this part and rely on RA.
And, one more point. This is quite a special case - here we want to
perform move via half of vector register. This is the main reason why
these particular cases are handled in special, not common, way.


> I wrote the original function, but it is not really clear for me what the 
> function
> does now. I.e. what is code for updating addresses and what means reusing 
> iter.
> I guess reusing iter means that we won't start the loop from 0.  Could you
> expand comments a bit more?
>
> I know I did not documented them originally, but all the parameters ought to 
> be
> explicitely documented in a function comment.

Yep, you're right - we just don't start the loop from 0. I'll send a
version with the comments soon.


> -/* Output code to copy at most count & (max_size - 1) bytes from SRC to 
> DEST.  */
> +/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
> +   then move this consatnt to a vector register before emitting strset.  */
> +static void
> +emit_strset (rtx destmem, rtx value,
> +            rtx destptr, enum machine_mode mode, int offset)
>
> This seems to more naturally belong into gen_strset expander?

I don't think it matters here, but to make emit_strset look similar to
emit_strmov, most of emit_strset body realy could be moved to
gen_strset.


>   if (max_size > 16)
>     {
>       rtx label = ix86_expand_aligntest (count, 16, true);
>       if (TARGET_64BIT)
>        {
> -         dest = change_address (destmem, DImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, DImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> +                                                               value)));
>
> No use for 128bit moves here?
>        }
>       else
>        {
> -         dest = change_address (destmem, SImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, SImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
>
> And here?

For memset prologues/epilogues I avoid using vector moves as it could
require expensive initialization (we need to create a vector filled
with the given value). For other cases, I'll re-check if use of vector
moves is possible.


> @@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, 
> rtx destreg, rtx srcreg,
>       off = 4;
>       emit_insn (gen_strmov (destreg, dst, srcreg, src));
>     }
> +  if (align_bytes & 8)
> +    {
> +      if (TARGET_64BIT || TARGET_SSE)
> +       {
> +         dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
> +         src = adjust_automodify_address_nv (src, DImode, srcreg, off);
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
> +       }
> +      else
> +       {
> +         dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> +         src = adjust_automodify_address_nv (src, SImode, srcreg, off);
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
>
> again, no use for vector moves?

Actually, here vector-moves are used if they are available - if 32bit
mode is used (so we can't do the move via GPR), but SSE is available,
then SSE-move would be generated.


> +/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
> +   supposed to represent one byte.  MODE could be a vector mode.
> +   Example:
> +   1) VAL = const_int (0xAB), mode = SImode,
> +   the result is const_int (0xABABABAB).
>
> This can be handled in machine independent way, right?
>
> +   2) if VAL isn't const, then the result will be the result of 
> MUL-instruction
> +   of VAL and const_int (0x01010101) (for SImode).  */
>
> This would probably go better as named expansion pattern, like we do for other
> machine description interfaces.

I don't think it could be done in machine-independent way - e.g. if
AVX is available, we could use broadcast-instructions, if not - we
need to use multiply-instructions, on other architectures there
probably some other more efficient ways to duplicate byte value across
the entire vector register. So IMO it's a good place to have a hook.


> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 90cef1c..4b7d67b 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -5780,6 +5780,32 @@ mode returned by 
> @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
>  The default is zero which means to not iterate over other vector sizes.
>  @end deftypefn
>
> +@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum 
> machine_mode @var{mode}, unsigned int @var{align})
> +This hook should return true if memory accesses in mode @var{mode} to data
> +aligned by @var{align} bits have a cost many times greater than aligned
> +accesses, for example if they are emulated in a trap handler.
>
> New hooks should go to the machine indpendent part of the patch.
>

These changes present in middle-end part too (of course, in the full
patch it's not duplicated). I left it in this patch too to avoid
possible problems with build.


-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.


On 27 October 2011 19:09, Jan Hubicka <[email protected]> wrote:
> Hi,
> sorry for delay with the review. This is my first pass through the backend 
> part, hopefully
> someone else will do the middle end bits.
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 2c53423..d7c4330 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for 
> tuning for size */
>   COSTS_N_BYTES (2),                   /* cost of FABS instruction.  */
>   COSTS_N_BYTES (2),                   /* cost of FCHS instruction.  */
>   COSTS_N_BYTES (2),                   /* cost of FSQRT instruction.  */
> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> +   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
>
> I am bit concerned about explossion of variants, but adding aligned variants 
> probably makes
> sense.  I guess we should specify what alignment needs to be known. I.e. is 
> alignment of 2 enough
> or shall the alignment be matching the size of loads/stores produced?
>
> @@ -15266,6 +15362,38 @@ ix86_expand_move (enum machine_mode mode, rtx 
> operands[])
>     }
>   else
>     {
> +      if (mode == DImode
> +         && !TARGET_64BIT
> +         && TARGET_SSE2
> +         && MEM_P (op0)
> +         && MEM_P (op1)
> +         && !push_operand (op0, mode)
> +         && can_create_pseudo_p ())
> +       {
> +         rtx temp = gen_reg_rtx (V2DImode);
> +         emit_insn (gen_sse2_loadq (temp, op1));
> +         emit_insn (gen_sse_storeq (op0, temp));
> +         return;
> +       }
> +      if (mode == DImode
> +         && !TARGET_64BIT
> +         && TARGET_SSE
> +         && !MEM_P (op1)
> +         && GET_MODE (op1) == V2DImode)
> +       {
> +         emit_insn (gen_sse_storeq (op0, op1));
> +         return;
> +       }
> +      if (mode == TImode
> +         && TARGET_AVX2
> +         && MEM_P (op0)
> +         && !MEM_P (op1)
> +         && GET_MODE (op1) == V4DImode)
> +       {
> +         op0 = convert_to_mode (V2DImode, op0, 1);
> +         emit_insn (gen_vec_extract_lo_v4di (op0, op1));
> +         return;
> +       }
>
> This hunk seems dangerous in a way that by emitting the explicit loadq/storeq 
> pairs (and similar) will
> prevent use of integer registers for 64bit/128bit arithmetic.
>
> I guess we could play such tricks for memory-memory moves & constant stores. 
> With gimple optimizations
> we already know pretty well that the moves will stay as they are.  That might 
> be enough for you?
>
> If you go this way, please make separate patch so it can be benchmarked. 
> While the moves are faster
> there are always problem with size mismatches in load/store buffers.
>
> I think for string operations we should output the SSE/AVX instruction 
> variants
> by hand + we could try to instruct IRA to actually preffer use of SSE/AVX when
> feasible?  This has been traditionally problem with old RA because it did not
> see that because pseudo is eventually used for arithmetics, it can not go into
> SSE register. So it was not possible to make MMX/SSE/AVX to be preferred
> variants for 64bit/128bit manipulations w/o hurting performance of code that
> does arithmetic on long long and __int128.  Perhaps IRA can solve this now.
> Vladimir, do you have any ida?
>
> +/* Helper function for expand_set_or_movmem_via_loop.
> +   This function can reuse iter rtx from another loop and don't generate
> +   code for updating the addresses.  */
> +static rtx
> +expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
> +                                        rtx destptr, rtx srcptr, rtx value,
> +                                        rtx count, rtx iter,
> +                                        enum machine_mode mode, int unroll,
> +                                        int expected_size, bool change_ptrs)
>
> I wrote the original function, but it is not really clear for me what the 
> function
> does now. I.e. what is code for updating addresses and what means reusing 
> iter.
> I guess reusing iter means that we won't start the loop from 0.  Could you
> expand comments a bit more?
>
> I know I did not documented them originally, but all the parameters ought to 
> be
> explicitely documented in a function comment.
> @@ -20913,7 +21065,27 @@ emit_strmov (rtx destmem, rtx srcmem,
>   emit_insn (gen_strmov (destptr, dest, srcptr, src));
>  }
>
> -/* Output code to copy at most count & (max_size - 1) bytes from SRC to 
> DEST.  */
> +/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
> +   then move this consatnt to a vector register before emitting strset.  */
> +static void
> +emit_strset (rtx destmem, rtx value,
> +            rtx destptr, enum machine_mode mode, int offset)
>
> This seems to more naturally belong into gen_strset expander?
>        {
> -         if (TARGET_64BIT)
> -           {
> -             dest = adjust_automodify_address_nv (destmem, DImode, destptr, 
> offset);
> -             emit_insn (gen_strset (destptr, dest, value));
> -           }
> -         else
> -           {
> -             dest = adjust_automodify_address_nv (destmem, SImode, destptr, 
> offset);
> -             emit_insn (gen_strset (destptr, dest, value));
> -             dest = adjust_automodify_address_nv (destmem, SImode, destptr, 
> offset + 4);
> -             emit_insn (gen_strset (destptr, dest, value));
> -           }
> -         offset += 8;
> +         if (GET_MODE (destmem) != move_mode)
> +           destmem = change_address (destmem, move_mode, destptr);
> AFAIK change_address is not equilvalent into adjust_automodify_address_nv in 
> a way
> it copies memory aliasing attributes and it is needed to zap them here since 
> stringops
> behaves funily WRT aliaseing.
>   if (max_size > 16)
>     {
>       rtx label = ix86_expand_aligntest (count, 16, true);
>       if (TARGET_64BIT)
>        {
> -         dest = change_address (destmem, DImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, DImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> +                                                               value)));
>
> No use for 128bit moves here?
>        }
>       else
>        {
> -         dest = change_address (destmem, SImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, SImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
>
> And here?
> @@ -21204,8 +21426,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
>   if (align <= 1 && desired_alignment > 1)
>     {
>       rtx label = ix86_expand_aligntest (destptr, 1, false);
> -      srcmem = change_address (srcmem, QImode, srcptr);
> -      destmem = change_address (destmem, QImode, destptr);
> +      srcmem = adjust_automodify_address_1 (srcmem, QImode, srcptr, 0, 1);
> +      destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1);
>
> You want to always use adjust_automodify_address or 
> adjust_automodify_address_nv,
> adjust_automodify_address_1 is not intended for general use.
> @@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, 
> rtx destreg, rtx srcreg,
>       off = 4;
>       emit_insn (gen_strmov (destreg, dst, srcreg, src));
>     }
> +  if (align_bytes & 8)
> +    {
> +      if (TARGET_64BIT || TARGET_SSE)
> +       {
> +         dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
> +         src = adjust_automodify_address_nv (src, DImode, srcreg, off);
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
> +       }
> +      else
> +       {
> +         dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> +         src = adjust_automodify_address_nv (src, SImode, srcreg, off);
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
>
> again, no use for vector moves?
> +/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
> +   supposed to represent one byte.  MODE could be a vector mode.
> +   Example:
> +   1) VAL = const_int (0xAB), mode = SImode,
> +   the result is const_int (0xABABABAB).
>
> This can be handled in machine independent way, right?
>
> +   2) if VAL isn't const, then the result will be the result of 
> MUL-instruction
> +   of VAL and const_int (0x01010101) (for SImode).  */
>
> This would probably go better as named expansion pattern, like we do for other
> machine description interfaces.
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 90cef1c..4b7d67b 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -5780,6 +5780,32 @@ mode returned by 
> @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
>  The default is zero which means to not iterate over other vector sizes.
>  @end deftypefn
>
> +@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum 
> machine_mode @var{mode}, unsigned int @var{align})
> +This hook should return true if memory accesses in mode @var{mode} to data
> +aligned by @var{align} bits have a cost many times greater than aligned
> +accesses, for example if they are emulated in a trap handler.
>
> New hooks should go to the machine indpendent part of the patch.
>
> Honza
>

Re: Use of vector instructions in memmov/memset expanding

Reply via email to