Hi Jan!
Thanks for the review, you could find my answers to some of your
remarks below. I'll send a corrected patch soon with answers to the
rest of your remarks.
> - {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> - {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
> + {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> + {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
>
> I am bit concerned about explossion of variants, but adding aligned variants
> probably makes
> sense. I guess we should specify what alignment needs to be known. I.e. is
> alignment of 2 enough
> or shall the alignment be matching the size of loads/stores produced?
Yes, alignment should match the size of loads/stores as well as offset
from alignment boundary should be known. In other case, strategies for
unknown alignment would be chosen.
> This hunk seems dangerous in a way that by emitting the explicit loadq/storeq
> pairs (and similar) will
> prevent use of integer registers for 64bit/128bit arithmetic.
>
> I guess we could play such tricks for memory-memory moves & constant stores.
> With gimple optimizations
> we already know pretty well that the moves will stay as they are. That might
> be enough for you?
Yes, theoretically it could harm 64/128-bit arithmetic, but actually
what could we do if we have DImode, mem-to-mem move and our mode is
32-bit? Ideally, RA should be able to make desicions on how to perform
such moves, but currently it doesn't generate SSE-moves - when it'll
be able to do so, I think we could remove this part and rely on RA.
And, one more point. This is quite a special case - here we want to
perform move via half of vector register. This is the main reason why
these particular cases are handled in special, not common, way.
> I wrote the original function, but it is not really clear for me what the
> function
> does now. I.e. what is code for updating addresses and what means reusing
> iter.
> I guess reusing iter means that we won't start the loop from 0. Could you
> expand comments a bit more?
>
> I know I did not documented them originally, but all the parameters ought to
> be
> explicitely documented in a function comment.
Yep, you're right - we just don't start the loop from 0. I'll send a
version with the comments soon.
> -/* Output code to copy at most count & (max_size - 1) bytes from SRC to
> DEST. */
> +/* Emit strset instuction. If RHS is constant, and vector mode will be used,
> + then move this consatnt to a vector register before emitting strset. */
> +static void
> +emit_strset (rtx destmem, rtx value,
> + rtx destptr, enum machine_mode mode, int offset)
>
> This seems to more naturally belong into gen_strset expander?
I don't think it matters here, but to make emit_strset look similar to
emit_strmov, most of emit_strset body realy could be moved to
gen_strset.
> if (max_size > 16)
> {
> rtx label = ix86_expand_aligntest (count, 16, true);
> if (TARGET_64BIT)
> {
> - dest = change_address (destmem, DImode, destptr);
> - emit_insn (gen_strset (destptr, dest, value));
> - emit_insn (gen_strset (destptr, dest, value));
> + destmem = change_address (destmem, DImode, destptr);
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> + value)));
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> + value)));
>
> No use for 128bit moves here?
> }
> else
> {
> - dest = change_address (destmem, SImode, destptr);
> - emit_insn (gen_strset (destptr, dest, value));
> - emit_insn (gen_strset (destptr, dest, value));
> - emit_insn (gen_strset (destptr, dest, value));
> - emit_insn (gen_strset (destptr, dest, value));
> + destmem = change_address (destmem, SImode, destptr);
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + value)));
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + value)));
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + value)));
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + value)));
>
> And here?
For memset prologues/epilogues I avoid using vector moves as it could
require expensive initialization (we need to create a vector filled
with the given value). For other cases, I'll re-check if use of vector
moves is possible.
> @@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp,
> rtx destreg, rtx srcreg,
> off = 4;
> emit_insn (gen_strmov (destreg, dst, srcreg, src));
> }
> + if (align_bytes & 8)
> + {
> + if (TARGET_64BIT || TARGET_SSE)
> + {
> + dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
> + src = adjust_automodify_address_nv (src, DImode, srcreg, off);
> + emit_insn (gen_strmov (destreg, dst, srcreg, src));
> + }
> + else
> + {
> + dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> + src = adjust_automodify_address_nv (src, SImode, srcreg, off);
> + emit_insn (gen_strmov (destreg, dst, srcreg, src));
> + emit_insn (gen_strmov (destreg, dst, srcreg, src));
>
> again, no use for vector moves?
Actually, here vector-moves are used if they are available - if 32bit
mode is used (so we can't do the move via GPR), but SSE is available,
then SSE-move would be generated.
> +/* Target hook. Returns rtx of mode MODE with promoted value VAL, that is
> + supposed to represent one byte. MODE could be a vector mode.
> + Example:
> + 1) VAL = const_int (0xAB), mode = SImode,
> + the result is const_int (0xABABABAB).
>
> This can be handled in machine independent way, right?
>
> + 2) if VAL isn't const, then the result will be the result of
> MUL-instruction
> + of VAL and const_int (0x01010101) (for SImode). */
>
> This would probably go better as named expansion pattern, like we do for other
> machine description interfaces.
I don't think it could be done in machine-independent way - e.g. if
AVX is available, we could use broadcast-instructions, if not - we
need to use multiply-instructions, on other architectures there
probably some other more efficient ways to duplicate byte value across
the entire vector register. So IMO it's a good place to have a hook.
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 90cef1c..4b7d67b 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -5780,6 +5780,32 @@ mode returned by
> @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
> The default is zero which means to not iterate over other vector sizes.
> @end deftypefn
>
> +@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum
> machine_mode @var{mode}, unsigned int @var{align})
> +This hook should return true if memory accesses in mode @var{mode} to data
> +aligned by @var{align} bits have a cost many times greater than aligned
> +accesses, for example if they are emulated in a trap handler.
>
> New hooks should go to the machine indpendent part of the patch.
>
These changes present in middle-end part too (of course, in the full
patch it's not duplicated). I left it in this patch too to avoid
possible problems with build.
--
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.
On 27 October 2011 19:09, Jan Hubicka <[email protected]> wrote:
> Hi,
> sorry for delay with the review. This is my first pass through the backend
> part, hopefully
> someone else will do the middle end bits.
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 2c53423..d7c4330 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for
> tuning for size */
> COSTS_N_BYTES (2), /* cost of FABS instruction. */
> COSTS_N_BYTES (2), /* cost of FCHS instruction. */
> COSTS_N_BYTES (2), /* cost of FSQRT instruction. */
> - {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> - {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
> + {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> + {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
>
> I am bit concerned about explossion of variants, but adding aligned variants
> probably makes
> sense. I guess we should specify what alignment needs to be known. I.e. is
> alignment of 2 enough
> or shall the alignment be matching the size of loads/stores produced?
>
> @@ -15266,6 +15362,38 @@ ix86_expand_move (enum machine_mode mode, rtx
> operands[])
> }
> else
> {
> + if (mode == DImode
> + && !TARGET_64BIT
> + && TARGET_SSE2
> + && MEM_P (op0)
> + && MEM_P (op1)
> + && !push_operand (op0, mode)
> + && can_create_pseudo_p ())
> + {
> + rtx temp = gen_reg_rtx (V2DImode);
> + emit_insn (gen_sse2_loadq (temp, op1));
> + emit_insn (gen_sse_storeq (op0, temp));
> + return;
> + }
> + if (mode == DImode
> + && !TARGET_64BIT
> + && TARGET_SSE
> + && !MEM_P (op1)
> + && GET_MODE (op1) == V2DImode)
> + {
> + emit_insn (gen_sse_storeq (op0, op1));
> + return;
> + }
> + if (mode == TImode
> + && TARGET_AVX2
> + && MEM_P (op0)
> + && !MEM_P (op1)
> + && GET_MODE (op1) == V4DImode)
> + {
> + op0 = convert_to_mode (V2DImode, op0, 1);
> + emit_insn (gen_vec_extract_lo_v4di (op0, op1));
> + return;
> + }
>
> This hunk seems dangerous in a way that by emitting the explicit loadq/storeq
> pairs (and similar) will
> prevent use of integer registers for 64bit/128bit arithmetic.
>
> I guess we could play such tricks for memory-memory moves & constant stores.
> With gimple optimizations
> we already know pretty well that the moves will stay as they are. That might
> be enough for you?
>
> If you go this way, please make separate patch so it can be benchmarked.
> While the moves are faster
> there are always problem with size mismatches in load/store buffers.
>
> I think for string operations we should output the SSE/AVX instruction
> variants
> by hand + we could try to instruct IRA to actually preffer use of SSE/AVX when
> feasible? This has been traditionally problem with old RA because it did not
> see that because pseudo is eventually used for arithmetics, it can not go into
> SSE register. So it was not possible to make MMX/SSE/AVX to be preferred
> variants for 64bit/128bit manipulations w/o hurting performance of code that
> does arithmetic on long long and __int128. Perhaps IRA can solve this now.
> Vladimir, do you have any ida?
>
> +/* Helper function for expand_set_or_movmem_via_loop.
> + This function can reuse iter rtx from another loop and don't generate
> + code for updating the addresses. */
> +static rtx
> +expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
> + rtx destptr, rtx srcptr, rtx value,
> + rtx count, rtx iter,
> + enum machine_mode mode, int unroll,
> + int expected_size, bool change_ptrs)
>
> I wrote the original function, but it is not really clear for me what the
> function
> does now. I.e. what is code for updating addresses and what means reusing
> iter.
> I guess reusing iter means that we won't start the loop from 0. Could you
> expand comments a bit more?
>
> I know I did not documented them originally, but all the parameters ought to
> be
> explicitely documented in a function comment.
> @@ -20913,7 +21065,27 @@ emit_strmov (rtx destmem, rtx srcmem,
> emit_insn (gen_strmov (destptr, dest, srcptr, src));
> }
>
> -/* Output code to copy at most count & (max_size - 1) bytes from SRC to
> DEST. */
> +/* Emit strset instuction. If RHS is constant, and vector mode will be used,
> + then move this consatnt to a vector register before emitting strset. */
> +static void
> +emit_strset (rtx destmem, rtx value,
> + rtx destptr, enum machine_mode mode, int offset)
>
> This seems to more naturally belong into gen_strset expander?
> {
> - if (TARGET_64BIT)
> - {
> - dest = adjust_automodify_address_nv (destmem, DImode, destptr,
> offset);
> - emit_insn (gen_strset (destptr, dest, value));
> - }
> - else
> - {
> - dest = adjust_automodify_address_nv (destmem, SImode, destptr,
> offset);
> - emit_insn (gen_strset (destptr, dest, value));
> - dest = adjust_automodify_address_nv (destmem, SImode, destptr,
> offset + 4);
> - emit_insn (gen_strset (destptr, dest, value));
> - }
> - offset += 8;
> + if (GET_MODE (destmem) != move_mode)
> + destmem = change_address (destmem, move_mode, destptr);
> AFAIK change_address is not equilvalent into adjust_automodify_address_nv in
> a way
> it copies memory aliasing attributes and it is needed to zap them here since
> stringops
> behaves funily WRT aliaseing.
> if (max_size > 16)
> {
> rtx label = ix86_expand_aligntest (count, 16, true);
> if (TARGET_64BIT)
> {
> - dest = change_address (destmem, DImode, destptr);
> - emit_insn (gen_strset (destptr, dest, value));
> - emit_insn (gen_strset (destptr, dest, value));
> + destmem = change_address (destmem, DImode, destptr);
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> + value)));
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> + value)));
>
> No use for 128bit moves here?
> }
> else
> {
> - dest = change_address (destmem, SImode, destptr);
> - emit_insn (gen_strset (destptr, dest, value));
> - emit_insn (gen_strset (destptr, dest, value));
> - emit_insn (gen_strset (destptr, dest, value));
> - emit_insn (gen_strset (destptr, dest, value));
> + destmem = change_address (destmem, SImode, destptr);
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + value)));
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + value)));
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + value)));
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + value)));
>
> And here?
> @@ -21204,8 +21426,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
> if (align <= 1 && desired_alignment > 1)
> {
> rtx label = ix86_expand_aligntest (destptr, 1, false);
> - srcmem = change_address (srcmem, QImode, srcptr);
> - destmem = change_address (destmem, QImode, destptr);
> + srcmem = adjust_automodify_address_1 (srcmem, QImode, srcptr, 0, 1);
> + destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1);
>
> You want to always use adjust_automodify_address or
> adjust_automodify_address_nv,
> adjust_automodify_address_1 is not intended for general use.
> @@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp,
> rtx destreg, rtx srcreg,
> off = 4;
> emit_insn (gen_strmov (destreg, dst, srcreg, src));
> }
> + if (align_bytes & 8)
> + {
> + if (TARGET_64BIT || TARGET_SSE)
> + {
> + dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
> + src = adjust_automodify_address_nv (src, DImode, srcreg, off);
> + emit_insn (gen_strmov (destreg, dst, srcreg, src));
> + }
> + else
> + {
> + dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> + src = adjust_automodify_address_nv (src, SImode, srcreg, off);
> + emit_insn (gen_strmov (destreg, dst, srcreg, src));
> + emit_insn (gen_strmov (destreg, dst, srcreg, src));
>
> again, no use for vector moves?
> +/* Target hook. Returns rtx of mode MODE with promoted value VAL, that is
> + supposed to represent one byte. MODE could be a vector mode.
> + Example:
> + 1) VAL = const_int (0xAB), mode = SImode,
> + the result is const_int (0xABABABAB).
>
> This can be handled in machine independent way, right?
>
> + 2) if VAL isn't const, then the result will be the result of
> MUL-instruction
> + of VAL and const_int (0x01010101) (for SImode). */
>
> This would probably go better as named expansion pattern, like we do for other
> machine description interfaces.
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 90cef1c..4b7d67b 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -5780,6 +5780,32 @@ mode returned by
> @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
> The default is zero which means to not iterate over other vector sizes.
> @end deftypefn
>
> +@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum
> machine_mode @var{mode}, unsigned int @var{align})
> +This hook should return true if memory accesses in mode @var{mode} to data
> +aligned by @var{align} bits have a cost many times greater than aligned
> +accesses, for example if they are emulated in a trap handler.
>
> New hooks should go to the machine indpendent part of the patch.
>
> Honza
>