On Fri, Aug 1, 2014 at 2:21 AM, <pins...@gmail.com> wrote: > > >> On Jun 6, 2014, at 1:50 AM, James Greenhalgh <james.greenha...@arm.com> >> wrote: >> >> >> Hi, >> >> The move_by_pieces infrastructure performs a copy by repeatedly trying >> the largest safe copy it can make. So for a 15-byte copy we might see: >> >> offset amount bytes copied >> 0 8 0-7 >> 8 4 8-11 >> 12 2 12-13 >> 14 1 14 >> >> However, we can implement a 15-byte copy as so: >> >> offset amount bytes copied >> 0 8 0-7 >> 7 8 7-14 >> >> Which can prove more efficient for both space and speed. >> >> In this patch we set MOVE_RATIO low to avoid using move_by_pieces, and >> implement the movmem pattern name to expand small block copy cases. Note, >> this >> optimization does not apply for -mstrict-align targets, which must continue >> copying byte-by-byte. > > Why not change move_by_pieces instead of having a target specific code? This > seems like a better option. You can check is unaligned slow target macro to > see if you want to do this optimization too. As I mentioned in the other > email make sure you check the volatile ness of the from and to before doing > this optimization.
Attached is the patch which does what I mentioned, I also changed store_by_pieces to implement a similar optimization there (for memset and strcpy). Also since I used SLOW_UNALIGNED_ACCESS, this is a generic optimization. I had tested an earlier version on x86_64-linux-gnu and I am in the middle of bootstrap/testing on this one. Thanks, Andrew Pinski * expr.c (move_by_pieces): Take the min of max_size and len to speed up things and to take advatage of the mode in move_by_pieces_1. (move_by_pieces_1): Read/write the leftovers using an overlapping memory locations to reduce the number of reads/writes. (store_by_pieces_1): Take the min of max_size and len to speed up things and to take advatage of the mode in store_by_pieces_2. (store_by_pieces_2): Write the leftovers using an overlapping memory locations to reduce the number of writes. > > Thanks, > Andrew > > >> >> Setting MOVE_RATIO low in this way causes a few tests to begin failing, >> both of these are documented in the test-case as expected to fail for >> low MOVE_RATIO targets, which do not allow certain Tree-Level optimizations. >> >> Bootstrapped on aarch64-unknown-linux-gnu with no issues. >> >> OK for trunk? >> >> Thanks, >> James >> >> --- >> gcc/ >> >> 2014-06-06 James Greenhalgh <james.greenha...@arm.com> >> >> * config/aarch64/aarch64-protos.h (aarch64_expand_movmem): New. >> * config/aarch64/aarch64.c (aarch64_move_pointer): New. >> (aarch64_progress_pointer): Likewise. >> (aarch64_copy_one_part_and_move_pointers): Likewise. >> (aarch64_expand_movmen): Likewise. >> * config/aarch64/aarch64.h (MOVE_RATIO): Set low. >> * config/aarch64/aarch64.md (movmem<mode>): New. >> >> gcc/testsuite/ >> >> 2014-06-06 James Greenhalgh <james.greenha...@arm.com> >> >> * gcc.dg/tree-ssa/pr42585.c: Skip for AArch64. >> * gcc.dg/tree-ssa/sra-12.c: Likewise. >> <0001-AArch64-Implement-movmem-for-the-benefit-of-inline-m.patch>
Index: expr.c =================================================================== --- expr.c (revision 213306) +++ expr.c (working copy) @@ -876,6 +876,9 @@ move_by_pieces (rtx to, rtx from, unsign if (data.reverse) data.offset = len; data.len = len; + /* Use the MIN of the length and the max size we can use. */ + max_size = max_size > (len + 1) ? (len + 1) : max_size; + /* If copying requires more than two move insns, copy addresses to registers (to make displacements shorter) and use post-increment if available. */ @@ -1073,6 +1076,32 @@ move_by_pieces_1 (insn_gen_fn genfun, ma data->len -= size; } + + /* If we have some data left and unalign accesses + are not slow, back up slightly and emit the move. */ + if (data->len > 0 + && !STRICT_ALIGNMENT + && !SLOW_UNALIGNED_ACCESS (mode, 1) + /* Not a stack push */ + && data->to + /* Neither side is volatile memory. */ + && !MEM_VOLATILE_P (data->to) + && !MEM_VOLATILE_P (data->from) + && ceil_log2 (data->len) == exact_log2 (size) + /* No incrementing of the to or from. */ + && data->explicit_inc_to == 0 + && data->explicit_inc_from == 0 + /* No auto-incrementing of the to or from. */ + && !data->autinc_to + && !data->autinc_from + && !data->reverse) + { + unsigned offset = data->offset - (size - data->len); + to1 = adjust_address (data->to, mode, offset); + from1 = adjust_address (data->from, mode, offset); + emit_insn ((*genfun) (to1, from1)); + data->len = 0; + } } /* Emit code to move a block Y to a block X. This may be done with @@ -2636,6 +2665,9 @@ store_by_pieces_1 (struct store_by_piece if (data->reverse) data->offset = data->len; + /* Use the MIN of the length and the max size we can use. */ + max_size = max_size > (data->len + 1) ? (data->len + 1) : max_size; + /* If storing requires more than two move insns, copy addresses to registers (to make displacements shorter) and use post-increment if available. */ @@ -2733,6 +2765,24 @@ store_by_pieces_2 (insn_gen_fn genfun, m data->len -= size; } + + /* If we have some data left and unalign accesses + are not slow, back up slightly and emit that constant. */ + if (data->len > 0 + && !STRICT_ALIGNMENT + && !SLOW_UNALIGNED_ACCESS (mode, 1) + && !MEM_VOLATILE_P (data->to) + && ceil_log2 (data->len) == exact_log2 (size) + && data->explicit_inc_to == 0 + && !data->autinc_to + && !data->reverse) + { + unsigned offset = data->offset - (size - data->len); + to1 = adjust_address (data->to, mode, offset); + cst = (*data->constfun) (data->constfundata, offset, mode); + emit_insn ((*genfun) (to1, cst)); + data->len = 0; + } } /* Write zeros through the storage of OBJECT. If OBJECT has BLKmode, SIZE is