On Fri, Aug 1, 2014 at 2:21 AM,  <pins...@gmail.com> wrote:
>
>
>> On Jun 6, 2014, at 1:50 AM, James Greenhalgh <james.greenha...@arm.com> 
>> wrote:
>>
>>
>> Hi,
>>
>> The move_by_pieces infrastructure performs a copy by repeatedly trying
>> the largest safe copy it can make. So for a 15-byte copy we might see:
>>
>> offset   amount  bytes copied
>> 0        8       0-7
>> 8        4       8-11
>> 12       2       12-13
>> 14       1       14
>>
>> However, we can implement a 15-byte copy as so:
>>
>> offset   amount  bytes copied
>> 0        8       0-7
>> 7        8       7-14
>>
>> Which can prove more efficient for both space and speed.
>>
>> In this patch we set MOVE_RATIO low to avoid using move_by_pieces, and
>> implement the movmem pattern name to expand small block copy cases. Note, 
>> this
>> optimization does not apply for -mstrict-align targets, which must continue
>> copying byte-by-byte.
>
> Why not change move_by_pieces instead of having a target specific code? This 
> seems like a better option. You can check is unaligned slow target macro to 
> see if you want to do this optimization too.   As I mentioned in the other 
> email make sure you check the volatile ness of the from and to before doing 
> this optimization.

Attached is the patch which does what I mentioned, I also changed
store_by_pieces to implement a similar optimization there (for memset
and strcpy).  Also since I used SLOW_UNALIGNED_ACCESS, this is a
generic optimization.

I had tested an earlier version on x86_64-linux-gnu and I am in the
middle of bootstrap/testing on this one.

Thanks,
Andrew Pinski

* expr.c (move_by_pieces):
Take the min of max_size and len to speed up things
and to take advatage of the mode in move_by_pieces_1.
(move_by_pieces_1): Read/write the leftovers using an overlapping
memory locations to reduce the number of reads/writes.
(store_by_pieces_1): Take the min of max_size and len to speed up things
and to take advatage of the mode in store_by_pieces_2.
(store_by_pieces_2): Write the leftovers using an overlapping
memory locations to reduce the number of writes.


>
> Thanks,
> Andrew
>
>
>>
>> Setting MOVE_RATIO low in this way causes a few tests to begin failing,
>> both of these are documented in the test-case as expected to fail for
>> low MOVE_RATIO targets, which do not allow certain Tree-Level optimizations.
>>
>> Bootstrapped on aarch64-unknown-linux-gnu with no issues.
>>
>> OK for trunk?
>>
>> Thanks,
>> James
>>
>> ---
>> gcc/
>>
>> 2014-06-06  James Greenhalgh  <james.greenha...@arm.com>
>>
>>    * config/aarch64/aarch64-protos.h (aarch64_expand_movmem): New.
>>    * config/aarch64/aarch64.c (aarch64_move_pointer): New.
>>    (aarch64_progress_pointer): Likewise.
>>    (aarch64_copy_one_part_and_move_pointers): Likewise.
>>    (aarch64_expand_movmen): Likewise.
>>    * config/aarch64/aarch64.h (MOVE_RATIO): Set low.
>>    * config/aarch64/aarch64.md (movmem<mode>): New.
>>
>> gcc/testsuite/
>>
>> 2014-06-06  James Greenhalgh  <james.greenha...@arm.com>
>>
>>    * gcc.dg/tree-ssa/pr42585.c: Skip for AArch64.
>>    * gcc.dg/tree-ssa/sra-12.c: Likewise.
>> <0001-AArch64-Implement-movmem-for-the-benefit-of-inline-m.patch>
Index: expr.c
===================================================================
--- expr.c      (revision 213306)
+++ expr.c      (working copy)
@@ -876,6 +876,9 @@ move_by_pieces (rtx to, rtx from, unsign
   if (data.reverse) data.offset = len;
   data.len = len;
 
+  /* Use the MIN of the length and the max size we can use. */
+  max_size = max_size > (len + 1) ? (len + 1) : max_size;
+
   /* If copying requires more than two move insns,
      copy addresses to registers (to make displacements shorter)
      and use post-increment if available.  */
@@ -1073,6 +1076,32 @@ move_by_pieces_1 (insn_gen_fn genfun, ma
 
       data->len -= size;
     }
+
+  /* If we have some data left and unalign accesses
+     are not slow, back up slightly and emit the move. */
+  if (data->len > 0
+      && !STRICT_ALIGNMENT
+      && !SLOW_UNALIGNED_ACCESS (mode, 1)
+      /* Not a stack push */
+      && data->to
+      /* Neither side is volatile memory. */
+      && !MEM_VOLATILE_P (data->to)
+      && !MEM_VOLATILE_P (data->from)
+      && ceil_log2 (data->len) == exact_log2 (size)
+      /* No incrementing of the to or from. */
+      && data->explicit_inc_to == 0
+      && data->explicit_inc_from == 0
+      /* No auto-incrementing of the to or from. */
+      && !data->autinc_to
+      && !data->autinc_from
+      && !data->reverse)
+    {
+      unsigned offset = data->offset - (size - data->len);
+      to1 = adjust_address (data->to, mode, offset);
+      from1 = adjust_address (data->from, mode, offset);
+      emit_insn ((*genfun) (to1, from1));
+      data->len = 0;
+    }
 }
 
 /* Emit code to move a block Y to a block X.  This may be done with
@@ -2636,6 +2665,9 @@ store_by_pieces_1 (struct store_by_piece
   if (data->reverse)
     data->offset = data->len;
 
+  /* Use the MIN of the length and the max size we can use. */
+  max_size = max_size > (data->len + 1) ? (data->len + 1)  : max_size;
+
   /* If storing requires more than two move insns,
      copy addresses to registers (to make displacements shorter)
      and use post-increment if available.  */
@@ -2733,6 +2765,24 @@ store_by_pieces_2 (insn_gen_fn genfun, m
 
       data->len -= size;
     }
+
+  /* If we have some data left and unalign accesses
+     are not slow, back up slightly and emit that constant.  */
+  if (data->len > 0
+      && !STRICT_ALIGNMENT
+      && !SLOW_UNALIGNED_ACCESS (mode, 1)
+      && !MEM_VOLATILE_P (data->to)
+      && ceil_log2 (data->len) == exact_log2 (size)
+      && data->explicit_inc_to == 0
+      && !data->autinc_to
+      && !data->reverse)
+    {
+      unsigned offset = data->offset - (size - data->len);
+      to1 = adjust_address (data->to, mode, offset);
+      cst = (*data->constfun) (data->constfundata, offset, mode);
+      emit_insn ((*genfun) (to1, cst));
+      data->len = 0;
+    }
 }
 
 /* Write zeros through the storage of OBJECT.  If OBJECT has BLKmode, SIZE is

Reply via email to