The effects of closed loop SSA and Scalar Evolution Const Prop.

Pranav Bhandarkar Tue, 11 Mar 2008 05:39:12 -0700

Hi,

I am writing about a problem I noticed with the code generated for
memcpy by arm-none-eabi-gcc.


Now, memcpy has three distinct loops - one that copies (4 *sizeof
(long) ) bytes per iteration, one that copies sizeof (long) bytes per
iteration and the last one that copies one byte per iteration. The
registers used for the src and destination pointers should, IMHO, be
the same across all the three loops. However, what I noticed is that
after the first loop the src and dest registers arent reused but the
address of the next byte to be copied is recalculated as (original src
address + (number of iterations of 1st loop * 16)) . Similarly for the
destination address too. See the assembly snippet below.

.L3:
       <Code here for the loop that copies 16 bytes per iteration. The
src pointer is ip and the dest pointer is r1. The len is in r0>
        bhi     .L3     @,
        sub     r2, r4, #16     @ D.1312, len0,
        mov     r3, r2, lsr #4  @ D.1314, D.1312,
        sub     r1, r2, r3, asl #4      @ len, D.1312, D.1314,
        add     r3, r3, #1      @ tmp225, D.1314,
        mov     r3, r3, asl #4  @ D.1320, tmp225,
        cmp     r1, #3  @ len,
        add     r4, r5, r3      @ aligned_src.60, src0, D.1320
<----Recalcuation of the src pointer for the second loop.
        add     r0, r6, r3      @ aligned_dst, dst0, D.1320
<--- Recalculation of the dest pointer for the second loop.
        bls     .L4     @,
        mov     ip, #0  @ ivtmp.31,
.L5:
        ldr     r3, [r4, ip]    @ tmp226,* ivtmp.31
        str     r3, [r0, ip]    @ tmp226,* ivtmp.31
        add     ip, ip, #4      @ ivtmp.31, ivtmp.31,
        rsb     r3, ip, r1      @ tmp227, ivtmp.31, len
        cmp     r3, #3  @ tmp227,
        bhi     .L5     @,

What seems to happen is that closed loop SSA demands that no variable
be used outside the loop it is defined in . And before the loop
optimization initialization pass, the second loop finds the following
PHI nodes in its first basic block.

  # len_74 = PHI <len_38(17), len_15(15)>
  # aligned_src_73 = PHI <aligned_src_35(17), aligned_src_22(15)>
  # aligned_dst_61 = PHI <aligned_dst_34(17), aligned_dst_21(15)>

All of  len_38, aligned_src_35 and aligned_dst_34 are defined in the
first loop and used here again the second loop. Therefore the loop
initialization pass puts the following PHI nodes in the exit block of
the 1st loop.
  # len_59 = PHI <len_38(5)>
  # aligned_src_62 = PHI <aligned_src_35(5)>
  # aligned_dst_16 = PHI <aligned_dst_34(5)>

and the PHI nodes in the loop 2 are changed into

  # len_74 = PHI <len_59(17), len_15(15)>
  # aligned_src_73 = PHI <aligned_src_62(17), aligned_src_22(15)>
  # aligned_dst_61 = PHI <aligned_dst_16(17), aligned_dst_21(15)>


Now tree scalar evolution goes over PHI nodes and realises that
aligned_src_35 has a scalar evolution {aligned_src_22 + 16, +, 16}_1)
where aligned_src_22 is
(const long int *) src0_12(D) i.e the original src pointer.  Therefore
to calculate aligned_src_62 before the second loop computations are
introduced based on aligned_src_22.

My question is, shouldnt scalar evolution ignore PHI nodes with one
argument (implying a copy) or If not atleast pay heed to the cost of
additional computations introduced.

cheers!
Pranav

The effects of closed loop SSA and Scalar Evolution Const Prop.

Reply via email to