Hi, I am writing about a problem I noticed with the code generated for memcpy by arm-none-eabi-gcc.
Now, memcpy has three distinct loops - one that copies (4 *sizeof (long) ) bytes per iteration, one that copies sizeof (long) bytes per iteration and the last one that copies one byte per iteration. The registers used for the src and destination pointers should, IMHO, be the same across all the three loops. However, what I noticed is that after the first loop the src and dest registers arent reused but the address of the next byte to be copied is recalculated as (original src address + (number of iterations of 1st loop * 16)) . Similarly for the destination address too. See the assembly snippet below. .L3: <Code here for the loop that copies 16 bytes per iteration. The src pointer is ip and the dest pointer is r1. The len is in r0> bhi .L3 @, sub r2, r4, #16 @ D.1312, len0, mov r3, r2, lsr #4 @ D.1314, D.1312, sub r1, r2, r3, asl #4 @ len, D.1312, D.1314, add r3, r3, #1 @ tmp225, D.1314, mov r3, r3, asl #4 @ D.1320, tmp225, cmp r1, #3 @ len, add r4, r5, r3 @ aligned_src.60, src0, D.1320 <----Recalcuation of the src pointer for the second loop. add r0, r6, r3 @ aligned_dst, dst0, D.1320 <--- Recalculation of the dest pointer for the second loop. bls .L4 @, mov ip, #0 @ ivtmp.31, .L5: ldr r3, [r4, ip] @ tmp226,* ivtmp.31 str r3, [r0, ip] @ tmp226,* ivtmp.31 add ip, ip, #4 @ ivtmp.31, ivtmp.31, rsb r3, ip, r1 @ tmp227, ivtmp.31, len cmp r3, #3 @ tmp227, bhi .L5 @, What seems to happen is that closed loop SSA demands that no variable be used outside the loop it is defined in . And before the loop optimization initialization pass, the second loop finds the following PHI nodes in its first basic block. # len_74 = PHI <len_38(17), len_15(15)> # aligned_src_73 = PHI <aligned_src_35(17), aligned_src_22(15)> # aligned_dst_61 = PHI <aligned_dst_34(17), aligned_dst_21(15)> All of len_38, aligned_src_35 and aligned_dst_34 are defined in the first loop and used here again the second loop. Therefore the loop initialization pass puts the following PHI nodes in the exit block of the 1st loop. # len_59 = PHI <len_38(5)> # aligned_src_62 = PHI <aligned_src_35(5)> # aligned_dst_16 = PHI <aligned_dst_34(5)> and the PHI nodes in the loop 2 are changed into # len_74 = PHI <len_59(17), len_15(15)> # aligned_src_73 = PHI <aligned_src_62(17), aligned_src_22(15)> # aligned_dst_61 = PHI <aligned_dst_16(17), aligned_dst_21(15)> Now tree scalar evolution goes over PHI nodes and realises that aligned_src_35 has a scalar evolution {aligned_src_22 + 16, +, 16}_1) where aligned_src_22 is (const long int *) src0_12(D) i.e the original src pointer. Therefore to calculate aligned_src_62 before the second loop computations are introduced based on aligned_src_22. My question is, shouldnt scalar evolution ignore PHI nodes with one argument (implying a copy) or If not atleast pay heed to the cost of additional computations introduced. cheers! Pranav