On Fri, Sep 2, 2011 at 4:51 AM, David Gilbert <david.gilb...@linaro.org> wrote:
> Hi Michael,
>  I've just committed a pair of memcpy's into src/linaro-a9 - memcpy.S
> that is armv7
> and memcpy-hybrid.S that is a Neon hybrid which uses neon for non-aligned 
> cases
> and for large (128K or larger) copies.   I've also (accidentally)
> wired the memcpy-hybrid
> one into the Makefile.am (I wasn't sure what the right way to do this
> was - the neon_sources
> seemed a good place for it, but there is nothing currently in there
> that turns off the non-neon
> version).
>
>  I'd be interested in seeing the results for both; I've got a bit of
> a soft spot for the hybrid
> solution.
>
>  On the memset, yes the 'and' that you added is fine - but I started
> having a play and have
> some performance results (on -t 128) that I don't really understand:
>
>
> 1) and r1,#0xff
>    orr  r1,r1,r1,lsl#8
>    orr  r1,r1,r1,lsl#16
>
>   That's your solution - and fastest at somewhere around 2270MB/s for
> me - by the TRM I reckon
> that should be 3 cycles.
>
> 2) lsls r1,#24
>    orr  r1,r1,r1,lsr#8
>    orr  r1,r1,r1,lsr#16
>
>    lsl isn't explicitly listed in the TRM, so I assumed that was the
> same as a move with a constant
> shift, which my reading is that it's a single cycle; and the lsls is 2
> bytes - so you would think
> that should be as fast as yours but 2 bytes smaller - except it's
> reliably down at 2228MB/s - so
> it is slower.
>
> 3) Thinking it was an alignment issue I tried adding a mov r5,r1 to
> the front of that, and got 2248MB/s -
> so being faster with an extra instruction it probably was an alignment issue?
>
> 4) I also tried a pair of bfi's:
>   bfi r1,r1, #8, #8
>   bfi r1,r1, #16, #16
>
>   That came out at 2228MB/s - and is 4 cycles by the book.

Unfortunately you can't tell the performance from the latency.
Attached is a micro benchmark that has the three different versions
(and, ubx, lsl).  After compensating for the loop time, I got:

 * lsl: 1.006 s
 * ubx: 0.876 s
 * and: 0.918 s

even though ubx has a latency of two cycles.

I then took the AND version and shifted it to the start of the file.
This small change in alignment pushed it up to 1.048 s which is 14 %
slower.

-- Michael
#include <stdint.h>
#include <time.h>

uint32_t spread(uint32_t v) __attribute__((noinline));
uint32_t spread2(uint32_t v) __attribute__((noinline));
uint32_t spread3(uint32_t v) __attribute__((noinline));
uint32_t spread4(uint32_t v) __attribute__((noinline));

uint32_t spread4(uint32_t v)
{
   asm("and %0, %0, #255\n\t"
       "orr %0, %0, %0, lsl #8\n\t"
       "orr %0, %0, %0, lsl #16\n\t"
       :
       : "r" (v)
       );
//   return v;
}

uint32_t spread(uint32_t v)
{
   asm("");
   v <<= 24;
   v |= (v >> 8);
   v |= (v >> 16);
   
   return v;
}

uint32_t spread2(uint32_t v)
{
   asm("");
   v &= 0xFF;
   v |= (v << 8);
   v |= (v << 16);
   
   return v;
}

uint32_t spread3(uint32_t v)
{
   asm("");
   return v;
}

int main()
{
   struct timespec start, end;
   clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
   
   for (int i = 0; i < 10000; i++)
     {
	for (int j = 0; j < 10000; j++)
	  {
	     spread4(123);
	     spread4(56);
	     spread4(45);
	     spread4(99);
	  }
     }

   clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
   double elapsed = end.tv_sec - start.tv_sec + (end.tv_nsec - start.tv_nsec)*1e-9;
   printf("%.6f\n", elapsed);
   return 0;
}
_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain

Reply via email to