On Fri, Sep 2, 2011 at 4:51 AM, David Gilbert <david.gilb...@linaro.org> wrote: > Hi Michael, > I've just committed a pair of memcpy's into src/linaro-a9 - memcpy.S > that is armv7 > and memcpy-hybrid.S that is a Neon hybrid which uses neon for non-aligned > cases > and for large (128K or larger) copies. I've also (accidentally) > wired the memcpy-hybrid > one into the Makefile.am (I wasn't sure what the right way to do this > was - the neon_sources > seemed a good place for it, but there is nothing currently in there > that turns off the non-neon > version). > > I'd be interested in seeing the results for both; I've got a bit of > a soft spot for the hybrid > solution. > > On the memset, yes the 'and' that you added is fine - but I started > having a play and have > some performance results (on -t 128) that I don't really understand: > > > 1) and r1,#0xff > orr r1,r1,r1,lsl#8 > orr r1,r1,r1,lsl#16 > > That's your solution - and fastest at somewhere around 2270MB/s for > me - by the TRM I reckon > that should be 3 cycles. > > 2) lsls r1,#24 > orr r1,r1,r1,lsr#8 > orr r1,r1,r1,lsr#16 > > lsl isn't explicitly listed in the TRM, so I assumed that was the > same as a move with a constant > shift, which my reading is that it's a single cycle; and the lsls is 2 > bytes - so you would think > that should be as fast as yours but 2 bytes smaller - except it's > reliably down at 2228MB/s - so > it is slower. > > 3) Thinking it was an alignment issue I tried adding a mov r5,r1 to > the front of that, and got 2248MB/s - > so being faster with an extra instruction it probably was an alignment issue? > > 4) I also tried a pair of bfi's: > bfi r1,r1, #8, #8 > bfi r1,r1, #16, #16 > > That came out at 2228MB/s - and is 4 cycles by the book.
Unfortunately you can't tell the performance from the latency. Attached is a micro benchmark that has the three different versions (and, ubx, lsl). After compensating for the loop time, I got: * lsl: 1.006 s * ubx: 0.876 s * and: 0.918 s even though ubx has a latency of two cycles. I then took the AND version and shifted it to the start of the file. This small change in alignment pushed it up to 1.048 s which is 14 % slower. -- Michael
#include <stdint.h> #include <time.h> uint32_t spread(uint32_t v) __attribute__((noinline)); uint32_t spread2(uint32_t v) __attribute__((noinline)); uint32_t spread3(uint32_t v) __attribute__((noinline)); uint32_t spread4(uint32_t v) __attribute__((noinline)); uint32_t spread4(uint32_t v) { asm("and %0, %0, #255\n\t" "orr %0, %0, %0, lsl #8\n\t" "orr %0, %0, %0, lsl #16\n\t" : : "r" (v) ); // return v; } uint32_t spread(uint32_t v) { asm(""); v <<= 24; v |= (v >> 8); v |= (v >> 16); return v; } uint32_t spread2(uint32_t v) { asm(""); v &= 0xFF; v |= (v << 8); v |= (v << 16); return v; } uint32_t spread3(uint32_t v) { asm(""); return v; } int main() { struct timespec start, end; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for (int i = 0; i < 10000; i++) { for (int j = 0; j < 10000; j++) { spread4(123); spread4(56); spread4(45); spread4(99); } } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); double elapsed = end.tv_sec - start.tv_sec + (end.tv_nsec - start.tv_nsec)*1e-9; printf("%.6f\n", elapsed); return 0; }
_______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain