On 15 June 2012 00:33, Michael Hope <michael.h...@linaro.org> wrote: > On 11 June 2012 21:53, Mans Rullgard <mans.rullg...@linaro.org> wrote: >> On 11 June 2012 02:14, Michael Hope <michael.h...@linaro.org> wrote: >>> We talked at Connect about finishing up the cortex-strings work by >>> upstreaming them into Bionic, Newlib, and GLIBC. I've written up one >>> of our standard 'Output' pages: >>> >>> https://wiki.linaro.org/WorkingGroups/ToolChain/Outputs/CortexStrings >>> >>> with a summary of what we did, what else exists, benchmark results, >>> and next steps. This can be used to justify the routines to the >>> different upstreams. >>> >>> The Android guys are going to upstream these to Bionic. I need a >>> volunteer to do Newlib and GLIBC. >>> >>> One surprise was that the Newlib plain C routines are very good on >>> strings - probably due to a good end of string detector. >> >> Those graphs end at 4k, which is well within even L1 cache. > > Yip, that's deliberate. Larger sizes are only relevant for memcpy() > and memset() and past results show little change once you go outside > the L1.
That's not what I have observed. >> How do these functions compare for sizes that hit L2 or external memory? >> I would expect functions doing some prefetching to perform better >> there. > > The routines don't use explicit preload as the memory access is > obvious and better left to the hardware. Having said that, these were > run on an OMAP4460 which has the auto preload turned off by default. And unfortunately it can't be enabled on GP devices due to missing ROM calls. There are a few parameters that can be tuned, however, and I have obtained speedups for some loads by tinkering with these. > I'll add check memcpy() for large blocks with and without preloads to > my list. > >> Some time ago, I compared a few memcpy() implementations >> on large blocks, and the Bionic NEON-optimised one was several >> times faster than glibc. It is of course possible that glibc has >> improved since then. > > A NEON based memcpy() is twice as fast on the A8 for both a cold L1 > and larger blocks as the NEON unit has wider access direct into the L2 > cache. The same effect doesn't occur on the A9. On the 4460 the effect isn't great, but on the 4430 proper preloads make a huge difference. Then again, the 4430 is perhaps not something we should care too much about. -- Mans Rullgard / mru _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain