Re: String routines writeup with benchmarks

Mans Rullgard Fri, 15 Jun 2012 07:56:11 -0700

On 15 June 2012 00:33, Michael Hope <michael.h...@linaro.org> wrote:
> On 11 June 2012 21:53, Mans Rullgard <mans.rullg...@linaro.org> wrote:
>> On 11 June 2012 02:14, Michael Hope <michael.h...@linaro.org> wrote:
>>> We talked at Connect about finishing up the cortex-strings work by
>>> upstreaming them into Bionic, Newlib, and GLIBC.  I've written up one
>>> of our standard 'Output' pages:
>>>
>>>  https://wiki.linaro.org/WorkingGroups/ToolChain/Outputs/CortexStrings
>>>
>>> with a summary of what we did, what else exists, benchmark results,
>>> and next steps.  This can be used to justify the routines to the
>>> different upstreams.
>>>
>>> The Android guys are going to upstream these to Bionic.  I need a
>>> volunteer to do Newlib and GLIBC.
>>>
>>> One surprise was that the Newlib plain C routines are very good on
>>> strings - probably due to a good end of string detector.
>>
>> Those graphs end at 4k, which is well within even L1 cache.
>
> Yip, that's deliberate.  Larger sizes are only relevant for memcpy()
> and memset() and past results show little change once you go outside
> the L1.


That's not what I have observed.

>> How do these functions compare for sizes that hit L2 or external memory?
>> I would expect functions doing some prefetching to perform better
>> there.
>
> The routines don't use explicit preload as the memory access is
> obvious and better left to the hardware.  Having said that, these were
> run on an OMAP4460 which has the auto preload turned off by default.

And unfortunately it can't be enabled on GP devices due to missing ROM
calls.  There are a few parameters that can be tuned, however, and I
have obtained speedups for some loads by tinkering with these.

> I'll add check memcpy() for large blocks with and without preloads to
> my list.
>
>>  Some time ago, I compared a few memcpy() implementations
>> on large blocks, and the Bionic NEON-optimised one was several
>> times faster than glibc.  It is of course possible that glibc has
>> improved since then.
>
> A NEON based memcpy() is twice as fast on the A8 for both a cold L1
> and larger blocks as the NEON unit has wider access direct into the L2
> cache.  The same effect doesn't occur on the A9.

On the 4460 the effect isn't great, but on the 4430 proper preloads make
a huge difference.  Then again, the 4430 is perhaps not something we
should care too much about.

-- 
Mans Rullgard / mru

_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain

Re: String routines writeup with benchmarks

Reply via email to