On 16/08/2019 09:15, Shaokun Zhang wrote:
Hi Will,

On 2019/8/16 0:46, Will Deacon wrote:
On Thu, May 16, 2019 at 11:14:35AM +0800, Zhangshaokun wrote:
On 2019/5/15 17:47, Will Deacon wrote:
On Mon, Apr 15, 2019 at 07:18:22PM +0100, Robin Murphy wrote:
On 12/04/2019 10:52, Will Deacon wrote:
I'm waiting for Robin to come back with numbers for a C implementation.

Robin -- did you get anywhere with that?

Still not what I would call finished, but where I've got so far (besides an
increasingly elaborate test rig) is as below - it still wants some unrolling
in the middle to really fly (and actual testing on BE), but the worst-case
performance already equals or just beats this asm version on Cortex-A53 with
GCC 7 (by virtue of being alignment-insensitive and branchless except for
the loop). Unfortunately, the advantage of C code being instrumentable does
also come around to bite me...

Is there any interest from anybody in spinning a proper patch out of this?
Shaokun?

HiSilicon's Kunpeng920(Hi1620) benefits from do_csum optimization, if Ard and
Robin are ok, Lingyan or I can try to do it.
Of course, if any guy posts the patch, we are happy to test it.
Any will be ok.

I don't mind who posts it, but Robin is super busy with SMMU stuff at the
moment so it probably makes more sense for you or Lingyan to do it.

Thanks for restarting this topic, I or Lingyan will do it soon.

FWIW, I've rolled up what I had so far and dumped it up into a quick semi-realistic patch here:

http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=859c5566510c32ae72039aa5072e932a771a3596

So far I'd put most of the effort into the aforementioned benchmarking harness to compare performance and correctness for all the proposed implementations over all reasonable alignment/length combinations - I think that got pretty much finished, but as Will says I'm unlikely to find time to properly look at this again for several weeks.

Robin.

Reply via email to