Hi Will, On 2019/5/15 17:47, Will Deacon wrote: > On Mon, Apr 15, 2019 at 07:18:22PM +0100, Robin Murphy wrote: >> On 12/04/2019 10:52, Will Deacon wrote: >>> I'm waiting for Robin to come back with numbers for a C implementation. >>> >>> Robin -- did you get anywhere with that? >> >> Still not what I would call finished, but where I've got so far (besides an >> increasingly elaborate test rig) is as below - it still wants some unrolling >> in the middle to really fly (and actual testing on BE), but the worst-case >> performance already equals or just beats this asm version on Cortex-A53 with >> GCC 7 (by virtue of being alignment-insensitive and branchless except for >> the loop). Unfortunately, the advantage of C code being instrumentable does >> also come around to bite me... > > Is there any interest from anybody in spinning a proper patch out of this? > Shaokun?
HiSilicon's Kunpeng920(Hi1620) benefits from do_csum optimization, if Ard and Robin are ok, Lingyan or I can try to do it. Of course, if any guy posts the patch, we are happy to test it. Any will be ok. Thanks, Shaokun > > Will > >> /* Looks dumb, but generates nice-ish code */ >> static u64 accumulate(u64 sum, u64 data) >> { >> __uint128_t tmp = (__uint128_t)sum + data; >> return tmp + (tmp >> 64); >> } >> >> unsigned int do_csum_c(const unsigned char *buff, int len) >> { >> unsigned int offset, shift, sum, count; >> u64 data, *ptr; >> u64 sum64 = 0; >> >> offset = (unsigned long)buff & 0x7; >> /* >> * This is to all intents and purposes safe, since rounding down cannot >> * result in a different page or cache line being accessed, and @buff >> * should absolutely not be pointing to anything read-sensitive. >> * It does, however, piss off KASAN... >> */ >> ptr = (u64 *)(buff - offset); >> shift = offset * 8; >> >> /* >> * Head: zero out any excess leading bytes. Shifting back by the same >> * amount should be at least as fast as any other way of handling the >> * odd/even alignment, and means we can ignore it until the very end. >> */ >> data = *ptr++; >> #ifdef __LITTLE_ENDIAN >> data = (data >> shift) << shift; >> #else >> data = (data << shift) >> shift; >> #endif >> count = 8 - offset; >> >> /* Body: straightforward aligned loads from here on... */ >> //TODO: fancy stuff with larger strides and uint128s? >> while(len > count) { >> sum64 = accumulate(sum64, data); >> data = *ptr++; >> count += 8; >> } >> /* >> * Tail: zero any over-read bytes similarly to the head, again >> * preserving odd/even alignment. >> */ >> shift = (count - len) * 8; >> #ifdef __LITTLE_ENDIAN >> data = (data << shift) >> shift; >> #else >> data = (data >> shift) << shift; >> #endif >> sum64 = accumulate(sum64, data); >> >> /* Finally, folding */ >> sum64 += (sum64 >> 32) | (sum64 << 32); >> sum = sum64 >> 32; >> sum += (sum >> 16) | (sum << 16); >> if (offset & 1) >> return (u16)swab32(sum); >> >> return sum >> 16; >> } > > . >