From: Tom Herbert
> Sent: 02 March 2016 22:19
...
> + /* Main loop using 64byte blocks */
> + for (; len > 64; len -= 64, buff += 64) {
> + asm("addq 0*8(%[src]),%[res]\n\t"
> + "adcq 1*8(%[src]),%[res]\n\t"
> + "adcq 2*8(%[src]),%[res]\n\t"
> + "adcq 3*8(%[src]),%[res]\n\t"
> + "adcq 4*8(%[src]),%[res]\n\t"
> + "adcq 5*8(%[src]),%[res]\n\t"
> + "adcq 6*8(%[src]),%[res]\n\t"
> + "adcq 7*8(%[src]),%[res]\n\t"
> + "adcq $0,%[res]"
> + : [res] "=r" (result)
> + : [src] "r" (buff),
> + "[res]" (result));
Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..'
without any unrolling?
...
> + /* Sum over any remaining bytes (< 8 of them) */
> + if (len & 0x7) {
> + unsigned long val;
> + /*
> + * Since "len" is > 8 here we backtrack in the buffer to load
> + * the outstanding bytes into the low order bytes of a quad and
> + * then shift to extract the relevant bytes. By doing this we
> + * avoid additional calls to load_unaligned_zeropad.
That comment is wrong. Maybe:
* Read the last 8 bytes of the buffer then shift to extract
* the required bytes.
* This is safe because the original length was > 8 and avoids
* any problems reading beyond the end of the valid data.
David