From: Tom Herbert
> Sent: 03 February 2016 19:19
...
> + /* Main loop */
> +50: adcq 0*8(%rdi),%rax
> + adcq 1*8(%rdi),%rax
> + adcq 2*8(%rdi),%rax
> + adcq 3*8(%rdi),%rax
> + adcq 4*8(%rdi),%rax
> + adcq 5*8(%rdi),%rax
> + adcq 6*8(%rdi),%rax
> + adcq 7*8(%rdi),%rax
> + adcq 8*8(%rdi),%rax
> + adcq 9*8(%rdi),%rax
> + adcq 10*8(%rdi),%rax
> + adcq 11*8(%rdi),%rax
> + adcq 12*8(%rdi),%rax
> + adcq 13*8(%rdi),%rax
> + adcq 14*8(%rdi),%rax
> + adcq 15*8(%rdi),%rax
> + lea 128(%rdi), %rdi
> + loop 50b
I'd need convincing that unrolling the loop like that gives any significant
gain.
You have a dependency chain on the carry flag so have delays between the 'adcq'
instructions (these may be more significant than the memory reads from l1
cache).
I also don't remember (might be wrong) the 'loop' instruction being executed
quickly.
If 'loop' is fast then you will probably find that:
10: adcq 0(%rdi),%rax
lea 8(%rdi),%rdi
loop 10b
is just as fast since the three instructions could all be executed in parallel.
But I suspect that 'dec %cx; jnz 10b' is actually better (and might execute as
a single micro-op).
IIRC 'adc' and 'dec' will both have dependencies on the flags register
so cannot execute together (which is a shame here).
It is also possible that breaking the carry-chain dependency by doing 32bit
adds (possibly after 64bit reads) can be made to be faster.
David