https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534
--- Comment #35 from rguenther at suse dot de <rguenther at suse dot de> --- On Thu, 9 May 2019, amker at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 > > --- Comment #34 from bin cheng <amker at gcc dot gnu.org> --- > So we could have three different addressing modes here. > 1. What we have now: > leaq 0(,%rbp,8), %rax > movsd 8(%rbx,%rax), %xmm0 > addsd (%rbx,%rbp,8), %xmm0 > addq $8, %rbp > addsd 16(%rbx,%rax), %xmm0 > addsd 24(%rbx,%rax), %xmm0 > addsd %xmm0, %xmm1 > movsd 32(%rbx,%rax), %xmm0 > addsd 40(%rbx,%rax), %xmm0 > addsd 48(%rbx,%rax), %xmm0 > addsd 56(%rbx,%rax), %xmm0 > addsd %xmm0, %xmm2 > cmpq %rsi, %rbp > 2. GCC-4.7: > fldl (%esi,%ebx,8) > lea 0x8(%ebx),%eax > faddl 0x8(%esi,%ebx,8) > cmp %eax,%edi > faddl 0x10(%esi,%ebx,8) > faddl 0x18(%esi,%ebx,8) > faddp %st,%st(2) > fldl 0x20(%esi,%ebx,8) > faddl 0x28(%esi,%ebx,8) > faddl 0x30(%esi,%ebx,8) > faddl 0x38(%esi,%ebx,8) > faddp %st,%st(1) > 3. With slsr change: > leaq 0(%rbp,%rbx,8), %rax > addq $8, %rbx > movsd (%rax), %xmm0 > addsd 8(%rax), %xmm0 > addsd 16(%rax), %xmm0 > addsd 24(%rax), %xmm0 > addsd %xmm0, %xmm1 > movsd 32(%rax), %xmm0 > addsd 40(%rax), %xmm0 > addsd 48(%rax), %xmm0 > addsd 56(%rax), %xmm0 > addsd %xmm0, %xmm2 > cmpq %rsi, %rbx > > This was reported that 2. is better than 1. Also Jeff recommended 3. > > What I don't understand are: > A) why 2. is better than 1.? It seems to have more computations in address. > B) Is 3. the best one? It has the simplest addressing mode, but does require > one additional lea because of strength reduction. I think that depends on the micro-architecture. On most x86 implementations complex addressing modes need an additional uop. Case 3 is certainly "simple" and also smaller to encode so I'd indeed say this one is best. Case 2 is definitely a complex addressing mode which should be avoided unless it's not used very much and saves a register. I'd say if you can do it, 3 is the better choice if you look at more than one memory reference. If you do a transform that only looks at single memory references 2 might seem to be best.