------- Comment #20 from rguenth at gcc dot gnu dot org 2007-11-04 11:45 ------- With mainline we now get
.p2align 4,,7 .p2align 3 .L6: addl $1, %eax cmpl %eax, %edi movl %eax, -20(%ebp) jle .L3 movl %eax, %ecx movl %esi, %edx .p2align 4,,7 .p2align 3 .L5: movl -4(%esi), %ebx movl (%edx), %eax cmpl %eax, %ebx jle .L4 movl %eax, -4(%esi) movl %ebx, (%edx) .L4: addl $1, %ecx addl $4, %edx cmpl %ecx, %edi jg .L5 .L3: movl -20(%ebp), %eax addl $4, %esi cmpl -16(%ebp), %eax jl .L6 which looks good, apart from the issue Andrew pointed out (but that's PR26726): lsti_11 = MEM[index: ivtmp.27_14, offset: 0x0fffffffc]; MEM[index: ivtmp.27_14, offset: 0x0fffffffc] = lstj_15; 4.0 is still faster with the original testcase, but the only difference I can spot is that mainline uses addl $1, %eax while 4.0 uses incl here. Oh, and 4.0 uses an extra induction variable(!) for the exit test (and less loop alignment): .L3: incl %eax cmpl %eax, 12(%ebp) movl %eax, -20(%ebp) jle .L4 movl 12(%ebp), %edi movl %esi, %edx xorl %ebx, %ebx subl %eax, %edi .p2align 4,,15 .L6: movl -4(%esi), %ecx movl (%edx), %eax cmpl %eax, %ecx jle .L7 movl %eax, -4(%esi) movl %ecx, (%edx) .L7: incl %ebx addl $4, %edx cmpl %edi, %ebx jne .L6 .L4: movl -20(%ebp), %eax addl $4, %esi cmpl -16(%ebp), %eax jl .L3 Using -mtune=core2 on trunk get's back the incl and makes the code faster than 4.0 (on my Core CPU, that is). So the generic tuning here makes the difference for trunk. 4.2 is still broken, though. I would say let's close this as fixed. -- rguenth at gcc dot gnu dot org changed: What |Removed |Added ---------------------------------------------------------------------------- Known to work|4.0.4 |4.0.4 4.3.0 Last reconfirmed|2006-02-24 15:20:29 |2007-11-04 11:45:07 date| | Summary|[4.1/4.2/4.3 Regression]: |[4.1/4.2 Regression]: code |code pessimization wrt. GCC |pessimization wrt. GCC 4.0 |4.0 probably due to |probably due to |TARGET_MEM_REF |TARGET_MEM_REF http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26290