Following testcase exposes optimization problem with current SVN gcc:
--cut here--
extern const int srcshift;
void good (const int *srcdata, int *dstdata)
{
int i;
for (i = 0; i < 256; i++)
dstdata[i] = srcdata[i] << srcshift;
}
void bad (const int *srcdata, int *dstdata)
{
int i;
for (i = 0; i < 256; i++)
{
dstdata[i] |= srcdata[i] << srcshift;
}
}
--cut here--
Using -O3 -msse2, the loop in above testcase gets vectorized, and produced code
differs substantially between good and bad function:
good:
...
.L8:
xorl %eax, %eax
movd srcshift, %xmm1
.p2align 4,,7
.p2align 3
.L4:
movdqu (%ebx,%eax), %xmm0
pslld %xmm1, %xmm0
movdqa %xmm0, (%esi,%eax)
addl $16, %eax
cmpl $1024, %eax
jne .L4
...
bad:
...
.L21:
movl %esi, %eax (2)
movl %ebx, %edx
leal 1024(%esi), %ecx
.p2align 4,,7
.p2align 3
.L17:
movdqu (%edx), %xmm0
movd srcshift, %xmm1 (1)
pslld %xmm1, %xmm0
movdqu (%eax), %xmm1 (3)
por %xmm1, %xmm0
movdqa %xmm0, (%eax)
addl $16, %eax (4)
addl $16, %edx
cmpl %ecx, %eax
jne .L17
popl %ebx
popl %esi
popl %ebp
ret
In addition to memory load in the loop (1), several other problems can be
identified: There is no need to move registers (2), because loop is followed by
function exit. For some reason, additional IV is used (4) and the same address
is accessed with unaligned access (3) as well as aligned access.
Expected code for "bad" case would be something like "good" case with
additional movaps+por instructions:
.L8:
xorl %eax, %eax
movd srcshift, %xmm1
.p2align 4,,7
.p2align 3
.L4:
movdqu (%ebx,%eax), %xmm0
movaps %xmm0, %xmm2
pslld %xmm1, %xmm0
por %xmm2, %xmm0
movdqa %xmm0, (%esi,%eax)
addl $16, %eax
cmpl $1024, %eax
jne .L4
Missing IV elimination could be attributed to tree loop optimizations, but
others are IMO RTL optimization problems, because we enter RTL generation with:
good:
<bb 3>:
MEM[base: dstdata, index: ivtmp.60] = M*(vect_p.29 + ivtmp.60){misalignment:
0} << srcshift.1;
bad:
<bb 4>:
MEM[index: ivtmp.127] = M*(vector int *) ivtmp.130{misalignment: 0} <<
srcshift.3 | M*(vector int *) ivtmp.127{misalignment: 0};
--
Summary: Memory load is not eliminated from tight vectorized loop
Product: gcc
Version: 4.3.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: ubizjak at gmail dot com
GCC target triplet: i686-*-*, x86_64-*-*
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011