Following testcase exposes optimization problem with current SVN gcc:

--cut here--
extern const int srcshift;

void good (const int *srcdata, int *dstdata)
{
  int i;

  for (i = 0; i < 256; i++)
    dstdata[i] = srcdata[i] << srcshift;
}


void bad (const int *srcdata, int *dstdata)
{
  int i;

  for (i = 0; i < 256; i++)
    {
      dstdata[i] |= srcdata[i] << srcshift;
    }
}
--cut here--

Using -O3 -msse2, the loop in above testcase gets vectorized, and produced code
differs substantially between good and bad function:

good:
        ...
.L8:
        xorl    %eax, %eax
        movd    srcshift, %xmm1
        .p2align 4,,7
        .p2align 3
.L4:
        movdqu  (%ebx,%eax), %xmm0
        pslld   %xmm1, %xmm0
        movdqa  %xmm0, (%esi,%eax)
        addl    $16, %eax
        cmpl    $1024, %eax
        jne     .L4
        ...

bad:
        ...
.L21:
        movl    %esi, %eax        (2)
        movl    %ebx, %edx
        leal    1024(%esi), %ecx
        .p2align 4,,7
        .p2align 3
.L17:
        movdqu  (%edx), %xmm0
        movd    srcshift, %xmm1   (1)
        pslld   %xmm1, %xmm0
        movdqu  (%eax), %xmm1     (3)
        por     %xmm1, %xmm0
        movdqa  %xmm0, (%eax)
        addl    $16, %eax         (4)
        addl    $16, %edx
        cmpl    %ecx, %eax
        jne     .L17
        popl    %ebx
        popl    %esi
        popl    %ebp
        ret

In addition to memory load in the loop (1), several other problems can be
identified: There is no need to move registers (2), because loop is followed by
function exit. For some reason, additional IV is used (4) and the same address
is accessed with unaligned access (3) as well as aligned access.

Expected code for "bad" case would be something like "good" case with
additional movaps+por instructions:

.L8:
        xorl    %eax, %eax
        movd    srcshift, %xmm1
        .p2align 4,,7
        .p2align 3
.L4:
        movdqu  (%ebx,%eax), %xmm0
        movaps  %xmm0, %xmm2
        pslld   %xmm1, %xmm0
        por     %xmm2, %xmm0
        movdqa  %xmm0, (%esi,%eax)
        addl    $16, %eax
        cmpl    $1024, %eax
        jne     .L4

Missing IV elimination could be attributed to tree loop optimizations, but
others are IMO RTL optimization problems, because we enter RTL generation with:

good:
<bb 3>:
  MEM[base: dstdata, index: ivtmp.60] = M*(vect_p.29 + ivtmp.60){misalignment:
0} << srcshift.1;

bad:
<bb 4>:
  MEM[index: ivtmp.127] = M*(vector int *) ivtmp.130{misalignment: 0} <<
srcshift.3 | M*(vector int *) ivtmp.127{misalignment: 0};


-- 
           Summary: Memory load is not eliminated from tight vectorized loop
           Product: gcc
           Version: 4.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: ubizjak at gmail dot com
GCC target triplet: i686-*-*, x86_64-*-*


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011

Reply via email to