https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83202

            Bug ID: 83202
           Summary: Try joining operations on consecutive array elements
                    during tree vectorization
           Product: gcc
           Version: 7.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

void test(double data[4][4])
{
  for (int i = 0; i < 4; i++)
  {
    for (int j = i; j < 4; j+=2)
    {
      data[i][j] = data[i][j] * data[i][j];
      data[i][j+1] = data[i][j+1] * data[i][j+1];
    }
  }
}

gcc creates this:

test(double (*) [4]):
  vmovsd xmm0, QWORD PTR [rdi]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi], xmm0
  vmovsd xmm0, QWORD PTR [rdi+8]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+8], xmm0
  vmovsd xmm0, QWORD PTR [rdi+16]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+16], xmm0
  vmovsd xmm0, QWORD PTR [rdi+24]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+24], xmm0
  vmovsd xmm0, QWORD PTR [rdi+40]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+40], xmm0
  vmovsd xmm0, QWORD PTR [rdi+48]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+48], xmm0
  vmovsd xmm0, QWORD PTR [rdi+56]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+56], xmm0
  vmovsd xmm0, QWORD PTR [rdi+64]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+64], xmm0
  vmovsd xmm0, QWORD PTR [rdi+80]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+80], xmm0
  vmovsd xmm0, QWORD PTR [rdi+88]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+88], xmm0
  vmovsd xmm0, QWORD PTR [rdi+120]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+120], xmm0
  vmovsd xmm0, QWORD PTR [rdi+128]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+128], xmm0
  ret

clang detects that it is possible to use packed operations instead of scalar
ones, and produces this. Please implement similar optimization in gcc too.

test(double (*) [4]): # @test(double (*) [4])
  vmovupd xmm0, xmmword ptr [rdi]
  vmovupd xmm1, xmmword ptr [rdi + 16]
  vmovupd xmm2, xmmword ptr [rdi + 40]
  vmovupd xmm3, xmmword ptr [rdi + 56]
  vmulpd xmm0, xmm0, xmm0
  vmovupd xmmword ptr [rdi], xmm0
  vmulpd xmm0, xmm1, xmm1
  vmovupd xmmword ptr [rdi + 16], xmm0
  vmulpd xmm0, xmm2, xmm2
  vmovupd xmmword ptr [rdi + 40], xmm0
  vmulpd xmm0, xmm3, xmm3
  vmovupd xmmword ptr [rdi + 56], xmm0
  vmovupd xmm0, xmmword ptr [rdi + 80]
  vmulpd xmm0, xmm0, xmm0
  vmovupd xmmword ptr [rdi + 80], xmm0
  vmovupd xmm0, xmmword ptr [rdi + 120]
  vmulpd xmm0, xmm0, xmm0
  vmovupd xmmword ptr [rdi + 120], xmm0
  ret

Reply via email to