https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83202
Bug ID: 83202 Summary: Try joining operations on consecutive array elements during tree vectorization Product: gcc Version: 7.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- void test(double data[4][4]) { for (int i = 0; i < 4; i++) { for (int j = i; j < 4; j+=2) { data[i][j] = data[i][j] * data[i][j]; data[i][j+1] = data[i][j+1] * data[i][j+1]; } } } gcc creates this: test(double (*) [4]): vmovsd xmm0, QWORD PTR [rdi] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi], xmm0 vmovsd xmm0, QWORD PTR [rdi+8] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+8], xmm0 vmovsd xmm0, QWORD PTR [rdi+16] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+16], xmm0 vmovsd xmm0, QWORD PTR [rdi+24] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+24], xmm0 vmovsd xmm0, QWORD PTR [rdi+40] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+40], xmm0 vmovsd xmm0, QWORD PTR [rdi+48] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+48], xmm0 vmovsd xmm0, QWORD PTR [rdi+56] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+56], xmm0 vmovsd xmm0, QWORD PTR [rdi+64] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+64], xmm0 vmovsd xmm0, QWORD PTR [rdi+80] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+80], xmm0 vmovsd xmm0, QWORD PTR [rdi+88] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+88], xmm0 vmovsd xmm0, QWORD PTR [rdi+120] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+120], xmm0 vmovsd xmm0, QWORD PTR [rdi+128] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+128], xmm0 ret clang detects that it is possible to use packed operations instead of scalar ones, and produces this. Please implement similar optimization in gcc too. test(double (*) [4]): # @test(double (*) [4]) vmovupd xmm0, xmmword ptr [rdi] vmovupd xmm1, xmmword ptr [rdi + 16] vmovupd xmm2, xmmword ptr [rdi + 40] vmovupd xmm3, xmmword ptr [rdi + 56] vmulpd xmm0, xmm0, xmm0 vmovupd xmmword ptr [rdi], xmm0 vmulpd xmm0, xmm1, xmm1 vmovupd xmmword ptr [rdi + 16], xmm0 vmulpd xmm0, xmm2, xmm2 vmovupd xmmword ptr [rdi + 40], xmm0 vmulpd xmm0, xmm3, xmm3 vmovupd xmmword ptr [rdi + 56], xmm0 vmovupd xmm0, xmmword ptr [rdi + 80] vmulpd xmm0, xmm0, xmm0 vmovupd xmmword ptr [rdi + 80], xmm0 vmovupd xmm0, xmmword ptr [rdi + 120] vmulpd xmm0, xmm0, xmm0 vmovupd xmmword ptr [rdi + 120], xmm0 ret