https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111894

            Bug ID: 111894
           Summary: Missed vectorization opportunity
           Product: gcc
           Version: 13.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: Mark_B53 at yahoo dot com
  Target Milestone: ---

Consider the following code that uses views to implement a two dimensional
`iota`:

    #include <array>
    #include <ranges>

    template<std::integral T>
    std::ranges::range auto iota2D(T xbound, T ybound) {
        auto fn = [=](T idx) { return std::tuple{idx / ybound, idx % ybound};
};
        return std::views::iota(T{0}, xbound * ybound) |
std::views::transform(fn);
    }

    constexpr std::size_t N = 20;
    std::array<std::array<int, N>, N> data;

    __attribute__((noinline)) void init1() {
        for (auto i : std::views::iota(size_t{}, N)) {
            for (auto j : std::views::iota(size_t{}, N)) {
                data[i][j] = 123;
            }
        }
    }

    __attribute__((noinline)) void init2() {
        for (auto [i, j] : iota2D(N,N)) {
            data[i][j] = 123;
        }
    }

Using gcc 13.2 with -O3, we see that the code using a nested loop is nicely
vectorized:

    init1():
            movdqa  xmm0, XMMWORD PTR .LC0[rip]
            mov     eax, OFFSET FLAT:data
    .L2:
            movaps  XMMWORD PTR [rax], xmm0
            add     rax, 80
            movaps  XMMWORD PTR [rax-64], xmm0
            movaps  XMMWORD PTR [rax-48], xmm0
            movaps  XMMWORD PTR [rax-32], xmm0
            movaps  XMMWORD PTR [rax-16], xmm0
            cmp     rax, OFFSET FLAT:data+1600
            jne     .L2
            ret

The code using iota2D is not vectorized:

    init2():
            xor     eax, eax
    .L6:
            mov     DWORD PTR data[0+rax*4], 123
            add     rax, 1
            cmp     rax, 400
            jne     .L6
            ret

Although GCC 13 produces much higher quality assembly than previous versions,
it fails to vectorize the loop.

Reply via email to