https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120428

--- Comment #13 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---

> 
>     constexpr std::size_t ProcessChunkSize = BlockSize * OrderSize;
> 
>     std::array<std::byte, ProcessChunkSize> buffer{};
> 
>     std::byte* const bytes = reinterpret_cast<std::byte*>(data);
> 
>     for (std::size_t i = 0; i < TotalSize; i += ProcessChunkSize)
>     {
>         std::byte* const values = &bytes[i];
> 
>         for (std::size_t j = 0; j < OrderSize; j++)
>         {
>             auto* const buffer_chunk = &buffer[j * BlockSize];
>             auto* const value_chunk  = &values[order[j] * BlockSize];
> 
>             std::copy(value_chunk, value_chunk + BlockSize, buffer_chunk);
>         }

The inner loop is not completely unrolled since std::copy is lowered to
__builtin_memmove instead of __builtin_memcpy(with -mprefer-vector-width=256,
it can be lower to memcpy)

147size: 22-4, last_iteration: 2-2                                              
148  Loop size: 22                                                              
149  Estimated size after unrolling: 144-38                                     
150Not unrolling loop 2: contains call and code would grow.

And w/o completely unroll, the redundant copies are not eliminated.

Reply via email to