https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120428
--- Comment #13 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- > > constexpr std::size_t ProcessChunkSize = BlockSize * OrderSize; > > std::array<std::byte, ProcessChunkSize> buffer{}; > > std::byte* const bytes = reinterpret_cast<std::byte*>(data); > > for (std::size_t i = 0; i < TotalSize; i += ProcessChunkSize) > { > std::byte* const values = &bytes[i]; > > for (std::size_t j = 0; j < OrderSize; j++) > { > auto* const buffer_chunk = &buffer[j * BlockSize]; > auto* const value_chunk = &values[order[j] * BlockSize]; > > std::copy(value_chunk, value_chunk + BlockSize, buffer_chunk); > } The inner loop is not completely unrolled since std::copy is lowered to __builtin_memmove instead of __builtin_memcpy(with -mprefer-vector-width=256, it can be lower to memcpy) 147size: 22-4, last_iteration: 2-2 148 Loop size: 22 149 Estimated size after unrolling: 144-38 150Not unrolling loop 2: contains call and code would grow. And w/o completely unroll, the redundant copies are not eliminated.