https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120428
--- Comment #14 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- (In reply to Hongtao Liu from comment #13) > > > > constexpr std::size_t ProcessChunkSize = BlockSize * OrderSize; > > > > std::array<std::byte, ProcessChunkSize> buffer{}; > > > > std::byte* const bytes = reinterpret_cast<std::byte*>(data); > > > > for (std::size_t i = 0; i < TotalSize; i += ProcessChunkSize) > > { > > std::byte* const values = &bytes[i]; > > > > for (std::size_t j = 0; j < OrderSize; j++) > > { > > auto* const buffer_chunk = &buffer[j * BlockSize]; > > auto* const value_chunk = &values[order[j] * BlockSize]; > > > > std::copy(value_chunk, value_chunk + BlockSize, buffer_chunk); > > } > > The inner loop is not completely unrolled since std::copy is lowered to > __builtin_memmove instead of __builtin_memcpy(with > -mprefer-vector-width=256, it can be lower to memcpy) > Since BlockSize is 16, so it's just 16-byte copy.