https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120428

--- Comment #14 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to Hongtao Liu from comment #13)
> > 
> >     constexpr std::size_t ProcessChunkSize = BlockSize * OrderSize;
> > 
> >     std::array<std::byte, ProcessChunkSize> buffer{};
> > 
> >     std::byte* const bytes = reinterpret_cast<std::byte*>(data);
> > 
> >     for (std::size_t i = 0; i < TotalSize; i += ProcessChunkSize)
> >     {
> >         std::byte* const values = &bytes[i];
> > 
> >         for (std::size_t j = 0; j < OrderSize; j++)
> >         {
> >             auto* const buffer_chunk = &buffer[j * BlockSize];
> >             auto* const value_chunk  = &values[order[j] * BlockSize];
> > 
> >             std::copy(value_chunk, value_chunk + BlockSize, buffer_chunk);
> >         }
> 
> The inner loop is not completely unrolled since std::copy is lowered to
> __builtin_memmove instead of __builtin_memcpy(with
> -mprefer-vector-width=256, it can be lower to memcpy)
> 

Since BlockSize is 16, so it's just 16-byte copy.

Reply via email to