We discussed this previously, we decided since AVX1 supports unaligned accesses we could not do an alignment check at the start of the function, but as you've discovered, this memcpy issue creates undefined behaviour.
Most performant would probably be an alignment check at the start and then manually processing the first N bytes. Another option could be to simply cast data to unsigned char* and then we can guarantee the compiler doesn't hit alignment issues? What are people's preferences here? On Fri, 17 Jan 2025 at 08:11, Paul Eggert <egg...@cs.ucla.edu> wrote: > On 2025-01-16 21:25, Jeffrey Walton wrote: > > On Fri, Jan 17, 2025 at 12:07 AM Bruno Haible via Gnulib discussion > > list <bug-gnulib@gnu.org> wrote: > >> Yes, the undefined behaviour really starts here, in line 35: > >> > >> const __m128i *data = buf; > >> > >> 'buf' was not aligned, 'const __m128i *' is 16-byte aligned. > > > > Disassemble the code around that line. See which asm instruction is > > being used for the load. I suspect MOVDQA (aligned) is being used > > instead of MOVDQU (unaligned). > > The compiler is entitled to do that. Bruno's right, the behavior is > undefined once the code assigns the unaligned pointer to an __m128i * > variable; see C23 §6.3.2.3 ¶7. Since behavior is undefined, the compiler > can do whatever it likes. > > I installed the attached patch to work around the immediate issue of the > undefined behavior. This skips the pclmul speedup if the buffer is not > properly aligned. If that is a significant performance issue in > Gnulib-using code, I hope Sam or somebody can come up with a > higher-performance fix.