I'm fairly new on the list and haven't read earlier discussion about the crc-x86_64 module, so I hope this doesn't come out wrong. I rewrote the CRC CLMUL code[1] in XZ Utils six months ago, and I'm commenting based on that experience.
On 2025-01-16 Paul Eggert wrote: > That being said, it does appear there are a lot of unaligned word > accesses nearby, which can't be good for performance even if the > hardware doesn't trap. With the code variants I tried last year, the penalty from unaligned reads was very small. The buffer would need to be large until the extra cost of aligning at the beginning would pay off. This was with 128-bit registers. It might be different with wider registers. On 2025-01-17 Sam Russell wrote: > Most performant would probably be an alignment check at the start and > then manually processing the first N bytes. Another option could be > to simply cast data to unsigned char* and then we can guarantee the > compiler doesn't hit alignment issues? Try using __m128i_u* instead of __m128i*. The former has aligned(1) attribute. It's not available in old GCC versions though. I kept the input as uint8_t* and cast it for the loads in a wrapper: static inline __m128i my_load128(const uint8_t *p) { return _mm_loadu_si128((const __m128i *)p); } I suppose __m128i_u* would be more correct above too but sanitizers don't complain about it. Perhaps sanitizers treat the intrinsics differently than memcpy. [1] https://github.com/tukaani-project/xz/blob/master/src/liblzma/check/crc_x86_clmul.h -- Lasse Collin