xroche wrote: You're right -- I verified this and the vectorizer handles the loop case well.
With `-O2 -march=native` (GFNI + AVX2), the `uint64_t[4]` loop and the manually unrolled version both produce identical vectorized code: `vpxor ymm` + `vgf2p8affineqb` + `vpshufb` + `vpsadbw` + horizontal reduction. The `__uint256_t` version actually produces *worse* code: 4x scalar `popcntq` + `addl`, because the value lives in GPRs, not vector registers. With AVX-512 VPOPCNTDQ, same story: the loop gets `vpxor ymm` + `vpopcntq ymm` + `vpmovqb` + `vpsadbw` (8 instructions), while `__uint256_t` stays scalar (11 instructions). The 18% speedup I measured was a red herring -- scalar `popcntq` happened to be faster than the GFNI-based vector popcount path on the specific test CPU, not a real advantage of the type. I'll remove the Hamming distance claim from the PR description. The stronger motivation for `__int256` is arithmetic ergonomics and performance vs `_BitInt(256)` (3x for add/sub/bitwise, 1.5x for division), not SIMD popcount. https://github.com/llvm/llvm-project/pull/182733 _______________________________________________ cfe-commits mailing list [email protected] https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
