https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120858
--- Comment #6 from Simon H. <sh1.gccbug at tikouka dot nz> --- Three equivalent functions, FWIW: ``` uint32_t crc_04C11DB7_u64(uint32_t crc, uint64_t w) { return __crc32d(crc, w); } uint32_t crc_04C11DB7_clmul_u64(uint32_t crc, uint64_t x) { constexpr uint64_t k1 = 0xb4e5b025f7011641; // bitrev64(2^128) / 0x104c11db7) constexpr uint64_t k2 = 0x1db710640; // bitrev64(0x04c11db7) * 2 x ^= crc; uint64_t a = clmul(x, k1); uint64_t b = clmul_high(a, k2); return b; } uint64_t crc64_04C11DB700000000_clmul_u64(uint64_t crc, uint64_t x) { constexpr uint64_t k1 = 0xb4e5b025f7011641; // bitrev64(2^128) / 0x104c11db7) constexpr uint64_t k2 = 0xdb71064000000000; // bitrev64(0x04c11db7) * 2 << 32 x ^= crc; uint64_t a = clmul(x, k1); uint64_t b = clmul_high(a, k2) ^ a; return b >> 32; } ``` The last function should support an arbitrary 64-bit polynomial, and it avoids the 65-bit constant by applying an extra exclusive-or after shifting the result down by 64 bits. However, if you're using the 64-bit builtin with a 32-bit polynomial (because you want to checksum 64 bits at once even if the polynomial is only 32-bit) the middle function might be preferable. Of course first function is even more preferable when the polynomial suits. Also, while I was poking about in godbolt I noticed that the materialisation of k1 is unnecessarily complex: (clmul(x, y) & 0xff) == (clmul(x & 0xff, y & 0xff) & 0xff) which means that the _data8 variants need only the low 8 bits of k1, and the _data16 variants need only the low 16 bits of k1, etc.. These can be achieved in fewer instructions than are currently used when the whole 33-bit constant is used.