https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120858

--- Comment #6 from Simon H. <sh1.gccbug at tikouka dot nz> ---
Three equivalent functions, FWIW:

```
uint32_t crc_04C11DB7_u64(uint32_t crc, uint64_t w) {
    return __crc32d(crc, w);
}
uint32_t crc_04C11DB7_clmul_u64(uint32_t crc, uint64_t x) {
    constexpr uint64_t k1 = 0xb4e5b025f7011641;   // bitrev64(2^128) /
0x104c11db7)
    constexpr uint64_t k2 = 0x1db710640;          // bitrev64(0x04c11db7) * 2

    x ^= crc;
    uint64_t a = clmul(x, k1);
    uint64_t b = clmul_high(a, k2);
    return b;
}
uint64_t crc64_04C11DB700000000_clmul_u64(uint64_t crc, uint64_t x) {
    constexpr uint64_t k1 = 0xb4e5b025f7011641;   // bitrev64(2^128) /
0x104c11db7)
    constexpr uint64_t k2 = 0xdb71064000000000;   // bitrev64(0x04c11db7) * 2
<< 32

    x ^= crc;
    uint64_t a = clmul(x, k1);
    uint64_t b = clmul_high(a, k2) ^ a;
    return b >> 32;
}
```

The last function should support an arbitrary 64-bit polynomial, and it avoids
the 65-bit constant by applying an extra exclusive-or after shifting the result
down by 64 bits.  However, if you're using the 64-bit builtin with a 32-bit
polynomial (because you want to checksum 64 bits at once even if the polynomial
is only 32-bit) the middle function might be preferable.  Of course first
function is even more preferable when the polynomial suits.

Also, while I was poking about in godbolt I noticed that the materialisation of
k1 is unnecessarily complex:

(clmul(x, y) & 0xff) == (clmul(x & 0xff, y & 0xff) & 0xff)

which means that the _data8 variants need only the low 8 bits of k1, and the
_data16 variants need only the low 16 bits of k1, etc..  These can be achieved
in fewer instructions than are currently used when the whole 33-bit constant is
used.

Reply via email to