On Sun, 24 Sep 2023, Joern Rennecke wrote:
> It is a stated goal of coremark to test performance for CRC. I would expect a good CRC benchmark to print CRC throughput in bytes per cycle or megabytes per second. I don't see where Coremark states that goal. In the readme at https://github.com/eembc/coremark/blob/main/README.md it enumerates the three subcategories (linked list, matrix ops, state machine) and indicates that CRC is used for validation. If it claims that elsewhere, the way its code employs CRC does not correspond to real-world use patterns, like in the Linux kernel for protocol and filesystem checksumming, or decompression libraries. > They do not use a library call to implement CRC, but a specific > bit-banging algorithm they have chosen. That algorithm is, for the > vast majority of processors, not representative of the targets > performance potential in calculating CRCs, It is, however, representative of the target CPU's ability to run those basic bitwise ops with good overlap with the rest of computation, which is far more relevant for the real-world performance of the CPU. > thus if a compiler fails to translate this into the CRC implementation > that would be used for performance code, the compiler frustrates this > goal of coremark to give a measure of CRC calculation performance. Are you seriously saying that if a customer chooses CPU A over CPU B based on Coremark scores, and then discovers that actual performance in, say, zlib (which uses slice-by-N for CRC) is better on CPU B, that's entirely fair and the benchmarks scores they saw were not misleading? > > At best we might have > > a discussion on providing a __builtin_clmul for carry-less multiplication > > (which _is_ a fundamental primitive, unlike __builtin_crc), and move on. > > Some processors have specialized instructions for CRC computations. Only for one or two fixed polynomials. For that matter, some processors have instructions for AES and SHA, but that doesn't change that clmul is a more fundamental and flexible primitive than "CRC". > If you want to recognize a loop that does a CRC on a block, it makes > sense to start with recognizing the CRC computation for single array > elements first. We have to learn to walk before we can run. If only the "walk before you run" logic was applied in favor of implementing a portable clmul builtin prior to all this. > A library can be used to implement built-ins in gcc (we still need to > define one for block operations, one step at a time...). However, > someone or something needs to rewrite the existing code to use the > library. It is commonly accepted that an efficient way to do this is > to make a compiler do the necessary transformations, as long as it can > be made to churn out good enough code. How does this apply to the real world? Among CRC implementations in the Linux kernel, ffmpeg, zlib, bzip2, xz-utils, and zstd I'm aware of only a single instance where bitwise CRC is used. It's in the table initialization function in xz-utils. The compiler would transform that to copying one table into another. Is that a valuable transform? > Alexander Monakov: > > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg? > > These consumers need high-performance blockwise CRC, offering them > > a latency-bound elementwise CRC primitive is a disservice. And what > > should they use as a fallback when __builtin_crc is unavailable? > > We can provide a fallback implementation for all targets with table > lookup and/or shifts . How would it help when they are compiled with LLVM, or GCC version earlier than 14? Alexander