Mariam Harutyunyan: +++ b/gcc/ChangeLog @@ -1,3 +1,45 @@ +2023-08-03 Mariam Arutunian <mariamarutun...@gmail.com> +
It is common courtesy to include all authors in the list of authors for the ChangeLog; also, this may help people in the future understand the history of the code better. While must of your patch is new, it still contains non-trivial parts of mine ( https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591744.html ) .And stripping out the comment why, currently, we can't use linkonce for crc tables on the the RISC-V target is not helpful to someone who wants to understand the code. See also the discussion to put this into loop distribution: https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591821.html https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591866.html Mariam Harutyunyan: > It adds internal > functions and built-ins specifically designed to handle CRC computations > efficiently. This sounds like this is a finished work, although defining built-in functions to calculate the CRC of single data elements and recognizers for some C idioms that do these calculations, is just a starting point. Alexander Monakov : > Jeff, as I understand this all is happening only because Coremark contains > use of bitwise CRC that affects benchmark scores. In another universe where > - Coremark was careful to checksum outputs outside of timed sections, or > - implemented CRC in a manner that is not transparent to the compiler, or > - did not use CRC at all > we would not be spending effort on this, correct? It is a stated goal of coremark to test performance for CRC. They do not use a library call to implement CRC, but a specific bit-banging algorithm they have chosen. That algorithm is, for the vast majority of processors, not representative of the targets performance potential in calculating CRCs, thus if a compiler fails to translate this into the CRC implementation that would be used for performance code, the compiler frustrates this goal of coremark to give a measure of CRC calculation performance. > At best we might have > a discussion on providing a __builtin_clmul for carry-less multiplication > (which _is_ a fundamental primitive, unlike __builtin_crc), and move on. Some processors have specialized instructions for CRC computations. > Instead, efficient CRC loops have the following structure: > - they carry an unreduced remainder in the loop, performing final reduction > modulo polynomial only once after the loop — this halves the CLMUL count; > - they overlap multiple CLMUL chains to make the loop throughput-bound > rather than latency-bound. The typical unroll factor is about 4x-8x. If you want to recognize a loop that does a CRC on a block, it makes sense to start with recognizing the CRC computation for single array elements first. We have to learn to walk before we can run. Nore that my initial patch already contained a match.pd stanza to recognize two chained single-element CRC calculations. Jeff Law: > The intention is to provide a useful > builtin_crc while at the same time putting one side of the > infrastructure we need for automatic detection of CRC loops and turning > them into table lookups or CLMULs. Note that when optimizing for size, for a target with tiny memory, or when using a non-constant (or constant but undiscoverable by the compiler) polynom, we can't use the table lookup. But still, even if we don't have CLmul, we can generally speed up CRC computation over the coremark algorithm by using something more suited to the target, like the crcu16_1 function I put into comments in my patch. Alexander Monakov : > So... just provide a library? A library code is easier to develop and audit, > it can be released independently, people can use it with their compiler of > choice. Not everything needs to be in libgcc. A library can be used to implement built-ins in gcc (we still need to define one for block operations, one step at a time...). However, someone or something needs to rewrite the existing code to use the library. It is commonly accepted that an efficient way to do this is to make a compiler do the necessary transformations, as long as it can be made to churn out good enough code. Alexander Monakov: > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg? > These consumers need high-performance blockwise CRC, offering them > a latency-bound elementwise CRC primitive is a disservice. And what > should they use as a fallback when __builtin_crc is unavailable? We can provide a fallback implementation for all targets with table lookup and/or shifts . Alexander Monakov: > I think offering a conventional library for CRC has substantial advantages. Are you volunteering? It would make our work to emit code for block CRC easier if we could just use a library call when we recognize a block CRC (although making that recognizer is likely still considerable work if we want to get good coverage over different coding styles). Although maybe Oleg Endo's library, as mentioned in https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591748.html , might be suitable? What is the license for that? Jeff Law : > I'm not real comfortable with directly supporting sub-word sizes for the > output operand. The vast majority of insns on the risc-v port have > outputs that use either the X iterator (which maps to either SI or DI > mode for rv32 and rv64 respectively) or they use the GPR iterator. I > don't think many us ANYI. > Which ultimately makes sense since operations actually write X mode > bits. > Presumably the goal here is to capture cases where the resultant CRC > value is a sub-word size. Yes, the builtin functions naturally have lots of narrow operations, and thus the operands are narrow. Note that in my implementation of crcsihi4 / crchihi4, I use a Pmode paradoxical subreg of the output operand, so that the move itself can be in Pmode. I suppose we could make the output mode wider than the data size, but that can only be for the targets that want it; we can't bend the target-independent built-in for the convenience of RISC-V. So the builtin function expanding machinery would have to know to use the mode of the pattern instead of the one implied by the type. Jeff Law : > I wonder if all the table based generation support code should move into a > generic file so that it could be used by other targets. I don't offhand > see anything in there that is RISC-V specific. Well, the avoidance of linkonce is. OTOH, other targets might not have weak. On the grabber hand, making the table static and duplicating it all over would be sad. Jeff Law : >+ sprintf (buf, "crc_table_for_%u_bit_crc_%llx_polynomial", crc_bits, >polynom); > So I think we need to be more careful here with namespace pollution. > Ideally I think we'd want this symbol to be static and const. I had: + sprintf (buf, "__gcc_crc%s%s4_" HOST_WIDE_INT_PRINT_HEX, + mode_name[data_bits / BITS_PER_UNIT], + mode_name[crc_bits / BITS_PER_UNIT], polynom); Jeff Law: > Note that if you were to create a DECL node for the table and attach a > suitable DECL_INITIAL to it with its initialization data I think most, > if not all of the assembly output complexity would be handled for you > automatically. I suppose we could make it a string constant, to take advantage of string commoning on targets where that is supported. However, that would likely get us into trouble on targets with addressable units other that 8 bits. Jeff Law: > + rtx op1 = gen_rtx_ASHIFTRT (crc_mode, crc, > + GEN_INT (mode_bit_size - 8)); > In general, I'd suggest gen_int_mode rather than GEN_INT. Well, in this specific case, we have a non-negative number that fits well into the space provided. Although the latter might get tight on a target with 8 bit shift counts, so if we want to make this properly portable, you have a point. However, the code generation actually is somewhat target specific beyond these technicalities. The widths of arithmetic used, and how we order it with memory accesses (which might need arithmetic transformations), depends on the facilities available and latency considerations to get good code. Well, we can still have some generic implementation. but optimal performance it makessense to look at what assembly code you get and consider a target-specific code generation if you can do better. The other port we used this optimization for actually has dedicated crc instructions in some configurations, and is generally memory-tight, so the fallback is tightly coded library functions. Jeff Law: > Hmm, I'm not sure we can depend on the target actually supporting > ZERO_EXTEND. > Similarly, don't we have to worry about whether or not the resulting > address is actually valid? It wouldn't be an an alpha, since that doesn't have indexed addressing. And that has nothing to do with 12 bit, tab is a symbol (usually needs lots of bits, and multiple machine instructions to set up for RISC-V), and ix is a an index. .. > And here I don't think we can necessarily depend on the target > supporting shifts by a constant amount. .. > Similarly here, some targets may not support AND with immediate or may > not have enough bits to handle all the immediates that can arise here. Yes, it's RISC-V specific code in a RISC_V file. If we want to make this more general, we might be better off constructing some trees or gimple and expanding them. We might indeed do a check if the internal function or a library function is defined, and substitute a table lookup otherwise. (or leave it alone for -Os in that case). > So did we ever figure out why we're treating the CRC optab as a conversion optab? Input and output are different sizes. At least for some of the variants. So you get an ICE if you treat it as something other than a conversion.