Re: RISC-V: Added support for CRC.

Joern Rennecke Sat, 23 Sep 2023 16:06:13 -0700

Mariam Harutyunyan:
+++ b/gcc/ChangeLog
@@ -1,3 +1,45 @@
+2023-08-03  Mariam Arutunian  <mariamarutun...@gmail.com>
+


It is common courtesy to include all authors in the list of authors
for the ChangeLog; also,
this may help people in the future understand the history of the code better.
While must of your patch is new, it still contains non-trivial parts of mine
( https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591744.html )
.And stripping out the comment why, currently,  we can't use linkonce
for crc tables on the the RISC-V target is
not helpful to someone who wants to understand the code.

See also the discussion to put this into loop distribution:
https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591821.html
https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591866.html

Mariam Harutyunyan:
> It adds internal
> functions and built-ins specifically designed to handle CRC computations
> efficiently.

This sounds like this is a finished work, although defining built-in
functions to calculate the CRC of single data elements and recognizers
for some C idioms that do these calculations,
is just a starting point.

Alexander Monakov :

> Jeff, as I understand this all is happening only because Coremark contains
> use of bitwise CRC that affects benchmark scores. In another universe where
> - Coremark was careful to checksum outputs outside of timed sections, or
> - implemented CRC in a manner that is not transparent to the compiler, or
> - did not use CRC at all
> we would not be spending effort on this, correct?

It is a stated goal of coremark to test performance for CRC.  They do
not use a library call
to implement CRC, but a specific bit-banging algorithm they have
chosen.  That algorithm is,
for the vast majority of processors, not representative of the targets
performance potential in calculating CRCs, thus if a compiler fails to
translate this into the CRC implementation that
would be used for performance code, the compiler frustrates this goal
of coremark to give a measure of CRC calculation performance.

> At best we might have
> a discussion on providing a __builtin_clmul for carry-less multiplication
> (which _is_ a fundamental primitive, unlike __builtin_crc), and move on.

Some processors have specialized instructions for CRC computations.

> Instead, efficient CRC loops have the following structure:
> - they carry an unreduced remainder in the loop, performing final reduction
>  modulo polynomial only once after the loop — this halves the CLMUL count;
> - they overlap multiple CLMUL chains to make the loop throughput-bound
> rather than latency-bound. The typical unroll factor is about 4x-8x.

If you want to recognize a loop that does a CRC on a block, it makes
sense to start with recognizing the CRC computation for single array
elements first.  We have to learn to
walk before we can run.

Nore that my initial patch already contained a match.pd stanza to
recognize two chained single-element CRC calculations.

Jeff Law:
> The intention is to provide a useful
> builtin_crc while at the same time putting one side of the
> infrastructure we need for automatic detection of CRC loops and turning
> them into table lookups or CLMULs.

Note that when optimizing for size, for a target with tiny memory, or
when using a non-constant (or constant but undiscoverable by the
compiler) polynom, we can't use the table lookup.  But still, even if
we don't have CLmul, we can generally speed up CRC computation over
the coremark algorithm by using something more suited to the target,
like the crcu16_1 function I
put into comments in my patch.

Alexander Monakov :
> So... just provide a library? A library code is easier to develop and audit,
> it can be released independently, people can use it with their compiler of
> choice. Not everything needs to be in libgcc.

A library can be used to implement built-ins in gcc (we still need to
define one for block operations, one step at a time...).
However, someone or something needs to rewrite the existing code to
use the library.
It is commonly accepted that an efficient way to do this is to make a
compiler do the
necessary transformations, as long as it can be made to churn out good
enough code.

Alexander Monakov:
> Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
> These consumers need high-performance blockwise CRC, offering them
> a latency-bound elementwise CRC primitive is a disservice. And what
> should they use as a fallback when __builtin_crc is unavailable?

We can provide a fallback implementation for all targets with table
lookup and/or shifts .

Alexander Monakov:
> I think offering a conventional library for CRC has substantial advantages.
Are you volunteering?  It would make our work to emit code for block
CRC easier if we could
just use a library call when we recognize a block CRC (although making
that recognizer is likely still considerable work if we want to get
good coverage over different coding styles).

Although maybe Oleg Endo's library, as mentioned in
https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591748.html ,
might be suitable?  What is the license for that?

Jeff Law :
> I'm not real comfortable with directly supporting sub-word sizes for the
> output operand.  The vast majority of insns on the risc-v port have
> outputs that use either the X iterator (which maps to either SI or DI
> mode for rv32 and rv64 respectively) or they use the GPR iterator.  I
> don't think many us ANYI.

> Which ultimately makes sense since operations actually write X mode
> bits.

> Presumably the goal here is to capture cases where the resultant CRC
> value is a sub-word size.

Yes, the builtin functions naturally have lots of narrow operations,
and thus the operands are narrow.
Note that in my implementation of crcsihi4 / crchihi4, I use a Pmode
paradoxical subreg of the output operand, so that the move itself can
be in Pmode.

I suppose we could make the output mode wider than the data size, but
that can only be for the targets that want it; we can't bend the
target-independent built-in for the convenience of RISC-V.
So the builtin function expanding machinery would have to know to use
the mode of the pattern instead of the one implied by the type.

Jeff Law :

> I wonder if all the table based generation support code should move into a
> generic file so that it could be used by other targets.    I don't offhand
> see anything in there that is RISC-V specific.

Well, the avoidance of linkonce is.  OTOH, other targets might not
have weak.  On the grabber hand, making the table static and
duplicating it all over would be sad.

Jeff Law :

>+  sprintf (buf, "crc_table_for_%u_bit_crc_%llx_polynomial", crc_bits, 
>polynom);
> So I think we need to be more careful here with namespace pollution.
> Ideally I think we'd want this symbol to be static and const.

I had:

+  sprintf (buf, "__gcc_crc%s%s4_" HOST_WIDE_INT_PRINT_HEX,
+   mode_name[data_bits / BITS_PER_UNIT],
+   mode_name[crc_bits / BITS_PER_UNIT], polynom);

Jeff Law:
> Note that if you were to create a DECL node for the table and attach a
> suitable DECL_INITIAL to it with its initialization data I think most,
> if not all of the assembly output complexity would be handled for you
> automatically.

I suppose we could make it a string constant, to take advantage of
string commoning on targets where that is supported.  However, that
would likely get us into trouble on targets with addressable units
other that 8 bits.

Jeff Law:
> +      rtx op1 = gen_rtx_ASHIFTRT (crc_mode, crc,
> +  GEN_INT (mode_bit_size - 8));
> In general, I'd suggest gen_int_mode rather than GEN_INT.
Well, in this specific case, we have a non-negative number that fits
well into the space provided.
Although the latter might get tight on a target with 8 bit shift
counts, so if we want to make this properly portable, you have a
point.

However, the code generation actually is somewhat target specific
beyond these technicalities.
The widths of arithmetic used, and how we order it with memory
accesses (which might need arithmetic transformations), depends on the
facilities available and latency considerations to get good code.
Well, we can still have some generic implementation. but optimal
performance it makessense to look at what assembly code you get and
consider a target-specific code generation if you can do better.

The other port we used this optimization for actually has dedicated
crc instructions in some configurations, and is generally
memory-tight, so the fallback is tightly coded library functions.

Jeff Law:

> Hmm, I'm not sure we can depend on the target actually supporting
> ZERO_EXTEND.

> Similarly, don't we have to worry about whether or not the resulting
> address is actually valid?

It wouldn't be an an alpha, since that doesn't have indexed
addressing.  And that has nothing to do with 12 bit, tab is a symbol
(usually needs lots of bits, and multiple machine instructions
to set up for RISC-V), and ix is a an index.
..
> And here I don't think we can necessarily depend on the target
> supporting shifts by a constant amount.
..
> Similarly here, some targets may not support AND with immediate or may
> not have enough bits to handle all the immediates that can arise here.

Yes, it's RISC-V specific code in a RISC_V file.  If we want to make
this more general, we might be better off constructing some trees or
gimple and expanding them.  We might indeed do a check if the internal
function or a library function is defined, and substitute a table
lookup otherwise.  (or leave it alone for -Os in that case).

> So did we ever figure out why we're treating the CRC optab as a
conversion optab?

Input and output are different sizes.  At least for some of the
variants.  So you get an ICE if you treat it as something other than a
conversion.

Re: RISC-V: Added support for CRC.

Reply via email to