12] aarch64: Implement new expander for efficient CRC computation

Mariam Arutunian Wed, 03 Jul 2024 01:18:26 -0700

On Sat, Jun 8, 2024 at 3:41 PM Richard Sandiford <richard.sandif...@arm.com>
wrote:


> Mariam Arutunian <mariamarutun...@gmail.com> writes:
> > This patch introduces two new expanders for the aarch64 backend,
> > dedicated to generate optimized code for CRC computations.
> > The new expanders are designed to leverage specific hardware capabilities
> > to achieve faster CRC calculations,
> > particularly using the pmul or crc32 instructions when supported by the
> > target architecture.
>
> Thanks for porting this to aarch64!
>
> > Expander 1: Bit-Forward CRC (crc<ALLI:mode><ALLX:mode>4)
> > For targets that support pmul instruction (TARGET_AES),
> > the expander will generate code that uses the pmul (crypto_pmulldi)
> > instruction for CRC computation.
> >
> > Expander 2: Bit-Reversed CRC (crc_rev<ALLI:mode><ALLX:mode>4)
> > The expander first checks if the target supports the CRC32 instruction
> set
> > (TARGET_CRC32)
> > and the polynomial in use is 0x1EDC6F41 (iSCSI). If the conditions are
> met,
> > it emits calls to the corresponding crc32 instruction (crc32b, crc32h,
> > crc32w, or crc32x depending on the data size).
> > If the target does not support crc32 but supports pmul, it then uses the
> > pmul (crypto_pmulldi) instruction for bit-reversed CRC computation.
> >
> > Otherwise table-based CRC is generated.
> >
> >   gcc/config/aarch64/
> >
> >     * aarch64-protos.h (aarch64_expand_crc_using_clmul): New extern
> > function declaration.
> >     (aarch64_expand_reversed_crc_using_clmul):  Likewise.
> >     * aarch64.cc (aarch64_expand_crc_using_clmul): New function.
> >     (aarch64_expand_reversed_crc_using_clmul):  Likewise.
> >     * aarch64.md (UNSPEC_CRC, UNSPEC_CRC_REV):  New unspecs.
> >     (crc_rev<ALLI:mode><ALLX:mode>4): New expander for reversed CRC.
> >     (crc<ALLI:mode><ALLX:mode>4): New expander for reversed CRC.
> >     * iterators.md (crc_data_type): New mode attribute.
> >
> >   gcc/testsuite/gcc.target/aarch64/
> >
> >     * crc-1-pmul.c: Likewise.
> >     * crc-10-pmul.c: Likewise.
> >     * crc-12-pmul.c: Likewise.
> >     * crc-13-pmul.c: Likewise.
> >     * crc-14-pmul.c: Likewise.
> >     * crc-17-pmul.c: Likewise.
> >     * crc-18-pmul.c: Likewise.
> >     * crc-21-pmul.c: Likewise.
> >     * crc-22-pmul.c: Likewise.
> >     * crc-23-pmul.c: Likewise.
> >     * crc-4-pmul.c: Likewise.
> >     * crc-5-pmul.c: Likewise.
> >     * crc-6-pmul.c: Likewise.
> >     * crc-7-pmul.c: Likewise.
> >     * crc-8-pmul.c: Likewise.
> >     * crc-9-pmul.c: Likewise.
> >     * crc-CCIT-data16-pmul.c: Likewise.
> >     * crc-CCIT-data8-pmul.c: Likewise.
> >     * crc-coremark-16bitdata-pmul.c: Likewise.
> >     * crc-crc32-data16.c: New test.
> >     * crc-crc32-data32.c: Likewise.
> >     * crc-crc32-data8.c: Likewise.
> >
> > Signed-off-by: Mariam Arutunian <mariamarutun...@gmail.com
> > diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> > index 1d3f94c813e..167e1140f0d 100644
> > --- a/gcc/config/aarch64/aarch64-protos.h
> > +++ b/gcc/config/aarch64/aarch64-protos.h
> > @@ -1117,5 +1117,8 @@ extern void mingw_pe_encode_section_info (tree,
> rtx, int);
> >
> >  bool aarch64_optimize_mode_switching (aarch64_mode_entity);
> >  void aarch64_restore_za (rtx);
> > +void aarch64_expand_crc_using_clmul (rtx *);
> > +void aarch64_expand_reversed_crc_using_clmul (rtx *);
> > +
> >
> >  #endif /* GCC_AARCH64_PROTOS_H */
> > diff --git a/gcc/config/aarch64/aarch64.cc
> b/gcc/config/aarch64/aarch64.cc
> > index ee12d8897a8..05cd0296d38 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -30265,6 +30265,135 @@ aarch64_retrieve_sysreg (const char *regname,
> bool write_p, bool is128op)
> >    return sysreg->encoding;
> >  }
> >
> > +/* Generate assembly to calculate CRC
> > +   using carry-less multiplication instruction.
> > +   OPERANDS[1] is input CRC,
> > +   OPERANDS[2] is data (message),
> > +   OPERANDS[3] is the polynomial without the leading 1.  */
> > +
> > +void
> > +aarch64_expand_crc_using_clmul (rtx *operands)
>
> This should probably be pmul rather than clmul.
>
> +{
> > +  /* Check and keep arguments.  */
> > +  gcc_assert (!CONST_INT_P (operands[0]));
> > +  gcc_assert (CONST_INT_P (operands[3]));
> > +  rtx crc = operands[1];
> > +  rtx data = operands[2];
> > +  rtx polynomial = operands[3];
> > +
> > +  unsigned HOST_WIDE_INT
> > +      crc_size = GET_MODE_BITSIZE (GET_MODE (operands[0])).to_constant
> ();
> > +  gcc_assert (crc_size <= 32);
> > +  unsigned HOST_WIDE_INT
> > +      data_size = GET_MODE_BITSIZE (GET_MODE (data)).to_constant ();
>
> We could instead make the interface:
>
> void
> aarch64_expand_crc_using_pmul (scalar_mode crc_mode, scalar_mode data_mode,
>                                rtx *operands)
>
> so that the lines above don't need the to_constant.  This should "just
> work" on the .md file side, since the modes being passed are naturally
> scalar_mode.
>
> I think it'd be worth asserting also that data_size <= crc_size.
> (Although we could handle any MAX (data_size, crc_size) <= 32
> with some adjustment.)
>
> > +
> > +  /* Calculate the quotient.  */
> > +  unsigned HOST_WIDE_INT
> > +      q = gf2n_poly_long_div_quotient (UINTVAL (polynomial), crc_size +
> 1);
> > +
> > +  /* CRC calculation's main part.  */
> > +  if (crc_size > data_size)
> > +    crc = expand_shift (RSHIFT_EXPR, DImode, crc, crc_size - data_size,
> > +                     NULL_RTX, 1);
> > +
> > +  rtx t0 = gen_reg_rtx (DImode);
> > +  aarch64_emit_move (t0, gen_int_mode (q, DImode));
>
> It's only a minor simplification, but this could instead be:
>
>   rtx t0 = force_reg (DImode, gen_int_mode (q, DImode));
>
> > +  rtx t1 = gen_reg_rtx (DImode);
> > +  aarch64_emit_move (t1, polynomial);
>
> If polynomial is a constant operand of mode crc_mode, GCC's standard
> CONST_INT representation is to sign-extend the constant to 64 bits.
> E.g. a QImode value of 0b1000_0000 would be represented as -128.
>
> I think here we want the zero-extended form, so it might be safer to do:
>
>   polynomial = simplify_gen_unary (ZERO_EXTEND, DImode, polynomial,
> crc_mode);
>   rtx t1 = force_reg (DImode, polynomial);
>
> > +
> > +  rtx a0 = expand_binop (DImode, xor_optab, crc, data, NULL_RTX, 1,
> > +                      OPTAB_WIDEN);
> > +
> > +  rtx clmul_res = gen_reg_rtx (TImode);
> > +  emit_insn (gen_aarch64_crypto_pmulldi (clmul_res, a0, t0));
> > +  a0 = gen_lowpart (DImode, clmul_res);
> > +
> > +  a0 = expand_shift (RSHIFT_EXPR, DImode, a0, crc_size, NULL_RTX, 1);
> > +
> > +  emit_insn (gen_aarch64_crypto_pmulldi (clmul_res, a0, t1));
> > +  a0 = gen_lowpart (DImode, clmul_res);
> > +
> > +  if (crc_size > data_size)
> > +    {
> > +      rtx crc_part = expand_shift (LSHIFT_EXPR, DImode, operands[1],
> data_size,
> > +                                NULL_RTX, 0);
> > +      a0 =  expand_binop (DImode, xor_optab, a0, crc_part, NULL_RTX, 1,
> > +                       OPTAB_DIRECT);
>
> Formatting nit: extra space after "a0 = "
>
> > +    }
> > +  /* Zero upper bits beyond crc_size.  */
> > +  rtx num_shift = gen_int_mode (64 - crc_size, DImode);
> > +  a0 = expand_shift (LSHIFT_EXPR, DImode, a0, 64 - crc_size,  NULL_RTX,
> 0);
> > +  a0 = expand_shift (RSHIFT_EXPR, DImode, a0, 64 - crc_size,  NULL_RTX,
> 1);
>
> Rather than shift left and then right, I think we should just AND:
>
>   rtx mask = gen_int_mode (GET_MODE_MASK (crc_mode), DImode);
>   a0 = expand_binop (DImode, and_optab, a0, mask, NULL_RTX, 1,
> OPTAB_DIRECT);
>
> That said, it looks like operands[0] has crc_mode.  The register bits
> above crc_size therefore shouldn't matter, since they're undefined on read.
> E.g. even though (reg:SI R) is stored in an X register, only the low 32
> bits are defined; the upper 32 bits can be any value.
>
> So I'd expect we could replace this and...
>
> > +
> > +  rtx tgt = simplify_gen_subreg (DImode, operands[0],
> > +                              GET_MODE (operands[0]), 0);
> > +  aarch64_emit_move (tgt, a0);
>
> ...this with just:
>
>   aarch64_emitmove (operands[0], gen_lowpart (crc_mode, a0));
>
> Perhaps that would break down if operands[0] is a subreg with
> SUBREG_PROMOTED_VAR_P set, but I think it's up to target-independent
> code to handle that case.
>

Thank you once again for your review.
I made all the recommended changes except for this and sent the new version.
On some tests the CRC function was returning the whole computed value,
without zeroing the upper part.

I'm also unsure about the indentation. Last time, I thought I had set it
correctly, but it turned out to be incorrect. This time, I've adjusted the
settings, but I'm still not sure if it's correct.

 [...snip...]

Best regards,
Mariam

Re: [RFC/RFA] [PATCH 06/12] aarch64: Implement new expander for efficient CRC computation

Reply via email to