Richard Biener <richard.guent...@gmail.com> writes:
>> Am 16.05.2025 um 19:37 schrieb Richard Sandiford <richard.sandif...@arm.com>:
>> 
>> genemit has traditionally used open-coded gen_rtx_FOO sequences
>> to build up the instruction pattern.  This is now the source of
>> quite a bit of bloat in the binary, and also a source of slow
>> compile times.
>> 
>> Two obvious ways of trying to deal with this are:
>> 
>> (1) Try to identify rtxes that have a similar form and use shared
>>    routines to generate rtxes of that form.
>> 
>> (2) Use a static table to encode the rtx and call a common routine
>>    to expand it.
>> 
>> I did briefly look at (1).  However, it's more complex than (2),
>> and I think suffers from being the worst of both worlds, for reasons
>> that I'll explain below.  This patch therefore does (2).
>> 
>> In theory, one of the advantages of open-coding the calls to
>> gen_rtx_FOO is that the rtx can be populated using stores of known
>> constants (for the rtx code, mode, unspec number, etc).  However,
>> the time spent constructing an rtx is likely to be dominated by
>> the call to rtx_alloc, rather than by the stores to the fields.
>> 
>> Option (1) above loses this advantage of storing constants.
>> The shared routines would parameterise an rtx according to things
>> like the modes on the rtx and its suboperands, so the code would
>> need to fetch the parameters.  In a sense, the rtx structure would
>> be open-coded but the parameters would be table-encoded (albeit
>> in a simple way).
>> 
>> The expansion code also shouldn't be particularly hot.  Anything that
>> treats expand/discard cycles as very cheap would be misconceived,
>> since each discarded expansion generates garbage memory that needs
>> to be cleaned up later.
>> 
>> Option (2) turns out to be pretty simple -- certainly simpler
>> than (1) -- and seems to give a reasonable saving.  Some numbers,
>> all for --enable-checking=yes,rtl,extra:
>> 
>> [A] size of the @progbits sections in insn-emit-*.o, new / old
>> [B] size of the load segments in cc1, new / old
>> [C] time to compile a typical insn-emit*.cc, new / old
>> 
>> Target                 [A]      [B]      [C]
>> --------------------------------------------
>> native aarch64      0.5627   0.9585   0.5677
>> native x86_64       0.5925   0.9467   0.6377
>> aarch64-x-riscv64   0.5555   0.9066   0.2762
>
> Nice.  So how large is the tables, aka what’s the effect on .rodata of cc1?
>
> One nice thing about the old way is that you can set breakpoints on the gen_* 
> routines.  Can the tables be annotated with comments so it’s easy to lookup 
> the part for a particular .md entry and is there a place to break on 
> conditional on some table index to simulate the old way?

Yeah, the number of gen_* routines is unchanged, and it's still possible
to set breakpoints on them.

I did wonder about removing the out-of-line gen_* functions where possible
and adding the encoding to the recog_data array instead.  But it would
still be necessary to define the gen_* routines as at least macros or
inline functions, since the target code can expect gen_* routines to exist
for any non-* named pattern, even optab ones.  Doing that did sound like it
would hurt debuggability and might be counterproductive in size terms too.

A typical function looks like:

-----------------------------------------------------------------------------
/* .../gcc/config/aarch64/aarch64.md:1162 */
rtx
gen_aarch64_tbnehidi (rtx operand0, rtx operand1, rtx operand2)
{
  rtx operands[3] ATTRIBUTE_UNUSED = { operand0, operand1, operand2 };
  static const uint8_t expand_encoding[] = {
     0x17, 0x00, 0x02, 0x1f, 0x2f, 0x39, 0x00, 0x5c,
     0x00, 0x81, 0x06, 0x11, 0x01, 0x00, 0x27, 0x01,
     0x01, 0x01, 0x27, 0x00, 0x37, 0x00, 0x01, 0x02,
     0x2f, 0x05, 0x02, 0x42
  };
  return expand_rtx (expand_encoding, operands);
}
-----------------------------------------------------------------------------

or, for embedded C++ code:

-----------------------------------------------------------------------------
/* .../gcc/config/aarch64/aarch64.md:3117 */
rtx
gen_addsi3_carryinC (rtx operand0, rtx operand1, rtx operand2)
{
  rtx operands[7] ATTRIBUTE_UNUSED = { operand0, operand1, operand2 };
  start_sequence ();
  {
#define FAIL return (end_sequence (), nullptr)
#define DONE return end_sequence ()
#line 3134 "/home/ricsan01/gnu/src/gcc/gcc/config/aarch64/aarch64.md"
{
  operands[3] = gen_rtx_REG (CC_ADCmode, CC_REGNUM);
  rtx ccin = gen_rtx_REG (CC_Cmode, CC_REGNUM);
  operands[4] = gen_rtx_LTU (DImode, ccin, const0_rtx);
  operands[5] = gen_rtx_LTU (SImode, ccin, const0_rtx);
  operands[6] = immed_wide_int_const (wi::shwi (1, DImode)
                                      << GET_MODE_BITSIZE (SImode),
                                      TImode);
}
#undef DONE
#undef FAIL
  }
  static const uint8_t expand_encoding[] = {
     0x01, 0x17, 0x00, 0x02, 0x1f, 0x01, 0x03, 0x3a,
     0x0b, 0x3b, 0x11, 0x3b, 0x11, 0x01, 0x04, 0x6f,
     0x11, 0x01, 0x01, 0x6f, 0x11, 0x01, 0x02, 0x01,
     0x06, 0x1f, 0x01, 0x00, 0x3b, 0x10, 0x3b, 0x10,
     0x01, 0x05, 0x01, 0x01, 0x01, 0x02
  };
  return complete_seq (expand_encoding, operands);
}
-----------------------------------------------------------------------------

(Adding a #line to switch back to the real source file is still a TODO.
It was previously hampered by earlier standards putting a low limit
on the maximum #line number that implementations need to support.)

Having shared routines for generating the rtxes might allow new types of
conditional breakpoint, e.g. if you're not sure where exactly an rtx is
coming from.  At the moment it's only possible to break on rtx_alloc
for the code, but not (say) to break on a particular mode/code combination.

Richard

Reply via email to