RFC -- CSE compile-time stupidity

Jeffrey A Law Mon, 21 Feb 2005 09:22:55 -0800

Sigh.  I can't wait for this code to become less critical, both in terms
of runtime and compile time performance.  There are so many things in 
cse.c that are just plain bad....


cse.c has a fair amount of complexity in its hash table management
code to avoid spending too much time invalidating/removing items
in the hash table when a value of an expression changes.

Specifically we have REG_TICK, REG_IN_TABLE and other fields to track 
when entries are valid.  It's fairly common to have code which looks
something like this (from mention_regs):

      for (i = regno; i < endregno; i++)
        {
          if (REG_IN_TABLE (i) >= 0 && REG_IN_TABLE (i) != REG_TICK (i))
            remove_invalid_refs (i);

          REG_IN_TABLE (i) = REG_TICK (i);
          SUBREG_TICKED (i) = -1;
        }


Looks pretty simple and straightforward.  Let's dig a little deeper into
those accessor macros...

/* Get the number of times this register has been updated in this
   basic block.  */

#define REG_TICK(N) (get_cse_reg_info (N)->reg_tick)

/* Get the point at which REG was recorded in the table.  */

#define REG_IN_TABLE(N) (get_cse_reg_info (N)->reg_in_table)

/* Get the SUBREG set at the last increment to REG_TICK (-1 if not a
   SUBREG).  */

#define SUBREG_TICKED(N) (get_cse_reg_info (N)->subreg_ticked)

Hmmm, it's starting to get interesting -- each macro has a function call
to get_cse_reg_info, which looks like:


static inline struct cse_reg_info *
get_cse_reg_info (unsigned int regno)
{
  struct cse_reg_info *p = &cse_reg_info_table[regno];

  /* If this entry has not been initialized, go ahead and initialize
     it.  */
  if (p->timestamp != cse_reg_info_timestamp)
    get_cse_reg_info_1 (regno);

  return p;
}



OUCH OUCH OUCH.  Yes, this is inlined, but that's of little/no value
with the function call to get_cse_reg_info_1.  Let me show why by
looking at the tree dumps for the conditional above:


  if (REG_IN_TABLE (i) >= 0 && REG_IN_TABLE (i) != REG_TICK (i))
            remove_invalid_refs (i);

  # BLOCK 5
  # PRED: 4 [50.0%]  (true,exec)
<L119>:;
  D.20813_497 = regno_328 * 20;
  ivtmp.200_122 = (struct cse_reg_info *) D.20813_497;
  # SUCC: 6 [100.0%]  (fallthru)
OK.  We're just getting the offset of the entry within the table
that we want.

  # BLOCK 6
  # PRED: 21 [89.0%]  (dfs_back,true,exec) 5 [100.0%]  (fallthru)
  # ivtmp.200_499 = PHI <ivtmp.200_498(21), ivtmp.200_122(5)>;
  # i_33 = PHI <i_370(21), regno_328(5)>;
<L6>:;
  cse_reg_info_table.99_335 = cse_reg_info_table;
  p_336 = cse_reg_info_table.99_335 + ivtmp.200_499;
  D.20520_337 = p_336->timestamp;
  cse_reg_info_timestamp.100_338 = cse_reg_info_timestamp;
  if (D.20520_337 != cse_reg_info_timestamp.100_338) goto <L7>; else
goto <L8>;
  # SUCC: 7 [51.2%]  (true,exec) 8 [48.8%]  (false,exec)
Check and see if the entry's timestamp indicates the entry is
valid/initialized.

  # BLOCK 7
  # PRED: 6 [51.2%]  (true,exec)
<L7>:;
  get_cse_reg_info_1 (i_33);
  # SUCC: 8 [100.0%]  (fallthru,exec)
Initialize this entry (if necessary).

  # BLOCK 8
  # PRED: 6 [48.8%]  (false,exec) 7 [100.0%]  (fallthru,exec)
<L8>:;
  D.20361_341 = p_336->reg_in_table;
  if (D.20361_341 >= 0) goto <L10>; else goto <L18>;
  # SUCC: 9 [79.0%]  (true,exec) 15 [21.0%]  (false,exec)
Load REG_IN_TABLE for this register.  So far so good.


  # BLOCK 9
  # PRED: 8 [79.0%]  (true,exec)
<L10>:;
  cse_reg_info_table.99_383 = cse_reg_info_table;
  p_384 = cse_reg_info_table.99_383 + ivtmp.200_499;
  D.20529_385 = p_384->timestamp;
  cse_reg_info_timestamp.100_386 = cse_reg_info_timestamp;
  if (D.20529_385 != cse_reg_info_timestamp.100_386) goto <L11>; else
goto <L12>;
  # SUCC: 10 [51.2%]  (true,exec) 11 [48.8%]  (false,exec)
Totally useless -- we're verifying that this entry is valid again.
There's no way this entry can be invalid at this point.  So, let's
see.  3 loads, 1 comparison and 1 arithmetic operation we don't need.

  # BLOCK 10
  # PRED: 9 [51.2%]  (true,exec)
<L11>:;
  get_cse_reg_info_1 (i_33);
  # SUCC: 11 [100.0%]  (fallthru,exec)
This can't ever be reached at runtime because there's no way
the entry would need initialization again.

  # BLOCK 11
  # PRED: 9 [48.8%]  (false,exec) 10 [100.0%]  (fallthru,exec)
<L12>:;
  D.20363_389 = p_384->reg_in_table;
  cse_reg_info_table.99_393 = cse_reg_info_table;
  p_394 = cse_reg_info_table.99_393 + ivtmp.200_499;
  D.20538_395 = p_394->timestamp;
  cse_reg_info_timestamp.100_396 = cse_reg_info_timestamp;
  if (D.20538_395 != cse_reg_info_timestamp.100_396) goto <L14>; else
goto <L15>;
  # SUCC: 12 [51.2%]  (true,exec) 13 [48.8%]  (false,exec)
Totally useless.  We don't need to load the REG_IN_TABLE field
again -- we already loaded it in block #8 above.  But the calls
(which can never be reached at runtime) must be considered as
potentially modifying REG_IN_TABLE.  Ugh.  Worse yet we then proceed
to verify this entry is valid AGAIN.  4 loads, 1 comparison and 1
arithmetic operation that we don't need.

  # BLOCK 12
  # PRED: 11 [51.2%]  (true,exec)
<L14>:;
  get_cse_reg_info_1 (i_33);
  # SUCC: 13 [100.0%]  (fallthru,exec)
ANother call that can't be reached at runtime.  We certainly don't
need to initialize the entry 3 times :(

  # BLOCK 13
  # PRED: 11 [48.8%]  (false,exec) 12 [100.0%]  (fallthru,exec)
<L15>:;
  D.20365_399 = p_394->reg_tick;
  if (D.20363_389 != D.20365_399) goto <L17>; else goto <L18>;
  # SUCC: 14 [51.2%]  (true,exec) 15 [48.8%]  (false,exec)
Load REG_TICK and compare.  This is necessary.

  # BLOCK 14
  # PRED: 13 [51.2%]  (true,exec)
<L17>:;
  remove_invalid_refs (i_33);
Remove the entry if all the tests passed.

Basically we've got 7 loads, 2 arithmetic instructions, and 2
compare/branch instructions which we execute, but which are totally
useless.  Plus we've got two calls which can never be reached
at runtime.

Basically each invocation of an accessor  macro results in a
test for validity and conditional initialization -- which can't be
removed.  OUCH OUCH OUCH.


I realize that we are trying to write clean, easy to read code by using
a level of abstraction, but the abstraction is really getting in the
way of achieving reasonable compile-time performance.

Here's what I get by explicitly calling get_cse_reg_info once and
just doing dereferences out of the structure:

  # BLOCK 5
  # PRED: 4 [50.0%]  (true,exec)
<L80>:;
  D.20623_259 = regno_191 * 20;
  ivtmp.200_80 = (struct cse_reg_info *) D.20623_259;
  # SUCC: 6 [100.0%]  (fallthru)
Same as the previous code -- get the offset of the entry
within the table that we want.

  # BLOCK 6
  # PRED: 11 [89.0%]  (dfs_back,true,exec) 5 [100.0%]  (fallthru)
  # ivtmp.200_261 = PHI <ivtmp.200_260(11), ivtmp.200_80(5)>;
  # i_19 = PHI <i_207(11), regno_191(5)>;
<L6>:;
  cse_reg_info_table.99_198 = cse_reg_info_table;
  reg_info_199 = cse_reg_info_table.99_198 + ivtmp.200_261;
  D.20486_200 = reg_info_199->timestamp;
  cse_reg_info_timestamp.100_201 = cse_reg_info_timestamp;
  if (D.20486_200 != cse_reg_info_timestamp.100_201) goto <L7>; else
goto <L8>;
  # SUCC: 7 [51.2%]  (true,exec) 8 [48.8%]  (false,exec)
Test the entry for validity.  Same as previous code.

  # BLOCK 7
  # PRED: 6 [51.2%]  (true,exec)
<L7>:;
  get_cse_reg_info_1 (i_19);
  # SUCC: 8 [100.0%]  (fallthru,exec)
Conditionally initialize the entry.  Same as previous code.


  # BLOCK 8
  # PRED: 6 [48.8%]  (false,exec) 7 [100.0%]  (fallthru,exec)
<L8>:;
  D.20363_205 = reg_info_199->reg_in_table;
  if (D.20363_205 >= 0) goto <L10>; else goto <L12>;
  # SUCC: 9 [79.0%]  (true,exec) 11 [21.0%]  (false,exec)
Whee, load the REG_IN_TABLE bit and test it.

  # BLOCK 9
  # PRED: 8 [79.0%]  (true,exec)
<L10>:;
  D.20364_209 = reg_info_199->reg_tick;
  if (D.20363_205 != D.20364_209) goto <L11>; else goto <L12>;
  # SUCC: 10 [51.2%]  (true,exec) 11 [48.8%]  (false,exec)
Load the REG_TICK and compare it to the REG_IN_TABLE.

  # BLOCK 10
  # PRED: 9 [51.2%]  (true,exec)
<L11>:;
  remove_invalid_refs (i_19);
  # SUCC: 11 [100.0%]  (fallthru,exec)
Invalidate entries if necessary.


As you can see the second is *FAR* more efficient and compact.  In fact,
it's nearly optimal.

Fixing cse.c to not use the accessor macros for REG_IN_TABLE, REG_TICK
and SUBREG_TICKED saves about 1% compilation time for the components
of cc1.  Yes, that's a net 1% improvement by dropping the abstraction
layer.

Comments?

jeff

RFC -- CSE compile-time stupidity

Reply via email to