https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64036

--- Comment #2 from Oleg Endo <olegendo at gcc dot gnu.org> ---
An example function, compiling with -O2 -m4:

int test_0 (unsigned short* x, int y, int z)
{
  return 
     (x[0] + x[1] + x[2] + x[3] + x[4] + x[5] + x[6]
           + x[7] + x[8] + x[9] + x[10]) ? y : z;
}

Without sched1, there are lots of dependencies on the results of memory loads.

With sched1, there is generally more code generated and variable live ranges
are longer.  The above function will use r8 and r9, which is not really
necessary.  Memory load dependencies are reduced and more LS/EX/MT instructions
can be executed in parallel.  Code size for the test function increases from
~37 insns to ~50 insns.  Approximated cycles on SH4 pipeline should be ~37
cycles without sched1 and ~33 cycles with sched1.  On SH4A the latency of a
load is 1 cycle, so without sched1 it should be ~28 cycles.

So basically this seems to stuff stall cycles with additional (reg-reg move)
instructions, but the end result is almost the same.  The larger code size and
longer live ranges seem to eliminate the benefits.  Using post-inc addressing
and appropriate scheduling and RA, it should be possible to get ~27 cycles (38
insns) on SH4 for that function.

If we could get that 37 -> 27 cycle drop with sched1, it'd be worth enabling
it.  It looks like addressing mode selection has to be improved first to reduce
pressure on R0.  Even then, there should be some way to prevent sched1 if the
resulting code will only stuff stall cycles with reg-reg copy insns to avoid
code bloat, since larger code increases probability of icache misses.

Reply via email to