http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57462

            Bug ID: 57462
           Summary: ira-costs considers only a single register at a time
           Product: gcc
           Version: 4.8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: josh.m.conner at gmail dot com

In this code:

  int PopCnt(unsigned long long a, unsigned long long b)
  {
    register int c=0;

    while(a) {
      c++;
      a &= a + b;
    }
    return(c);
  }

Built for ARM with:

  gcc test.c -O2 -S -o test.s

The code generated for the loop is:

  .L3:
      fmdrr   d18, r0, r1     @ int
      vadd.i64        d16, d18, d17
      fmrrd   r4, r5, d16     @ int
      and     r0, r0, r4
      and     r1, r1, r5
      orrs    r5, r0, r1
      add     r3, r3, #1
      bne     .L3

There is quite a bit of gymnastics in order to use the FP registers for the add
instruction.  The code is simpler if all registers are allocated to integer
registers:

  .L3:
      adds    r2, r4, r6
      adc     r3, r5, r7
      and     r4, r4, r2
      and     r5, r5, r3
      orrs    r3, r4, r5
      add     r0, r0, #1
      bne     .L3

The code is shorter, and doesn't include the potentially-expensive FP<=>INT
register move operations.

*** The rest of this bug is my analysis, providing an explanation of why I have
put this bug into the rtl-optimization category.

The problem I see is that the register classifier (ira-costs.c) makes decisions
on register classes for each register in relative isolation, without adequately
considering the impact of that decision on other registers.  In this example,
we have 3 main registers we're concerned with: a, b, and a temporary register
(ignoring c, which we don't need to consider).  The code when costs are
calculated is roughly:

  tmp = a + b
  a = a & tmp
  CC = compare (a, 0)

Both the adddi3 and anddi3 operations can be performed in either integer or FP
regs, with a preference for the FP regs because the sequence is shorter (1 insn
instead of 2).

The compare operation can only be performed in an integer register.

In the first pass of the cost analysis:
"a" is assigned to the integer registers, since the cheaper adddi/anddi
operations are outweighed by the cost of having to move the value from FP=>INT
for the compare.
"b" and "tmp" are both assigned to FP registers, since they are only involved
in operations that are cheaper with the FP hardware.

In the second pass of the cost analysis, each register is again analyzed
independently:
"a" is left in the integer register because moving it to a FP register would
add an additional FP=>INT move for the compare.
"b" and "tmp" are both left in FP registers because moving either one would
still leave us with mixed FP/INT operations.

The biggest problem I see is that the first pass should recognize that since
"a" must be in an integer register, there is an unconsidered cost to putting
"b" and "tmp" in FP registers since they are involved in instructions where the
operands must be in the same register class.

A lesser, and probably more difficult, problem is that the second pass could do
better if it would consider changing register classes of more than one register
at a time.  This seems potentially complex, but perhaps we could just consider
register pairs that are involved in instructions with mismatched operand
classes, where the combination is invalid for the instruction.

Reply via email to