http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57462
Bug ID: 57462 Summary: ira-costs considers only a single register at a time Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: josh.m.conner at gmail dot com In this code: int PopCnt(unsigned long long a, unsigned long long b) { register int c=0; while(a) { c++; a &= a + b; } return(c); } Built for ARM with: gcc test.c -O2 -S -o test.s The code generated for the loop is: .L3: fmdrr d18, r0, r1 @ int vadd.i64 d16, d18, d17 fmrrd r4, r5, d16 @ int and r0, r0, r4 and r1, r1, r5 orrs r5, r0, r1 add r3, r3, #1 bne .L3 There is quite a bit of gymnastics in order to use the FP registers for the add instruction. The code is simpler if all registers are allocated to integer registers: .L3: adds r2, r4, r6 adc r3, r5, r7 and r4, r4, r2 and r5, r5, r3 orrs r3, r4, r5 add r0, r0, #1 bne .L3 The code is shorter, and doesn't include the potentially-expensive FP<=>INT register move operations. *** The rest of this bug is my analysis, providing an explanation of why I have put this bug into the rtl-optimization category. The problem I see is that the register classifier (ira-costs.c) makes decisions on register classes for each register in relative isolation, without adequately considering the impact of that decision on other registers. In this example, we have 3 main registers we're concerned with: a, b, and a temporary register (ignoring c, which we don't need to consider). The code when costs are calculated is roughly: tmp = a + b a = a & tmp CC = compare (a, 0) Both the adddi3 and anddi3 operations can be performed in either integer or FP regs, with a preference for the FP regs because the sequence is shorter (1 insn instead of 2). The compare operation can only be performed in an integer register. In the first pass of the cost analysis: "a" is assigned to the integer registers, since the cheaper adddi/anddi operations are outweighed by the cost of having to move the value from FP=>INT for the compare. "b" and "tmp" are both assigned to FP registers, since they are only involved in operations that are cheaper with the FP hardware. In the second pass of the cost analysis, each register is again analyzed independently: "a" is left in the integer register because moving it to a FP register would add an additional FP=>INT move for the compare. "b" and "tmp" are both left in FP registers because moving either one would still leave us with mixed FP/INT operations. The biggest problem I see is that the first pass should recognize that since "a" must be in an integer register, there is an unconsidered cost to putting "b" and "tmp" in FP registers since they are involved in instructions where the operands must be in the same register class. A lesser, and probably more difficult, problem is that the second pass could do better if it would consider changing register classes of more than one register at a time. This seems potentially complex, but perhaps we could just consider register pairs that are involved in instructions with mismatched operand classes, where the combination is invalid for the instruction.