Wei Mi <w...@google.com> wrote: >Thanks Richard. Yes, without that restriction, the number of >partitions in the partition map will be increased somewhat. But I >think it may not increase a lot for 2 reasons. 1. usually coalesce >list is not a very big list and only the vars in that list will be >added to conflict graph. It already reduces conflict graph bitmaps a >lot. 2. a ssa var may appear in multiple phi stmts. Suppose in >phi-stmt1 it has different basename with other phi arg, while in >phi-stmt2 it has the same basename with other phi arg. In such case, >the ssa var will be added to conflict graph anyway because of >phi-stmt2, but it will not be added to coalesce list for phi-stmt1 >with the restriction. So the restriction block the coalesce >opportunity in phi-stmt1 without reducing memory. > >I hacked the out-of-ssa phase and added different names var in the >same phi into coalesce list. I tried spec2000 int and saw no >significant memory increase for expand phase (I used your -fmem-report >patch to dump the memory usage of each pass. It is useful. I am >wondering why it didn't go into the trunk).
You probably should dig in history as to for what bug this restriction was added. Coalescing different variables will also degrade debug information. Note that artificial variables should nowadays be anonymous ssa names. Richard. >Thanks, >Wei Mi. > > > >On Fri, Aug 23, 2013 at 5:10 AM, Richard Biener ><richard.guent...@gmail.com> wrote: >> Wei Mi <w...@google.com> wrote: >>>For the following case: >>> >>>float total = 0.2; >>> >>>int main() { >>> int i; >>> >>> for (i = 0; i < 1000000000; i++) { >>> total += i; >>> } >>> >>> return total == 0.3; >>>} >>> >>>The gcc assembly of its kernel loop is: >>> >>>.L3: >>> movaps %xmm0, %xmm1 >>>.L2: >>> cvtsi2ss %eax, %xmm0 >>> addl $1, %eax >>> cmpl $1000000000, %eax >>> addss %xmm1, %xmm0 >>> jne .L3 >>> >>>The movaps is redundent, the loop could be changed to: >>> >>>.L3: >>> cvtsi2ss %eax, %xmm1 >>> addl $1, %eax >>> cmpl $1000000000, %eax >>> addss %xmm1, %xmm0 >>> jne .L3 >>> >>>Manually removing the extra movaps improves performance from 1.26s to >>>0.95s >>>on sandybridge using trunk (r201859). >>> >>>load PRE tries to promote MEM op of total out of the loop, it >generates >>>a >>>new PHI at the start of loop body: >>> >>> <bb 2>: >>> pretmp_22 = total; >>> goto <bb 4>; >>> >>> <bb 3>: >>> >>> <bb 4>: >>> # i_15 = PHI <i_8(3), 0(2)> >>># prephitmp_23 = PHI <total.1_6(3), pretmp_22(2)> ==> PHI >>>generated. >>> _4 = (float) i_15; >>> total.0_5 = prephitmp_23; >>> total.1_6 = _4 + total.0_5; >>> total = total.1_6; >>> i_8 = i_15 + 1; >>> if (i_8 != 1000000000) >>> goto <bb 3>; >>> else >>> goto <bb 5>; >>> >>>out-of-ssa phase should have coalesced prephitmp_23 and total.1_6(3) >to >>>the >>>same temp var, but existing out-of-ssa has a limitation that it will >>>not >>>coalesce ssa variables with different base var names, even if they >are >>>in >>>the same phi and their live ranges don't conflict. So out-of-ssa will >>>insert the redundent mov pretmp = total.1_6 in bb3. >>> >>> <bb 2>: >>> pretmp = total; >>> goto <bb 4>; >>> >>> <bb 3>: >>> pretmp = total.1_6; ==> inserted by out-of-ssa. >>> >>> <bb 4>: >>> _4 = (float) i_15; >>> total.1_6 = _4 + pretmp; >>> i_8 = i_15 + 1; >>> if (i_8 != 1000000000) >>> goto <bb 3>; >>> else >>> goto <bb 5>; >>> >>>IRA phase has the potential to allocate pretmp and total.1_6 to the >>>same >>>hardreg and remove the extra mov, but for the above case, regmove >phase >>>happen to block ira from doing the cleanup. regmove guesses the >>>register >>>constraint of an insn and try to change the insn to satisfy the >>>constraint >>>before IRA phase. Usually it could help IRA make a better decision, >but >>>here regmove decides to merge _4 and total.1_6 into total.1_6 in >order >>>to >>>satisfy the constraint of two operand plus on x86 (addss xmm1, xmm2). >>>After >>>_4 and total.1_6 are merged, The live range of total.1_6 has conflict >>>with >>>that of pretmp in IRA, so they cannot be allocated to the same >hardreg, >>>and >>>the redundent mov (pretmp = total.1_6) couldn't be deleted. However, >It >>>is >>>not trivial to make regmove choose to merge total.1_6 and pretmp, >>>because >>>it requires regmove to have global live range analysis (Existing >>>regmove >>>has simple correctness check in a range limited to single bb). >>> >>>If we use -mtune=corei7-avx, then the redundent mov disappear. That >is >>>because after using avx support, regmove knows avx provide three >>>operands >>>plus: vaddsd xmm1, xmm2, xmm3/m32, so it will not merge total.1_6 and >>>_4, >>>then IRA could allocate total.1_6 and pretmp to the same hardreg. >>> >>>If we change the type of total from float to int, then the redundent >>>mov >>>also disappears. It has similar reason as the above one. x86 provides >>>LEA >>>insn which could be used as plus op and it could have three operands, >>>so >>>regmove chooses not to merge total.1_6 and _4. >>> >>>My question is, why out-of-ssa cannot do the cleanup by coalescing >all >>>the >>>vars without conflicts in the same phi stmt, instead of only >coalescing >>>the >>>vars with the same base name? >> >> The restriction exists to keep conflict bitmaps small. Otherwise >you'll have quadratic memory usage for them. >> >> Richard. >> >>>Thanks, >>>Wei Mi. >> >>