Re: Imprecise data flow analysis leads to code bloat

Richard Biener Fri, 18 Jan 2013 01:39:57 -0800

On Thu, Jan 17, 2013 at 6:04 PM, Georg-Johann Lay <[email protected]> wrote:
> Richard Biener wrote:
>> On Thu, Jan 17, 2013 at 12:20 PM, Georg-Johann Lay wrote:
>>> Hi, suppose the following C code:
>>>
>>>
>>> static __inline__ __attribute__((__always_inline__))
>>> _Fract rbits (const int i)
>>> {
>>>     _Fract f;
>>>     __builtin_memcpy (&f, &i, sizeof (_Fract));
>>>     return f;
>>> }
>>>
>>> _Fract func (void)
>>> {
>>> #if B == 1
>>>     return rbits (0x1234);
>>> #elif B == 2
>>>     return 0.14222r;
>>> #endif
>>> }
>>>
>>>
>>> Type-punning idioms like in rbits above are very common in libgcc, for 
>>> example
>>> in fixed-bit.c.
>>>
>>> In this example, both compilation variants are equivalent (provided int and
>>> _Fract are 16 bits wide).  The problem with the B=1 variant is that it is
>>> inefficient:
>>>
>>> Variant B=1 needs 2 instructions.
>>
>> B == 1 shows that fold-const.c native_interpret/encode_expr lack
>> support for FIXED_POINT_TYPE.
>>
>>> Variant B=2 needs 11 instructions, 9 of them are not needed at all.
>
> I confused B=1 and B=2.  The inefficient case with 11 instructions is B=1, of
> course.
>
> Would a patch like below be acceptable in the current stage?


I'd be fine with it (not sure about test coverage).  But please also add
native_encode_fixed.

> It's only the native_interpret and pretty much like the int case.
>
> Difference is that it rejects if the sizes don't match exactly.

Hmm, yeah.  I'm not sure why the _interpret routines chose to ignore
tail padding ... was there any special correctness reason you did it
differently than the int variant?

>  A new function
> is used for low-level construction of a const_fixed_from_double_int.  Isn't
> there a better way?  I wonder that such functionality is not already there...

Good question - there are probably more places that could make use of
this.

> A quick test shows that it works reasonable on AVR.
>
>>> The problem goes as follows:
>>>
>>> The memcpy is represented as a VIEW_CONVERT_EXPR<_Fract> and expanded to 
>>> memory
>>> moves through the frame:
>>>
>>> (insn 5 4 6 (set (reg:HI 45)
>>>         (const_int 4660 [0x1234])) bloat.c:5 -1
>>>      (nil))
>>>
>>> (insn 6 5 7 (set (mem/c:HI (reg/f:HI 37 virtual-stack-vars) [2 S2 A8])
>>>         (reg:HI 45)) bloat.c:5 -1
>>>      (nil))
>>>
>>> (insn 7 6 8 (set (reg:HQ 46)
>>>         (mem/c:HQ (reg/f:HI 37 virtual-stack-vars) [2 S2 A8])) bloat.c:12 -1
>>>      (nil))
>>>
>>> (insn 8 7 9 (set (reg:HQ 43 [ <retval> ])
>>>         (reg:HQ 46)) bloat.c:12 -1
>>>      (nil))
>>>
>>>
>>> Is there a specific reason why this is not expanded as subreg like this?
>>>
>>>
>>>   (set (reg:HQ 46)
>>>        (subreg:HQ [(reg:HI 45)] 0))
>>
>> Probably your target does not allow this?  Not sure, but then a testcase like
>
> This (movhq) is supported, similar for other fixed-point modes moves:
>
> QQ, UQQ, HQ, UHQ, HA, UHA, SQ, USQ, SA, USA, DQ, UDQ, DA, UDA, TA, UTA.
>
>
>> static __inline__ __attribute__((__always_inline__))
>> _Fract rbits (const int i)
>> {
>>   _Fract f;
>>   __builtin_memcpy (&f, &i, sizeof (_Fract));
>>   return f;
>> }
>>
>> _Fract func (int i)
>> {
>>   return rbits (i);
>> }
>>
>> would be more interesting, as your testcase boils down to a constant
>> folding issue (see above).
>
> This case is optimized fine and the generated code is nice.

I see - so it's really confused about the constant-ness.  Yeah, I can imagine
that.

Thanks,
Richard.

>>> The insns are analyzed in .dfinit:
>>>
>>> ;;  regs ever live       24[r24] 25[r25] 29[r29]
>>>
>>> 24/25 is the return register, 28/29 is the frame pointer:
>>>
>>>
>>> (insn 6 5 7 2 (set (mem/c:HI (plus:HI (reg/f:HI 28 r28)
>>>                 (const_int 1 [0x1])) [2 S2 A8])
>>>         (reg:HI 45)) bloat.c:5 82 {*movhi}
>>>
>>>
>>> Is there a reason why R28 is not marked as live?
>>>
>>>
>>> The memory accesses are optimized out in .fwprop2 so that no frame pointer 
>>> is
>>> needed any more.
>>>
>>> However, in the subsequent passes, R29 is still reported as "regs ever live"
>>> which is not correct.
>>>
>>> The "regs ever live" don't change until .ira, where R28 springs to live 
>>> again:
>>>
>>> ;;  regs ever live       24[r24] 25[r25] 28[r28] 29[r29]
>>>
>>> And in .reload:
>>>
>>> ;;  regular block artificial uses        28 [r28] 32 [__SP_L__]
>>> ;;  eh block artificial uses     28 [r28] 32 [__SP_L__] 34 [argL]
>>> ;;  entry block defs     8 [r8] 9 [r9] 10 [r10] 11 [r11] 12 [r12] 13 [r13] 
>>> 14
>>> [r14] 15 [r15] 16 [r16] 17 [r17] 18 [r18] 19 [r19] 20 [r20] 21 [r21] 22 
>>> [r22]
>>> 23 [r23] 24 [r24] 25 [r25] 28 [r28] 32 [__SP_L__]
>>> ;;  exit block uses      24 [r24] 25 [r25] 28 [r28] 32 [__SP_L__]
>>> ;;  regs ever live       24[r24] 25[r25] 28[r28] 29[r29]
>>>
>>> Outcome is that the frame pointer is set up without need, which is very 
>>> costly
>>> on avr.
>>>
>>>
>>> The compiler is trunk from 2013-01-16 configured for avr:
>>>
>>>    --target=avr --enable-languages=c,c++ --disable-nls --with-dwarf2
>>>
>>> The example is compiled with
>>>
>>>    avr-gcc bloat.c -S -dp -save-temps -Os -da -fdump-tree-optimized -DB=1
>>>
>>>
>>> In what stage happens the misoptimization?
>>>
>>> Is this worth a PR? (I.e. is there a change that anybody cares?)
>>
>> Well, AVR folks may be interested, or people caring for fixed-point support.
>
> Besides the possible optimization in fold-const.c, data flow produces strange
> results.
>
> Presumably it misses some properties of the frame pointer which spans two hard
> registers?  This is rather uncommon in GCC as far as I know.
>
> There were attempts to use HARD_FRAME_POINTER in the avr BE but the results
> were completely unusable because a frame was set up in any function :-(
>
>
> Johann
>

Re: Imprecise data flow analysis leads to code bloat

Reply via email to