Re: Imprecise data flow analysis leads to code bloat

Georg-Johann Lay Sat, 26 Jan 2013 23:12:24 -0800

Richard Biener schrieb:

On Thu, Jan 17, 2013 at 6:04 PM, Georg-Johann Lay <a...@gjlay.de> wrote:

Richard Biener wrote:

On Thu, Jan 17, 2013 at 12:20 PM, Georg-Johann Lay wrote:

Hi, suppose the following C code:



static __inline__ __attribute__((__always_inline__))
_Fract rbits (const int i)
{
    _Fract f;
    __builtin_memcpy (&f, &i, sizeof (_Fract));
    return f;
}

_Fract func (void)
{
#if B == 1
    return rbits (0x1234);
#elif B == 2
    return 0.14222r;
#endif
}


Type-punning idioms like in rbits above are very common in libgcc, for example
in fixed-bit.c.

In this example, both compilation variants are equivalent (provided int and
_Fract are 16 bits wide).  The problem with the B=1 variant is that it is
inefficient:

Variant B=1 needs 2 instructions.

B == 1 shows that fold-const.c native_interpret/encode_expr lack
support for FIXED_POINT_TYPE.

Variant B=2 needs 11 instructions, 9 of them are not needed at all.

I confused B=1 and B=2.  The inefficient case with 11 instructions is B=1, of
course.

Would a patch like below be acceptable in the current stage?


I'd be fine with it (not sure about test coverage).  But please also add
native_encode_fixed.



For the patch, please f'up to

http://gcc.gnu.org/ml/gcc-patches/2013-01/msg01053.html

Johann

It's only the native_interpret and pretty much like the int case.

Difference is that it rejects if the sizes don't match exactly.


Hmm, yeah.  I'm not sure why the _interpret routines chose to ignore
tail padding ... was there any special correctness reason you did it
differently than the int variant?

 A new function
is used for low-level construction of a const_fixed_from_double_int.  Isn't
there a better way?  I wonder that such functionality is not already there...


Good question - there are probably more places that could make use of
this.

A quick test shows that it works reasonable on AVR.

The problem goes as follows:

The memcpy is represented as a VIEW_CONVERT_EXPR<_Fract> and expanded to memory
moves through the frame:

(insn 5 4 6 (set (reg:HI 45)
        (const_int 4660 [0x1234])) bloat.c:5 -1
     (nil))

(insn 6 5 7 (set (mem/c:HI (reg/f:HI 37 virtual-stack-vars) [2 S2 A8])
        (reg:HI 45)) bloat.c:5 -1
     (nil))

(insn 7 6 8 (set (reg:HQ 46)
        (mem/c:HQ (reg/f:HI 37 virtual-stack-vars) [2 S2 A8])) bloat.c:12 -1
     (nil))

(insn 8 7 9 (set (reg:HQ 43 [ <retval> ])
        (reg:HQ 46)) bloat.c:12 -1
     (nil))


Is there a specific reason why this is not expanded as subreg like this?


  (set (reg:HQ 46)
       (subreg:HQ [(reg:HI 45)] 0))

Probably your target does not allow this?  Not sure, but then a testcase like

This (movhq) is supported, similar for other fixed-point modes moves:

QQ, UQQ, HQ, UHQ, HA, UHA, SQ, USQ, SA, USA, DQ, UDQ, DA, UDA, TA, UTA.

static __inline__ __attribute__((__always_inline__))
_Fract rbits (const int i)
{
  _Fract f;
  __builtin_memcpy (&f, &i, sizeof (_Fract));
  return f;
}

_Fract func (int i)
{
  return rbits (i);
}

would be more interesting, as your testcase boils down to a constant
folding issue (see above).

This case is optimized fine and the generated code is nice.


I see - so it's really confused about the constant-ness.  Yeah, I can imagine
that.

Thanks,
Richard.

The insns are analyzed in .dfinit:

;;  regs ever live       24[r24] 25[r25] 29[r29]

24/25 is the return register, 28/29 is the frame pointer:


(insn 6 5 7 2 (set (mem/c:HI (plus:HI (reg/f:HI 28 r28)
                (const_int 1 [0x1])) [2 S2 A8])
        (reg:HI 45)) bloat.c:5 82 {*movhi}


Is there a reason why R28 is not marked as live?


The memory accesses are optimized out in .fwprop2 so that no frame pointer is
needed any more.

However, in the subsequent passes, R29 is still reported as "regs ever live"
which is not correct.

The "regs ever live" don't change until .ira, where R28 springs to live again:

;;  regs ever live       24[r24] 25[r25] 28[r28] 29[r29]

And in .reload:

;;  regular block artificial uses        28 [r28] 32 [__SP_L__]
;;  eh block artificial uses     28 [r28] 32 [__SP_L__] 34 [argL]
;;  entry block defs     8 [r8] 9 [r9] 10 [r10] 11 [r11] 12 [r12] 13 [r13] 14
[r14] 15 [r15] 16 [r16] 17 [r17] 18 [r18] 19 [r19] 20 [r20] 21 [r21] 22 [r22]
23 [r23] 24 [r24] 25 [r25] 28 [r28] 32 [__SP_L__]
;;  exit block uses      24 [r24] 25 [r25] 28 [r28] 32 [__SP_L__]
;;  regs ever live       24[r24] 25[r25] 28[r28] 29[r29]

Outcome is that the frame pointer is set up without need, which is very costly
on avr.


The compiler is trunk from 2013-01-16 configured for avr:

   --target=avr --enable-languages=c,c++ --disable-nls --with-dwarf2

The example is compiled with

   avr-gcc bloat.c -S -dp -save-temps -Os -da -fdump-tree-optimized -DB=1


In what stage happens the misoptimization?

Is this worth a PR? (I.e. is there a change that anybody cares?)

Well, AVR folks may be interested, or people caring for fixed-point support.

Besides the possible optimization in fold-const.c, data flow produces strange
results.

Presumably it misses some properties of the frame pointer which spans two hard
registers?  This is rather uncommon in GCC as far as I know.

There were attempts to use HARD_FRAME_POINTER in the avr BE but the results
were completely unusable because a frame was set up in any function :-(


Johann

Re: Imprecise data flow analysis leads to code bloat

Reply via email to