Re: Imprecise data flow analysis leads to code bloat

Georg-Johann Lay Thu, 17 Jan 2013 09:04:52 -0800

Richard Biener wrote:
> On Thu, Jan 17, 2013 at 12:20 PM, Georg-Johann Lay wrote:
>> Hi, suppose the following C code:
>>
>>
>> static __inline__ __attribute__((__always_inline__))
>> _Fract rbits (const int i)
>> {
>>     _Fract f;
>>     __builtin_memcpy (&f, &i, sizeof (_Fract));
>>     return f;
>> }
>>
>> _Fract func (void)
>> {
>> #if B == 1
>>     return rbits (0x1234);
>> #elif B == 2
>>     return 0.14222r;
>> #endif
>> }
>>
>>
>> Type-punning idioms like in rbits above are very common in libgcc, for 
>> example
>> in fixed-bit.c.
>>
>> In this example, both compilation variants are equivalent (provided int and
>> _Fract are 16 bits wide).  The problem with the B=1 variant is that it is
>> inefficient:
>>
>> Variant B=1 needs 2 instructions.
> 
> B == 1 shows that fold-const.c native_interpret/encode_expr lack
> support for FIXED_POINT_TYPE.
> 
>> Variant B=2 needs 11 instructions, 9 of them are not needed at all.


I confused B=1 and B=2.  The inefficient case with 11 instructions is B=1, of
course.

Would a patch like below be acceptable in the current stage?

It's only the native_interpret and pretty much like the int case.

Difference is that it rejects if the sizes don't match exactly.  A new function
is used for low-level construction of a const_fixed_from_double_int.  Isn't
there a better way?  I wonder that such functionality is not already there...

A quick test shows that it works reasonable on AVR.

>> The problem goes as follows:
>>
>> The memcpy is represented as a VIEW_CONVERT_EXPR<_Fract> and expanded to 
>> memory
>> moves through the frame:
>>
>> (insn 5 4 6 (set (reg:HI 45)
>>         (const_int 4660 [0x1234])) bloat.c:5 -1
>>      (nil))
>>
>> (insn 6 5 7 (set (mem/c:HI (reg/f:HI 37 virtual-stack-vars) [2 S2 A8])
>>         (reg:HI 45)) bloat.c:5 -1
>>      (nil))
>>
>> (insn 7 6 8 (set (reg:HQ 46)
>>         (mem/c:HQ (reg/f:HI 37 virtual-stack-vars) [2 S2 A8])) bloat.c:12 -1
>>      (nil))
>>
>> (insn 8 7 9 (set (reg:HQ 43 [ <retval> ])
>>         (reg:HQ 46)) bloat.c:12 -1
>>      (nil))
>>
>>
>> Is there a specific reason why this is not expanded as subreg like this?
>>
>>
>>   (set (reg:HQ 46)
>>        (subreg:HQ [(reg:HI 45)] 0))
> 
> Probably your target does not allow this?  Not sure, but then a testcase like

This (movhq) is supported, similar for other fixed-point modes moves:

QQ, UQQ, HQ, UHQ, HA, UHA, SQ, USQ, SA, USA, DQ, UDQ, DA, UDA, TA, UTA.


> static __inline__ __attribute__((__always_inline__))
> _Fract rbits (const int i)
> {
>   _Fract f;
>   __builtin_memcpy (&f, &i, sizeof (_Fract));
>   return f;
> }
> 
> _Fract func (int i)
> {
>   return rbits (i);
> }
> 
> would be more interesting, as your testcase boils down to a constant
> folding issue (see above).

This case is optimized fine and the generated code is nice.

>> The insns are analyzed in .dfinit:
>>
>> ;;  regs ever live       24[r24] 25[r25] 29[r29]
>>
>> 24/25 is the return register, 28/29 is the frame pointer:
>>
>>
>> (insn 6 5 7 2 (set (mem/c:HI (plus:HI (reg/f:HI 28 r28)
>>                 (const_int 1 [0x1])) [2 S2 A8])
>>         (reg:HI 45)) bloat.c:5 82 {*movhi}
>>
>>
>> Is there a reason why R28 is not marked as live?
>>
>>
>> The memory accesses are optimized out in .fwprop2 so that no frame pointer is
>> needed any more.
>>
>> However, in the subsequent passes, R29 is still reported as "regs ever live"
>> which is not correct.
>>
>> The "regs ever live" don't change until .ira, where R28 springs to live 
>> again:
>>
>> ;;  regs ever live       24[r24] 25[r25] 28[r28] 29[r29]
>>
>> And in .reload:
>>
>> ;;  regular block artificial uses        28 [r28] 32 [__SP_L__]
>> ;;  eh block artificial uses     28 [r28] 32 [__SP_L__] 34 [argL]
>> ;;  entry block defs     8 [r8] 9 [r9] 10 [r10] 11 [r11] 12 [r12] 13 [r13] 14
>> [r14] 15 [r15] 16 [r16] 17 [r17] 18 [r18] 19 [r19] 20 [r20] 21 [r21] 22 [r22]
>> 23 [r23] 24 [r24] 25 [r25] 28 [r28] 32 [__SP_L__]
>> ;;  exit block uses      24 [r24] 25 [r25] 28 [r28] 32 [__SP_L__]
>> ;;  regs ever live       24[r24] 25[r25] 28[r28] 29[r29]
>>
>> Outcome is that the frame pointer is set up without need, which is very 
>> costly
>> on avr.
>>
>>
>> The compiler is trunk from 2013-01-16 configured for avr:
>>
>>    --target=avr --enable-languages=c,c++ --disable-nls --with-dwarf2
>>
>> The example is compiled with
>>
>>    avr-gcc bloat.c -S -dp -save-temps -Os -da -fdump-tree-optimized -DB=1
>>
>>
>> In what stage happens the misoptimization?
>>
>> Is this worth a PR? (I.e. is there a change that anybody cares?)
> 
> Well, AVR folks may be interested, or people caring for fixed-point support.

Besides the possible optimization in fold-const.c, data flow produces strange
results.

Presumably it misses some properties of the frame pointer which spans two hard
registers?  This is rather uncommon in GCC as far as I know.

There were attempts to use HARD_FRAME_POINTER in the avr BE but the results
were completely unusable because a frame was set up in any function :-(


Johann

Index: fold-const.c
===================================================================
--- fold-const.c	(revision 195244)
+++ fold-const.c	(working copy)
@@ -7410,6 +7410,58 @@ native_interpret_int (tree type, const u
 
 
 /* Subroutine of native_interpret_expr.  Interpret the contents of
+   the buffer PTR of length LEN as a FIXED_CST of type TYPE.
+   If the buffer cannot be interpreted, return NULL_TREE.  */
+
+static tree
+native_interpret_fixed (tree type, const unsigned char *ptr, int len)
+{
+  int total_bytes = GET_MODE_SIZE (TYPE_MODE (type));
+  int byte, offset, word, words;
+  unsigned char value;
+  double_int result;
+  FIXED_VALUE_TYPE fixed_value;
+
+  if (total_bytes != len)
+    return NULL_TREE;
+  if (total_bytes * BITS_PER_UNIT > HOST_BITS_PER_DOUBLE_INT)
+    return NULL_TREE;
+
+  result = double_int_zero;
+  words = total_bytes / UNITS_PER_WORD;
+
+  for (byte = 0; byte < total_bytes; byte++)
+    {
+      int bitpos = byte * BITS_PER_UNIT;
+      if (total_bytes > UNITS_PER_WORD)
+	{
+	  word = byte / UNITS_PER_WORD;
+	  if (WORDS_BIG_ENDIAN)
+	    word = (words - 1) - word;
+	  offset = word * UNITS_PER_WORD;
+	  if (BYTES_BIG_ENDIAN)
+	    offset += (UNITS_PER_WORD - 1) - (byte % UNITS_PER_WORD);
+	  else
+	    offset += byte % UNITS_PER_WORD;
+	}
+      else
+	offset = BYTES_BIG_ENDIAN ? (total_bytes - 1) - byte : byte;
+      value = ptr[offset];
+
+      if (bitpos < HOST_BITS_PER_WIDE_INT)
+	result.low |= (unsigned HOST_WIDE_INT) value << bitpos;
+      else
+	result.high |= (unsigned HOST_WIDE_INT) value
+		       << (bitpos - HOST_BITS_PER_WIDE_INT);
+    }
+
+  fixed_value = const_fixed_from_double_int (result, TYPE_MODE (type));
+
+  return build_fixed (type, fixed_value);
+}
+
+
+/* Subroutine of native_interpret_expr.  Interpret the contents of
    the buffer PTR of length LEN as a REAL_CST of type TYPE.
    If the buffer cannot be interpreted, return NULL_TREE.  */
 
@@ -7533,6 +7585,9 @@ native_interpret_expr (tree type, const
     case REAL_TYPE:
       return native_interpret_real (type, ptr, len);
 
+    case FIXED_POINT_TYPE:
+      return native_interpret_fixed (type, ptr, len);
+
     case COMPLEX_TYPE:
       return native_interpret_complex (type, ptr, len);
 
@@ -7557,6 +7612,7 @@ can_native_interpret_type_p (tree type)
     case BOOLEAN_TYPE:
     case POINTER_TYPE:
     case REFERENCE_TYPE:
+    case FIXED_POINT_TYPE:
     case REAL_TYPE:
     case COMPLEX_TYPE:
     case VECTOR_TYPE:
Index: fixed-value.c
===================================================================
--- fixed-value.c	(revision 195244)
+++ fixed-value.c	(working copy)
@@ -81,6 +81,24 @@ check_real_for_fixed_mode (REAL_VALUE_TY
   return FIXED_OK;
 }
 
+
+/* Construct a CONST_FIXED from a bit payload and machine mode MODE.
+   The bits in PAYLOAD are used verbatim.  */
+
+FIXED_VALUE_TYPE
+const_fixed_from_double_int (double_int payload, enum machine_mode mode)
+{
+  FIXED_VALUE_TYPE value;
+
+  gcc_assert (GET_MODE_BITSIZE (mode) <= HOST_BITS_PER_DOUBLE_INT);
+
+  value.data = payload;
+  value.mode = mode;
+
+  return value;
+}
+
+
 /* Initialize from a decimal or hexadecimal string.  */
 
 void
Index: fixed-value.h
===================================================================
--- fixed-value.h	(revision 195244)
+++ fixed-value.h	(working copy)
@@ -49,6 +49,11 @@ extern FIXED_VALUE_TYPE fconst1[MAX_FCON
   const_fixed_from_fixed_value (r, m)
 extern rtx const_fixed_from_fixed_value (FIXED_VALUE_TYPE, enum machine_mode);
 
+/* Construct a CONST_FIXED from a bit payload and machine mode MODE.
+   The bits in PAYLOAD are used verbatim.  */
+extern FIXED_VALUE_TYPE const_fixed_from_double_int (double_int,
+						     enum machine_mode);
+
 /* Initialize from a decimal or hexadecimal string.  */
 extern void fixed_from_string (FIXED_VALUE_TYPE *, const char *,
 			       enum machine_mode);

Re: Imprecise data flow analysis leads to code bloat

Reply via email to