[Bug rtl-optimization/56124] Redundant reload for loading from memory

2013-04-18 Thread amker.cheng at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56124



bin.cheng  changed:



   What|Removed |Added



 Status|UNCONFIRMED |RESOLVED

 Resolution||FIXED



--- Comment #2 from bin.cheng  2013-04-18 
09:42:48 UTC ---

Fixed by http://gcc.gnu.org/ml/gcc-cvs/2013-04/msg00399.html


[Bug target/54414] New: ARM:mis-compiled prologue/epilogue on cortex-m0 when optimizing with -Os

2012-08-30 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54414

 Bug #: 54414
   Summary: ARM:mis-compiled prologue/epilogue on cortex-m0 when
optimizing with -Os
Classification: Unclassified
   Product: gcc
   Version: 4.8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: amker.ch...@gmail.com


For the case of pr45070.c as below:

/* PR45070 */
extern void abort(void);

struct packed_ushort {
unsigned short ucs;
} __attribute__((packed));

struct source {
int pos, length;
int flag;
};

static void __attribute__((noinline)) fetch(struct source *p)
{
p->length = 128;
}

static struct packed_ushort __attribute__((noinline)) next(struct source *p)
{
struct packed_ushort rv;

if (p->pos >= p->length) {
if (p->flag) {
p->flag = 0;
fetch(p);
return next(p);
}
p->flag = 1;
rv.ucs = 0x;
return rv;
}
rv.ucs = 0;
return rv;
}

int main(void)
{
struct source s;
int i;

s.pos = 0;
s.length = 0;
s.flag = 0;

for (i = 0; i < 16; i++) {
struct packed_ushort rv = next(&s);
if ((i == 0 && rv.ucs != 0x)
|| (i > 0 && rv.ucs != 0))
abort();
}
return 0;
}
Compile with below options:
$ arm-none-eabi-gcc -mthumb -mcpu=cortex-m0 -Os pr45070.c -o pr45070.S
The generated assembly code for function next is like:

next:
push{r0, r1, r2, r3, r4, lr}
ldrr2, [r0]
ldrr3, [r0, #4]
movr4, r0
cmpr2, r3
blt.L3
ldrr2, [r0, #8]
cmpr2, #0
beq.L4
movr3, #0
strr3, [r0, #8]
addr0, r0, #4
blfetch.isra.0
movr0, r4
blnext
movr3, sp
sxthr0, r0
strbr0, [r3]
lsrr0, r0, #8
strbr0, [r3, #1]
movr3, sp
ldrhr2, [r3]
b.L6
.L4:
movr3, #1
strr3, [r0, #8]
negr2, r3
b.L6
.L3:
movr2, #0
.L6:
addr3, sp, #12
strhr2, [r3]
addr3, sp, #12
ldrbr0, [r3, #1]
ldrbr2, [r3]
lslr0, r0, #8
orrr0, r2
@ sp needed for prologue
pop{r1, r2, r3, r4, pc}

The pc register is restored with wong value.


[Bug target/54414] ARM:mis-compiled prologue/epilogue on cortex-m0 when optimizing with -Os

2012-08-30 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54414

--- Comment #1 from amker.cheng  2012-08-30 
10:17:15 UTC ---
I suspect that the call of arm_size_return_regs in function
thumb1_extra_regs_pushed should also be covered as in
http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00830.html


[Bug rtl-optimization/54133] regrename introduces additional dependencies

2012-09-25 Thread amker.cheng at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54133



--- Comment #8 from amker.cheng  2012-09-25 
07:45:02 UTC ---

I have spent some time investigating this bug and now I think I understand the

issue.



The problematic instruction patterns which save/restore argument/return

registers is generated/kept on Thumb1 because ARM back end defines target hook

TARGET_SMALL_REGISTER_CLASSES_FOR_MODE_P. 



The intention is to keep live range of hardware registers short, so I think it

is inappropriate to do the propagation before IRA.



I can only think about fixing this in following ways:

1. run an additional cprop_hardreg before register renaming. Of course this

seems not decent.

2. post reload pass supports simple CSE by using cselib, we can do the

transformation in postreload.



Currently CSELIB can't detect such cases. Root cause is:

1. argument registers usually have no initialization; return register usually

initialized by call_expr.

2. CSELIB uses the first element of the elt_list defines the mode in which the

register was set; if the mode is unknown or the value is no longer valid in

that mode, ELT will be NULL for the first element.

3. CSELIB creates first NULL elt_list for argument registers in function

"cselib_lookup_1", because such registers has no initialization.

4. CSELIB ignores return registers initialized by call_expr, as in function

"cselib_hash_rtx". Then create first NULL elt_list for return registers.

5. In function "cselib_reg_set_mode", CSELIB checks whether the first element

of elt_list is NULL, this results in argument/return register won't be CSEd.



But I am not sure whether CSELIB can be improved to address such issue.


[Bug target/54989] FAIL: gcc.dg/hoist-register-pressure.c scan-rtl-dump hoist "PRE/HOIST: end of bb .* copying expression" on darwin

2012-10-19 Thread amker.cheng at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54989



bin.cheng  changed:



   What|Removed |Added



 CC||amker.cheng at gmail dot

   ||com



--- Comment #1 from bin.cheng  2012-10-20 
05:40:08 UTC ---

The failure is caused by higher register pressure in the THEN branch of the

case, though I am not sure why the register pressure is higher than x86-linux.



This can be fixed by simplifying test case as below:



/* { dg-options "-Os -fdump-rtl-hoist" }  */

/* { dg-final { scan-rtl-dump "PRE/HOIST: end of bb .* copying expression"

"hoist" } } */



#define BUF 100

int a[BUF];



void com (int);

void bar (int);



int foo (int x, int y, int z)

{

  /* "x+y" won't be hoisted if "-fira-hoist-pressure" is disabled,

 because its rtx_cost is too small.  */

  if (z)

{

  a[1] = a[0];

  a[2] = a[1];

  a[3] = a[2];

  a[4] = a[3];

  a[5] = a[4];

  a[6] = a[5];

  a[7] = a[6];

  com (x+y);

}

  else

{

  bar (x+y);

}



  return 0;

}



I will send a patch fixing this.


[Bug other/55031] New: Documentation on RTL GCSE pass is outdated

2012-10-22 Thread amker.cheng at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55031

 Bug #: 55031
   Summary: Documentation on RTL GCSE pass is outdated
Classification: Unclassified
   Product: gcc
   Version: 4.8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: amker.ch...@gmail.com


Quoting from GCCINT, section "9.5 RTL passes":
"When optimizing for size, GCSE is done using Morel-Renvoise Partial Redundancy
Elimination, with the exception that it does not try to move invariants out of
loops—that is left to the loop optimization pass. If MR PRE GCSE is done, code
hoisting (aka unification) is also done, as well as load motion."

While the pass gate function is as below:

static bool
gate_rtl_pre (void)
{
  return optimize > 0 && flag_gcse
&& !cfun->calls_setjmp
&& optimize_function_for_speed_p (cfun)
&& dbg_cnt (pre);
}

This conflicts with the documentation, which says Morel-Renvoise PRE will be
used when optimizing for size. I think the document is outdated.


[Bug target/54989] FAIL: gcc.dg/hoist-register-pressure.c scan-rtl-dump hoist "PRE/HOIST: end of bb .* copying expression" on darwin

2012-10-31 Thread amker.cheng at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54989



--- Comment #7 from bin.cheng  2012-10-31 
08:45:37 UTC ---

I think this is fixed and it's a bug in 4.8.0.

Hi Jack, could you verify that it is fixed? Thanks very much.


[Bug rtl-optimization/57540] New: stack pointer related loop invariants after reload

2013-06-06 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57540

Bug ID: 57540
   Summary: stack pointer related loop invariants after reload
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amker.cheng at gmail dot com

For below program,

void foo ( unsigned char *len,
 int alphaSize,
 int maxLen )
{
   int i, j, k;
   unsigned char tooLong;

   int parent [ 258 * 2 ];


  parent[0] = -2;

  tooLong = 0;
  for (i = 1; i <= alphaSize; i++)
  {
 j = 0;
 k = i;
 while (parent[k] >= 0)
 {
 k = parent[k];
 j++;
 }
 len[i-1] = j;
 if (j > maxLen)
 tooLong = 1;
  }
}

Compile with command line,
arm-linux-gnueabihf-gcc -S -O2 -marm -mcpu=cortex-a15 -o foo.S -xc foo.E

The generated code is like,
.cpu cortex-a15
.eabi_attribute 27, 3
.eabi_attribute 28, 1
.fpu vfpv3-d16
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 2
.eabi_attribute 30, 2
.eabi_attribute 34, 1
.eabi_attribute 18, 4
.file"foo.E"
.text
.align2
.globalfoo
.typefoo, %function
foo:
@ args = 0, pretend = 0, frame = 2064
@ frame_needed = 0, uses_anonymous_args = 0
strlr, [sp, #-4]!
subsp, sp, #2064
mvnr3, #1
subsp, sp, #4
cmpr1, #0
strr3, [sp]
ble.L1
movip, sp
addr1, r0, r1
.L6:
ldrr3, [ip, #4]!
movr2, #0
cmpr3, #0
blt.L3
.L5:
addlr, sp, #2064loop invariant
addr2, r2, #1
addr3, lr, r3, asl #2
ldrr3, [r3, #-2064]
cmpr3, #0
bge.L5
uxtbr2, r2
.L3:
strbr2, [r0], #1
cmpr0, r1
bne.L6
.L1:
addsp, sp, #2064
addsp, sp, #4
@ sp needed
ldrpc, [sp], #4
.sizefoo, .-foo
.ident"GCC: (GNU) 4.9.0 20130524 (experimental)"
.section.note.GNU-stack,"",%progbits


Apparently, first instruction in basic block .L5 is invariant, but kept in loop
because it is generated by reload.

I think this is a common issue.


[Bug rtl-optimization/57540] stack pointer related loop invariants after reload

2013-06-06 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57540

--- Comment #1 from bin.cheng  ---
The dump of loop_init is like,
   72: r178:SI=0
  106: L106:
   90: NOTE_INSN_BASIC_BLOCK 6
   91: r178:SI=r178:SI+0x1
   94: r190:SI=r177:SI<<0x2
  REG_DEAD r177:SI
   95: r191:SI=sfp:SI+r190:SI
  REG_DEAD r190:SI
   96: r192:SI=r191:SI-0x810
  REG_DEAD r191:SI
  REG_DEAD r189:SI
   97: r177:SI=[r192:SI]
  REG_DEAD r192:SI
   98: cc:CC=cmp(r177:SI,0)
   99: pc={(cc:CC>=0)?L104:pc}
  REG_DEAD cc:CC
  REG_BR_PROB 0x238c

Instructions 95/96 should be re-factored as below:
   95: r191:SI=sfp:SI-0x810
  REG_DEAD r190:SI
   96: r192:SI=r191:SI+r190:SI
  REG_DEAD r191:SI
  REG_DEAD r189:SI

Thus instruction 95 is loop invariant and be hoisted. For arm target, the loop
can be simplified into:

blt.L3

.L5:
addr2, r2, #1
ldrr3, [sp, r3, asl #2]
cmpr3, #0
bge.L5
uxtbr2, r2


[Bug target/57540] stack pointer related loop invariants after reload

2013-06-07 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57540

bin.cheng  changed:

   What|Removed |Added

  Component|rtl-optimization|target

--- Comment #2 from bin.cheng  ---
This only happens on arm mode.
For below gimple,
  k_8 = parent[k_29];
On ARM mode GCC expands it into,
   81: r180:SI=0xf7f0
   82: zero_extract(r180:SI,0x10,0x10)=0x
   83: r181:SI=r165:SI<<0x2
   84: r182:SI=r105:SI+r181:SI
   85: r183:SI=r182:SI+r180:SI
   86: r165:SI=[r183:SI]
while on Thumb2 GCC expands it into,
   88: r185:SI=r105:SI
   89: r186:SI=r105:SI-0x810
   90: r171:SI=[r171:SI*0x4+r186:SI]
thus resulting in much better assembly code,
.L5:
ldrr3, [sp, r3, lsl #2]
addsr2, r2, #1
cmpr3, #0
bge.L5
uxtbr2, r2


[Bug target/57540] stack pointer related loop invariants after reload for ARM mode

2013-06-09 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57540

--- Comment #3 from bin.cheng  ---
I think this should be handled in expand.  During expanding, GCC tries "base +
scaled_offset + offset" pattern, which is invalid for targets like arm. At this
point we still have a chance to refactor "base + offset" and force it into
register, thus generating "reg + scaled_offset".
By doing this, 
1) "base + offset" can be kept as loop invariant;
2) the multiplication is done by scaled address, saving another add
instruction.

I am testing a patch and will send it for review once it passes tests.


[Bug target/56102] Wrong rtx cost calculated for Thumb1

2013-08-06 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56102

bin.cheng  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from bin.cheng  ---
Yes, it's fixed by that checkin.


[Bug target/57540] stack pointer related loop invariants after reload for ARM mode

2013-08-06 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57540

bin.cheng  changed:

   What|Removed |Added

  Component|middle-end  |target

--- Comment #4 from bin.cheng  ---
Sorry, according to http://gcc.gnu.org/ml/gcc-patches/2013-06/msg00932.html,
This seems should be fixed in backend.  I will fixed this in
arm_legitimize_address, so I change this entry to TARGET.


[Bug target/58423] New: [ARM]ICE with shrink-wrap-sibcall.c on a15/neon/hard

2013-09-15 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58423

Bug ID: 58423
   Summary: [ARM]ICE with shrink-wrap-sibcall.c on a15/neon/hard
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amker.cheng at gmail dot com

GCC ICEed with shrink-wrap-sibcall.c on a15 with below command line:
./arm-none-eabi-gcc -O2 -marm -mcpu=cortex-a15 -mfpu=neon -mfloat-abi=hard
shrink-wrap-sibcall.c -S -o shrink-wrap-sibcall.S -fno-diagnostics-show-caret
-fdiagnostics-color=never -O2 -g

ICE msg is:

shrink-wrap-sibcall.c: In function 'baz':
shrink-wrap-sibcall.c:26:1: internal compiler error: in
maybe_record_trace_start, at dwarf2cfi.c:2218
0x82bfe41 maybe_record_trace_start
../../gcc/gcc/dwarf2cfi.c:2218
0x82c22f2 scan_trace
../../gcc/gcc/dwarf2cfi.c:2395
0x82c2a25 create_cfi_notes
../../gcc/gcc/dwarf2cfi.c:2549
0x82c2a25 execute_dwarf2_frame
../../gcc/gcc/dwarf2cfi.c:2904
0x82c2a25 execute
../../gcc/gcc/dwarf2cfi.c:3400
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <http://gcc.gnu.org/bugs.html> for instructions.


GCC is at revision r202599 and the ICE relates to a15/neon/hard-abi, no matter
how it is configured for arm.


[Bug target/58424] New: [ARM]gcc.target/arm/pr42575.c failed on arm

2013-09-15 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58424

Bug ID: 58424
   Summary: [ARM]gcc.target/arm/pr42575.c failed on arm
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amker.cheng at gmail dot com

gcc is at revision r202599 and is configured as:
../gcc/configure
build=i686-linux-gnu
host=i686-linux-gnu
target=arm-none-eabi
prefix=.../trunk-orig/target/
disable-decimal-float
disable-libffi
disable-libgomp
disable-libmudflap
disable-libquadmath
disable-libssp
disable-libstdcxx-pch
disable-nls
disable-shared
disable-threads
disable-tls
with-gnu-as
with-gnu-ld
with-newlib
with-headers=yes
with-sysroot=.../trunk-orig/target/arm-none-eabi
with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic -lm'
with-mode=thumb
with-arch=armv7-m
disable-multilib
enable-lto
enable-languages=c,c++,lto

The source code is:
/* { dg-options "-O2" }  */
/* Make sure RA does good job allocating registers and avoids
   unnecessary moves.  */
/* { dg-final { scan-assembler-not "mov" } } */

long long longfunc(long long x, long long y)
{
  return x * y;
}


The generated assembly is:
longfunc:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
mulr3, r0, r3
push{r4, r5}
mlar1, r2, r1, r3
umullr4, r5, r0, r2
addr5, r5, r1
movr0, r4
movr1, r5
pop{r4, r5}
bxlr
.sizelongfunc, .-longfunc

But I think the case would fail for other configurations too.


[Bug rtl-optimization/50663] New: conditional propagation missed in cprop.c pass

2011-10-08 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50663

 Bug #: 50663
   Summary: conditional propagation missed in cprop.c pass
Classification: Unclassified
   Product: gcc
   Version: 4.7.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: rtl-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: amker.ch...@gmail.com


For following test case:
extern int g;
int main(int a, int b)
{
if (a == 1)
{
b = a;
}

g = b;
return 0;
}

piece of dump file for cprop1 pass is like:


(insn 8 4 9 2 (set (reg:CC 24 cc)
(compare:CC (reg/v:SI 135 [ a ])
(const_int 1 [0x1]))) test.c:4 200 {*arm_cmpsi_insn}
 (nil))

(jump_insn 9 8 10 2 (set (pc)
(if_then_else (ne (reg:CC 24 cc)
(const_int 0 [0]))
(label_ref 11)
(pc))) test.c:4 212 {*arm_cond_branch}
 (expr_list:REG_DEAD (reg:CC 24 cc)
(expr_list:REG_BR_PROB (const_int 6218 [0x184a])
(nil)))
 -> 11)

(note 10 9 5 3 [bb 3] NOTE_INSN_BASIC_BLOCK)

(insn 5 10 11 3 (set (reg/v:SI 136 [ b ])
(reg/v:SI 135 [ a ])) test.c:4 696 {*thumb2_movsi_insn}
 (expr_list:REG_DEAD (reg/v:SI 135 [ a ])
(expr_list:REG_EQUAL (const_int 1 [0x1])
(nil

The r135 in insn_5 should handled by conditional propagation, like:

(note 10 9 5 3 [bb 3] NOTE_INSN_BASIC_BLOCK)

(insn 5 10 11 3 (set (reg/v:SI 136 [ b ])
(const_int 1 [0x1])) test.c:4 709 {*thumb2_movsi_insn}
 (expr_list:REG_DEAD (reg/v:SI 135 [ a ])
(expr_list:REG_EQUAL (const_int 1 [0x1])
(nil

Seems cprop misses the conditional propagation for the branch basic block.
FYI, I compiled the test case with command:
./arm-none-eabi-gcc -march=armv7-m -mthumb -O2 -S test.c -o test.S -da

The gcc is comfigured for arm-none-eabi and it's on trunk.


[Bug rtl-optimization/50663] conditional propagation missed in cprop.c pass

2011-10-08 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50663

--- Comment #1 from amker.cheng  2011-10-08 
10:25:04 UTC ---
Here comes the cause:

Though the cprop.c pass collected the implicit_set information, it is recorded
as local info of basic block, and cprop only does global propagation.
The result is such conditional const propagation opportunities is missed.

The whole process in cprop pass is like:

bb0 : if (x)
then
 bb1
else
 bb2
end

1, implicit_set from the preceding bb0 is tagged as local in bb1;
2, in compute_local_properties, the implicit_set is recorded in avloc[bb1];
3, in compute_cprop_available, the implicit_set is only recorded in avout[bb1],
   not in avin[bb1], which it should be;
4, in cprop_insn and find_avail_set, only info recorded in avin[bb1] is
considered
   when try to do propagation for bb1;

Well, I believe it is a small problem, since implicit_set is recorded
in avout[bb1],
The basic block bb1 is the only one get missed in propagation.

I'm working on a patch and will send it for reviewing later.


[Bug rtl-optimization/44025] Multiple load 0 to register

2011-11-01 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44025

--- Comment #4 from amker.cheng  2011-11-02 
06:03:56 UTC ---
I noticed that for attached reduced test case "reduced_test.c",
cse pass can eliminate such redundant load constant instructions.
But since cse works on extended basic block, rather than globally,
it can do nothing for the original case.

The questions are:
1, why pre does not do such optimization;
2, if pre does do the work, surely the live range of r0 is extended, which
might harm the register allocation;

Also I found the regcprop.c, which is a peephole pass eliminates redundant
register moves. It should be able to work for redundant constant load insns if
:
a) extend it in a value numbering way, at least for these constant values;
b) extend it in a global data analysis way;

Such change might also impact the scheduling pass and I am not sure how is the
benefit for common codes.


[Bug rtl-optimization/44025] Multiple load 0 to register

2011-11-01 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44025

--- Comment #5 from amker.cheng  2011-11-02 
06:05:23 UTC ---
Created attachment 25687
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25687
reduced test case which can be handled by cse pass


[Bug rtl-optimization/52804] IRA/RELOAD allocate wrong register on ARM for cortex-m0

2012-05-14 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52804

--- Comment #6 from amker.cheng  2012-05-15 
02:15:59 UTC ---
No regression reported in trunk so far, I back ported it into 4.7 branch.


[Bug middle-end/51867] GCC generates inconsistent code for same sources calling builtin calls, like sqrtf

2012-06-17 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51867

--- Comment #5 from amker.cheng  2012-06-18 
02:03:21 UTC ---
Should be fixed.


[Bug middle-end/53922] New: VRP: semantic conflict between range_includes_zero_p and value_inside_range

2012-07-10 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53922

 Bug #: 53922
   Summary: VRP: semantic conflict between range_includes_zero_p
and value_inside_range
Classification: Unclassified
   Product: gcc
   Version: 4.8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: amker.ch...@gmail.com


In tree-vrp.c
function value_inside_range returns:
  1 if VAL is inside value range VR (VR->MIN <= VAL <= VR->MAX),
  0 if VAL is not inside VR,
 -2 if we cannot tell either way.

While in function range_includes_zero_p, it:
 return (value_inside_range (zero, vr) == 1);
which is bogus.
Because when value_inside_range returns -2, there is the possibility that value
range includes zero.

For example:

int x(int a)
{
return a;
}
int y(int a) __attribute__ ((weak));
int (*scan_func)(int);
extern int g;
int g = 0;
int main()
{
if (g)
scan_func = x;
else
scan_func = y;

if (scan_func)
g = scan_func(10);

return 0;
}

compiled with command line:
arm-none-eabi-gcc -mthumb -mcpu=cortex-m3 -Os -S test.c -o test.S
-fdump-tree-all

The dump of vrp2 pass is:
main ()
{
  int (*) (int) cstore.6;
  int g.2;
  int g.0;

:
  g.0_1 = g;
  if (g.0_1 != 0)
goto ;
  else
goto ;

:

:
  # cstore.6_9 = PHI 
  scan_func = cstore.6_9;
  g.2_4 = cstore.6_9 (10);
  g = g.2_4;
  return 0;

}

Though the problem shows up with this case in gcc4.6 branch and -Os option on
arm, I think it exists in 4.7/4.8 too, just concealed by different gimple
statements.

I will work out a patch for this.


[Bug middle-end/53922] VRP: semantic conflict between range_includes_zero_p and value_inside_range

2012-07-11 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53922

--- Comment #2 from amker.cheng  2012-07-11 
08:03:11 UTC ---
Yes, the dump before pass vrp2 is like:
main ()
{
  int (*) (int) cstore.6;
  int g.2;
  int g.0;

:
  g.0_1 = g;
  if (g.0_1 != 0)
goto ;
  else
goto ;

:

:
  # cstore.6_9 = PHI 
  scan_func = cstore.6_9;
  if (cstore.6_9 != 0B)
goto ;
  else
goto ;

:
  g.2_4 = cstore.6_9 (10);
  g = g.2_4;

:
  return 0;

}

gcc parses "# cstore.6_9 = PHI " and asserts that cstore.6_9
non-zero, then folds predicate cstore.6_9 != 0B to 1, which is wrong, because
weak symbol y could be zero.


[Bug middle-end/53922] VRP: semantic conflict between range_includes_zero_p and value_inside_range

2012-07-11 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53922

--- Comment #3 from amker.cheng  2012-07-11 
10:12:24 UTC ---
vrp processes PHI node " # cstore.6_9 = PHI " in calling sequence:
vrp_visit_phi_node
  -> vrp_meet

When gcc gives up in function vrp_meet, it executes following code to derive an
anti-range against zero:

give_up:
  /* Failed to find an efficient meet.  Before giving up and setting
 the result to VARYING, see if we can at least derive a useful
 anti-range.  FIXME, all this nonsense about distinguishing
 anti-ranges from ranges is necessary because of the odd
 semantics of range_includes_zero_p and friends.  */
  if (!symbolic_range_p (vr0)
  && ((vr0->type == VR_RANGE && !range_includes_zero_p (vr0))
  || (vr0->type == VR_ANTI_RANGE && range_includes_zero_p (vr0)))
  && !symbolic_range_p (vr1)
  && ((vr1->type == VR_RANGE && !range_includes_zero_p (vr1))
  || (vr1->type == VR_ANTI_RANGE && range_includes_zero_p (vr1
{
  set_value_range_to_nonnull (vr0, TREE_TYPE (vr0->min));

  /* Since this meet operation did not result from the meeting of
 two equivalent names, VR0 cannot have any equivalences.  */
  if (vr0->equiv)
bitmap_clear (vr0->equiv);
}

Here vr0 is for "x" in source code, while vr1 for "y" in source code, which is
a weak symbol.

function range_includes_zero_p check whether vr1 includes zero by calling
value_inside_range. The value_inside_range works well by returning -2, because
of the WEAK symbol. After that, range_includes_zero_p checks whether return
value of value_inside_range equals 1. Finally in vrp_meet, condition
"((vr1->type == VR_RANGE && !range_includes_zero_p (vr1))" holds, resulting in
gcc asserting cstore.6_9 non-zero.

Am I missing something?


[Bug rtl-optimization/54133] New: regrename introduces additional dependencies

2012-07-30 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54133

 Bug #: 54133
   Summary: regrename introduces additional dependencies
Classification: Unclassified
   Product: gcc
   Version: 4.8.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: rtl-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: amker.ch...@gmail.com


With test program below:
typedef struct
{
double X, Y;
} Point;

typedef struct
{
Point p1;
Point c1;
Point c2;
Point p2;
} Curve;


double bar(double t, double p0, double p1, double p2, double p3);
void foo( Curve *curve, int count )
{
int n;
int step;
Point point;
Curve c0;
double t;
for ( n = 0; n < count; ++n )
{
c0 = curve[n];

for ( step = 0; step < (10); ++step )
{
t = ((double)(step)) / (double)(10);
point.X = bar( t, c0.p1.X, c0.c1.X, c0.c2.X, c0.p2.X );
point.Y = bar( t, c0.p1.Y, c0.c1.Y, c0.c2.Y, c0.p2.Y );
}
}
}

Compiled with command line:
arm-none-eabi-gcc -mthumb -mcpu=cortex-m0 -Os -frename-registers -S

The dump before and after regrenaming are like:
1. before regrename:
(insn 157 80 158 4 (set (reg:SI 4 r4 [180])
(reg:SI 0 r0)) ../office_pointio.E:29 187 {*thumb1_movsi_insn}
 (expr_list:REG_DEAD (reg:SI 0 r0)
(nil)))

(insn 158 157 147 4 (set (reg:SI 5 r5 [+4 ])
(reg:SI 1 r1 [+4 ])) ../office_pointio.E:29 187 {*thumb1_movsi_insn}
 (expr_list:REG_DEAD (reg:SI 1 r1 [+4 ])
(nil)))

(insn 147 158 83 4 (set (reg:DF 2 r2)
(mem/c:DF (plus:SI (reg/f:SI 13 sp)
(const_int 40 [0x28])) [6 %sfp+-56 S8 A64]))
../office_pointio.E:30 205 {*thumb_movdf_insn}
 (nil))

(insn 83 147 148 4 (set (mem:DF (reg/f:SI 13 sp) [0 S8 A64])
(reg:DF 2 r2)) ../office_pointio.E:30 205 {*thumb_movdf_insn}
 (expr_list:REG_DEAD (reg:DF 2 r2)
(nil)))

(insn 148 83 84 4 (set (reg:DF 2 r2)
(mem/c:DF (plus:SI (reg/f:SI 13 sp)
(const_int 56 [0x38])) [6 %sfp+-40 S8 A64]))
../office_pointio.E:30 205 {*thumb_movdf_insn}
 (nil))

(insn 84 148 149 4 (set (mem:DF (plus:SI (reg/f:SI 13 sp)
(const_int 8 [0x8])) [0 S8 A64])
(reg:DF 2 r2)) ../office_pointio.E:30 205 {*thumb_movdf_insn}
 (expr_list:REG_DEAD (reg:DF 2 r2)
(nil)))

(insn 149 84 85 4 (set (reg:DF 2 r2)
(mem/c:DF (plus:SI (reg/f:SI 13 sp)
(const_int 72 [0x48])) [6 %sfp+-24 S8 A64]))
../office_pointio.E:30 205 {*thumb_movdf_insn}
 (nil))

(insn 85 149 159 4 (set (mem:DF (plus:SI (reg/f:SI 13 sp)
(const_int 16 [0x10])) [0 S8 A64])
(reg:DF 2 r2)) ../office_pointio.E:30 205 {*thumb_movdf_insn}
 (expr_list:REG_DEAD (reg:DF 2 r2)
(nil)))

(insn 159 85 160 4 (set (reg:SI 0 r0)
(reg:SI 4 r4 [180])) ../office_pointio.E:30 187 {*thumb1_movsi_insn}
 (nil))

(insn 160 159 87 4 (set (reg:SI 1 r1 [+4 ])
(reg:SI 5 r5 [+4 ])) ../office_pointio.E:30 187 {*thumb1_movsi_insn}
 (nil))

2. after regrename:
(insn 157 80 158 4 (set (reg:SI 4 r4 [180])
(reg:SI 0 r0)) ../office_pointio.E:29 187 {*thumb1_movsi_insn}
 (expr_list:REG_DEAD (reg:SI 0 r0)
(nil)))

(insn 158 157 147 4 (set (reg:SI 5 r5 [+4 ])
(reg:SI 1 r1 [+4 ])) ../office_pointio.E:29 187 {*thumb1_movsi_insn}
 (expr_list:REG_DEAD (reg:SI 1 r1 [+4 ])
(nil)))

(insn 147 158 83 4 (set (reg:DF 0 r0)
(mem/c:DF (plus:SI (reg/f:SI 13 sp)
(const_int 40 [0x28])) [6 %sfp+-56 S8 A64]))
../office_pointio.E:30 205 {*thumb_movdf_insn}
 (nil))

(insn 83 147 148 4 (set (mem:DF (reg/f:SI 13 sp) [0 S8 A64])
(reg:DF 0 r0)) ../office_pointio.E:30 205 {*thumb_movdf_insn}
 (expr_list:REG_DEAD (reg:DF 2 r2)
(nil)))

(insn 148 83 84 4 (set (reg:DF 2 r2)
(mem/c:DF (plus:SI (reg/f:SI 13 sp)
(const_int 56 [0x38])) [6 %sfp+-40 S8 A64]))
../office_pointio.E:30 205 {*thumb_movdf_insn}
 (nil))

(insn 84 148 149 4 (set (mem:DF (plus:SI (reg/f:SI 13 sp)
(const_int 8 [0x8])) [0 S8 A64])
(reg:DF 2 r2)) ../office_pointio.E:30 205 {*thumb_movdf_insn}
 (expr_list:REG_DEAD (reg:DF 2 r2)
(nil)))

(insn 149 84 85 4 (set (reg:DF 1 r1)
(mem/c:DF (plus:SI (reg/f:SI 13 sp)
(const_int 72 [0x48])) [6 %sfp+-24 S8 A64]))
../office_pointio.E:30 205 {*thumb_movdf_insn}
 (nil))

(insn 85 149 159 4 (set (mem:DF (plus:SI (reg/f:SI 13 sp)
(const_int 16 [0x10])) [0 S8 A64])
(reg:DF 1 r1)) ../office_pointio.E:30 205 {*thumb_movdf_insn}
 (expr_list:REG_DEAD (reg:DF 2 r2)
(nil)))

(insn 159 85 160 4 (set (reg:SI 0 r0)
(reg:SI 4 r4 [180])) ../office_pointio.E:30 187 {*thumb1_movsi_insn}
 (nil))

(insn 160 159 87 4 (set (reg:SI 1 r1 [+4 ])
(reg:SI 5 r5 [+4 ])) ../office_pointio.E:30 187 {*

[Bug target/52412] another unnecessary register move on arm

2012-07-31 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52412

amker.cheng  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot
   ||com

--- Comment #2 from amker.cheng  2012-07-31 
14:12:54 UTC ---
The register move insn is generated by cse2 pass, and after that, there is no
cprop pass till ira.
The two allocnos for r6/r3(the original pseudos) are conflict with each other,
though they contains same value and connected by move insn, IRA cannot allocate
same hard register for them.
Moveover, the case is compile with Os, where gcc does IRA in whole single
region, and live range cannot be split either.


[Bug rtl-optimization/54133] regrename introduces additional dependencies

2012-08-01 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54133

--- Comment #2 from amker.cheng  2012-08-01 
07:49:51 UTC ---
I measured this kind of regression in benchmark CSiBE on
arm-none-eabi/cortex-m0 with Os optimization. Turns out most of the them are
relate to paramter/return register moving, like the reported case.

The logic is:
STEP1: At prologue or after call_insn, gcc saves parameter(or return) registers
in pseudos, then load it from the pseudo when need to use it(like calling
another function with the paramter).
For example:
{
  rx <- r0
  ...
  ...
  r0 <- rx
  call another function
}

If instructions between saving and using do not clobber paramter register, the
hard register can be propagated to remove one redundant move instruction.

STEP2: copy propagation before IRA just ignore hard registers, so usually these
can only be done in regcprop.c after IRA.

BUT,
STEP3: register renaming does not honor any propagation opportunities and may
using r0 to rename, which introduces additional dependencies. It's a common
regression because regrename always select renaming register from 0 to
FIRST_PSEUOD_REG.


In experiment, if I disable r0/r1 from renaming, most regressions observed in
CSiBE are gone.

So how should this be fixed? Thanks.


[Bug rtl-optimization/54133] regrename introduces additional dependencies

2012-08-01 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54133

--- Comment #5 from amker.cheng  2012-08-01 
13:48:50 UTC ---
Thanks for your patch, IMHO, I don't think the problem could be fixed in this
way, because:
1. 
   78 r177:DF=r0:DF
   80 [sp:SI]=r166:DF
   81 [sp:SI+0x8]=r168:DF
   82 [sp:SI+0x10]=r170:DF
   84 r2:DF=r164:DF
   85 r0:DF=call [`bar'] argc:0x18
  REG_DEAD: r2:DF
  REG_UNUSED: r0:DF
   86 [sp:SI]=r167:DF
   87 [sp:SI+0x8]=r169:DF
   88 [sp:SI+0x10]=r171:DF
   89 r0:DF=r177:DF
  REG_DEAD: r177:DF
   90 r2:DF=r165:DF
   91 r0:DF=call [`bar'] argc:0x18

The propagation actually increases register pressure from insn 78 to insn 85,
since r177 and r0 are both alive now.
Maybe IRA makes a better decision in this case by spilling r177, I double the
common results.

2.The reported case is some kind of special with all related insns limited in
one basic block. In other cases like described in comment 2, the saving of hard
register is in prologue, so the propagation crosses basic blocks.

Anyway, one thing is clear that the problem is closely connected with
parameter/return register moving.


[Bug rtl-optimization/54133] regrename introduces additional dependencies

2012-08-02 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54133

--- Comment #7 from amker.cheng  2012-08-02 
10:18:41 UTC ---
(In reply to comment #6)
> > In experiment, if I disable r0/r1 from renaming, most regressions observed 
> > in
> > CSiBE are gone.
> > 
> > So how should this be fixed? Thanks.
> 
> The choice of the renaming register can be parameterized at the class level,
> but I'm not sure this would work here.  You could also try to add some
> additional heuristics for this choice, as it seems to be clearly
> counter-productive here.

My bad that I did not mention details of the method by disabling r0/r1 from
renaming.
When comparing to trunk(where regrename is disabled for Os), the method fixes
most of regrenaming regressions, which is good.
But it is too conservertive that some renaming opportunities are missed. From
the view of code size: data show that this method has 700/440 bytes
benefit/regression against the current implemention of regrename. This means
only 250 bytes benefit overall.
The data is collected from CSiBE on arm cortex-m0.

Giving that the regressions may cross basic_block, it's hard to fix them in
regrenaming without missing renaming opportunities.

Is it possible to run regcprop pass both before and after regrenaming?


[Bug target/51835] ARM EABI violation when passing arguments to helper floating functions like __aeabi_d2iz

2012-02-05 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51835

--- Comment #6 from amker.cheng  2012-02-06 
05:51:25 UTC ---
(In reply to comment #5)
> (In reply to comment #2)
> > This is only applicable to the 4.6 branch and trunk since support for the
> > Cortex M4 wasn't added till 4.6. 
> > 
> > cheers
> > Ramana
> 
> Maybe the Cortex M4 wasn't added until 4.6, but the other options are 
> permitted
> by 4.5 and I can easily get 4.5 to produce wrong-looking code.  With -O2
> -mfloat-abi=hard -mfpu=fpv4-sp-d16 -march=armv7-a -marm I see the following
> code generation difference between 4.5 and 4.6:
> 
> @@ -22,8 +22,9 @@
> @ frame_needed = 0, uses_anonymous_args = 0
> stmfd   sp!, {r3, lr}
> bl  __aeabi_f2d
> +   fmrrd   r0, r1, d0
> bl  __aeabi_d2iz
> ldmfd   sp!, {r3, pc}
> .size   func, .-func
> -   .ident  "GCC: (GNU) 4.5.4 20120126 (prerelease)"
> +   .ident  "GCC: (GNU) 4.6.3 20120203 (prerelease)"
> .section.note.GNU-stack,"",%progbits
> 
> Backporting r183734 from 4.6 to 4.5 makes 4.5 generate the same code as 4.6,
> i.e., with the fmrrd between the two calls.

beside this patch, Julian Brown's patch r174803 is necessary too.

For now,
1, arguments for both __aeabi_f2d and __aeabi_d2iz are wrong in 4.5;
2, arguments for __aeabi_f2d is wrong in 4.6
To solve this, have to:
1, backport r183734 and r174803 to 4.5;
2, backport r174803 to 4.6;


[Bug tree-optimization/43491] Unnecessary temporary for global register variable

2012-02-16 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43491

--- Comment #8 from amker.cheng  2012-02-17 
03:55:24 UTC ---
(In reply to comment #7)
> With tree hoisting we generate
> 
> :
>   pretmp.5_19 = data_0;
>   pretmp.5_20 = data_3;
>   i_21 = pretmp.5_19 + pretmp.5_20;
>   if (data_3(D) != 0)
> goto ;
>   else
> goto ;
> 
> :
> 
> :
>   # v_1 = PHI 
>   # i_2 = PHI 
>   D.1719_14 = v_1 * i_21;
>   D.1718_15 = i_2 * D.1719_14;
>   return D.1718_15;
> 
> instead of
> 
> :
>   if (data_3(D) != 0)
> goto ;
>   else
> goto ;
> 
> :
>   pretmp.5_19 = data_0;
>   pretmp.5_21 = data_3;
>   i_23 = pretmp.5_19 + pretmp.5_21;
>   goto ;
> 
> :
>   data_0.0_4 = data_0;
>   data_3.1_5 = data_3;
>   i_6 = data_0.0_4 + data_3.1_5;
> 
> :
>   # v_1 = PHI 
>   # i_2 = PHI 
>   # i_24 = PHI 
>   D.1719_14 = v_1 * i_24;
>   D.1718_15 = i_2 * D.1719_14;
>   return D.1718_15;
> 
> }
> 
> I suppose that's good enough?  See that PRE still inserts loads from
> register variables, not sure if you'd want to disallow that as well.

I think the reason why gcc inserts loads from global register variable is gcc
treats loads/uses of such variable as memory references. If I am right, It
seems a ssa issue, rather than PRE.
As for the original bug, it is caused by loading const global register
variable, then using the loaded ssa var across function calls(this step by
pre), which introduces unnecessary register conflict. I guess the load itself
won't hurt, but not sure whether hoisting will(as pre had done before).

BTW, I did not get the hoisted code on trunk. Is it a patch your are working
on?

Thanks.


[Bug middle-end/37780] Conditional expression with __builtin_clz() should be optimized out

2012-03-20 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37780

--- Comment #2 from amker.cheng  2012-03-20 
07:58:09 UTC ---
the special case could be easily detected when gimplifying.
but actually I am not sure whether it can be done even in middle end, since the
middle end should not depend on any target information, like
CLZ_DEFINED_VALUE_AT_ZERO, right?


[Bug target/52804] New: IRA/RELOAD allocate wrong register on ARM for cortex-m0

2012-03-31 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52804

 Bug #: 52804
   Summary: IRA/RELOAD allocate wrong register on ARM for
cortex-m0
Classification: Unclassified
   Product: gcc
   Version: 4.8.0
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: target
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: amker.ch...@gmail.com


For following code code:
void foo(unsigned char ** i, char *** o,
 unsigned int row, int num);
extern signed long tab[];
extern unsigned int w;
void foo(unsigned char ** i, char *** o,
 unsigned int row, int num)
{
  register int r, g, b;
  register signed long * t = tab;
  register char * pi;
  register char * o0;
  register char * o1;
  register unsigned int c;
  unsigned int n = w;

  while (--num >= 0) {
pi = *i++;
o0 = o[0][row];
o1 = o[1][row];
row++;
for (c = 0; c < n; c++) {
  r = ((int) (pi[0]));
  g = ((int) (pi[1]));
  b = ((int) (pi[2]));
  pi += 3;

  o0[c] = (unsigned char)
((t[r] + t[g] + t[b]));
  o1[c] = (unsigned char)
((t[r] + t[g] + t[b]));
}
  }
}
Compile it with following command:
$ arm-none-eabi-gcc -S -mthumb -mcpu=cortex-m0 -O2 -o foo.S foo.c

comparing ira/reload dump as following:
/*
dump of ira:

(insn 82 81 83 3 (set (reg/f:SI 281 [ *o_15(D) ])
(mem/f:SI (reg/v/f:SI 315 [orig:275 o ] [275]) [2 *o_15(D)+0 S4 A32]))
./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:18 186 {*thumb1_movsi_insn}
 (expr_list:REG_EQUIV (mem/f:SI (reg/v/f:SI 315 [orig:275 o ] [275]) [2
*o_15(D)+0 S4 A32])
(nil)))

(insn 83 82 84 3 (set (reg/v/f:SI 198 [ o0 ])
(mem/f:SI (plus:SI (reg/f:SI 281 [ *o_15(D) ])
(reg:SI 273 [ D.4183 ])) [2 *D.4088_18+0 S4 A32]))
./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:18 186 {*thumb1_movsi_insn}
 (expr_list:REG_DEAD (reg/f:SI 281 [ *o_15(D) ])
(nil)))

(insn 84 83 85 3 (set (reg/f:SI 282 [ MEM[(char * * *)o_15(D) + 4B] ])
(mem/f:SI (plus:SI (reg/v/f:SI 315 [orig:275 o ] [275])
(const_int 4 [0x4])) [2 MEM[(char * * *)o_15(D) + 4B]+0 S4
A32])) ./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:19 186 {*thumb1_movsi_insn}
 (expr_list:REG_EQUIV (mem/f:SI (plus:SI (reg/v/f:SI 315 [orig:275 o ]
[275])
(const_int 4 [0x4])) [2 MEM[(char * * *)o_15(D) + 4B]+0 S4
A32])
(nil)))

(insn 85 84 171 3 (set (reg/v/f:SI 201 [ o1 ])
(mem/f:SI (plus:SI (reg/f:SI 282 [ MEM[(char * * *)o_15(D) + 4B] ])
(reg:SI 273 [ D.4183 ])) [2 *D.4091_23+0 S4 A32]))
./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:19 186 {*thumb1_movsi_insn}
 (expr_list:REG_DEAD (reg/f:SI 282 [ MEM[(char * * *)o_15(D) + 4B] ])
(expr_list:REG_DEAD (reg:SI 273 [ D.4183 ])
(nil


dump of reload:

(note 82 81 207 3 NOTE_INSN_DELETED)

(insn 207 82 208 3 (set (reg:SI 6 r6)
(reg/v/f:SI 9 r9 [orig:275 o ] [275]))
./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:18 186 {*thumb1_movsi_insn}
 (nil))

(insn 208 207 209 3 (set (reg:SI 6 r6)
(mem/f:SI (reg:SI 6 r6) [2 *o_15(D)+0 S4 A32]))
./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:18 186 {*thumb1_movsi_insn}
 (nil))

(insn 209 208 210 3 (set (reg:SI 7 r7)
(mem/f:SI (plus:SI (reg:SI 6 r6)
(reg:SI 3 r3 [orig:273 D.4183 ] [273])) [2 *D.4088_18+0 S4
A32])) ./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:18 186 {*thumb1_movsi_insn}
 (nil))

(insn 210 209 84 3 (set (reg/v/f:SI 12 ip [orig:198 o0 ] [198])
(reg:SI 7 r7)) ./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:18 186
{*thumb1_movsi_insn}
 (nil))

(note 84 210 211 3 NOTE_INSN_DELETED)

(insn 211 84 85 3 (set (reg:SI 0 r0)
(mem/f:SI (plus:SI (reg:SI 6 r6)
(const_int 4 [0x4])) [2 MEM[(char * * *)o_15(D) + 4B]+0 S4
A32])) ./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:19 186 {*thumb1_movsi_insn}
 (nil))

(insn 85 211 171 3 (set (reg/v/f:SI 7 r7 [orig:201 o1 ] [201])
(mem/f:SI (plus:SI (reg:SI 0 r0)
(reg:SI 3 r3 [orig:273 D.4183 ] [273])) [2 *D.4091_23+0 S4
A32])) ./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:19 186 {*thumb1_movsi_insn}
 (nil))

*/
Obviously, r6 is corrupted in insn 208, while it is used in insn 211.
piece of generated assembly codes as following:

foo:
push{r4, r5, r6, r7, lr}
movr5, r9
movr7, fp
movr6, sl
movr4, r8
push{r4, r5, r6, r7}
movsl, r3
lslr2, r2, #2
ldrr3, .L11
subr2, r2, r0
subsp, sp, #20
movr9, r1  *step1
subr2, r2, #4
ldrr1, [r3]
ldrr5, .L11+4
movfp, r0
strr2, [sp, #12]
.L8:
movr6, sl
subr6, r6, #1
movsl, r6
bmi.L10
.L7:
movr0, fp
ldrr4, [sp, #12]
addr0, r0, #4
movr6, r9  *step2
movfp, r0
ldrr6, [r6]  *step3, r6 corrupted
movr3, r4
addr3, 

[Bug target/52804] IRA/RELOAD allocate wrong register on ARM for cortex-m0

2012-04-03 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52804

--- Comment #1 from amker.cheng  2012-04-03 
16:43:30 UTC ---
For insns before ira:


(insn 82 81 83 3 (set (reg/f:SI 281 [ *o_15(D) ])
(mem/f:SI (reg/v/f:SI 315 [orig:275 o ] [275]) [2 *o_15(D)+0 S4 A32]))
pr52804.c:18 186 {*thumb1_movsi_insn}
 (expr_list:REG_EQUIV (mem/f:SI (reg/v/f:SI 315 [orig:275 o ] [275]) [2
*o_15(D)+0 S4 A32])
(nil)))

(insn 83 82 84 3 (set (reg/v/f:SI 198 [ o0 ])
(mem/f:SI (plus:SI (reg/f:SI 281 [ *o_15(D) ])
(reg:SI 273 [ D.4183 ])) [2 *D.4088_18+0 S4 A32])) pr52804.c:18
186 {*thumb1_movsi_insn}
 (expr_list:REG_DEAD (reg/f:SI 281 [ *o_15(D) ])
(nil)))

(insn 84 83 85 3 (set (reg/f:SI 282 [ MEM[(char * * *)o_15(D) + 4B] ])
(mem/f:SI (plus:SI (reg/v/f:SI 315 [orig:275 o ] [275])
(const_int 4 [0x4])) [2 MEM[(char * * *)o_15(D) + 4B]+0 S4
A32])) pr52804.c:19 186 {*thumb1_movsi_insn}
 (expr_list:REG_EQUIV (mem/f:SI (plus:SI (reg/v/f:SI 315 [orig:275 o ]
[275])
(const_int 4 [0x4])) [2 MEM[(char * * *)o_15(D) + 4B]+0 S4
A32])
(nil)))

(insn 85 84 171 3 (set (reg/v/f:SI 201 [ o1 ])
(mem/f:SI (plus:SI (reg/f:SI 282 [ MEM[(char * * *)o_15(D) + 4B] ])
(reg:SI 273 [ D.4183 ])) [2 *D.4091_23+0 S4 A32])) pr52804.c:19
186 {*thumb1_movsi_insn}
 (expr_list:REG_DEAD (reg/f:SI 282 [ MEM[(char * * *)o_15(D) + 4B] ])
(expr_list:REG_DEAD (reg:SI 273 [ D.4183 ])
(nil

The registers allocated are:
r315 -> r9
r281 -> mem
r273 -> r3
r198 -> r12
r201 -> r7

The insns need reload are like:
insn 82 (deleted)
insn 84 (deleted)
insn 83
insn 85

The corresponding dump info of reload pass is like:

Reloads for insn # 83
Reload 0: reload_in (SI) = (reg/v/f:SI 9 r9 [orig:275 o ] [275])
BASE_REGS, RELOAD_FOR_INPADDR_ADDRESS (opnum = 1)
reload_in_reg: (reg/v/f:SI 9 r9 [orig:275 o ] [275])
reload_reg_rtx: (reg:SI 6 r6)
Reload 1: reload_in (SI) = (mem/f:SI (reg/v/f:SI 9 r9 [orig:275 o ] [275]) [2
*o_15(D)+0 S4 A32])
LO_REGS, RELOAD_FOR_INPUT_ADDRESS (opnum = 1), can't combine
reload_in_reg: (reg/f:SI 281 [ *o_15(D) ])
reload_reg_rtx: (reg:SI 6 r6)
Reload 2: LO_REGS, RELOAD_FOR_INPUT_ADDRESS (opnum = 1), can't combine,
secondary_reload_p
reload_reg_rtx: (reg:SI 7 r7)
Reload 3: reload_in (SI) = (mem/f:SI (plus:SI (reg/f:SI 281 [ *o_15(D) ])
(reg:SI 3 r3 [orig:273
D.4183 ] [273])) [2 *D.4088_18+0 S4 A32])
CORE_REGS, RELOAD_FOR_INPUT (opnum = 1)
reload_in_reg: (mem/f:SI (plus:SI (reg/f:SI 281 [ *o_15(D) ])
(reg:SI 3 r3 [orig:273
D.4183 ] [273])) [2 *D.4088_18+0 S4 A32])
reload_reg_rtx: (reg/v/f:SI 12 ip [orig:198 o0 ] [198])
secondary_in_reload = 2

Reloads for insn # 85
Reload 0: reload_in (SI) = (reg/v/f:SI 9 r9 [orig:275 o ] [275])
BASE_REGS, RELOAD_FOR_OPADDR_ADDR (opnum = 1)
reload_in_reg: (reg/v/f:SI 9 r9 [orig:275 o ] [275])
reload_reg_rtx: (reg:SI 6 r6)
Reload 1: reload_in (SI) = (mem/f:SI (plus:SI (reg/v/f:SI 9 r9 [orig:275 o ]
[275])
(const_int 4 [0x4])) [2
MEM[(char * * *)o_15(D) + 4B]+0 S4 A32])
LO_REGS, RELOAD_FOR_OPERAND_ADDRESS (opnum = 1), can't combine
reload_in_reg: (reg/f:SI 282 [ MEM[(char * * *)o_15(D) + 4B] ])
reload_reg_rtx: (reg:SI 0 r0)

We can see, after reload, insn sequence for insn 83/85 shoud be like:
insn 83:
  r6 = r9
  r6 = [r6]
  r7 = [r6 + r3]
  r12 = r7
insn 85:
  r6 = r9
  r0 = [r6 + 4]
  r7 = [r0 + r3]

***BUT***
The problem is:
RELOAD forms wrong inherited information when reloading insn 83, i.e., reload
assumes that r9 is reloaded in r6 and is valid for inheriting when reloading
insn 85. Resulting in using r6, which has already been corrupted.

After looking into reload. I think function reload_reg_reaches_end_p has missed
following case:
rld[0]
  in : r9
  reg_rtx : r6
  when_needed : RELOAD_FOR_INPADDR_ADDRESS
rld[1]
  in : [r9]
  reg_rtx : r6
  when_neede : RELOAD_FOR_INPUT_ADDRESS

In this case, the call of "reload_reg_reaches_end_p(regno(=6), reloadnum(=0))"
should return 0, rather than 1 as now. because r6 used in rld[0] is corrupted
by rld[1].


[Bug target/52804] IRA/RELOAD allocate wrong register on ARM for cortex-m0

2012-04-16 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52804

--- Comment #2 from amker.cheng  2012-04-16 
09:00:08 UTC ---
Any comments?
Or could anyone help me confirm this issue?
Thanks very much.


[Bug rtl-optimization/55190] [SH] ivopts causes loop setup bloat

2013-09-30 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55190

bin.cheng  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot com

--- Comment #3 from bin.cheng  ---
ARM can benefit from doloop structure too, but it is implemented in different
way. ARM backend defines special addsi_compare pattern and let combine pass
combine decrement and comparison instruction, thus saving the comparison
instruction.

IVOPT can be improved to select two iv candidates for the example loop, with
auto-increment one for the memory access and decrement one for loop exit check.
 This is especially good for target supports both doloop and auto-increment
instructions like ARM and SH.

BUT most hand-written loops have incremental basic iv, so IVOPT depends on
previous pass ivcanon to rewrite it into decremental iv, like below:

for (i = 0; i < 100; i++)
  //loop body

>
for (i = 100; i > 0; i--)
  //modified loop body

Unfortunately, ivcanon pass only do such loop transformation for loop which
iterates constant number times.

It seems difficult for RTL loop passes to revert decision made by IVOPT, so I
think it should be done in GIMPLE IVOPT. I will give it a try.

Thanks.


[Bug rtl-optimization/50749] Auto-inc-dec does not find subsequent contiguous mem accesses

2013-09-30 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50749

bin.cheng  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot com

--- Comment #15 from bin.cheng  ---
There must be another scenario for the example, and in this case example:

int test_0 (char* p, int c)
{
  int r = 0;
  r += *p++;
  r += *p++;
  r += *p++;
  return r;
}

should be translated into sth like:
  //...
  ldrb [rx]
  ldrb [rx+1]
  ldrb [rx+2]
  add rx, rx, #3
  //...
This way all loads are independent and can be issued on super scalar machine. 
Actuall for targets like arm which supports post-increment constant (other than
size of memory access), it can be further changed into:
  //...
  ldrb [rx], #3
  ldrb [rx-2]
  ldrb [rx-1]
  //...
For now auto-increment pass can't do this optimization.  I once have a patch
for this but benchmark shows the case is not common.

This case is common especially after loop unrolling and rtl passes deliberately
break down long dependence of RX, which I think is right.


[Bug tree-optimization/39200] ivopts slows down SciMark sparse matrix benchmark

2013-11-19 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39200

bin.cheng  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot com

--- Comment #1 from bin.cheng  ---
This is pretty old.
I tried latest trunk with revision r205025.
gcc -O2 -march=pentium4 [-fomit-frame-pointer]
.L7:
movl(%esi,%eax,4), %edx
fldl(%edi,%edx,8)
fmull(%ebx,%eax,8)
faddp%st, %st(1)
addl$1, %eax
cmpl%ecx, %eax
jne.L7
gcc -O2 -march=pentium4 [-fomit-frame-pointer] -fno-ivopts
.L7:
movl(%esi,%eax,4), %edx
fldl(%edi,%edx,8)
fmull(%ebx,%eax,8)
faddp%st, %st(1)
addl$1, %eax
cmpl%eax, %ecx
jg.L7

Also works for default arch in my configuration.

Should this be considered fixed?


[Bug tree-optimization/59445] [4.9 Regression] ICE in add_old_iv_candidates, at tree-ssa-loop-ivopts.c:2541

2013-12-10 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59445

--- Comment #13 from bin.cheng  ---
Sorry for bothering, I have reverted the patch.  Will investigate it.


[Bug tree-optimization/59445] [4.9 Regression] ICE in add_old_iv_candidates, at tree-ssa-loop-ivopts.c:2541

2013-12-10 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59445

--- Comment #14 from bin.cheng  ---
I found out the root cause of this ICE and will use the simplified code given
by comment#9 as an example.

The gimple dump before IVOPT is like:

  :

  :
  # c_2 = PHI 
  __val_comp_iter (D.4949);
  p2 = D.4950;
  c_6 = c_2 + 4294967292;
  _21 = MEM[(int *)c_2 + 4294967292B];
  if (a_11(D) != 0)
goto ;
  else
goto ;

  :
  c_3 = c_2 + 4;
  goto ;

  :
  goto ;

  :
  # c_23 = PHI 
  # _24 = PHI <_29(11), _22(10)>

  :
  # c_20 = PHI 
  # c_15 = PHI 
  # _26 = PHI <_24(6), _21(5)>
  if (_26 != 0)
goto ;
  else
goto ;

  :
  D::m_fn1 (&MEM[(struct G *)&p2].MFI);
  if (_13(D) != 0)
goto ;
  else
goto ;

  :
  goto ;

  :
  *c_20 = 0;
  c_7 = c_15 + 4294967292;
  _22 = *c_7;
  goto ;

  :
  *c_20 = 0;
  c_28 = c_15 + 4294967292;
  _29 = *c_28;
  goto ;

With the patch:
STEP1: # c_20 = PHI  is recognized as an iv. 
STEP2: Since # c_15 = PHI  comes from a merging conditional
branches, it shouldn't be marked as a biv in mark_bivs.
STEP3: When mark_bivs handling "# c_20 = PHI ",it should know
that this is a peeled iv and not mark either iv(c_20) or incr_iv(c_15) as bivs.

Unfortunately, this patch should add logic in mark_bivs to skip peeled iv,
rather than give an assert later when adding candidates for bivs.

The following patch should fix this problem:
@@ -1074,7 +1074,7 @@ find_bivs (struct ivopts_data *data)
 static void
 mark_bivs (struct ivopts_data *data)
 {
-  gimple phi;
+  gimple phi, def;
   tree var;
   struct iv *iv, *incr_iv;
   struct loop *loop = data->current_loop;
@@ -1090,6 +1090,13 @@ mark_bivs (struct ivopts_data *data)
 continue;

   var = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop));
+  def = SSA_NAME_DEF_STMT (var);
+  /* Don't mark iv peeled from other one as biv.  */
+  if (def
+  && gimple_code (def) == GIMPLE_PHI
+  && gimple_bb (def) == loop->header)
+continue;
+
   incr_iv = get_iv (data, var);
   if (!incr_iv)
 continue;

PS, the example code can be optimized with fixed version patch by recognizing
more address ivs.  I attached the generated assembly code for arm cortex-m3.


[Bug tree-optimization/59445] [4.9 Regression] ICE in add_old_iv_candidates, at tree-ssa-loop-ivopts.c:2541

2013-12-10 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59445

--- Comment #15 from bin.cheng  ---
Created attachment 31414
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31414&action=edit
The generated assembly with/without patch for code in comment #9 on cortex-m3


[Bug tree-optimization/59445] [4.9 Regression] ICE in add_old_iv_candidates, at tree-ssa-loop-ivopts.c:2541

2013-12-11 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59445

--- Comment #16 from bin.cheng  ---
I fixed the reported problem and posted new patch at
http://gcc.gnu.org/ml/gcc-patches/2013-12/msg01159.html
Apology that I missed java in bootstrap for previous patch.  This version
passes bootstrap and test for c,c++,lto,fortran,java,go,objc,obj_c++ on x86_64.
 I am not sure if the java case is covered by bootstrap, or other applications.
 If it's in other application, could anyone help verifying that the issue is
addressed on apple-darwin?

Thanks.


[Bug tree-optimization/59479] New: Inlining of static function bloats code size when Os

2013-12-11 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59479

Bug ID: 59479
   Summary: Inlining of static function bloats code size when Os
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amker.cheng at gmail dot com

Created attachment 31424
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31424&action=edit
The preprocessed file for newlib/libc/stdio/findfp.c

Hi, for attached preprocessed code from newlib/libc/stdio/findfp.c, GCC inlines
static function `std' even when optimizing for Os.
With command line:
$ ./arm-none-eabi-gcc -Os -mthumb -mcpu=cortex-m0 -c -xc findfp.E -o findfp.o

The dumped symbols are like:
21: 000916 FUNCGLOBAL DEFAULT1 _cleanup_r
...
29: 0055   224 FUNCGLOBAL DEFAULT1 __sinit
...
41: 01d524 FUNCGLOBAL DEFAULT1 __fp_unlock_all

With command line:
$ ./arm-none-eabi-gcc -Os -mthumb -mcpu=cortex-m0 -c -xc findfp.E -o findfp.o
-fno-inline

The dumped symbols are like:
 9: 0018 0 NOTYPE  LOCAL  DEFAULT1 $t
10: 001972 FUNCLOCAL  DEFAULT1 std.isra.0
...
24: 000916 FUNCGLOBAL DEFAULT1 _cleanup_r
...
36: 009d80 FUNCGLOBAL DEFAULT1 __sinit

This occurs on trunk and 4_8 branch.


[Bug tree-optimization/59445] [4.9 Regression] ICE in add_old_iv_candidates, at tree-ssa-loop-ivopts.c:2541

2013-12-12 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59445

--- Comment #18 from bin.cheng  ---
Hi Dominique d'Humieres,
Thanks for verifying it.


[Bug middle-end/39838] [4.7/4.8/4.9 regression] unoptimal code for two simple loops

2013-12-13 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39838

bin.cheng  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot com

--- Comment #15 from bin.cheng  ---
The situation gets a little bit better on 4_9 trunk.  The Os assembly code on
cortex-m0 (thumb1 as reported) is like:
test:
push{r0, r1, r2, r4, r5, r6, r7, lr}
movr6, r0
movr4, #0
strr2, [sp, #4]
.L2:
ldrr2, [r6]
cmpr4, r2
bge.L7
movr5, #0
lslr7, r4, #2
addr2, r7, #4   <move to before XXX 
strr2, [sp] <spill
.L3:
ldrr3, [sp, #4]
cmpr5, r3
bge.L8
ldrr3, [r6, #4]
ldrr2, [sp] <spill
ldrr0, [r3, r7]
ldrr1, [r3, r2] <XXX
blfunc
addr5, r5, #1
b.L3
.L8:
addr4, r4, #1
b.L2
.L7:
@ sp needed
pop{r0, r1, r2, r4, r5, r6, r7, pc}
.sizetest, .-test

IVOPT chooses the original biv for all uses in outer loop, regression comes
from long live range of "r2" and the corresponding spill.
Then I realized that GCC IVOPT computes iv (for non-linear uses) at original
place, we may be able to teach IVOPT to compute the iv just before it's used in
order to shrink live range of iv.  The patch I had at
http://gcc.gnu.org/ml/gcc-patches/2013-11/msg00535.html is similar to this,
only it computes iv uses at appropriate place for outside loop iv uses.

But this idea won't help this specific case because LIM will hoist all the
computation to basic block .L2 after IVOPT.


[Bug middle-end/39838] [4.7/4.8/4.9 regression] unoptimal code for two simple loops

2013-12-13 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39838

--- Comment #16 from bin.cheng  ---
For optimization level O2, the dump before IVOPT is like:

  :
  _21 = p_6(D)->count;
  if (_21 > 0)
goto ;
  else
goto ;

  :

  :
  # i_26 = PHI 
  if (count_8(D) > 0)
goto ;
  else
goto ;

  :
  pretmp_23 = (sizetype) i_26;
  pretmp_32 = pretmp_23 + 1;
  pretmp_33 = pretmp_32 * 4;
  pretmp_34 = pretmp_23 * 4;

  :
  # j_27 = PHI 
  _9 = p_6(D)->data;
  _13 = _9 + pretmp_33;
  _14 = *_13;
  _16 = _9 + pretmp_34;
  _17 = *_16;
  func (_17, _14);
  j_19 = j_27 + 1;
  if (count_8(D) > j_19)
goto ;
  else
goto ;

  :
  goto ;

  :

  :
  i_20 = i_26 + 1;
  _7 = p_6(D)->count;
  if (_7 > i_20)
goto ;
  else
goto ;

  :
  goto ;

  :
  return;

There might be two issues that block IVOPT choosing the biv(i) for pretmp_33
and pretmp_34:
1) on some target (like ARM), "i << 2 + 4" can be done in one instruction, if
the cost is same as simple shift or plus, then overall cost of biv(i) is lower
than the two candidate iv sets.  GCC doesn't do such check in
get_computation_cost_at for now.
2) there is CSE opportunity between computation of pretmp_33 and pretmp_34, for
example they can be computed as below:
   pretmp_33 = i << 2
   pretmp_34 = pretmp_33 + 4
but GCC IVOPT is insensitive to such CSE opportunities between different iv
uses.  I guess this isn't easy because unless the two uses are very close in
code (like this one), such CSE may avail to nothing.

These kind tweaks on cost are tricky(and most probably has no overall benefit)
because the cost IVOPT computed from RTL is far from precise to do such fine
granularity tuning.

Another point, as Zdenek pointed out, IVOPT doesn't know that
pretmp_33/pretmp_34 are going to be used in memory accesses, which means some
of address computation can be embedded by appropriate addressing mode.  In
other words, computation of pretmp_33/pretmp_34 shouldn't be honored when
computing overall cost and choosing iv candidates set.  Since "_9 +
pretmp_33/pretmp_34" is not affine iv, the only way to handle this issue is to
lower both memory accesses before IVOPT, into some code like below:

  :
  pretmp_23 = (sizetype) i_26;
  pretmp_32 = pretmp_23 + 1;

  :
  # j_27 = PHI 
  _9 = p_6(D)->data;
  _14 = MEM[_9 + pretmp_32 << 2];
  _17 = MEM[_9 + pretmp_23 << 2];
  func (_17, _14);
  j_19 = j_27 + 1;
  if (count_8(D) > j_19)
goto ;
  else
goto ;

With this code, the iv uses are biv(i), pretmp_23(i_26) and pretmp_32(i_26+1),
and IVOPT won't even add the annoying candidate.


[Bug tree-optimization/59479] Inlining of static function bloats code size when Os

2013-12-13 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59479

--- Comment #2 from bin.cheng  ---
I will investigate it later.  Just clarifying, the function is called three
times by the caller, it would increase code size usually.

BTW, could you explain a little about "2nd-order effect"?  I am not familiar
with the concept.  Thanks in advance.


[Bug tree-optimization/52272] [4.7/4.8/4.9 regression] Performance regression of 410.bwaves on x86.

2013-12-17 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52272

bin.cheng  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot com

--- Comment #21 from bin.cheng  ---
Hi Richard,
I looked into PR50955 for which the mentioned commit causing this PR is
applied:

Commit 
2012-02-06  Richard Guenther  

PR tree-optimization/50955
* tree-ssa-loop-ivopts.c (get_computation_cost_at): Artificially
raise cost of expressions that replace an address with an
expression based on a different pointer.

I noticed that the offending non-linear use in PR50955 is actually from memory
reference.  If I understand the issue correct, the whole alias issue is
introduced by rewriting iv use with one base_object through candidate with
another incompatible base_object, and it is related to memory reference.  An
genuine non-linear iv use (the pointer never de-referenced, like in this PR)
won't have this issue.

So I come up this idea to relax the condition:

-  if (address_p)
+  if (address_p
+  || (use->iv->base_object
+ && cand->iv->base_object
+ && POINTER_TYPE_P (TREE_TYPE (use->iv->base_object))
+ && POINTER_TYPE_P (TREE_TYPE (cand->iv->base_object
 {
   /* Do not try to express address of an object with computation based
 on address of a different object.  This may cause problems in rtl

to non-linear uses which truly occurred in memory reference, something like:

-  if (address_p)
+  if (address_p
+  || (use->in_mem_ref_p
+ && use->iv->base_object
+ && cand->iv->base_object
+ && POINTER_TYPE_P (TREE_TYPE (use->iv->base_object))
+ && POINTER_TYPE_P (TREE_TYPE (cand->iv->base_object
 {
   /* Do not try to express address of an object with computation based
 on address of a different object.  This may cause problems in rtl

The flag in_mem_ref_p can be set for appropriate uses when finding interesting
address uses.

With this change, this PR should be resolved while not violating PR50955.

I am not very much into 50955, so how does this sound? I can send a patch for
review if the idea is in right direction.

BTW, I cannot reproduce 50955 with the reported revision of GCC.  The store
isn't deleted by pass_cd_dce, though it is re-written just as the PR reported. 
So maybe I just misunderstood something.

Any words?

Thanks,
bin


[Bug tree-optimization/50955] [4.7 Regression] IVopts incorrectly rewrite the address of a global memory access into a local form.

2013-12-18 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50955

bin.cheng  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot com

--- Comment #17 from bin.cheng  ---
Hi Richard,
I am having difficulty in understanding cases if this PR.
For the reported case with two loops:

  for( y=0; y<4; y++, pDst += dstStep ) {
for( x=y+1; x<4; x++ ) {
s = ( p1[x-y-1] + p1[x-y] + p1[x-y] + p1[x-y+1] + 2 ) >> 2;
pDst[x] = (unsigned char)s;
}

pDst[y] = p3;
  }

The dump for statement 'pDst[y] = p3;' before IVOPT is like:

:
Invalid sum of incoming frequencies 1667, should be 278
  y.2_64 = (sizetype) y_89;
  D.6421_65 = pDst_88 + y.2_64;
  *D.6421_65 = p3_37;
  pDst_69 = pDst_88 + pretmp.21_118;
  ivtmp.35_116 = ivtmp.35_87 - 1;
  if (ivtmp.35_116 != 0)
goto ;
  else
goto ;


IVOPT chooses candidate 15:
candidate 15
  depends on 3
  var_before ivtmp.154
  var_after ivtmp.154
  incremented before exit test
  type unsigned int
  base (unsigned int) pDst_39(D) - (unsigned int) &p1
  step (unsigned int) (pretmp.21_118 + 1)
for use 1:
use 1
  address
  in statement *D.6421_65 = p3_37;

  at position *D.6421_65
  type unsigned char *
  base pDst_39(D)
  step pretmp.21_118 + 1
  base object (void *) pDst_39(D)
  related candidates 

After rewriting, the dump is like:

:
Invalid sum of incoming frequencies 1667, should be 278
  MEM[symbol: p1, index: ivtmp.154_200, offset: 0B] = p3_37;
  pDst_69 = pDst_88 + pretmp.21_118;
  ivtmp.149_218 = ivtmp.149_249 - 1;
  ivtmp.154_190 = ivtmp.154_200 + D.6617_250;
  if (x_40 != 4)
goto ;
  else
goto ;

Eventually, the storing to TMR[p1,ivtmp,0] is considered local and deleted.

BUT, for your reduced case:

  p3 = (unsigned char)(((signed int)p1[1] + (signed int)p2[1]
+ (signed int)p1[0] +(signed int)p1[0] + 2 ) >> 2 );

  for( x=y+1; x<4; x++ ) {
  s = ( p1[x-y-1] + p1[x-y] + p1[x-y] + p1[x-y+1] + 2 ) >> 2;
  pDst[x] = (unsigned char)s;
  }

  pDst[y] = p3;

It is about the the TMR in below dump (before IVOPT):

:
  # vect_pp1.30_166 = PHI 
  # vect_pp1.37_176 = PHI 
  # vect_pp1.46_194 = PHI 
  # vect_p.60_223 = PHI 
  # ivtmp.64_225 = PHI 
  ...
  MEM[(unsigned char *)vect_p.60_223] = vect_var_.58_219;
  vect_pp1.30_167 = vect_pp1.30_166 + 8;
  vect_pp1.37_177 = vect_pp1.37_176 + 8;
  vect_pp1.46_195 = vect_pp1.46_194 + 8;
  vect_p.60_224 = vect_p.60_223 + 8;
  ivtmp.64_226 = ivtmp.64_225 + 1;
  if (ivtmp.64_226 < bnd.27_128)
goto ;
  else
goto ;

Your patch prevents IVOPT from choosing cand 4:
candidate 4 (important)
  var_before ivtmp.110
  var_after ivtmp.110
  incremented before exit test
  type unsigned int
  base (unsigned int) (&p1 + 8)
  step 8
  base object (void *) &p1
for use 3:
use 3
  generic
  in statement vect_p.60_223 = PHI 

  at position 
  type vector(8) unsigned char *
  base batmp.61_221 + 1
  step 8
  base object (void *) batmp.61_221
  is a biv
  related candidates 

To prevent IVOPT from rewriting into:

:
  # ivtmp.107_150 = PHI 
  # ivtmp.110_241 = PHI 
  D.6585_133 = (unsigned int) batmp.61_221;
  p1.131_277 = (unsigned int) &p1;
  D.6587_278 = D.6585_133 - p1.131_277;
  D.6588_279 = D.6587_278 + ivtmp.110_241;
  D.6589_280 = D.6588_279 + 4294967289;
  D.6590_281 = (vector(8) unsigned char *) D.6589_280;
  vect_p.60_223 = D.6590_281;
  ...
  MEM[(unsigned char *)vect_p.60_223] = vect_var_.58_219;
  ivtmp.107_256 = ivtmp.107_150 + 1;
  ivtmp.110_146 = ivtmp.110_241 + 8;
  if (ivtmp.107_256 < bnd.27_128)
goto ;
  else
goto ;

Thus prevents IVOPT from generating candidate 15 in outer loop.  (Expressing
use 3 by cand 4 itself is good, right?)


---
But,
It seems because the check:

  if (address_p)
{
  /* Do not try to express address of an object with computation based
 on address of a different object.  This may cause problems in rtl
 level alias analysis (that does not expect this to be happening,
 as this is illegal in C), and would be unlikely to be useful
 anyway.  */
  if (use->iv->base_object
  && cand->iv->base_object
  && !operand_equal_p (use->iv->base_object, cand->iv->base_object, 0))
return infinite_cost;

failed because cand(15)->iv->base_object == NULL.  For the reported case, it's
not about an iv use appearing in memory reference while not marked as
address_p, and can be fixed by revise the existing check condition, is it true?

PS, sorry for replying to a fixed PR, I found it's kind of impossible to fix
PR52272 without fully understanding this one.


[Bug tree-optimization/50955] [4.7 Regression] IVopts incorrectly rewrite the address of a global memory access into a local form.

2013-12-18 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50955

--- Comment #19 from bin.cheng  ---
> 
> >not about an iv use appearing in memory reference while not marked as
> >address_p, and can be fixed by revise the existing check condition, is
> >it true?
> 
> No, even expressing an address this way is broken as for example dependence 
> analysis via scev can get confused about the actual base object.
Agree, only I think it's not scev's responsibility since scev only cares base
value initialized for the analyzing loop, rather than the BASE object.

> 
> IIRC previously we already avoided the mem-use case and I had to generalize 
> it 
> to also avoid addresses.
Not all.
For the reported case, use and cand like:
use 3
  generic
  in statement vect_p.70_247 = PHI 

  at position 
  type vector(8) unsigned char *
  base batmp.71_245 + 1
  step 8
  base object (void *) batmp.71_245
  is a biv
  related candidates 

candidate 15
  depends on 3
  var_before ivtmp.154
  var_after ivtmp.154
  incremented before exit test
  type unsigned int
  base (unsigned int) pDst_39(D) - (unsigned int) &p1
  step (unsigned int) (pretmp.21_118 + 1)

The check:

  if (address_p
  || (use->iv->base_object
  && cand->iv->base_object
  && POINTER_TYPE_P (TREE_TYPE (use->iv->base_object))
  && POINTER_TYPE_P (TREE_TYPE (cand->iv->base_object
{
  /* Do not try to express address of an object with computation based
 on address of a different object.  This may cause problems in rtl
 level alias analysis (that does not expect this to be happening,
 as this is illegal in C), and would be unlikely to be useful
 anyway.  */
  if (use->iv->base_object
  && cand->iv->base_object
  && !operand_equal_p (use->iv->base_object, cand->iv->base_object, 0))
return infinite_cost;
}

still evaluates to false because:
   use->iv->base_object != NULL  &&  cand->iv->base_object == NULL
>


[Bug bootstrap/59536] [4.9 regression] internal compiler error: in cselib_record_set, at cselib.c:2376 breaks m68k-linux bootstrap

2013-12-18 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59536

bin.cheng  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot com

--- Comment #5 from bin.cheng  ---
I will have a look.
Thanks.


[Bug bootstrap/59536] [4.9 regression] internal compiler error: in cselib_record_set, at cselib.c:2376 breaks m68k-linux bootstrap

2013-12-18 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59536

--- Comment #6 from bin.cheng  ---
Hi,
Sorry I don't have m68k environment to do the bootstrap, could anyone help dump
"-fdump-tree-all-details -fdump-rtl-all-slim" with and without the patch for
me?  Otherwise I have to revert the patch and hold it for future.

Hi Jakub, should I revert the patch for now?

Thanks.


[Bug bootstrap/59536] [4.9 regression] internal compiler error: in cselib_record_set, at cselib.c:2376 breaks m68k-linux bootstrap

2013-12-18 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59536

--- Comment #8 from bin.cheng  ---
(In reply to Andreas Schwab from comment #1)
> Between r205951 and r205984.

(In reply to H.J. Lu from comment #7)
> (In reply to bin.cheng from comment #6)
> > Hi,
> > Sorry I don't have m68k environment to do the bootstrap, could anyone help
> > dump "-fdump-tree-all-details -fdump-rtl-all-slim" with and without the
> > patch for me?  Otherwise I have to revert the patch and hold it for future.
> > 
> 
> Can't you use cross compiler on preprocessed input to debug it?

The bare-metal tool seems not handle the preprocessed file correctly, so am
trying to build cross linux tools.  Unfortunately, cross-ng only supports
uclinux for m68k.  Given that I am not familiar with m68k-linux, so I am having
difficulty in enabling one for now.


[Bug bootstrap/59536] [4.9 regression] internal compiler error: in cselib_record_set, at cselib.c:2376 breaks m68k-linux bootstrap

2013-12-18 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59536

--- Comment #9 from bin.cheng  ---
Turns out my crossed bare-metal tool works after deleting all preprocessed "#
xxx file" lines, but why these lines matter?


[Bug bootstrap/59536] [4.9 regression] internal compiler error: in cselib_record_set, at cselib.c:2376 breaks m68k-linux bootstrap

2013-12-19 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59536

--- Comment #10 from bin.cheng  ---
The offending loop before IVOPT is like:

  :
  # var_index_1889 = PHI <1(924), var_index_983(923)>
  # var_index.250_1269 = PHI <1(924), var_index.250_1959(923)>
  if (var_index.250_1269 < _1237)
goto ;
  else
goto ;

  :
  loopi_952 = MEM[(const struct vec
*)pretmp_2270].m_vecdata[var_index.250_1269];
  _947 = loopi_952->num;
  if (_947 == pretmp_2268)
goto ;
  else
goto ;

  :
  var_index_983 = var_index_1889 + 1;
  var_index.250_1959 = (unsigned int) var_index_983;
  goto ;

  :
  goto ;

The patch can recognize var_index.250_1269 is an iv with {1, 1}_loop, thus
rewriting the loop into:


  :
  # var_index_1889 = PHI <1(924), var_index_983(923)>
  # ivtmp.1067_1968 = PHI 
  var_index.250_1269 = (unsigned int) var_index_1889;
  if (var_index_1889 != _958)
goto ;
  else
goto ;

  :
  _111 = (void *) ivtmp.1067_1968;
  loopi_952 = MEM[base: _111, offset: 0B];
  ivtmp.1067_884 = ivtmp.1067_1968 + 4;
  _947 = loopi_952->num;
  if (_947 == pretmp_2268)
goto ;
  else
goto ;

  :
  var_index_983 = var_index_1889 + 1;
  goto ;

  :
  _1542 = pretmp_2270 + 12;
  ivtmp.1067_696 = (unsigned int) _1542;
  _958 = (int) _1237;
  goto ;

The transformation looks good and takes advantage of post-increment addressing
mode for memory access "MEM[base: _111, offset: 0B]".
The loop is expanded into rtl like:
 4438: L4438:
 1814: NOTE_INSN_BASIC_BLOCK 352
 1815: r626:SI=r817:SI
 1816: cc0=cmp(r817:SI,r492:SI)
 1817: pc={(cc0==0)?L4244:pc}
  REG_BR_PROB 900
 1818: NOTE_INSN_BASIC_BLOCK 353
 1819: r490:SI=[r829:SI]
 1820: r829:SI=r829:SI+0x4
 1821: cc0=cmp([r490:SI],r864:SI)
 1822: pc={(cc0!=0)?L4435:pc}
   ...
 4435: L4435:
 4436: NOTE_INSN_BASIC_BLOCK 952
 4437: r817:SI=r817:SI+0x1
 4439: pc=L4438
 4440: barrier
 4441: L4441:
 4442: NOTE_INSN_BASIC_BLOCK 953
 4443: r829:SI=r865:SI+0xc
 : r492:SI=r621:SI
   44: r817:SI=0x1
 4445: pc=L4438

Then instruction 1819/1820 are combined by auto-inc-dec pass into:

 1819: r490:SI=[r829:SI++]
  REG_INC r829:SI
 1821: cc0=cmp([r490:SI],r864:SI)
  REG_DEAD r490:SI
 1822: pc={(cc0!=0)?L4435:pc}
  REG_BR_PROB 9550

Problem comes from reload which puts both r490 and r829 into %a0 (reg 8?) and
generates below code:
 1819: %a0:SI=[%a0:SI++]
  REG_INC %a0:SI
 1821: cc0=cmp([%a0:SI],%d2:SI)
 1822: pc={(cc0!=0)?L4435:pc}
  REG_BR_PROB 9550

Insn 1819 is now bogus and causes assertion in cselib.

In IRA, there are dumps like:
  Popping a1119(r829,l0: a921(r829,l17))  -- assign reg 8
  Popping a1122(r,l0: a924(r,l17))  -- assign reg 8
  Popping a1120(r494,l0: a922(r494,l17))  -- assign reg 9
  Popping a1147(r1054,l0: a1006(r1054,l15))  -- assign reg 8
  Popping a1157(r490,l0: a1124(r490,l17: a959(r490,l18)))  -- assign reg 2

But in reload, there are dumps:

Reloads for insn # 1819
Reload 0: reload_in (SI) = (post_inc:SI (reg:SI 829 [ ivtmp.1067 ]))
reload_out (SI) = (post_inc:SI (reg:SI 829 [ ivtmp.1067 ]))
ADDR_REGS, RELOAD_FOR_OPERAND_ADDRESS (opnum = 1), inc by 4
reload_in_reg: (post_inc:SI (reg:SI 829 [ ivtmp.1067 ]))
reload_reg_rtx: (reg:SI 8 %a0)
Reload 1: reload_out (SI) = (reg/v/f:SI 490 [ loopi ])
GENERAL_REGS, RELOAD_FOR_OUTPUT (opnum = 0), optional
reload_out_reg: (reg/v/f:SI 490 [ loopi ])
Reload 2: reload_in (SI) = (mem/f:SI (post_inc:SI (reg:SI 829 [ ivtmp.1067 ]))
[4 MEM[base: _111, offset: 0B]+0 S4 A16])
GENERAL_REGS, RELOAD_FOR_INPUT (opnum = 1), optional
reload_in_reg: (mem/f:SI (post_inc:SI (reg:SI 829 [ ivtmp.1067 ])) [4
MEM[base: _111, offset: 0B]+0 S4 A16])


So I am not sure if there are some bugs in reload for m68k, or ivopt is doing
something very trick and wrong?

Thanks,
bin


[Bug bootstrap/59536] [4.9 regression] internal compiler error: in cselib_record_set, at cselib.c:2376 breaks m68k-linux bootstrap

2013-12-19 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59536

bin.cheng  changed:

   What|Removed |Added

 CC||bernds at codesourcery dot com,
   ||uweigand at de dot ibm.com

--- Comment #11 from bin.cheng  ---
Add reload maintainer for some suggestions.


[Bug bootstrap/59536] [4.9 regression] internal compiler error: in cselib_record_set, at cselib.c:2376 breaks m68k-linux bootstrap

2013-12-19 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59536

--- Comment #13 from bin.cheng  ---
(In reply to Andreas Schwab from comment #12)
> -fno-auto-inc-dec avoids the crash.  Dup of #52306?

It looks like, AFAICT.  Only this time it's blocking bootstrap :(


[Bug c++/59555] New: bogus error: template with C linkage with preprocessed c++ file

2013-12-19 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59555

Bug ID: 59555
   Summary: bogus error: template with C linkage with preprocessed
c++ file
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amker.cheng at gmail dot com

Created attachment 31478
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31478&action=edit
preprocessed c++ file

For attached preprocessed file, arm-none-eabi-g++ and m68k-unknown-elf-g++ give
below error messages with either "-xc++" or "-xc++-cpp-output":

In file included from
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/stringfwd.h:40:0,
 from
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:39,
 from
/daten/cross/m68k-linux/m68k-linux/sys-root/usr/include/gmp.h:25,
 from ../../gcc/gcc/system.h:647,
 from ../../gcc/gcc/tree-loop-distribution.c:45:
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/memoryfwd.h:63:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/memoryfwd.h:66:3:
error: template specialization with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/memoryfwd.h:70:3:
error: template with C linkage
In file included from
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:39:0,
 from
/daten/cross/m68k-linux/m68k-linux/sys-root/usr/include/gmp.h:25,
 from ../../gcc/gcc/system.h:647,
 from ../../gcc/gcc/tree-loop-distribution.c:45:
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/stringfwd.h:52:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/stringfwd.h:55:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/stringfwd.h:59:3:
error: template specialization with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/stringfwd.h:65:3:
error: template specialization with C linkage
In file included from
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:40:0,
 from
/daten/cross/m68k-linux/m68k-linux/sys-root/usr/include/gmp.h:25,
 from ../../gcc/gcc/system.h:647,
 from ../../gcc/gcc/tree-loop-distribution.c:45:
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/postypes.h:111:3:
error: template with C linkage
In file included from
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:40:0,
 from
/daten/cross/m68k-linux/m68k-linux/sys-root/usr/include/gmp.h:25,
 from ../../gcc/gcc/system.h:647,
 from ../../gcc/gcc/tree-loop-distribution.c:45:
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/postypes.h:214:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/postypes.h:219:3:
error: template with C linkage
In file included from
/daten/cross/m68k-linux/m68k-linux/sys-root/usr/include/gmp.h:25:0,
 from ../../gcc/gcc/system.h:647,
 from ../../gcc/gcc/tree-loop-distribution.c:45:
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:76:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:79:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:82:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:85:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:88:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:91:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:95:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:99:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:103:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:107:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:110:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:113:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:116:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:119:3:
error: template with C linkage
/daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:122:3:
error: template with

[Bug middle-end/52306] [4.8/4.9 regression] ICE in cselib_record_set, at cselib.c:2158

2013-12-19 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52306

bin.cheng  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot com

--- Comment #27 from bin.cheng  ---
(In reply to Andreas Schwab from comment #26)
> What does that mean, it's too late?

We are in stage 3 now, enabling LRA needs non-trivial work, so it's very likely
we can't make it work in time.


[Bug tree-optimization/59519] [4.9 Regression] ICE on valid code at -O3 on x86_64-linux-gnu in slpeel_update_phi_nodes_for_guard1, at tree-vect-loop-manip.c:486

2013-12-19 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59519

bin.cheng  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot com

--- Comment #3 from bin.cheng  ---
I will look into it.


[Bug tree-optimization/59519] [4.9 Regression] ICE on valid code at -O3 on x86_64-linux-gnu in slpeel_update_phi_nodes_for_guard1, at tree-vect-loop-manip.c:486

2013-12-19 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59519

--- Comment #4 from bin.cheng  ---
First clue.

b_lsm.11_13 is recognized as chrec {1, +, 1}_2 with the patch, thus the loop
can be vectorized now.

  :

  :
  # b.4_30 = PHI 
  # prephitmp_28 = PHI 
  # b_lsm.11_13 = PHI 
  # ivtmp_46 = PHI 
  c.1_9 = prephitmp_28 | 1;
  b.4_12 = b.4_30 + 1;
  ivtmp_45 = ivtmp_46 - 1;
  if (ivtmp_45 != 0)
goto ;
  else
goto ;

Problem arises in calling stack like:
vect_do_peeling_for_loop_bound
  slpeel_tree_peel_loop_to_edge
slpeel_update_phi_nodes_for_guard1
for phi node : # b_lsm.11_13 = PHI 

It looks like loop peeling has difficulty in coping with peeled phi node.


[Bug tree-optimization/59519] [4.9 Regression] ICE on valid code at -O3 on x86_64-linux-gnu in slpeel_update_phi_nodes_for_guard1, at tree-vect-loop-manip.c:486

2013-12-20 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59519

--- Comment #5 from bin.cheng  ---
For the offending loop:

  :

  :
  # b.4_30 = PHI 
  # prephitmp_28 = PHI 
  # b_lsm.11_13 = PHI 
  # ivtmp_46 = PHI 
  c.1_9 = prephitmp_28 | 1;
  b.4_12 = b.4_30 + 1;
  ivtmp_45 = ivtmp_46 - 1;
  if (ivtmp_45 != 0)
goto ;
  else
goto ;

Now SCEV recognizes b_lsm.11_13 as {1,1}_2, and vectorizer considers it can be
vectorized.
The problem comes in function slpeel_update_phi_nodes_for_guard1 for phi node
:# b_lsm.11_13 = PHI .  It's special because its loop_arg:
b.4_12 has already been handled in previous node and has non-null current
definition, resulting in assertion failure at line:
  gcc_assert (get_current_def (current_new_name) == NULL_TREE);

It seems loop manipulating utility for vectorization can't cope with this kind
PEELED phi node.

We can get more loops vectorized if we can handle this issue in vectorization.
For example, the more complicated example reported can be vectorized
successfully.

But, I think it's a little bit difficult to handle the case because it's
possible to have the PEELED phi node come before the phi node from which it's
peeled from (b.4_30, in this case), just like:

  :

  :
  # b_lsm.11_13 = PHI

[Bug tree-optimization/59519] [4.9 Regression] ICE on valid code at -O3 on x86_64-linux-gnu in slpeel_update_phi_nodes_for_guard1, at tree-vect-loop-manip.c:486

2014-01-02 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59519

--- Comment #7 from bin.cheng  ---
(In reply to Jakub Jelinek from comment #6)
> Created attachment 31562 [details]
> gcc49-pr59519.patch
> 
> I wonder if this isn't just a checking issue, the two PHI nodes created in
> *new_exit_bb have the same argument, so I think it is just fine if the two
> PHI results are used interchangeably, later optimization passes should
> hopefully coalesce them into a single IV.

I tested one similar patch before.  It passed x86_64 bootstrap and normal
regression test.  It failed some ada (also one go) cases if I ran regression
test with "-O3" option.  The failures look like noise to me, which I am not
sure about.  What's your test results?

One potential shortage is it introduces additional PHI/copy of different ssa
names and makes the generated code some kind of ugly and hard to read, but just
as you pointed out, later passes should be able to coalescing them (I am not
sure about that, especially after seeing ssa names not get coalesced in some
more regular cases.)

Thanks.


[Bug tree-optimization/59519] [4.9 Regression] ICE on valid code at -O3 on x86_64-linux-gnu in slpeel_update_phi_nodes_for_guard1, at tree-vect-loop-manip.c:486

2014-01-03 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59519

--- Comment #10 from bin.cheng  ---
(In reply to Jakub Jelinek from comment #9)
> BTW, the patch can hardly regress anything, it only affects cases that ICEd
> before the patch.

Em, I am worried if vectorization can handle peeled phi correctly for each
scenario before, because I barely know the implementation.  That's why I looked
for your guys' suggestions in the first place.

Thanks.


[Bug rtl-optimization/43491] Unnecessary temporary for global register variable

2011-11-22 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43491

amker.cheng  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot
   ||com

--- Comment #2 from amker.cheng  2011-11-23 
05:50:51 UTC ---
Noticed that pass 097t.copyprop4 propagates reg.0_12 to statement Y in
following dump:
-
:
  reg.0_12 = reg;
  D.4705_13 = MEM[(unsigned int *)reg.0_12 + 8B];   <-statement Z
  if (D.4705_13 != 0)
goto ;
  else
goto ;

:

:
  c ();
  reg.0_1 = reg.0_12; 
<-statement X
  D.4705_3 = MEM[(unsigned int *)reg.0_1 + 8B]; <-statement Y
  if (D.4705_3 != 0)
goto ;
  else
goto ;

:
  goto ;

:
  return;
-
to be:
  reg.0_1 = reg.0_12;  
<-statement X
  D.4705_3 = MEM[(unsigned int *)reg.0_12 + 8B]; <-statement Y

So, should it propagates reg directly? Could this be done on ssa?

Also I found 
1) there are similar cases on redundant copy or load constant, for example,
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44025
2) some of these cases are generated after expanding into rtl;
3) redundant copy might be handled in IRA, but redundant load const might be
more difficult.

How about extending regcprop.c pass into a global pass?


[Bug rtl-optimization/43491] Unnecessary temporary for global register variable

2011-11-24 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43491

--- Comment #3 from amker.cheng  2011-11-24 
09:24:37 UTC ---
(In reply to comment #1)

> 
> I'm thinking that this is perfectly normal thing to do, and that the redundant
> move is meant to disappear in a later pass.  My guess is that IRA is choosing
> not to assign the pseudo to r4, but I do not know why at the moment.

As dump in 191r.shed1:
--
(insn 5 7 6 2 (set (reg/f:SI 135 [ reg.0 ])
(reg/v:SI 4 r4 [ reg ])) pr43491.c:16 709 {*thumb2_movsi_insn}
 (expr_list:REG_DEAD (reg/v:SI 4 r4 [ reg ])
(nil)))

(insn 6 5 8 2 (set (reg:SI 137 [ MEM[(unsigned int *)reg.0_12 + 8B] ])
(mem:SI (plus:SI (reg/f:SI 135 [ reg.0 ])
(const_int 8 [0x8])) [2 MEM[(unsigned int *)reg.0_12 + 8B]+0 S4
A32])) pr43491.c:16 709 {*thumb2_movsi_insn}
 (nil))

(jump_insn 8 6 49 2 (parallel [
(set (pc)
(if_then_else (eq (reg:SI 137 [ MEM[(unsigned int *)reg.0_12 +
8B] ])
(const_int 0 [0]))
(label_ref:SI 22)
(pc)))
(clobber (reg:CC 24 cc))
]) pr43491.c:16 747 {*thumb2_cbz}
 (expr_list:REG_DEAD (reg:SI 137 [ MEM[(unsigned int *)reg.0_12 + 8B] ])
(expr_list:REG_UNUSED (reg:CC 24 cc)
(expr_list:REG_BR_PROB (const_int 900 [0x384])
(nil
 -> 22)

(code_label 49 8 48 3 4 "" [1 uses])

(note 48 49 16 3 [bb 3] NOTE_INSN_BASIC_BLOCK)

(note 16 48 14 3 NOTE_INSN_DELETED)

(call_insn 14 16 15 3 (parallel [
(call (mem:SI (symbol_ref:SI ("c") [flags 0x41]  ) [0 c S4 A32])
(const_int 0 [0]))
(use (const_int 0 [0]))
(clobber (reg:SI 14 lr))
]) pr43491.c:17 247 {*call_symbol}
 (nil)
(nil))

(insn 15 14 17 3 (set (reg:SI 138 [ MEM[(unsigned int *)reg.0_12 + 8B] ])
(mem:SI (plus:SI (reg/f:SI 135 [ reg.0 ])
(const_int 8 [0x8])) [2 MEM[(unsigned int *)reg.0_12 + 8B]+0 S4
A32])) pr43491.c:16 709 {*thumb2_movsi_insn}
 (nil))
--
Since reg is manually declared in r4, function globalize_reg sets r4 in
fixed_reg_set/call_used_reg_set/call_fixed_reg_set. IRA then add r4 into
allocno(r135)'s conflict_hard_regs. That's why IRA not assigns the pseudo(r135)
to r4. I guess it's natural unless we can make IRA aware of constant register.


[Bug rtl-optimization/43491] Unnecessary temporary for global register variable

2011-12-20 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43491

--- Comment #4 from amker.cheng  2011-12-21 
03:44:03 UTC ---
This bug is even worse on mips.

The cause is ssa-pre eliminates global register variable when it is the RHS of
single assign statment, while following passes do not handle the const/register
attributes of the variable.
It can be handled in tree-ssa-pre.c without hurting true redundancy elimination
on global register variables.

So could somebody change the tag from rtl-optimization to tree-optimization?


[Bug target/51835] New: ARM EABI violation when passing arguments to helper floating functions like __aeabi_d2iz

2012-01-12 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51835

 Bug #: 51835
   Summary: ARM EABI violation when passing arguments to helper
floating functions like __aeabi_d2iz
Classification: Unclassified
   Product: gcc
   Version: 4.7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: amker.ch...@gmail.com


For following program
int func(float f)
{
  double d = (double)f;
  return (int)d;
}
compile it with following command:
$ arm-none-eabi-gcc -O2 -mthumb -mcpu=cortex-m4 -mfloat-abi=hard
-mfpu=fpv4-sp-d16 -S test.c -o test.S

the generated assembly code is:
---
fun:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
push{r3, lr}
fmrsr0, s0
bl__aeabi_f2d
fmdrrd0, r0, r1
bl__aeabi_d2iz
pop{r3, pc}
.sizefun, .-fun

The argument of __aeabi_d2iz is passed in fp register, While ARM RTABI document
says that such functions should use the soft-float ABI, even when
-mfloat-abi=hard is specified.

The problem at least exists on trunk and 4.6 branch.

I am working a patch and will send it for review later.


[Bug middle-end/51867] New: GCC generates inconsistent code for same sources calling builtin calls, like sqrtf

2012-01-16 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51867

 Bug #: 51867
   Summary: GCC generates inconsistent code for same sources
calling builtin calls, like sqrtf
Classification: Unclassified
   Product: gcc
   Version: 4.7.0
Status: UNCONFIRMED
  Severity: trivial
  Priority: P3
 Component: middle-end
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: amker.ch...@gmail.com


compile following program:
--
#include 
int a(float x) {
 return sqrtf(x);
}
int b(float x) {
 return sqrtf(x);
}

With command:
arm-none-eabi-gcc -mthumb -mhard-float -mfpu=fpv4-sp-d16
-mcpu=cortex-m4 -O0 -S a.c -o a.S

The generated assembly codes is like:
--
a:
   @ args = 0, pretend = 0, frame = 8
   @ frame_needed = 1, uses_anonymous_args = 0
   push{r7, lr}
   sub sp, sp, #8
   add r7, sp, #0
   fstss0, [r7, #4]
   fldss15, [r7, #4]
   fsqrts  s15, s15
   fcmps   s15, s15
   fmstat
   beq .L2
   fldss0, [r7, #4]
   bl  sqrtf
   fcpys   s15, s0
.L2:
   ftosizs s15, s15
   fmrsr3, s15 @ int
   mov r0, r3
   add r7, r7, #8
   mov sp, r7
   pop {r7, pc}
   .size   a, .-a
   .align  2
   .global b
   .thumb
   .thumb_func
   .type   b, %function
b:
   @ args = 0, pretend = 0, frame = 8
   @ frame_needed = 1, uses_anonymous_args = 0
   push{r7, lr}
   sub sp, sp, #8
   add r7, sp, #0
   fstss0, [r7, #4]
   fldss0, [r7, #4]
   bl  sqrtf
   fcpys   s15, s0
   ftosizs s15, s15
   fmrsr3, s15 @ int
   mov r0, r3
   add r7, r7, #8
   mov sp, r7
   pop {r7, pc}
   .size   b, .-b

The problem exists on trunk and triggered only by O0 optimization.
The problem stands for x86 target too.


[Bug middle-end/51867] GCC generates inconsistent code for same sources calling builtin calls, like sqrtf

2012-01-16 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51867

--- Comment #1 from amker.cheng  2012-01-16 
10:15:59 UTC ---
The cause is in function expand_builtin, gcc checks following conditions:
--
 /* When not optimizing, generate calls to library functions for a certain
set of builtins.  */
 if (!optimize
 && !called_as_built_in (fndecl)
 && DECL_ASSEMBLER_NAME_SET_P (fndecl)
 && fcode != BUILT_IN_ALLOCA
 && fcode != BUILT_IN_ALLOCA_WITH_ALIGN
 && fcode != BUILT_IN_FREE)
   return expand_call (exp, target, ignore);

The control flow is:
1, DECL_ASSEMBLER_NAME_SET_P (fndecl) is false at the first time when compiling
a;
2, It is then set in following codes when expanding sqrtf call in function a;
3, When compiling function b, gcc checks DECL_ASSEMBLER_NAME_SET_P (fndecl)
again and this time it's true;


[Bug middle-end/51867] GCC generates inconsistent code for same sources calling builtin calls, like sqrtf

2012-01-17 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51867

--- Comment #3 from amker.cheng  2012-01-17 
10:35:14 UTC ---
test case c-c++-common/dfp/signbit-2.c depends on this check.
the case is like:
-
/* { dg-options "-O0" } */

/* Check that the compiler uses builtins for signbit; if not the link
   will fail because library functions are in libm.  */

#include "dfp-dbg.h"

volatile _Decimal32 sd = 2.3df;
volatile _Decimal64 dd = -4.5dd;
volatile _Decimal128 tf = 5.3dl;
volatile float f = 1.2f;
volatile double d = -7.8;
volatile long double ld = 3.4L;

EXTERN int signbitf (float);
EXTERN int signbit (double);
EXTERN int signbitl (long double);
EXTERN int signbitd32 (_Decimal32);
EXTERN int signbitd64 (_Decimal64);
EXTERN int signbitd128 (_Decimal128);

int
main ()
{
  if (signbitf (f) != 0) FAILURE
  if (signbit (d) == 0) FAILURE
  if (signbitl (ld) != 0) FAILURE
  if (signbitd32 (sd) != 0) FAILURE
  if (signbitd64 (dd) == 0) FAILURE
  if (signbitd128 (tf) != 0) FAILURE

  FINISH
}

It is compiled without optimization and will fail if no builtin_* functions are
used.
Not sure it is intended or not.


[Bug tree-optimization/88932] [8/9 Regression] ICE: verify_ssa failed (Error: definition in block 29 does not dominate use in block 25)

2019-01-31 Thread amker.cheng at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88932

bin.cheng  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot com

--- Comment #4 from bin.cheng  ---
(In reply to Jakub Jelinek from comment #3)
> This has been approved for trunk, are you going to commit it?

Thanks for reminding, will commit it tomorrow.  I would also need an approval
for 8 branch.

[Bug tree-optimization/82965] [8 regression][armeb] gcc.dg/vect/pr79347.c starts failing after r254379

2018-02-17 Thread amker.cheng at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82965

bin.cheng  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot com

--- Comment #10 from bin.cheng  ---
a proposed patch @https://gcc.gnu.org/ml/gcc-patches/2018-01/msg02419.html

[Bug tree-optimization/28364] poor optimization choices when iterating over a std::string (probably not c++-specific)

2018-03-04 Thread amker.cheng at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28364

bin.cheng  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot com

--- Comment #31 from bin.cheng  ---
This is a really old issue!  I will also check status of this issue on trunk.

[Bug tree-optimization/49498] [4.7/4.8 Regression]: gcc.dg/uninit-pred-8_b.c bogus warning line 20

2012-11-19 Thread amker.cheng at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49498



bin.cheng  changed:



   What|Removed |Added



 CC||amker.cheng at gmail dot

   ||com



--- Comment #17 from bin.cheng  2012-11-20 
07:08:18 UTC ---

Hi,

I spent some time analyzing this bug and I think I understand the problem now.

For below dump file from trunk/cris-elf when compiling the attached k.c:

;; Function foo (foo, funcdef_no=0, decl_uid=1323, cgraph_uid=0)



;; 1 loops found

;;

;; Loop 0

;;  header 0, latch 1

;;  depth 0, outer -1

;;  nodes: 0 1 2 3 4 5 6 7 8 9 10 11

;; 2 succs { 10 3 }

;; 3 succs { 11 4 }

;; 4 succs { 11 }

;; 5 succs { 6 7 }

;; 6 succs { 9 }

;; 7 succs { 6 8 }

;; 8 succs { 9 }

;; 9 succs { 1 }

;; 10 succs { 5 6 }

;; 11 succs { 5 8 }

foo (int n, int l, int m, int r)

{

  int v;

  int g.1;

  int g.0;



  :

  if (n_4(D) <= 9)

goto ;

  else

goto ;



  :

  if (m_5(D) > 100)

goto ;

  else

goto ;



  :

  goto ;



  :

  # v_14 = PHI 

  g.0_9 = g;

  g.1_10 = g.0_9 + 1;

  g = g.1_10;

  if (n_4(D) <= 9)

goto ;

  else

goto ;



  :

  # v_17 = PHI 

  blah (v_17);

  goto ;



  :

  if (m_5(D) > 100)

goto ;

  else

goto ;



  :



  :

  return 0;



  :

  if (m_5(D) != 0)

goto ;

  else

goto ;



  :

  # v_13 = PHI 

  if (m_5(D) != 0)

goto ;

  else

goto ;



}



There are two flaws in tree-ssa-uninit.c revealing this bug.

1. GCC try to find def_chains from cd_root(which is the closest dominating bb 

for phi_bb) to phi_bb, but only find use_predicates from phi_bb to use_bb. In

general case with canonical CFG, this is fine, but in non-canonical CFG, it's

possible to have ancestor basic block of phi_bb in def_chains which have branch

that never reach to phi_bb, like basic block 10 reported in this PR. In this

scenario the corresponding condition should not be counted in

def_chains(edge<10, 5> in this case).

There are two methods to fix this:

   a) find use predicates from dom(phi_bb), rather than phi_bb in non-canonical

CFGs.

   b) prune branch conditions that are irrelevant to this use/def in

def_chains.

Method a is simpler, but the problem is it results in more dep_chains which

might exceeds the limit MAX_NUM_CHAINS. As for method b), I haven't got any

clue to implement it.



2. When calling is_use_properly_guarded in find_uninit_use, GCC finds

predicates from source basic block if the use_stmt is a phi node. This results

in missing condition at the end of each def_chain. Different from the first

issue, this can be easily fixed.


[Bug tree-optimization/55424] New: [4.8 Regression]gcc.dg/uninit-pred-8_b.c bogus warning line 23 on ARM/Cortex-M0/-Os

2012-11-21 Thread amker.cheng at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55424



 Bug #: 55424

   Summary: [4.8 Regression]gcc.dg/uninit-pred-8_b.c bogus warning

line 23 on ARM/Cortex-M0/-Os

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: minor

  Priority: P3

 Component: tree-optimization

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: amker.ch...@gmail.com





The test case require optimization level "-O2" and it passes on ARM/cortex-m0

with "-O2", but the failure with "-Os" does reveal potential bug in

tree-ssa-uninit.c



Test command line:

arm-none-eabi-gcc ./uninit-pred-8_b.c  -fno-diagnostics-show-caret  

-Wuninitialized -fno-tree-dominator-opts -S-mthumb -mcpu=cortex-m0 -Os -o

uninit-pred-8_b.s

The warning info:

.../trunk-orig/gcc/gcc/testsuite/gcc.dg/uninit-pred-8_b.c: In function 'foo':

.../trunk-orig/gcc/gcc/testsuite/gcc.dg/uninit-pred-8_b.c:23:11: warning: 'v'

may be used uninitialized in this function [-Wmaybe-uninitialized]

.../trunk-orig/gcc/gcc/testsuite/gcc.dg/uninit-pred-8_b.c: In function 'foo_2':

.../trunk-orig/gcc/gcc/testsuite/gcc.dg/uninit-pred-8_b.c:42:11: warning: 'v'

may be used uninitialized in this function [-Wmaybe-uninitialized]



This failure occurs after checking in r193687. The patch prefers to generate

branches on ARM/cortex-m0.



After investigating tree dump of tree-ssa-uninit.c, I think:



tree-ssa-uninit.c computes control dependent chain for uses/def of variable and

checks whether each use is guarded by def. It has a upper bound on the number

of control dependent chains(MAX_NUM_CHAINS==8) and just retreat to false

warning if the number of chains exceeds MAX_NUM_CHAINS. In our scenario, the

number of chains exceeds MAX_NUM_CHAINS because we prefer short circuit now,

resulting in false warning information. These false warning cannot be fully

removed if the MAX_NUM_CHAINS exists, but we can improve it in following way:

There are lots of invalid control dependent chains computed in

tree-ssa-uninit.c now and should be pruned. I have already implemented a quick

fix and it works for our scenario.



I am not sure it should be fixed in this way, so please comments if you have

any opinions.



Thanks



Dump of tree-ssa-uninit.c:



;; Function foo (foo, funcdef_no=0, decl_uid=4065, cgraph_uid=0)





Use in stmt v_24 = PHI 

is guarded by :

 (.NOT.) if (m_6(D) != 0)

Operand defs of phi v_1 = PHI 

is guarded by :

 (.NOT.) if (n_5(D) <= 9)

(.AND.)

 (.NOT.) if (m_6(D) > 100)

(.AND.)

if (r_7(D) <= 19)

(.OR.)

if (n_5(D) <= 9)

(.OR.)

 (.NOT.) if (n_5(D) <= 9)

(.AND.)

 (.NOT.) if (m_6(D) > 100)

(.AND.)

 (.NOT.) if (r_7(D) <= 19)

(.AND.)

if (l_8(D) != 0)

foo (int n, int l, int m, int r)

{

  int v;

  int g.1;

  int g.0;



  :

  if (n_5(D) <= 9)

goto ;

  else

goto ;



  :

  if (m_6(D) > 100)

goto ;

  else

goto ;



  :

  if (r_7(D) <= 19)

goto ;

  else

goto ;



  :

  if (l_8(D) != 0)

goto ;

  else

goto ;



  :



  :

  # v_1 = PHI 

  if (m_6(D) != 0)

goto ;

  else

goto ;



  :

  # v_25 = PHI 

  g.0_11 = g;

  g.1_12 = g.0_11 + 1;

  g = g.1_12;

  goto ;



  :

  bar ();



  :

  # v_24 = PHI 

  if (n_5(D) <= 9)

goto ;

  else

goto ;



  :

  if (m_6(D) > 100)

goto ;

  else

goto ;



  :

  if (r_7(D) <= 19)

goto ;

  else

goto ;



  :

  if (m_6(D) > 100)

goto ;

  else

goto ;



  :

  blah (v_24);

  if (n_5(D) <= 9)

goto ;

  else

goto ;



  :

  blah (v_24);

  goto ;



  :

  if (r_7(D) <= 9)

goto ;

  else

goto ;



  :

  return 0;



  :

  # v_22 = PHI 

  goto ;



}


[Bug tree-optimization/49498] [4.7/4.8 Regression]: gcc.dg/uninit-pred-8_b.c bogus warning line 20

2012-11-21 Thread amker.cheng at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49498



--- Comment #19 from bin.cheng  2012-11-21 
13:24:02 UTC ---

(In reply to comment #18)

> *** Bug 55424 has been marked as a duplicate of this bug. ***



Just for the record.

If the analysis I gave in Comment #17 is right, this PR reveals another bug in

tree-ssa-uninit.c, apart from the limitation of MAX_NUM_CHAINS, while PR55424

is only about MAX_NUM_CHAINS.


[Bug rtl-optimization/54910] ARM: Missed optimization of very simple ctz function

2012-11-28 Thread amker.cheng at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54910



bin.cheng  changed:



   What|Removed |Added



 CC||amker.cheng at gmail dot

   ||com



--- Comment #2 from bin.cheng  2012-11-29 
02:17:37 UTC ---

This is fixed if replacing ldr constant by movw/movt. Unfortunately, problem

still exists on Thumb1/Cortex-M0, since there is no movw/movt instructions.


[Bug tree-optimization/55906] New: suboptimal code generated for post-inc on Thumb1

2013-01-07 Thread amker.cheng at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55906



 Bug #: 55906

   Summary: suboptimal code generated for post-inc on Thumb1

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: normal

  Priority: P3

 Component: tree-optimization

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: amker.ch...@gmail.com





For below program:



int

ffs(int word)



{

  int i;



  if (!word)

return 0;



  i = 0;

  for (;;)

{

  if (((1 << i++) & word) != 0)

 return i;

}

}



The dump of 164t.optimized is like:

ffs (int word)

{

  int i;

  int _6;

  int _7;



  :

  if (word_3(D) == 0)

goto ;

  else

goto ;



  :



  :

  # i_1 = PHI <0(3), i_5(5)>

  i_5 = i_1 + 1;

  _6 = word_3(D) >> i_1;

  _7 = _6 & 1;

  if (_7 != 0)

goto ;

  else

goto ;



  :

  goto ;



  :

  # i_2 = PHI <0(2), i_5(4)>

  return i_2;



}

GCC increases i before i_1 is used, causing i_5 and i_1 to be partitioned into

different partitions as in expanded rtl:

2: r115:SI=r0:SI

3: NOTE_INSN_FUNCTION_BEG

9: pc={(r115:SI==0)?L33:pc}

  REG_BR_PROB 0xf3c

   10: NOTE_INSN_BASIC_BLOCK 4

4: r110:SI=0

   18: L18:

   11: NOTE_INSN_BASIC_BLOCK 5

   12: r111:SI=r110:SI+0x1<-i_5/i_1 in different pseudos

   13: r116:SI=r115:SI>>r110:SI

   14: r118:SI=0x1

   15: r117:SI=r116:SI&r118:SI

  REG_EQUAL r116:SI&0x1

   16: pc={(r117:SI!=0)?L21:pc}

  REG_BR_PROB 0x384

   17: NOTE_INSN_BASIC_BLOCK 6

5: r110:SI=r111:SI

   19: pc=L18

   20: barrier

   33: L33:

   32: NOTE_INSN_BASIC_BLOCK 7

6: r111:SI=0

   21: L21:

   22: NOTE_INSN_BASIC_BLOCK 8

   23: r114:SI=r111:SI

   27: r0:SI=r114:SI

   30: use r0:SI



Finally, suboptimal codes are generated :

ffs:

movr3, #0

push{r4, lr}

cmpr0, r3

beq.L2

movr2, r3

movr1, #1

.L3:

movr4, r0

asrr4, r4, r2

addr3, r2, #1

tstr4, r1

bne.L2

movr2, r3

b.L3

.L2:

movr0, r3

@ sp needed

pop{r4, pc}



While GCC 4.6 generates better codes:

ffs:

push{lr}

subr3, r0, #0

beq.L2

movr3, #0

movr2, #1

.L3:

movr1, r0

asrr1, r1, r3

addr3, r3, #1

tstr1, r2

beq.L3

.L2:

movr0, r3

@ sp needed for prologue

pop{pc}





The command line is:

arm-none-eabi-gcc -mthumb -mcpu=cortex-m0 -Os -S ffs.c -o ffs.S



Same problem exists when optimizing with "-O2"


[Bug target/56058] New: GCC arm-none-eabi build failure

2013-01-20 Thread amker.cheng at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56058



 Bug #: 56058

   Summary: GCC arm-none-eabi build failure

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: major

  Priority: P3

 Component: target

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: amker.ch...@gmail.com





I configured the gcc with:

../gcc/configure 

--build=i686-linux-gnu

--host=i686-linux-gnu

--target=arm-none-eabi

--prefix=...

--disable-decimal-float

--disable-libffi

--disable-libgomp

--disable-libmudflap

--disable-libquadmath

--disable-libssp 

--disable-libstdcxx-pch 

--disable-lto 

--disable-nls

--disable-shared

--disable-threads

--disable-tls

--with-gnu-as

--with-gnu-ld

--with-newlib 

--with-headers=yes

--with-sysroot=...

--with-gmp=... --with-mpfr=... --with-mpc=... --with-ppl=...

--with-cloog=... --with-libelf=...

--with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic -lm'

--enable-languages=c,c++



And it failed with message:



build/gengtype  \

-S ../../gcc/gcc -I gtyp-input.list -w tmp-gtype.state

/bin/sh ../../gcc/gcc/../move-if-change tmp-gtype.state gtype.state

build/gengtype  \

-r gtype.state

echo timestamp > s-gtype

build/genattrtab ../../gcc/gcc/config/arm/arm.md insn-conditions.md \

-Atmp-attrtab.c -Dtmp-dfatab.c -Ltmp-latencytab.c

genattrtab: unknown value `alu' for `type' attribute

make[1]: *** [s-attrtab] Error 1

make[1]: Leaving directory

`/home/binche01/work/gcc-patches/arm-none-eabi/trunk-scan_one_insn/build/gcc'

make: *** [all-gcc] Error 2





It works if I revert r195295


[Bug target/56102] New: Wrong rtx cost calculated for Thumb1

2013-01-24 Thread amker.cheng at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56102



 Bug #: 56102

   Summary: Wrong rtx cost calculated for Thumb1

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: normal

  Priority: P3

 Component: target

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: amker.ch...@gmail.com





For below program:



double g = 1.0;

double func(int a, double d)

{

if (a > 0)

return 0.0 + g;

else

return 2.0 + d;

}



compiling with:

./arm-none-eabi-gcc -mthumb -mcpu=cortex-m0 -Os test.c -S -o test.S



The assembly code is:

.cpu cortex-m0

.fpu softvfp

.eabi_attribute 20, 1

.eabi_attribute 21, 1

.eabi_attribute 23, 3

.eabi_attribute 24, 1

.eabi_attribute 25, 1

.eabi_attribute 26, 1

.eabi_attribute 30, 4

.eabi_attribute 34, 0

.eabi_attribute 18, 4

.code16

.file"main.c"

.global__aeabi_dadd

.text

.align1

.globalfunc

.code16

.thumb_func

.typefunc, %function

func:

push{r3, lr}

cmpr0, #0

ble.L2

ldrr3, .L6+16

ldrr0, [r3]

ldrr1, [r3, #4]

ldrr3, .L6+4

ldrr2, .L6

b.L4

.L2:

movr0, r2

movr1, r3

ldrr2, .L6+8

ldrr3, .L6+12

.L4:

bl__aeabi_dadd

@ sp needed

pop{r3, pc}

.L7:

.align3

.L6:

.word0

.word0

.word0

.word1073741824

.word.LANCHOR0

.sizefunc, .-func

.globalg

.data

.align3

.set.LANCHOR0,. + 0

.typeg, %object

.sizeg, 8

g:

.word0

.word1072693248

.ident"GCC: (GNU) 4.8.0 20130122 (experimental)"



The problem is double word constant isn't split by GCC, causing bigger code

size.


[Bug target/56102] Wrong rtx cost calculated for Thumb1

2013-01-24 Thread amker.cheng at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56102



--- Comment #1 from bin.cheng  2013-01-25 
03:46:59 UTC ---

I have investigated this issue.



GCC uses function init_lower_subreg to initialize costs of MOVE insn with

different mode, then uses this information to decompose multi-word pseudo

registers into individual registers.



The problem is ARM backend returns wrong rtx cost for SET insn with multi-word

mode. Specifically, if you define LOG_COSTS in lower-subreg.c, GCC will dump

rtx costs when compiling with:



arm-none-eabi-gcc -mthumb -mcpu=cortex-m0 -Os/-O2 



The dump is:

Size costs

==



SI move: from zero cost 4, from reg cost 4

DI move: original cost 4, split cost 4 * 2

TI move: original cost 4, split cost 4 * 4

EI move: original cost 4, split cost 4 * 6

OI move: original cost 4, split cost 4 * 8

CI move: original cost 4, split cost 4 * 12

XI move: original cost 4, split cost 4 * 16

DQ move: original cost 4, split cost 4 * 2

TQ move: original cost 4, split cost 4 * 4

UDQ move: original cost 4, split cost 4 * 2

UTQ move: original cost 4, split cost 4 * 4

DA move: original cost 4, split cost 4 * 2

TA move: original cost 4, split cost 4 * 4

UDA move: original cost 4, split cost 4 * 2

UTA move: original cost 4, split cost 4 * 4

DF move: original cost 4, split cost 4 * 2

XF move: original cost 4, split cost 4 * 3

DD move: original cost 4, split cost 4 * 2

TD move: original cost 4, split cost 4 * 4

CSI move: original cost 4, split cost 4 * 2

CDI move: original cost 4, split cost 4 * 4

CTI move: original cost 4, split cost 4 * 8

CEI move: original cost 4, split cost 4 * 12

COI move: original cost 4, split cost 4 * 16

CCI move: original cost 4, split cost 4 * 24

CXI move: original cost 4, split cost 4 * 32

SC move: original cost 4, split cost 4 * 2

DC move: original cost 4, split cost 4 * 4

XC move: original cost 4, split cost 4 * 6

V8QI move: original cost 4, split cost 4 * 2

V4HI move: original cost 4, split cost 4 * 2

V2SI move: original cost 4, split cost 4 * 2

V16QI move: original cost 4, split cost 4 * 4

V8HI move: original cost 4, split cost 4 * 4

V4SI move: original cost 4, split cost 4 * 4

V2DI move: original cost 4, split cost 4 * 4

V4HF move: original cost 4, split cost 4 * 2

V2SF move: original cost 4, split cost 4 * 2

V8HF move: original cost 4, split cost 4 * 4

V4SF move: original cost 4, split cost 4 * 4

V2DF move: original cost 4, split cost 4 * 4



Speed costs

===



SI move: from zero cost 4, from reg cost 4

DI move: original cost 4, split cost 4 * 2

TI move: original cost 4, split cost 4 * 4

EI move: original cost 4, split cost 4 * 6

OI move: original cost 4, split cost 4 * 8

CI move: original cost 4, split cost 4 * 12

XI move: original cost 4, split cost 4 * 16

DQ move: original cost 4, split cost 4 * 2

TQ move: original cost 4, split cost 4 * 4

UDQ move: original cost 4, split cost 4 * 2

UTQ move: original cost 4, split cost 4 * 4

DA move: original cost 4, split cost 4 * 2

TA move: original cost 4, split cost 4 * 4

UDA move: original cost 4, split cost 4 * 2

UTA move: original cost 4, split cost 4 * 4

DF move: original cost 4, split cost 4 * 2

XF move: original cost 4, split cost 4 * 3

DD move: original cost 4, split cost 4 * 2

TD move: original cost 4, split cost 4 * 4

CSI move: original cost 4, split cost 4 * 2

CDI move: original cost 4, split cost 4 * 4

CTI move: original cost 4, split cost 4 * 8

CEI move: original cost 4, split cost 4 * 12

COI move: original cost 4, split cost 4 * 16

CCI move: original cost 4, split cost 4 * 24

CXI move: original cost 4, split cost 4 * 32

SC move: original cost 4, split cost 4 * 2

DC move: original cost 4, split cost 4 * 4

XC move: original cost 4, split cost 4 * 6

V8QI move: original cost 4, split cost 4 * 2

V4HI move: original cost 4, split cost 4 * 2

V2SI move: original cost 4, split cost 4 * 2

V16QI move: original cost 4, split cost 4 * 4

V8HI move: original cost 4, split cost 4 * 4

V4SI move: original cost 4, split cost 4 * 4

V2DI move: original cost 4, split cost 4 * 4

V4HF move: original cost 4, split cost 4 * 2

V2SF move: original cost 4, split cost 4 * 2

V8HF move: original cost 4, split cost 4 * 4

V4SF move: original cost 4, split cost 4 * 4

V2DF move: original cost 4, split cost 4 * 4





The original MOVE insn with multi-word mode has lower costs then split insns,

thus preventing gcc from splitting.



Root cause is that thumb1_rtx_costs/thumb1_size_rtx_costs does not handle

SET/ASHIFT/ASHIFTRT/LSHIFTRT/ROTATERT patterns with multi-word mode, as

rtx_cost does.



I am working on this and will send a patch.


[Bug target/56102] Wrong rtx cost calculated for Thumb1

2013-01-24 Thread amker.cheng at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56102



--- Comment #2 from bin.cheng  2013-01-25 
07:25:34 UTC ---

Created attachment 29270

  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=29270

correct test case



The previous test case is not appropriate, because gcc won't split even with

correct thumb1_rtx_cost.

Here attaches the right test case.


[Bug rtl-optimization/56124] New: Redundant reload for loading from memory

2013-01-27 Thread amker.cheng at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56124



 Bug #: 56124

   Summary: Redundant reload for loading from memory

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: normal

  Priority: P3

 Component: rtl-optimization

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: amker.ch...@gmail.com





For below test case:



typedef __builtin_va_list __gnuc_va_list;

typedef __gnuc_va_list va_list;



struct _reent

{

int _stdout;

};

struct _reent *_impure_ptr;

int bar (struct _reent *, int, const char *, va_list);

int

foo(const char *fmt , ...)

{

  int ret;

  va_list ap;

  struct _reent *ptr = _impure_ptr;



  __builtin_va_start(ap,fmt);

  ret = bar (ptr, ((ptr)->_stdout), fmt, ap);

  __builtin_va_end(ap);

  return ret;

}



The dump of reload pass is:

1: NOTE_INSN_DELETED

4: NOTE_INSN_BASIC_BLOCK 2

   28: r3:SI=sp:SI+0x10

  REG_EQUAL sp:SI+0x10

2: r2:SI=[r3:SI++]

  REG_INC r3:SI

  REG_EQUIV [afp:SI]

   31: [sp:SI+0x10]=r2:SI

3: NOTE_INSN_FUNCTION_BEG

6: r2:SI=[`*.LC0']

  REG_EQUIV `_impure_ptr'

7: r0:SI=[r2:SI]

9: [sp:SI+0x4]=r3:SI

   10: r1:SI=[r0:SI]

   14: r2:SI=[sp:SI+0x10]

   16: r0:SI=call [`bar'] argc:0

   25: use r0:SI

   29: NOTE_INSN_DELETED



which could be:



1: NOTE_INSN_DELETED

4: NOTE_INSN_BASIC_BLOCK 2

   28: r3:SI=sp:SI+0x10

  REG_EQUAL sp:SI+0x10

2: r2:SI=[r3:SI++]

  REG_INC r3:SI

  REG_EQUIV [afp:SI]

3: NOTE_INSN_FUNCTION_BEG

6: r1:SI=[`*.LC0']

  REG_EQUIV `_impure_ptr'

7: r0:SI=[r1:SI]

9: [sp:SI+0x4]=r3:SI

   10: r1:SI=[r0:SI]

   16: r0:SI=call [`bar'] argc:0

   25: use r0:SI

   29: NOTE_INSN_DELETED



It is obvious that insn 31/14 are generated/kept by redundant reload.



The command line is:

arm-none-eabi-gcc -mthumb -mcpu=cortex-m0 -Os ...


[Bug rtl-optimization/56124] Redundant reload for loading from memory

2013-01-27 Thread amker.cheng at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56124



--- Comment #1 from bin.cheng  2013-01-28 
02:43:10 UTC ---

The root cause is in ira:scan_one_insn function.



It decrease cost of memory for pseudo which are target of loading from memory:



  if (set != 0 && REG_P (SET_DEST (set)) && MEM_P (SET_SRC (set))

  && (note = find_reg_note (insn, REG_EQUIV, NULL_RTX)) != NULL_RTX

  && ((MEM_P (XEXP (note, 0)))

  || (CONSTANT_P (XEXP (note, 0))

  && targetm.legitimate_constant_p (GET_MODE (SET_DEST (set)),

XEXP (note, 0))

  && REG_N_SETS (REGNO (SET_DEST (set))) == 1))

  && general_operand (SET_SRC (set), GET_MODE (SET_SRC (set

{

  enum reg_class cl = GENERAL_REGS;

  rtx reg = SET_DEST (set);

  int num = COST_INDEX (REGNO (reg));



  COSTS (costs, num)->mem_cost

-= ira_memory_move_cost[GET_MODE (reg)][cl][1] * frequency;

  record_address_regs (GET_MODE (SET_SRC (set)),

   MEM_ADDR_SPACE (SET_SRC (set)),

   XEXP (SET_SRC (set), 0), 0, MEM, SCRATCH,

   frequency * 2);

  counted_mem = true;

}



The problem is if the src memory rtx (like in insn 2) has side effect, the orig

load insn won't be eliminated and causes redundant reload.



Patch will be sent for review.


[Bug tree-optimization/56139] New: unmodified static data could go in .rodata, not .data

2013-01-29 Thread amker.cheng at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56139



 Bug #: 56139

   Summary: unmodified static data could go in .rodata, not .data

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: enhancement

  Priority: P3

 Component: tree-optimization

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: amker.ch...@gmail.com





For below program:



static int x[] = {1, 2, 3, 4};



void bar (int x);

int func(int i)

{

int * const p = (int * const)&x;

bar(p[i]);



return 0;

}



build with:

arm-none-eabi-gcc -mthumb -mcpu=cortex-m0 -Os ...



The generated assembly code is:

.text

.align1

.globalfunc

.code16

.thumb_func

.typefunc, %function

func:

push{r3, lr}

ldrr3, .L2

lslr0, r0, #2

ldrr0, [r0, r3]

blbar

@ sp needed for prologue

movr0, #0

pop{r3, pc}

.L3:

.align2

.L2:

.word.LANCHOR0

.sizefunc, .-func

.data

.align2

.set.LANCHOR0,. + 0

.typex, %object

.sizex, 16

x:

.word1

.word2

.word3

.word4



while GCC 4.6 puts x in .rodata.


[Bug target/53090] suboptimal ivopt

2014-03-29 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53090

bin.cheng  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot com

--- Comment #2 from bin.cheng  ---
I tried the simple case, gcc doesn't work as expected on x86_64, but x86 is
fine.

I think there are several issues in ivopt causing this.  The first issue is
IVOPT is too conservative when representing iv_use with iv_cand in type with
smaller precision.
Consider below use/cand:
use 0
  address
  in statement t_mp_11 = *_10;

  at position *_10
  type int *
  base perm_9(D) + 4
  step 4
  base object (void *) perm_9(D)
  related candidates 
candidate 5 (important)
  var_before ivtmp.8
  var_after ivtmp.8
  incremented before exit test
  type unsigned int
  base 1
  step 1
candidate 6 (important)
  original biv
  type int
  base 1
  step 1

Use 0 is in type "int *" which has precision 64 on x86_64; cand is in type
"int" which has precision 32 on x86_64.  In function get_computation_cost_at,
there is below code:

  if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype))
{
  /* We do not have a precision to express the values of use.  */
  return infinite_cost;
}
But this is too conservative because the loop runs for "(j-i)/2" times, which
can be expressed by the candidate.  Even though the candidate has smaller type
than iv_use.

We should add some code checking loop niters against candidate's coverage here.
For example, the generated assembly changed into:
.L14:
movl(%rdx), %edi
movslq%eax, %rcx
addl$1, %eax
movl(%r15,%rcx,4), %esi
subq$4, %rdx
movl%edi, (%r15,%rcx,4)
movl%r8d, %ecx
subl%eax, %ecx
movl%esi, 4(%rdx)
cmpl%ecx, %eax
jl.L14

Now the original candidate is chosen as rcs for original induction variable
"i".  Unfortunately there are some other issues which prevent IVOPT from
choosing right candidate for original induction variable "j".

I will keep looking into it see what's going on.


[Bug tree-optimization/60363] [4.9 Regression]: logical_op_short_circuit, gcc.dg/tree-ssa/ssa-dom-thread-4.c scan-tree-dump-times dom1 "Threaded" 4

2014-03-31 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60363

--- Comment #10 from bin.cheng  ---
Patch sent at http://gcc.gnu.org/ml/gcc-patches/2014-03/msg00857.html , but it
need to wait for stage 1.

I will xfail it for now.


[Bug tree-optimization/60363] [4.9/4.10 Regression]: logical_op_short_circuit, gcc.dg/tree-ssa/ssa-dom-thread-4.c scan-tree-dump-times dom1 "Threaded" 4

2014-05-06 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60363

--- Comment #15 from bin.cheng  ---
Should be fixed now.


[Bug target/61367] New: Annoying rtx cost information in middle end dumps on arm/aarch64 targets

2014-05-29 Thread amker.cheng at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61367

Bug ID: 61367
   Summary: Annoying rtx cost information in middle end dumps on
arm/aarch64 targets
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amker.cheng at gmail dot com

Created attachment 32877
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=32877&action=edit
zipped dump files.

Given a simple program like:

#define LEN (32000)

__attribute__((aligned(16))) float a[LEN],b[LEN];

int s174 (int M)
{
  for (int i = 0; i < M; i++)
{
  a[i+M] = a[i] + b[i];
}
  return 0;
}

Build with O2/O3 -fdump-tree-all-details -fdump-rtl-all-details options.  The
middle-end's dump files contain lots of rtx cost information, which messes up
with true dump information.  The dump files of ivopt/cse2 are attached to show
this annoying problem.


[Bug target/61411] [NEON] ICE in reload_cse_simplify_operands, at postreload.c:411

2014-06-04 Thread amker.cheng at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61411

bin.cheng  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot com,
   ||mshawcroft at gcc dot gnu.org,
   ||vmakarov at gcc dot gnu.org

--- Comment #1 from bin.cheng  ---
The patch can fix the issue, but problem is why GCC/lra generated
register-indexing ([reg+reg]) addressing mode for V8HImode in the first place. 
Since without this patch, the address expression is illegal and shouldn't be
generated.  I didn't look into LRA's code and am not very sure whether this
patch is covering the problem.

Also added Marcus and Vlad to the CC list.


[Bug target/61411] [NEON] ICE in reload_cse_simplify_operands, at postreload.c:411

2014-06-05 Thread amker.cheng at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61411

--- Comment #3 from bin.cheng  ---
Then I think it's a latent bug in LRA.  It should consult back-end about
legitimized address expressions.


[Bug tree-optimization/60280] New: gcc.target/arm/ivopts.c and gcc.target/arm/ivopts-2.c failed caused by preserving loop structure.

2014-02-19 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60280

Bug ID: 60280
   Summary: gcc.target/arm/ivopts.c and gcc.target/arm/ivopts-2.c
failed caused by preserving loop structure.
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amker.cheng at gmail dot com

gcc.target/arm/ivopts-2.c is like:

/* { dg-do assemble } */
/* { dg-options "-Os -fdump-tree-ivopts -save-temps" } */

extern void foo2 (short*);

void
tr4 (short array[], int n)
{
  int x;
  if (n > 0)
for (x = 0; x < n; x++)
  foo2 (&array[x]);
}

/* { dg-final { scan-tree-dump-times "PHI 

[Bug tree-optimization/60280] gcc.target/arm/ivopts.c and gcc.target/arm/ivopts-2.c failed caused by preserving loop structure.

2014-02-19 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60280

--- Comment #1 from bin.cheng  ---
It's caused by patch at (revision r198333):
http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01530.html

After patching, forwarder basic block 6 in below dump didn't get removed:
tr4 (short int * array, int n)
{
  int x;
  unsigned int x.0;
  unsigned int _7;
  short int * _9;

  :
  if (n_4(D) > 0)
goto ;
  else
goto ;

  :

  :
  # x_14 = PHI 
  x.0_6 = (unsigned int) x_14;
  _7 = x.0_6 * 2;
  _9 = array_8(D) + _7;
  foo2 (_9);
  x_11 = x_14 + 1;
  if (x_11 < n_4(D))
goto ;
  else
goto ;

  :
  return;

  :
  goto ;
}

After expanding, pre-header is filled with pre-loop initialization instructions
and the problem turns into a cfglayout problem:
5: NOTE_INSN_BASIC_BLOCK 2
2: r115:SI=r0:SI
  REG_DEAD r0:SI
3: NOTE_INSN_DELETED
4: NOTE_INSN_FUNCTION_BEG
7: {cc:CC=cmp(r1:SI,0);r116:SI=r1:SI;}
  REG_DEAD r1:SI
8: pc={(cc:CC>0)?L24:pc}
  REG_DEAD cc:CC
  REG_BR_PROB 0x1f98
;;  succ:   4
;;  5
   29: L29:
   13: NOTE_INSN_BASIC_BLOCK 3
   14: r0:SI=r110:SI
   15: call [`foo2'] argc:0
  REG_DEAD r0:SI
   16: r110:SI=r110:SI+0x2
   18: cc:CC=cmp(r110:SI,r114:SI)
   19: pc={(cc:CC!=0)?L29:pc}
  REG_DEAD cc:CC
  REG_BR_PROB 0x2333
;;  succ:   3
;;  5
   24: L24:
   25: NOTE_INSN_BASIC_BLOCK 4
   26: r110:SI=r115:SI
  REG_DEAD r115:SI
   27: NOTE_INSN_DELETED
   28: r114:SI=r116:SI*0x2+r110:SI
  REG_DEAD r116:SI
;;  succ:   3
   32: L32:
   33: NOTE_INSN_BASIC_BLOCK 5
;;  succ:   EXIT


After outof_cfglayout, a jump (in bb3) to exit block is introduced:
5: NOTE_INSN_BASIC_BLOCK 2
3: NOTE_INSN_DELETED
4: NOTE_INSN_FUNCTION_BEG
7: {cc:CC=cmp(r1:SI,0);r1:SI=r1:SI;}
8: pc={(cc:CC>0)?L24:pc}
  REG_BR_PROB 0x1f98
;;  succ:   6
;;  3
   55: NOTE_INSN_BASIC_BLOCK 3
   56: pc=L32
;;  succ:   7
   29: L29:
   13: NOTE_INSN_BASIC_BLOCK 4
   14: r0:SI=r4:SI
   15: call [`foo2'] argc:0
   16: r4:SI=r4:SI+0x2
   18: cc:CC=cmp(r4:SI,r5:SI)
   19: pc={(cc:CC!=0)?L29:pc}
  REG_BR_PROB 0x2333
;;  succ:   4
;;  5
   58: NOTE_INSN_BASIC_BLOCK 5
   59: pc=L32
;;  succ:   7
   24: L24:
   25: NOTE_INSN_BASIC_BLOCK 6
   26: r4:SI=r0:SI
   27: NOTE_INSN_DELETED
   28: r5:SI=r1:SI*0x2+r4:SI
   61: pc=L29
;;  succ:   4
   32: L32:
   33: NOTE_INSN_BASIC_BLOCK 7
;;  succ:   EXIT

Ideally, basic block reordering could fix this, but before that, pass
pro_and_epilogue threads jump in bb3 to a direct return instruction and bb
reordering can do nothing any more.

So:
1) Unless we can teach passes before pro_and_epilogue to do some bb reordering
work, it's inappropriate to fix it on RTL.
2) It's natural to be fixed on GIMPLE, but it's disruptive because the cfg
stuff are shared by all GIMPLE(even RTL) optimizers. Yet this method makes more
sense than 1).

I am trying to work out a less intrusive patch for stage 4.


[Bug tree-optimization/60280] [4.9 Regression] gcc.target/arm/ivopts.c and gcc.target/arm/ivopts-2.c failed caused by preserving loop structure.

2014-02-20 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60280

--- Comment #3 from bin.cheng  ---
I think 4_8 is ok for this case.  At least it doesn't have
http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01530.html committed if I was
right.


[Bug tree-optimization/60280] [4.9 Regression] gcc.target/arm/ivopts.c and gcc.target/arm/ivopts-2.c failed caused by preserving loop structure.

2014-02-25 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60280

bin.cheng  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from bin.cheng  ---
Patch applied.  Fixed I think.


[Bug regression/60363] [4.9 Regression]: logical_op_short_circuit, gcc.dg/tree-ssa/ssa-dom-thread-4.c scan-tree-dump-times dom1 "Threaded" 4

2014-03-09 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60363

bin.cheng  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot com

--- Comment #2 from bin.cheng  ---
Created attachment 32315
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=32315&action=edit
tar of dump files.


[Bug regression/60363] [4.9 Regression]: logical_op_short_circuit, gcc.dg/tree-ssa/ssa-dom-thread-4.c scan-tree-dump-times dom1 "Threaded" 4

2014-03-09 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60363

--- Comment #3 from bin.cheng  ---
After patching 208165, there are two more jump threading opportunities for dom1
pass.  Jump threading is doing alright, the interesting thing is why there is
no such opportunities before patching.

I attatched related dump files with/without patch.  It seems dumps before vrp1
pass are pretty similar, while after vrp1, dump with patch shows the two
additional jump threading opportunities.  In other words, they are somehow
already fixed (not introduced) in pass vrp1 without patching.

For now I can just change ssa-dom-thread-4.c to handle the two jump threadings,
or should I look into vrp to find the difference in the first place?


[Bug regression/60363] [4.9 Regression]: logical_op_short_circuit, gcc.dg/tree-ssa/ssa-dom-thread-4.c scan-tree-dump-times dom1 "Threaded" 4

2014-03-09 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60363

--- Comment #4 from bin.cheng  ---
Although may be irrelavant.  I found loop's latch doesn't get updated after
removing the forwarder latch basic block.  Previous patch only catches function
remove_forwarder_block, but remove_forwarder_block_with_phi should be handled
too.

I will send a patch picking this up.


[Bug regression/60363] [4.9 Regression]: logical_op_short_circuit, gcc.dg/tree-ssa/ssa-dom-thread-4.c scan-tree-dump-times dom1 "Threaded" 4

2014-03-11 Thread amker.cheng at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60363

--- Comment #5 from bin.cheng  ---
Vrp1 generates below code:


  :
  if (b_elt_11(D) != 0B)
goto ;
  else
goto ;

  :
  # kill_elt_10 = PHI 
  goto ;

  :
  kill_elt_14 = kill_elt_2->next;

  :
  # kill_elt_2 = PHI 
  if (kill_elt_2 != 0B)
goto ;
  else
goto ;

  :
  _12 = kill_elt_2->indx;
  _13 = b_elt_11(D)->indx;
  if (_12 < _13)
goto ;
  else
goto ;

...


  :
  goto ;

  :
  # kill_elt_41 = PHI <0B(6)>
  if (b_elt_11(D) != 0B)
goto ;
  else
goto ;

The whole bb 19 is unnecessary since we know "b_elt_11(D) != 0" holds.


  1   2   >