http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49057
Summary: benchmark of gcc. a piece of loop code compiled by gcc-4.5.1 is slower compiled by gcc-4.4.2 when run on cortex-a9. Product: gcc Version: 4.5.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: kun...@mediatek.com The following C code is used to do a “integer add” test. The type of n, i, i1, i2, loop_cnt are all ‘int’. the initial value: loop_cnt=5000000, i=0, i1=3, i2=-3. for (n = loop_cnt; n > 0; n--) { /* 0 x -x - initial value */ i += i1; /* x x -x */ i1 += i2; /* x 0 -x */ i1 += i2; /* x -x -x */ i2 += i; /* x -x 0 */ i2 += i; /* x -x x */ i += i1; /* 0 -x x */ i += i1; /* -x -x x */ i1 += i2; /* -x 0 x */ i1 += i2; /* -x x x */ i2 += i; /* -x x 0 */ i2 += i; /* -x x -x */ i += i1; /* 0 x -x */ /* * Note that at loop end, i1 = -i2 */ /* * which is as we started. Thus, */ /* * the values in the loop are stable */ } I use gcc-4.4.2 and gcc-4.5.1 compile this C code, that will generate different binary code. Gcc-4.42: 284: e0800003 add r0, r0, r3 288: e2511001 subs r1, r1, #1 ; 0x1 28c: e0833082 add r3, r3, r2, lsl #1 290: e0822080 add r2, r2, r0, lsl #1 294: e0800083 add r0, r0, r3, lsl #1 298: e0833082 add r3, r3, r2, lsl #1 29c: e0822080 add r2, r2, r0, lsl #1 2a0: e0830000 add r0, r3, r0 2a4: 1afffff6 bne 284 <add_int+0x4c> Gcc-4.5.1: 138: e0800003 add r0, r0, r3 13c: e0833082 add r3, r3, r2, lsl #1 140: e0822080 add r2, r2, r0, lsl #1 144: e2511001 subs r1, r1, #1 148: e0800083 add r0, r0, r3, lsl #1 14c: e0833082 add r3, r3, r2, lsl #1 150: e0822080 add r2, r2, r0, lsl #1 154: e0830000 add r0, r3, r0 158: 1afffff6 bne 138 <add_int+0x4c> As you see, the only one difference is the position of “subs r1, r1, #1”, and this difference has led to huge differences in performance. The performance of the latter just has 80% of the former.