[Bug rtl-optimization/49057] New: benchmark of gcc. a piece of loop code compiled by gcc-4.5.1 is slower compiled by gcc-4.4.2 when run on cortex-a9.

kun.he at mediatek dot com Wed, 18 May 2011 23:51:30 -0700

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49057


           Summary: benchmark of gcc. a piece of loop code compiled by
                    gcc-4.5.1 is slower compiled by gcc-4.4.2 when run on
                    cortex-a9.
           Product: gcc
           Version: 4.5.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: kun...@mediatek.com


The following C code is used to do a “integer add” test. The type of n, i, i1,
i2, loop_cnt are all ‘int’. the initial value: loop_cnt=5000000, i=0, i1=3,
i2=-3.
for (n = loop_cnt; n > 0; n--) {        /*    0    x     -x  - initial value */
                i += i1;                /*    x    x     -x   */
                i1 += i2;               /*    x    0     -x   */
                i1 += i2;               /*    x    -x    -x   */
                i2 += i;                /*    x    -x    0    */
                i2 += i;                /*    x    -x    x    */
                i += i1;                /*    0    -x    x    */
                i += i1;                /*    -x   -x    x    */
                i1 += i2;               /*    -x   0     x    */
                i1 += i2;               /*    -x   x     x    */
                i2 += i;                /*    -x   x     0    */
                i2 += i;                /*    -x   x     -x   */
                i += i1;                /*    0    x     -x   */
                /*
                 * Note that at loop end, i1 = -i2
                 */
                /*
                 * which is as we started.  Thus,
                 */
                /*
                 * the values in the loop are stable
                 */
        }
I use gcc-4.4.2 and gcc-4.5.1 compile this C code, that will generate different
binary code.
Gcc-4.42:
284:    e0800003     add    r0, r0, r3
 288:    e2511001     subs    r1, r1, #1    ; 0x1
 28c:    e0833082     add    r3, r3, r2, lsl #1
 290:    e0822080     add    r2, r2, r0, lsl #1
 294:    e0800083     add    r0, r0, r3, lsl #1
 298:    e0833082     add    r3, r3, r2, lsl #1
 29c:    e0822080     add    r2, r2, r0, lsl #1
 2a0:    e0830000     add    r0, r3, r0
 2a4:    1afffff6     bne    284 <add_int+0x4c>

Gcc-4.5.1:
138:    e0800003     add    r0, r0, r3
 13c:    e0833082     add    r3, r3, r2, lsl #1
 140:    e0822080     add    r2, r2, r0, lsl #1
 144:    e2511001     subs    r1, r1, #1
 148:    e0800083     add    r0, r0, r3, lsl #1
 14c:    e0833082     add    r3, r3, r2, lsl #1
 150:    e0822080     add    r2, r2, r0, lsl #1
 154:    e0830000     add    r0, r3, r0
 158:    1afffff6     bne    138 <add_int+0x4c>

As you see, the only one difference is the position of “subs    r1, r1, #1”,
and this difference has led to huge differences in performance. The performance
of the latter just has 80% of the former.

[Bug rtl-optimization/49057] New: benchmark of gcc. a piece of loop code compiled by gcc-4.5.1 is slower compiled by gcc-4.4.2 when run on cortex-a9.

Reply via email to