http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55623



--- Comment #3 from Siarhei Siamashka <siarhei.siamashka at gmail dot com> 
2012-12-09 11:18:56 UTC ---

(In reply to comment #2)

> This is an ARM (both arm32 and arm64) specific issue due to the shifts being

> "free".  If you look at the mips assembly, it looks good for a dual issue

> processor as it is scheduled as an add followed by a shift.

> 

> I think the issue is reassocdoes not know that shifts are free on arm.



This does not look like only an ARM issue. To properly demonstrate it on MIPS

and even without dual-issue, all the additions can be just changed with

multiplications (because it is a long latency instruction). In this case we

get:



unsigned int f1(unsigned int x)

{

    unsigned int a, b;

    a = x >> 1;

    b = x >> 2;

    a *= x >> 3;

    b *= x >> 4;

    a *= x >> 5;

    b *= x >> 6;

    a *= x >> 7;

    b *= x >> 8;

    a *= x >> 9;

    b *= x >> 10;

    a *= x >> 11;

    b *= x >> 12;

    a *= x >> 13;

    b *= x >> 14;

    a *= x >> 15;

    b *= x >> 16;

    a *= x >> 17;

    b *= x >> 18;

    a *= x >> 19;

    b *= x >> 20;

    a *= x >> 21;

    b *= x >> 22;

    a *= x >> 23;

    b *= x >> 24;

    return a * b;

}



unsigned int f2(unsigned int x)

{

    unsigned int a, b;

    a = x >> 1;

    b = x >> 2;

    a *= x >> 3;

    b *= x >> 4;

    a *= x >> 5;

    b *= x >> 6;

    a *= x >> 7;

    b *= x >> 8;

    a *= x >> 9;

    b *= x >> 10;

    a *= x >> 11;

    b *= x >> 12;

    a *= x >> 13;

    b *= x >> 14;

    a *= x >> 15;

    b *= x >> 16;

    a *= x >> 17;

    b *= x >> 18;

    a *= x >> 19;

    b *= x >> 20;

    a *= x >> 21;

    b *= x >> 22;

    a *= x >> 23;

    b *= x >> 24;

    asm ("" : "+r" (a));

    return a * b;

}



And the benchmark run on MIPS 74K:



$ gcc -O2 -march=mips32r2 -mtune=74kc -o badschedmul badschedmul.c

$ time ./badchedmul 1



real    0m34.934s

user    0m34.689s

sys    0m0.073s



$ time ./badchedmul 2



real    0m19.261s

user    0m19.122s

sys    0m0.050s



The symptoms are still the same. GCC just merges two independent calculations

into a single dependency chain. While I would have expected it to be the other

way around (breaking dependency chains to run faster on the target CPU).

Reply via email to