The code generation by GCC 4.4 (trunk) for 68K/Coldfire will run slow on the SuperScalar pipelines of the 68060 and Coldfire V4/V5 cores.
if you compilining this example: uint32_t fletcher( uint16_t *data, size_t len ) { uint32_t sum1 = 0xffff, sum2 = 0xffff; while (len) { unsigned tlen = len > 360 ? 360 : len; len -= tlen; do { sum1 += *data++; sum2 += sum1; } while (--tlen); sum1 = (sum1 & 0xffff) + (sum1 >> 16); sum2 = (sum2 & 0xffff) + (sum2 >> 16); } /* Second reduction step to reduce sums to 16 bits */ sum1 = (sum1 & 0xffff) + (sum1 >> 16); sum2 = (sum2 & 0xffff) + (sum2 >> 16); return sum2 << 16 | sum1; } with "m68k-linux-gnu-gcc -mcpu=68060 -fomit-frame-pointer -O3 -S -o example.s example.c" Then you will see that this defination will generate the below code: { uint32_t sum1 = 0xffff, sum2 = 0xffff; } moveq #0,%d2 not.w %d2 move.l %d2,%d3 That are THREE depending instructions in a row. Even with result forwarding these THREE instruction will need 3 clocks to execute. Instead writing the above in three lines the compiler could have generated two lines like this: move.l #0xffff0000,%d2 move.l #0xffff0000,%d3 Or the compiler could have put other independing instructions between those. GCC does not try to reduce the instruction dependencies. The Code that GCC generates does not follow the scheduling recommendation for 68040/68060 and above multiscalar CPUs. Please be so kind and correct this. -- Summary: Generated 68K code bad for pipelining Product: gcc Version: 4.4.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: gunnar at greyhound-data dot com GCC host triplet: m68k-linux-gnu GCC target triplet: m68k-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36487