avr-gcc misses a number of optimisations when copying 4-byte values or assigning a single byte value to 4 byte values. The issue actually applies to other sized values as well, but since 4 byte values are common (such as for 32-bit ints, and for floats) the issue is especially relevant.
In summary, the compiler tends to produce code that is either a series of direct memory accesses, or uses indirect access (through Z) in a loop. A better choice would often be to set up Z as a pointer, then unroll the indirect pointer loop. All code was compiled using avr-gcc 4.3.2 from winavr-20090313, using -Os. Look at the code: typedef unsigned char uint8_t; typedef unsigned long int uint32_t; static uint8_t as[4]; static uint8_t bs[4]; void foo1(void) { for (uint8_t i = 0; i < sz; i++) { bs[i] = as[1]; } } void foo2(void) { for (uint8_t i = 0; i < sz; i++) { *(bs + i) = *(as + 1); } } foo1 compiles to: lds r24, as+1 sts bs, r24 sts bs+1, r24 sts bs+2, r24 sts bs+3, r24 ret Excluding the "ret", this is 10 words and 10 cycles. foo2 is logically identical (array access and pointer access are the same thing), but compiles to: lds r24, as+1 ldi r30, lo8(bs) ldi r31, hi8(bs) .L1: st Z+, r24 ldi r25, hi8(bs+4) cpi r30, lo8(bs+4) cpc r31, r25 brne L1 ret Excluding the "ret", this is 9 words and 31 cycles (27 on the XMega). Hoisting the "ldi r25, hi8(bs+4)" above the label would save four cycles. An implementation that is smaller than both of these, and slightly slower on the Mega and slightly faster on the XMega, is: lds r24, as+1 ldi r30, lo8(bs) ldi r31, hi8(bs) st Z+, r24 st Z+, r24 st Z+, r24 st Z+, r24 ret Excluding the "ret" this is 8 words, and 12 cycles (8 on the XMega). For the code: static uint32_t al, bl; static float af; void foo3(void) { al = 0; } void foo4(void) { af = 0; } we get: foo3: sts al, __zero_reg__ sts (al)+1, __zero_reg__ sts (al)+2, __zero_reg__ sts (al)+3, __zero_reg__ ret That's 8 words and 8 cycles (plus "ret"). Using ldi r30, lo8(bs) ldi r31, hi8(bs) st Z+, __zero_reg__ st Z+, __zero_reg__ st Z+, __zero_reg__ st Z+, __zero_reg__ ret Gives 6 words and 10 cycles, or 6 cycles on the XMega (plus "ret") Function foo4() should of course give the same code, but instead compiles to the very inefficient: foo4: ldi r24, lo8(0x00) ldi r25, hi8(0x00) ldi r26, hlo8(0x00) ldi r27, hhi8(0x00) sts af, __zero_reg__ sts (af)+1, __zero_reg__ sts (af)+2, __zero_reg__ sts (af)+3, __zero_reg__ ret That's 12 words and 12 cycles, and uses 4 registers unnecessarily. Similar code is produced when copying values: void foo5(void) { al = bl; } compiles to: foo5: lds r24, bl lds r25, (bl) + 1 lds r26, (bl) + 2 lds r27, (bl) + 3 sts al, r24 sts (al) + 1, r25 sts (al) + 2, r26 sts (al) + 3, r27 Using the Z and either X or Y pointers would make this code slightly smaller but marginally slower on the Mega (and marginally faster on the XMega). Even without that, re-arranging the code would allow a single register to be used rather than four. ret -- Summary: Missed optimisation when setting 4-byte values Product: gcc Version: 4.3.2 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: david dot brown at hesbynett dot no GCC host triplet: mingw GCC target triplet: avr-gcc http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39819