avr-gcc misses a number of optimisations when copying 4-byte values or
assigning a single byte value to 4 byte values.  The issue actually applies to
other sized values as well, but since 4 byte values are common (such as for
32-bit ints, and for floats) the issue is especially relevant.

In summary, the compiler tends to produce code that is either a series of
direct memory accesses, or uses indirect access (through Z) in a loop.  A
better choice would often be to set up Z as a pointer, then unroll the indirect
pointer loop.

All code was compiled using avr-gcc 4.3.2 from winavr-20090313, using -Os.

Look at the code:

typedef unsigned char uint8_t;
typedef unsigned long int uint32_t;

static uint8_t as[4];
static uint8_t bs[4];

void foo1(void) {
        for (uint8_t i = 0; i < sz; i++) {
                bs[i] = as[1];
        }
}

void foo2(void) {
        for (uint8_t i = 0; i < sz; i++) {
                *(bs + i) = *(as + 1);
        }
}

foo1 compiles to:

lds r24, as+1
sts bs, r24
sts bs+1, r24
sts bs+2, r24
sts bs+3, r24
ret

Excluding the "ret", this is 10 words and 10 cycles.

foo2 is logically identical (array access and pointer access are the same
thing), but compiles to:

lds r24, as+1
ldi r30, lo8(bs)
ldi r31, hi8(bs)
.L1:
st Z+, r24
ldi r25, hi8(bs+4)
cpi r30, lo8(bs+4)
cpc r31, r25
brne L1
ret

Excluding the "ret", this is 9 words and 31 cycles (27 on the XMega).  Hoisting
the "ldi r25, hi8(bs+4)" above the label would save four cycles.

An implementation that is smaller than both of these, and slightly slower on
the Mega and slightly faster on the XMega, is:

lds r24, as+1
ldi r30, lo8(bs)
ldi r31, hi8(bs)
st Z+, r24
st Z+, r24
st Z+, r24
st Z+, r24
ret

Excluding the "ret" this is 8 words, and 12 cycles (8 on the XMega).


For the code:

static uint32_t al, bl;
static float af;

void foo3(void) {
        al = 0;
}

void foo4(void) {
        af = 0;
}

we get:

foo3:
sts al, __zero_reg__
sts (al)+1, __zero_reg__
sts (al)+2, __zero_reg__
sts (al)+3, __zero_reg__
ret

That's 8 words and 8 cycles (plus "ret").  Using

ldi r30, lo8(bs)
ldi r31, hi8(bs)
st Z+, __zero_reg__
st Z+, __zero_reg__
st Z+, __zero_reg__
st Z+, __zero_reg__
ret

Gives 6 words and 10 cycles, or 6 cycles on the XMega (plus "ret")

Function foo4() should of course give the same code, but instead compiles to
the very inefficient:

foo4:
ldi r24, lo8(0x00)
ldi r25, hi8(0x00)
ldi r26, hlo8(0x00)
ldi r27, hhi8(0x00)
sts af, __zero_reg__
sts (af)+1, __zero_reg__
sts (af)+2, __zero_reg__
sts (af)+3, __zero_reg__
ret

That's 12 words and 12 cycles, and uses 4 registers unnecessarily.


Similar code is produced when copying values:

void foo5(void) {
        al = bl;
}

compiles to:

foo5:
lds r24, bl
lds r25, (bl) + 1
lds r26, (bl) + 2
lds r27, (bl) + 3
sts al, r24
sts (al) + 1, r25
sts (al) + 2, r26
sts (al) + 3, r27

Using the Z and either X or Y pointers would make this code slightly smaller
but marginally slower on the Mega (and marginally faster on the XMega).  Even
without that, re-arranging the code would allow a single register to be used
rather than four.

ret


-- 
           Summary: Missed optimisation when setting 4-byte values
           Product: gcc
           Version: 4.3.2
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: c
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: david dot brown at hesbynett dot no
  GCC host triplet: mingw
GCC target triplet: avr-gcc


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39819

Reply via email to