lower-subreg.c: Extreme code bloat for MEM splits

Georg-Johann Lay Sun, 04 Mar 2012 12:21:38 -0800

For the following small test case there is unbelievable code bloat from
lower-subreg.c


The code reads a 4-byte value from AVR's address spaces:


long readx (const __memx long *p)
{
    return *p;
}

long read1 (const __flash1 long *p)
{
    return *p;
}


Compiled with 4.8.0


$ avr-gcc flash.c -S -dp -Os -mmcu=avr51 -fno-split-wide-types

This yields

readx:
/* prologue: function */
        movw r30,r22
        mov r21,r24
        call __xload_4
        ret

read1:
/* prologue: function */
        movw r30,r24
        ldi r18,1
        out __RAMPZ__,r18
        elpm r22,Z+
        elpm r23,Z+
        elpm r24,Z+
        elpm r25,Z+
        ret


Which is reasonable. Loads from space __memx are expensive and are outsourced
to libgcc function __xload_4.

But without the -fno-split-wide-types the code is

readx:
        push r12
        push r13
        push r14
/* prologue: function */
        mov r26,r24
        movw r24,r22
        movw r18,r24
        mov r20,r26
        subi r18,-1
        sbci r19,-1
        sbci r20,-1
        movw r30,r18
        mov r21,r20
        call __xload_1
        mov r23,r22
        ldi r18,lo8(2)
        mov r12,r18
        mov r13,__zero_reg__
        mov r14,__zero_reg__
        add r12,r24
        adc r13,r25
        adc r14,r26
        movw r18,r24
        mov r20,r26
        subi r18,-3
        sbci r19,-1
        sbci r20,-1
        movw r30,r12
        mov r21,r14
        call __xload_1
        mov r24,r22
        movw r30,r18
        mov r21,r20
        call __xload_1
        mov r25,r22
/* epilogue start */
        pop r14
        pop r13
        pop r12
        ret

read1:
/* prologue: function */
        movw r30,r24
        ldi r18,1
        out __RAMPZ__,r18
        elpm r22,Z+
        ldi r18,1
        out __RAMPZ__,r18
        elpm r23,Z
        movw r18,r24
        subi r18,-2
        sbci r19,-1
        movw r20,r24
        subi r20,-3
        sbci r21,-1
        movw r30,r18
        ldi r24,1
        out __RAMPZ__,r24
        elpm r24,Z
        movw r30,r20
        ldi r25,1
        out __RAMPZ__,r25
        elpm r25,Z
        ret

You don't need to know anything about AVR to see that the code is *really* bad
and bloat to the maximum.

Besides that the code is wrong, there are just 3 __xload_1 calls instead of 4.
But that appears to be a different issue, PR52484.

The reason is that lower-subreg.c does not care about costs at all and greedily
splits everything it gets hold of.

And a second reason is that GCC is completely afraid of pre/post
increment/modify/decrement addressing modes.

Any idea how to fix this in the backend?

There is TARGET_MODE_DEPENDENT_ADDRESS_P and it can fix the first case which
uses PSImode as pointer mode.

The second case, however, uses Pmode and in that hook there is no way to tell
if an address is to generic address space or to a special address space because
that hook hides this information from the backend and there is no address-space
flavour of the hook.

Any ideas what to do about that?

Is it reasonable hack to make
   TARGET_MODE_DEPENDENT_ADDRESS_P (PSImode) = false

Why does lower-subreg not care for costs at all?
...even if; MEMORY_MOVE_COST is not sensitive to address spaces, either.

Johann

lower-subreg.c: Extreme code bloat for MEM splits

Reply via email to