On Fri, 17 Jun 2011, Arnd Bergmann wrote:

> On Friday 17 June 2011, Michael Hope wrote:
> > Hi Paolo.  I've had a look into this and updated the ticket.  The
> > problem is due to a change in behaviour between 4.5 and 4.6 when
> > accessing strcuts marked with __attribute__((packed)).  4.5 would do a
> > word-wise access where 4.6 assumes the worst and does a byte-by-byte
> > access.
> > 
> > There was a discussion on the GCC list:
> >  http://gcc.gnu.org/ml/gcc/2011-02/msg00035.html
> > 
> > which petered out unfortunately.  IMO the compiler is working
> > correctly and the problem should be fixed in the kernel initially and
> > then a long term plan worked out between the kernel and compiler.
> > 
> > Arnd, do you want to pick up that thread?  Both Ubuntu and Debian are
> > now 4.6 based so this problem will come up more often.  I think Patrik
> > said there's a fix for this specific problem pending.
> 
> I fear Russell will not like this too much, but I guess it has to be done.
> It will also come back to the toolchain team as a request to make the
> inline assembly syntax more helpful. When we last talked about it on the
> mailing list, the reason that was given for having writel() not do an
> inline assembly to access I/O memory was that we cannot express all
> possible addressing modes well.

I played with this a bit.  I think that GCC might have improved on this 
front.  It at least doesn't seem to produce the much suboptimal code 
compared to the plain C implementation as it used to do in the past.  
See my explanation and experiments below.

> The function in question is 
> 
> #define __raw_writel(v,a) (*(volatile unsigned int __force   *)(a) = (v))
> 
> AFAICT, this should really be written as
> 
> static inline void __raw_writel(unsigned int v, volatile __iomem unsigned int 
> *a)
> {
>       asm volatile("str %0, %1" : "=o" (*a) : "r" (v) : "memory");
> }
> 
> What I don't know is whether this form is ideal in all cases (or even 
> correct),
> and what the constraints should be. 

The problem here is that the assembly constraint doesn't let us specify 
which type of access this constraint is for.

Let's take the str instruction as an example.  It may use many different 
addressing modes:

        str     r0, [r1]        # destination address in r1
        str     r0, [r1, r2]    # base address in r1 with offset in r2
        str     r0, [r1, #<i>]  # base address in r1 with an immediate offset 

The third form is the most common in driver code.  You have the iobase 
in a variable, used with register offsets.  So, for example, if you have 
this code:

#define REG_A   0x00
#define REG_B   0x04
#define REG_C   0x08
#define REG_D   0x0c
#define REG_E   0x800

#define writel(v,a) (*(volatile unsigned int *)(a) = (v))

void foobar_init(void * iobase)
{
        /* initialize A, C, and E regs to zero */
        writel(0, iobase + REG_A);
        writel(0, iobase + REG_C);
        writel(0, iobase + REG_E);

        /* initialize B and D to -1 */
        writel(-1, iobase + REG_B);
        writel(-1, iobase + REG_D);
}

This compiles to:

foobar_init:
        mov     r3, #0
        mvn     r2, #0
        str     r3, [r0, #0]
        str     r3, [r0, #8]
        str     r3, [r0, #2048]
        str     r2, [r0, #4]
        str     r2, [r0, #12]
        bx      lr

Now the writel could have been written as a static inline:

static inline void writel(unsigned int v, volatile unsigned int *a)
{
        *a = v;
}

The result is the same as gcc is smart enough to share the common iobase 
and spread the immediate offsets. This also provides better type 
checking.  But last time this was attempted, a large amount of code was 
then complaining due to the fact that sometimes the address is provided 
as a pointer, sometimes as an integer.  this deserves to be cleaned, but 
this should really be orthogonal to this issue and the move to a static 
inline be done first, and separately from the actual access 
implementation.

Now let's try your suggestion, with a minor ordering fix:

static inline void writel(unsigned int v, volatile unsigned int *a)
{
        asm volatile ("str\t%1, %0" : : "=o" (*a) : "r" (v));
}

The result is then:

foobar_init:
        mov     r1, #0
        str     r1, [r0, #0]
        str     r1, [r0, #8]
        str     r1, [r0, #2048]
        mvn     r3, #0
        str     r3, [r0, #4]
        str     r3, [r0, #12]
        bx      lr

Not obvious consequences here, but the scheduling is 
different.  I suspect this would be more important in the readl() 
case as every ldr instructions have result delay slots to fill with 
other instructions when possible.  I don't know if the compiler is able 
to infer this from the constraint.

Arguably, this might not be significant given that IO accesses are 
typically much more costly with much longer stalls than a delay slot 
might be.

But more importantly, the immediate offset has a limited range of 
course.  But the problem is about the fact that not every store 
instructions have the same immediate offset range in their addressing 
mode.  For example, the strh instruction may accommodate only a 8-bit 
offset compared to 12-bit offset for a word.

Oh wait!  Looks like the o constraint is smart enough to adapt to the 
type of the pointer.  So if the pointer for a is a volatile unsigned 
short then we get this:

foobar_init:
        mov     r1, #0
        str     r1, [r0, #0]
        str     r1, [r0, #8]
        add     r2, r0, #2048
        str     r1, [r2, #0]
        mvn     r3, #0
        str     r3, [r0, #4]
        str     r3, [r0, #12]
        bx      lr

Obviously the inline asm should have used a strh instruction as well, 
but what I actually wanted to verify here is the restriction on the 
indexed addressing range.  Looks like GCC is doing the right thing here.  
I can't say for sure, but I have a vague recollection of GCC doing it 
wrong in some distant past.  At the time the only thing that seemed to 
be OK in all cases with an inline asm implementation was:

static inline void writeh(unsigned int v, volatile unsigned short  *a)
{
        asm volatile ("strh\t%1, [%0]" : : "r" (a), "r" (v));
}

which produced this:

foobar_init:
        mov     r1, #0
        strh    r1, [r0]
        mov     ip, #0
        add     r2, r0, #8
        strh    ip, [r2]
        add     r3, r0, #2048
        strh    ip, [r3]
        mvn     r2, #0
        add     r1, r0, #4
        strh    r2, [r1]
        mvn     r3, #0
        add     r0, r0, #12
        strh    r3, [r0]
        bx      lr

As you can see, this is far less optimal and larger.  And then GCC even 
gets confused wrt liveness of the value to store despite using -O2.  
Hence the implementation using pure C which produced far better code.

> If I understood Uli correctly, it should be possible to express this 
> with gcc syntax, but of course there may be a catch somewhere.

Yes.  You don't want to use a memory clobber.  That tells GCC to forget 
everything about what it has cached into registers and forces it to 
reload everything from memory, including the iobase pointer if it comes 
from a struct device's private data.

Consider, for example, this code:

struct device {
        void *iobase;
};

void foobar_init(struct device *dev)
{
        /* initialize A, C, and E regs to zero */
        writel(0, dev->iobase + REG_A);
        writel(0, dev->iobase + REG_C);
        writel(0, dev->iobase + REG_E);

        /* initialize B and D to -1 */
        writel(-1, dev->iobase + REG_B);
        writel(-1, dev->iobase + REG_D);
}

If you keep the "memory" clobber, GCC will produce this:

foobar_init:
        ldr     r1, [r0, #0]
        mov     r3, #0
        str     r3, [r1, #0]
        ldr     r2, [r0, #0]
        str     r3, [r2, #8]
        ldr     ip, [r0, #0]
        str     r3, [ip, #2048]
        mvn     r2, #0
        ldr     r1, [r0, #0]
        str     r2, [r1, #4]
        ldr     r3, [r0, #0]
        str     r2, [r3, #12]
        bx      lr

As you can see, the iobase value is reloaded all the time needlessly.  
Another recollection of mine is that a while ago GCC was considering 
static inline functions like full memory barriers too, which doesn't 
seem to be the case anymore.

But for the constraints the following should be best:

static inline void writel(unsigned int v, volatile unsigned int *a)
{
        asm volatile ("str\t%1, %0" : : "+o" (*a) : "r" (v));
}

static inline unsigned int readl(volatile unsigned int *a)
{
        unsigned int v;
        asm volatile ("ldr\t%1, %0" : : "+o" (*a), "=r" (v));
        return v;
}

The bidirectional qualifier indicates that we're going to modify the 
memory pointed by a, and yet it is going to be different after that, 
which pretty much matches how hardware registers behave.


Nicolas

_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain

Reply via email to