Tuesday Dec 06, 2005

GCC inline assembly, part 2

Long ago, I promised to write more about gcc inline assembly, in particular a few cases that are tricky to get right. Here, somewhat belatedly, are those cases. These examples are taken from libc, but the concepts apply to any inline assembly fragments you write for gcc. As I mentioned previously, these concerns apply only to gcc-style inlines; the Studio-style inline format doesn't require that you use this same level of caution. gcc expects you to write assembly fragments (even in a "separate" inline function) as if they are logically a part of the caller. That is, the compiler will allocate registers or other appropriate storage locations to each of the input and output C variables. This requires that you instruct the compiler very carefully as to your use of each variable, and the variables' relationships to one another. The advantage is much better register allocation; the compiler is free to allocate whatever registers it wishes to your input and output variables in a manner that is transparent to you. Instead, Studio requires that you code the fragment as if it were a leaf function, so the compiler does not do any register allocation for you. You are permitted to use the caller-saved registers any way you wish, and even to use the caller's stack as if you are in a leaf function. Arguments and return values are stored in their ABI-defined locations. Depending on the optimization level you use, this can be wasteful of registers (though the peephole optimizer can often clean up some of this waste) and can also make writing the fragment much more difficult. In exchange, however, you don't have to be nearly as careful to express the fragment's operation to the compiler.

Inputs, Outputs, and Clobbers (oh my!)

Each assembly fragment may have any or all of outputs, inputs, and clobbers. Each input and output maps a C variable or literal to a string suitable for use as an assembly operand. These operands can then be referenced as %0, %1, %2, etc. These are ordered beginning from 0 with the first output, followed by the inputs. Alternately, newer versions of gcc allow the use of symbolic names for each input and output. Clobbers are somewhat different; they express the set of registers and/or memory whose values are changed by the fragment but are not expressed in the outputs. Inputs which are also changed must be listed as outputs, not clobbers. Normally, the clobbers include explicit registers used by certain instructions, but may also include "cc" to indicate that the condition code registers are modified and/or "memory" to indicate that arbitrary memory addresses have had their contents altered.

Constraints

Outputs and inputs are expressed as constraints, in a language specifying the type of operand that will contain the value of a variable. Common constraints include "r", indicating that a general register should be allocated, and "m" indicating that some type of memory location should be used. The complete list of constraints is found in the gcc documentation. These constraints may contain modifiers, which give gcc more information about how the operand will be used. The most common modifiers are "=", "+", and "&". The "=" modifier is used to indicate that the operand is output-only; it may appear only in the constraint for an output variable. Even if the constraint is applied to a variable containing an existing value in your program, there is no guarantee that it will contain that value when your assembly fragment is executed. If you need that, you must use the "+" modifier instead of "="; this tells the compiler that this operand is both an input and an output. Nevertheless, the variable with this constraint is provided only in the outputs section of the fragment's specification. An alternate way to express the same thing is provided in the documentation. Note that providing the same variable as both an input and an output does not guarantee you that the same location (register, address, etc.) will be used for both of them. Thus the following is generally incorrect:

static inline int
add(int var1, int var2)
{
	__asm__(
		"add	%2, %0"
	: "=r" (var1)
	: "r" (var1), "r" (var2));


	return (var1);
}

The "&" modifier is used on an output operand whose value is overwritten before all the input operands are consumed. This exists to prevent gcc from using the same register for both the input and output operands. For example, for swap32() (see also the Studio inline function), we might think to write:

extern __inline__ uint32_t
swap32(volatile uint32_t *__memory, uint32_t __value)
{
	...
	uint32_t __tmp1, __tmp2;
	__asm__ __volatile__(
		"ld [%3], %1\n\t"
		"1:\n\t"
		"mov %0, %2\n\t"
		"cas [%3], %1, %2\n\t"
		"cmp %1, %2\n\t"
		"bne,a,pn %%icc, 1b\n\t"
		"  mov %2, %1"
		: "+r" (__value), "=r" (__tmp1), "=r" (__tmp2)
		: "r" (__memory)
		: "cc");
	return (__tmp2);
}

But suppose gcc decided to allocate o0 to both __tmp1 and __memory. This is allowable, because the "=r" constraint implies that the corresponding register is set only after all input-only operands are no longer needed (input/output operands obviously don't have this problem). In the case above, the first load would clobber o0 and the cas would operate on an arbitrary location. Instead, we must write "=&r" for both __tmp1 and __tmp2; neither variable may safely be allocated the same register as the input operand.

Bugs caused by omitting the earlyclobber are painful to track down because they often appear and disappear from one compilation to the next as entirely unrelated code changes cause increases or decreases in register pressure.

This is not an academic concern. Consider this example program:

#include 

static __inline__ void
incr32(volatile uint32_t *__memory)
{
        uint32_t __tmp1, __tmp2;
        __asm__ __volatile__(
        "ld [%2], %0\n\t"
        "1:\n\t"
        "add %0, 1, %1\n\t"
        "cas [%2], %0, %1\n\t"
        "cmp %0, %1\n\t"
        "bne,a,pn %%icc, 1b\n\t"
        "  mov %1, %0"
        : "=r" (__tmp1), "=r" (__tmp2)
        : "r" (__memory)
        : "cc");
}

uint32_t
func(uint32_t x)
{
        uint32_t y = 4;
        uint32_t z = x + y;

        incr32(&y);

        z = x + y;

        return (z);
}

gcc compiles this (use -O2 -mcpu=v9 -mv8plus) into:

func()
    func:                   9c 03 bf 88  add          %sp, -0x78, %sp
    func+0x4:               9a 10 20 04  mov          0x4, %o5
    func+0x8:               90 02 20 04  add          %o0, 0x4, %o0
    func+0xc:               da 23 a0 64  st           %o5, [%sp + 0x64]
    func+0x10:              82 03 a0 64  add          %sp, 0x64, %g1
    func+0x14:              c2 00 40 00  ld           [%g1], %g1	<===
    func+0x18:              9a 00 60 01  add          %g1, 0x1, %o5
    func+0x1c:              db e0 50 01  cas          [%g1] , %g1, %o5	<= SEGV
    func+0x20:              80 a0 40 0d  cmp          %g1, %o5
    func+0x24:              32 47 ff fd  bne,a,pn     %icc, func+0x18
    func+0x28:              82 10 00 0d  mov          %o5, %g1
    func+0x2c:              81 c3 e0 08  retl         
    func+0x30:              9c 23 bf 88  sub          %sp, -0x78, %sp

In this case, gcc has allocated g1 to both __tmp1 and __memory, and o5 to __tmp2. Note the highlighted instructions: the initial load destroys the value of g1, and the subsequent cas will attempt to operate on whatever address was stored at *__memory when the fragment began. In this example, that value will be 4 (g1 is assigned sp+0x64, which is simply the address of y). This program is compiled incorrectly due to improper constraints, and will cause a segmentation fault if the code in question is executed.

If instead we use "=&r" for both __tmp1 and __tmp2, gcc generates the following code:

func()
    func:                   9c 03 bf 88  add          %sp, -0x78, %sp
    func+0x4:               9a 10 20 04  mov          0x4, %o5
    func+0x8:               90 02 20 04  add          %o0, 0x4, %o0
    func+0xc:               da 23 a0 64  st           %o5, [%sp + 0x64]
    func+0x10:              82 03 a0 64  add          %sp, 0x64, %g1
    func+0x14:              d8 00 40 00  ld           [%g1], %o4	<===
    func+0x18:              9a 03 20 01  add          %o4, 0x1, %o5
    func+0x1c:              db e0 50 0c  cas          [%g1] , %o4, %o5	<= OK
    func+0x20:              80 a3 00 0d  cmp          %o4, %o5
    func+0x24:              32 47 ff fd  bne,a,pn     %icc, func+0x18
    func+0x28:              98 10 00 0d  mov          %o5, %o4
    func+0x2c:              81 c3 e0 08  retl         
    func+0x30:              9c 23 bf 88  sub          %sp, -0x78, %sp

This code now assigns o4 to __tmp1, which eliminates the problem described above. This function, however, still does not do the right thing. Why not?

Reloading

Compilers keep track of where each live variable in the program can be found; many variables can be found both at some memory location and in a register. Sometimes, the compiler chooses to use a register for a different variable, and stores the value back to its memory location (if it has changed) before doing so. Later, if this value is needed, the value must be loaded back into a register before being used. This is known as reloading. Other reasons reloading may be required include a variable's declaration as volatile and the case that concerns us here, a variable's modification via side effects.

In the example above, incr32() is actually operating on a memory address, not a register. So why did we assign __memory the "r" constraint instead of more correctly expressing the constraint as "+m" (*__memory)? It turns out that the "m" constraint allows a variety of possible addressing modes. On SPARC, this includes the register/offset mode (such as [%sp+0x64]). This is fine for instructions like ld and st, but the cas instruction is special: it allows no offset. No constraint exists to describe this condition; the "V" constraint is clearly similar but is not correct; a bare register ([%g1]) is an offsettable address, so "V" would actually exclude the case we want. Conversely, "o", the inverse constraint of "V", includes the register/offset addressing mode we specifically wish to exclude. So, the only way to express this constraint is "r". But this does nothing to capture the fact that although the pointer itself is not modified, the value at *__memory is altered by the assembly fragment. Is this a problem? Let's look at the assembly generated for func() a little more closely:

func()
    func:                   9c 03 bf 88  add          %sp, -0x78, %sp
    func+0x4:               9a 10 20 04  mov          0x4, %o5
    func+0x8:               90 02 20 04  add          %o0, 0x4, %o0	<===
    func+0xc:               da 23 a0 64  st           %o5, [%sp + 0x64]
    func+0x10:              82 03 a0 64  add          %sp, 0x64, %g1
    func+0x14:              d8 00 40 00  ld           [%g1], %o4
    func+0x18:              9a 03 20 01  add          %o4, 0x1, %o5
    func+0x1c:              db e0 50 0c  cas          [%g1] , %o4, %o5
    func+0x20:              80 a3 00 0d  cmp          %o4, %o5
    func+0x24:              32 47 ff fd  bne,a,pn     %icc, func+0x18
    func+0x28:              98 10 00 0d  mov          %o5, %o4
    func+0x2c:              81 c3 e0 08  retl         			<===
    func+0x30:              9c 23 bf 88  sub          %sp, -0x78, %sp

We see that gcc has assigned z the o0 register, which is not surprising given that it's the return value. But after o0 is set to x + 4 at the beginning of the function, it's never set again. The line z = x + y has been discarded by the compiler! This is because it does not know that our inline assembly modified the value of y, so it did not reload the value and recalculate z.

There are two ways we can correct this problem: (a) add a "+m" output operand for *__memory, or (b) add "memory" to the list of clobbers. This is a special clobber that tells gcc not to trust the values in any registers it would otherwise believe to hold the current values of variables stored in memory. In short, this clobber tells gcc that all registers must be reloaded if the correct value of a variable is required. This is somewhat inefficient when we know which piece of memory has been touched, so (a) is preferable for better performance. Whichever solution we choose, gcc now compiles our code to:

func()
    func:                   9c 03 bf 88  add          %sp, -0x78, %sp
    func+0x4:               9a 10 20 04  mov          0x4, %o5
    func+0x8:               98 10 00 08  mov          %o0, %o4
    func+0xc:               da 23 a0 64  st           %o5, [%sp + 0x64]
    func+0x10:              82 03 a0 64  add          %sp, 0x64, %g1
    func+0x14:              d6 00 40 00  ld           [%g1], %o3
    func+0x18:              9a 02 e0 01  add          %o3, 0x1, %o5
    func+0x1c:              db e0 50 0b  cas          [%g1] , %o3, %o5
    func+0x20:              80 a2 c0 0d  cmp          %o3, %o5
    func+0x24:              32 47 ff fd  bne,a,pn     %icc, func+0x18
    func+0x28:              96 10 00 0d  mov          %o5, %o3
    func+0x2c:              d0 03 a0 64  ld           [%sp + 0x64], %o0	<===
    func+0x30:              90 03 00 08  add          %o4, %o0, %o0	<===
    func+0x34:              81 c3 e0 08  retl         
    func+0x38:              9c 23 bf 88  sub          %sp, -0x78, %sp

Note the reload, which will now return the correct result. There are actually two other ways to correct this, although the use of "+m" is the most correct. First, we could declare z to be volatile in func(). This would force gcc to reload its value from memory any time that value is required. Use of the volatile keyword is mainly useful when some external thread (or hardware) may change the value at any time; using it as a substitute for correct constraints will cause unnecessary reloading, degrading performance. Second, and perhaps best of all, the compiler could be modified to accept a SPARC-specific constraint for use with the cas instruction, one which requires the address of the operand to be stored in a general register.

You can find more inline assembly examples in libc (math functions), MD5 acceleration, and the kernel illustrating these concepts. Be sure to read and understand the documentation completely before writing your own inline assembly for gcc, and always test your understanding by constructing and compiling simple test programs like these.

Posted at 01:32AM Dec 06, 2005 by wesolows in General | Comments[3] | Permalink

Comments:

Very interesting article. Keep 'em coming! Question #1: wouldn't it have been simpler to write straight assembler .S code and assemble it with `as`, then link it it with the rest of the C code, rather than having to play these games and dance around GCC? Question #2: does it make sense to fiddle with GCC in these ways when Sun Studio compilers are now free? Note: the most ideal case would be to have the famous Amiga ASM-One assembler IDE available on Solaris, which is of course next to impossible because the source code is in MC680xx assembler. Reference: http://www.euronet.nl/users/jdm/documents/asmone.html

Posted by ux-admin on December 06, 2005 at 10:12 AM UTC #

ux-admin, yes it is simpler to write assembly in a separate file and use the normal function call interface. But it's also much slower, especially since most of these functions are just a few instructions long. They really need to be inlined for performance reasons, especially the really simple functions like caller() and curthread(). Second, yes, this is worthwhile for several reasons. From a purely technical point of view, gcc catches bugs that neither cc nor lint does, and gcc will be a boon to anyone doing a port since Studio doesn't offer support for any other architectures. So using both compilers will help us keep the code portable and bug-free. From a philosophical point of view, Studio is free as in beer, but personally I'd rather not shut out people who want to use a Free compiler with a Free operating system. Perhaps at some point Studio will be open and this argument will be moot, but the technical merits would remain.

Posted by Keith M Wesolowski on December 07, 2005 at 04:56 PM UTC #

[linuxkernelnewbies] GCC inline assembly, part 2 : Edicts from CLUSTRON

Tuesday Dec 06, 2005

GCC inline assembly, part 2

Inputs, Outputs, and Clobbers (oh my!)

Constraints

Reloading

Reply via email to