Long ago, I promised to write more about gcc inline
assembly, in
particular a few cases that are tricky to get right. Here, somewhat
belatedly, are those cases. These examples are taken from libc, but the
concepts apply to any inline assembly fragments you write for gcc. As I
mentioned previously,
these concerns apply only to gcc-style inlines; the Studio-style inline
format doesn't require that you use this same level of caution. gcc
expects you to write assembly fragments (even in a "separate" inline
function) as if they are logically a part of the caller. That is, the
compiler will allocate registers or other appropriate storage locations
to each of the input and output C variables. This requires that you
instruct the compiler very carefully as to your use of each variable,
and the variables' relationships to one another. The advantage is much
better register allocation; the compiler is free to allocate whatever
registers it wishes to your input and output variables in a manner that
is transparent to you. Instead, Studio requires that you code the
fragment as if it were a leaf function, so the compiler does not do any
register allocation for you. You are permitted to use the caller-saved
registers any way you wish, and even to use the caller's stack as if
you
are in a leaf function. Arguments and return values are stored in their
ABI-defined locations. Depending on the optimization level you use,
this can be wasteful of registers (though the peephole optimizer can
often clean up some of this waste) and can also make writing the
fragment much more difficult. In exchange, however, you don't have to
be nearly as careful to express the fragment's operation to the
compiler.
Inputs, Outputs, and Clobbers (oh my!)
Each assembly fragment may have any or all of outputs,
inputs, and
clobbers. Each input and output maps a C variable or literal to a
string suitable for use as an assembly operand. These operands can then
be referenced as %0, %1, %2,
etc.
These are ordered beginning from 0 with the first output, followed by
the inputs. Alternately, newer versions of gcc allow the use of
symbolic names for each input and output. Clobbers are somewhat
different; they express the set of registers and/or memory whose values
are changed by the fragment but are not expressed in the outputs.
Inputs which are also changed must be listed as outputs, not clobbers.
Normally, the clobbers include explicit registers used by certain
instructions, but may also include "cc" to indicate that
the condition code registers are modified and/or "memory"
to indicate that arbitrary memory addresses have had their contents
altered.
Constraints
Outputs and inputs are expressed as constraints, in a
language
specifying the type of operand that will contain the value of a
variable. Common constraints include "r", indicating that
a general register should be allocated, and "m"
indicating
that some type of memory location should be used. The complete list of
constraints is found in the
gcc documentation. These constraints may contain modifiers, which
give gcc more information about how the operand will be used. The most
common modifiers are "=", "+", and
"&". The "=" modifier is used to
indicate
that the operand is output-only; it may appear only in the constraint
for an output variable. Even if the constraint is applied to a variable
containing an existing value in your program, there is no guarantee
that
it will contain that value when your assembly fragment is executed. If
you need that, you must use the "+" modifier instead of
"="; this tells the compiler that this operand is
both an
input and an output. Nevertheless, the variable with this constraint is
provided only in the outputs section of the fragment's specification.
An alternate way to express the same thing is provided in the
documentation. Note that providing the same variable as both an input
and an output does not guarantee you that the same location (register,
address, etc.) will be used for both of them. Thus the following is
generally incorrect:
static inline int
add(int var1, int var2)
{
__asm__(
"add %2, %0"
: "=r" (var1)
: "r" (var1), "r" (var2));
return (var1);
}
The
"&" modifier is used on an output operand whose
value
is overwritten before
all the input operands are consumed. This
exists to prevent gcc from using the same register for both the input
and output operands. For example, for
swap32()
(see also
the
Studio inline function), we might think to write:
extern __inline__ uint32_t
swap32(volatile uint32_t *__memory, uint32_t __value)
{
...
uint32_t __tmp1, __tmp2;
__asm__ __volatile__(
"ld [%3], %1\n\t"
"1:\n\t"
"mov %0, %2\n\t"
"cas [%3], %1, %2\n\t"
"cmp %1, %2\n\t"
"bne,a,pn %%icc, 1b\n\t"
" mov %2, %1"
: "+r" (__value), "=r" (__tmp1), "=r" (__tmp2)
: "r" (__memory)
: "cc");
return (__tmp2);
}
But suppose gcc decided to allocate o0 to
both
__tmp1 and __memory. This is
allowable,
because the "=r" constraint implies that the
corresponding
register is set only after all input-only operands are no longer needed
(input/output operands obviously don't have this problem). In the case
above, the first load would clobber o0 and the
cas would operate on an arbitrary location.
Instead, we
must write "=&r" for both __tmp1 and
__tmp2; neither variable may safely be allocated
the same
register as the input operand.
Bugs caused by omitting the earlyclobber are painful to
track down
because they often appear and disappear from one compilation to the
next as entirely unrelated code changes cause increases or decreases
in register pressure.
This is not an academic concern. Consider this example
program:
#include
static __inline__ void
incr32(volatile uint32_t *__memory)
{
uint32_t __tmp1, __tmp2;
__asm__ __volatile__(
"ld [%2], %0\n\t"
"1:\n\t"
"add %0, 1, %1\n\t"
"cas [%2], %0, %1\n\t"
"cmp %0, %1\n\t"
"bne,a,pn %%icc, 1b\n\t"
" mov %1, %0"
: "=r" (__tmp1), "=r" (__tmp2)
: "r" (__memory)
: "cc");
}
uint32_t
func(uint32_t x)
{
uint32_t y = 4;
uint32_t z = x + y;
incr32(&y);
z = x + y;
return (z);
}
gcc compiles this (use -O2 -mcpu=v9 -mv8plus) into:
func()
func: 9c 03 bf 88 add %sp, -0x78, %sp
func+0x4: 9a 10 20 04 mov 0x4, %o5
func+0x8: 90 02 20 04 add %o0, 0x4, %o0
func+0xc: da 23 a0 64 st %o5, [%sp + 0x64]
func+0x10: 82 03 a0 64 add %sp, 0x64, %g1
func+0x14: c2 00 40 00 ld [%g1], %g1 <===
func+0x18: 9a 00 60 01 add %g1, 0x1, %o5
func+0x1c: db e0 50 01 cas [%g1] , %g1, %o5 <= SEGV
func+0x20: 80 a0 40 0d cmp %g1, %o5
func+0x24: 32 47 ff fd bne,a,pn %icc, func+0x18
func+0x28: 82 10 00 0d mov %o5, %g1
func+0x2c: 81 c3 e0 08 retl
func+0x30: 9c 23 bf 88 sub %sp, -0x78, %sp
In this case, gcc has allocated g1 to both
__tmp1 and __memory, and o5
to
__tmp2. Note the highlighted instructions: the
initial
load destroys the value of g1, and the subsequent
cas will attempt to operate on whatever address
was stored
at *__memory when the fragment began. In this example,
that value will be 4 (g1 is assigned sp+0x64,
which is simply the address of y). This program is
compiled incorrectly due to improper constraints, and will cause a
segmentation fault if the code in question is executed.
If instead we use "=&r" for both __tmp1
and __tmp2, gcc generates the following code:
func()
func: 9c 03 bf 88 add %sp, -0x78, %sp
func+0x4: 9a 10 20 04 mov 0x4, %o5
func+0x8: 90 02 20 04 add %o0, 0x4, %o0
func+0xc: da 23 a0 64 st %o5, [%sp + 0x64]
func+0x10: 82 03 a0 64 add %sp, 0x64, %g1
func+0x14: d8 00 40 00 ld [%g1], %o4 <===
func+0x18: 9a 03 20 01 add %o4, 0x1, %o5
func+0x1c: db e0 50 0c cas [%g1] , %o4, %o5 <= OK
func+0x20: 80 a3 00 0d cmp %o4, %o5
func+0x24: 32 47 ff fd bne,a,pn %icc, func+0x18
func+0x28: 98 10 00 0d mov %o5, %o4
func+0x2c: 81 c3 e0 08 retl
func+0x30: 9c 23 bf 88 sub %sp, -0x78, %sp
This code now assigns o4 to __tmp1,
which
eliminates the problem described above. This function, however, still
does not do the right thing. Why not?
Reloading
Compilers keep track of where each live variable in the
program can
be found; many variables can be found both at some memory location and
in a register. Sometimes, the compiler chooses to use a register for a
different variable, and stores the value back to its memory location
(if
it has changed) before doing so. Later, if this value is needed, the
value must be loaded back into a register before being used. This is
known as reloading. Other reasons reloading may be required include a
variable's declaration as volatile and the case that
concerns us here, a variable's modification via side effects.
In the example above, incr32() is actually
operating on
a memory address, not a register. So why did we assign
__memory the "r" constraint instead
of more
correctly expressing the constraint as "+m" (*__memory)?
It turns out that the "m" constraint allows a variety of
possible addressing modes. On SPARC, this includes the register/offset
mode (such as [%sp+0x64]). This is fine for instructions
like ld and st, but the cas
instruction is special: it allows no offset. No constraint exists to
describe this condition; the "V" constraint is clearly
similar but is not correct; a bare register ([%g1]) is an
offsettable address, so "V" would actually exclude the
case
we want. Conversely, "o", the inverse constraint of
"V", includes the register/offset addressing mode
we
specifically wish to exclude. So, the only way to express this
constraint is "r". But this does nothing to capture the
fact that although the pointer itself is not modified, the value at
*__memory is altered by the assembly fragment. Is
this a
problem? Let's look at the assembly generated for func()
a
little more closely:
func()
func: 9c 03 bf 88 add %sp, -0x78, %sp
func+0x4: 9a 10 20 04 mov 0x4, %o5
func+0x8: 90 02 20 04 add %o0, 0x4, %o0 <===
func+0xc: da 23 a0 64 st %o5, [%sp + 0x64]
func+0x10: 82 03 a0 64 add %sp, 0x64, %g1
func+0x14: d8 00 40 00 ld [%g1], %o4
func+0x18: 9a 03 20 01 add %o4, 0x1, %o5
func+0x1c: db e0 50 0c cas [%g1] , %o4, %o5
func+0x20: 80 a3 00 0d cmp %o4, %o5
func+0x24: 32 47 ff fd bne,a,pn %icc, func+0x18
func+0x28: 98 10 00 0d mov %o5, %o4
func+0x2c: 81 c3 e0 08 retl <===
func+0x30: 9c 23 bf 88 sub %sp, -0x78, %sp
We see that gcc has assigned z the o0
register, which is not surprising given that it's the return value. But
after o0 is set to x + 4 at the beginning
of
the function, it's never set again. The line z = x + y
has
been discarded by the compiler! This is because it does not know that
our inline assembly modified the value of y, so it did
not
reload the value and recalculate z.
There are two ways we can correct this problem: (a) add a
"+m" output operand for *__memory,
or (b) add
"memory" to the list of clobbers. This is a
special
clobber that tells gcc not to trust the values in any registers it
would
otherwise believe to hold the current values of variables stored in
memory. In short, this clobber tells gcc that all registers must be
reloaded if the correct value of a variable is required. This is
somewhat inefficient when we know which piece of memory has been
touched, so (a) is preferable for better performance.
Whichever solution we choose, gcc now compiles our code to:
func()
func: 9c 03 bf 88 add %sp, -0x78, %sp
func+0x4: 9a 10 20 04 mov 0x4, %o5
func+0x8: 98 10 00 08 mov %o0, %o4
func+0xc: da 23 a0 64 st %o5, [%sp + 0x64]
func+0x10: 82 03 a0 64 add %sp, 0x64, %g1
func+0x14: d6 00 40 00 ld [%g1], %o3
func+0x18: 9a 02 e0 01 add %o3, 0x1, %o5
func+0x1c: db e0 50 0b cas [%g1] , %o3, %o5
func+0x20: 80 a2 c0 0d cmp %o3, %o5
func+0x24: 32 47 ff fd bne,a,pn %icc, func+0x18
func+0x28: 96 10 00 0d mov %o5, %o3
func+0x2c: d0 03 a0 64 ld [%sp + 0x64], %o0 <===
func+0x30: 90 03 00 08 add %o4, %o0, %o0 <===
func+0x34: 81 c3 e0 08 retl
func+0x38: 9c 23 bf 88 sub %sp, -0x78, %sp
Note the reload, which will now return the correct result.
There
are actually two other ways to correct this, although the use of
"+m" is the most correct. First, we could declare
z to be volatile in func().
This
would force gcc to reload its value from memory any time that value is
required. Use of the volatile keyword is mainly useful
when some external thread (or hardware) may change the value at any
time; using it as a substitute for correct constraints will cause
unnecessary reloading, degrading performance. Second, and perhaps best
of all, the compiler could be modified to accept a SPARC-specific
constraint for use with the cas instruction, one which
requires the address of the operand to be stored in a general register.
You can find more inline assembly examples in libc (math
functions), MD5
acceleration, and the
kernel illustrating these concepts. Be sure to read and understand
the documentation
completely before writing your own
inline assembly for gcc, and always test your understanding by
constructing and compiling simple test programs like these.
Posted by ux-admin on December 06, 2005 at 10:12 AM UTC #
Posted by Keith M Wesolowski on December 07, 2005 at 04:56 PM UTC #