Re: [DOC PATCH] PowerPC extended asm example

Alan Modra Tue, 04 Apr 2017 05:15:24 -0700

Revised patch.

        * doc/extend.texi (Extended Asm <Clobbers>): Rename to
        "Clobbers and Scratch Registers".  Add OpenBLAS example.


diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index 0f44ece..0b0a021 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -7869,7 +7869,7 @@ A comma-separated list of C expressions read by the 
instructions in the
 @item Clobbers
 A comma-separated list of registers or other values changed by the 
 @var{AssemblerTemplate}, beyond those listed as outputs.
-An empty list is permitted.  @xref{Clobbers}.
+An empty list is permitted.  @xref{Clobbers and Scratch Registers}.
 
 @item GotoLabels
 When you are using the @code{goto} form of @code{asm}, this section contains 
@@ -8229,7 +8229,7 @@ The enclosing parentheses are a required part of the 
syntax.
 
 When the compiler selects the registers to use to 
 represent the output operands, it does not use any of the clobbered registers 
-(@pxref{Clobbers}).
+(@pxref{Clobbers and Scratch Registers}).
 
 Output operand expressions must be lvalues. The compiler cannot check whether 
 the operands have data types that are reasonable for the instruction being 
@@ -8465,7 +8465,8 @@ as input.  The enclosing parentheses are a required part 
of the syntax.
 @end table
 
 When the compiler selects the registers to use to represent the input 
-operands, it does not use any of the clobbered registers (@pxref{Clobbers}).
+operands, it does not use any of the clobbered registers
+(@pxref{Clobbers and Scratch Registers}).
 
 If there are no output operands but there are input operands, place two 
 consecutive colons where the output operands would go:
@@ -8516,9 +8517,10 @@ asm ("cmoveq %1, %2, %[result]"
    : "r" (test), "r" (new), "[result]" (old));
 @end example
 
-@anchor{Clobbers}
-@subsubsection Clobbers
+@anchor{Clobbers and Scratch Registers}
+@subsubsection Clobbers and Scratch Registers
 @cindex @code{asm} clobbers
+@cindex @code{asm} scratch registers
 
 While the compiler is aware of changes to entries listed in the output 
 operands, the inline @code{asm} code may modify more than just the outputs. 
For 
@@ -8589,6 +8591,110 @@ ten bytes of a string, use a memory input like:
 
 @end table
 
+Rather than allocating fixed registers via clobbers to provide scratch
+registers for an @code{asm} statement, there are better techniques you
+can use which give the compiler more freedom.  There are also better
+ways than using a @code{"memory"} clobber to tell GCC that an
+@code{asm} statement accesses or modifies memory.  The following
+PowerPC example taken from OpenBLAS illustrates some of these
+techniques.
+
+In the function shown below, all of the function parameters are inputs
+except for the @code{y} array, which is modified by the function.
+Only the first few lines of assembly in the @code{asm} statement are
+shown, and a comment handy for checking register assignments.  These
+insns set up some registers for later use in loops, and in particular,
+set up four pointers into the @code{ap} array, @code{a0=ap},
+@code{a1=ap+lda}, @code{a2=ap+2*lda}, and @code{a3=ap+3*lda}.  The
+rest of the assembly is simply too large to include here.
+
+@smallexample
+static void
+dgemv_kernel_4x4 (long n, const double *ap, long lda,
+                  const double *x, double *y, double alpha)
+@{
+  double *a0;
+  double *a1;
+  double *a2;
+  double *a3;
+
+  __asm__
+    (
+       "lxvd2x         34, 0, %10      \n\t"   // x0, x1
+       "lxvd2x         35, %11, %10    \n\t"   // x2, x3
+       "xxspltd                32, %x9, 0      \n\t"   // alpha, alpha
+       "sldi           %6, %13, 3      \n\t"   // lda * sizeof (double)
+       "xvmuldp                34, 34, 32      \n\t"   // x0 * alpha, x1 * 
alpha
+       "xvmuldp                35, 35, 32      \n\t"   // x2 * alpha, x3 * 
alpha
+       "add            %4, %3, %6      \n\t"   // a0 = ap, a1 = a0 + lda
+       "add            %6, %6, %6      \n\t"   // 2 * lda
+       "xxspltd                32, 34, 0       \n\t"   // x0 * alpha, x0 * 
alpha
+       "xxspltd                33, 34, 1       \n\t"   // x1 * alpha, x1 * 
alpha
+       "xxspltd                34, 35, 0       \n\t"   // x2 * alpha, x2 * 
alpha
+       "xxspltd                35, 35, 1       \n\t"   // x3 * alpha, x3 * 
alpha
+       "add            %5, %3, %6      \n\t"   // a2 = a0 + 2 * lda
+       "add            %6, %4, %6      \n\t"   // a3 = a1 + 2 * lda
+     ...
+     "#n=%1 ap=%8=%12 lda=%13 x=%7=%10 y=%0=%2 alpha=%9 o16=%11\n"
+     "#a0=%3 a1=%4 a2=%5 a3=%6"
+     :
+       "+m" (*y),
+       "+r" (n),       // 1
+       "+b" (y),       // 2
+       "=b" (a0),      // 3
+       "=b" (a1),      // 4
+       "=&b" (a2),     // 5
+       "=&b" (a3)      // 6
+     :
+       "m" (*x),
+       "m" (*ap),
+       "d" (alpha),    // 9
+       "r" (x),                // 10
+       "b" (16),       // 11
+       "3" (ap),       // 12
+       "4" (lda)       // 13
+     :
+       "cr0",
+       "vs32","vs33","vs34","vs35","vs36","vs37",
+       "vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47"
+     );
+@}
+@end smallexample
+
+Allocating scratch registers is done by declaring a variable and
+making it an early-clobber @code{asm} output as with @code{a2} and
+@code{a3}, or making it an output tied to an input as with @code{a0}
+and @code{a1}.  You can use a normal @code{asm} output if all inputs
+that might share the same register are consumed before the scratch is
+used.  The VSX registers clobbered by the @code{asm} statement could
+have used the same technique except for GCC's limit on number of
+@code{asm} parameters.  It shouldn't be surprising that @code{a0} is
+tied to @code{ap} from the above description, and @code{lda} is only
+used in the fourth machine insn shown above, so that register is
+available for reuse as @code{a1}.  Note that tying an input to an
+output is the way to set up an initialized temporary register modified
+by an @code{asm} statement.  The example also shows an initialized
+register unchanged by the @code{asm} statement; @code{"b" (16)} sets
+up @code{%11} to 16.
+
+Rather than using a @code{"memory"} clobber, the @code{asm} has
+@code{"+m" (*y)} in the list of outputs to tell GCC that the @code{y}
+array is both read and written by the @code{asm} statement.
+@code{"m" (*x)} and @code{"m" (*ap)} in the inputs tell GCC that these
+arrays are read.  At a minimum, aliasing rules allow GCC to know what
+memory @emph{doesn't} need to be flushed, and if the function were
+inlined then GCC may be able to do even better.  Also, if GCC can
+prove that all of the outputs of an @code{asm} statement are unused,
+then the @code{asm} may be deleted.  Removal of dead @code{asm}
+statements will not happen if they clobber @code{"memory"}.  Notice
+that @code{x}, @code{y}, and @code{ap} all appear twice in the
+@code{asm} parameters, once to specify memory accessed, and once to
+specify a base register used by the @code{asm}.  You won't normally be
+wasting a register by doing this as GCC can use the same register for
+both purposes.  However, it would be foolish to use both @code{%0} and
+@code{%2} for @code{y} in this @code{asm} assembly and expect them to
+be the same.
+
 @anchor{GotoLabels}
 @subsubsection Goto Labels
 @cindex @code{asm} goto labels

-- 
Alan Modra
Australia Development Lab, IBM

Re: [DOC PATCH] PowerPC extended asm example

Reply via email to