================ @@ -0,0 +1,1990 @@ +=============== +AMDGPU Builtins +=============== + +.. contents:: + :local: + :depth: 2 + +This document describes the AMDGPU target-specific builtins available in Clang. +Most of these builtins provide direct access to AMDGPU hardware instructions +and intrinsics. + +.. note:: + + This document was generated with AI assistance, cross-referencing the + following sources: + + - ``clang/include/clang/Basic/BuiltinsAMDGPU.td`` (builtin definitions) + - ``llvm/include/llvm/IR/IntrinsicsAMDGPU.td`` (intrinsic definitions) + - ``clang/lib/Sema/SemaAMDGPU.cpp`` (argument validation and constraints) + - ``clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp`` (lowering logic) + - `GPUOpen Machine-Readable ISA <https://gpuopen.com/machine-readable-isa/>`_ + (ISA documents) + +.. warning:: + + These builtins, including their names, arguments, and target requirements, + are all subject to change without warning across LLVM releases. + +All AMDGPU builtins use the ``__builtin_amdgcn_`` prefix (or ``__builtin_r600_`` +for R600 targets). Some arguments must be compile-time constant expressions; +this is noted in the descriptions where applicable. + +ABI / Special Register Builtins +=============================== + +These builtins provide access to kernel dispatch metadata, work-item and +workgroup identification, and other ABI-level information. They are available +on all SI+ targets. + +Pointer Builtins +---------------- + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Builtin + - Description + * - ``void __constant * __builtin_amdgcn_dispatch_ptr()`` + - Returns a read-only pointer to the dispatch packet, which contains + workgroup size, grid size, and other dispatch parameters. + * - ``void __constant * __builtin_amdgcn_kernarg_segment_ptr()`` + - Returns a pointer to the beginning of the kernel argument segment. + * - ``void __constant * __builtin_amdgcn_implicitarg_ptr()`` + - Returns a pointer to the implicit arguments appended after explicit + kernel arguments. Layout depends on the code object version. + * - ``void __constant * __builtin_amdgcn_queue_ptr()`` + - Returns a pointer to the queue_t object for the queue executing the + current kernel. + +Work-Item and Workgroup Identification +-------------------------------------- + +These builtins take no arguments and have no side effects. They return +``unsigned int`` (or ``unsigned short`` for workgroup size). + +.. list-table:: + :header-rows: 1 + :widths: 40 30 30 + + * - Builtin + - Return Type + - Description + * - ``__builtin_amdgcn_workgroup_id_{x,y,z}()`` + - ``unsigned int`` + - Workgroup ID in the specified dimension. + * - ``__builtin_amdgcn_workitem_id_{x,y,z}()`` + - ``unsigned int`` + - Work-item (thread) ID within the workgroup. + * - ``__builtin_amdgcn_workgroup_size_{x,y,z}()`` + - ``unsigned short`` + - Workgroup size in the specified dimension. + * - ``__builtin_amdgcn_grid_size_{x,y,z}()`` + - ``unsigned int`` + - Total grid size in the specified dimension. + +**GFX1250+ Cluster Identification** (requires ``gfx1250-insts``): + +.. list-table:: + :header-rows: 1 + :widths: 50 50 + + * - Builtin + - Description + * - ``__builtin_amdgcn_cluster_id_{x,y,z}()`` + - Cluster ID in the specified dimension. + * - ``__builtin_amdgcn_cluster_workgroup_id_{x,y,z}()`` + - Workgroup ID within the cluster. + * - ``__builtin_amdgcn_cluster_workgroup_flat_id()`` + - Flat (linearized) workgroup ID within the cluster. + * - ``__builtin_amdgcn_cluster_workgroup_max_id_{x,y,z}()`` + - Maximum workgroup ID within the cluster. + * - ``__builtin_amdgcn_cluster_workgroup_max_flat_id()`` + - Maximum flat workgroup ID within the cluster. + +Other ABI Builtins +------------------ + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Builtin + - Description + * - ``unsigned int __builtin_amdgcn_mbcnt_lo(unsigned int mask, unsigned int val)`` + - Counts the number of bits set in ``mask`` among lanes 0--31 that are + below the current lane, and adds ``val``. Lanes 32--63 count all 32 + bits. + * - ``unsigned int __builtin_amdgcn_mbcnt_hi(unsigned int mask, unsigned int val)`` + - Counts the number of bits set in ``mask`` among lanes 32--63 that are + below the current lane, and adds ``val``. Lanes 0--31 add zero. + + **Typical usage**: To count the number of matching lanes below the + current lane across a full 64-lane wavefront, call ``mbcnt_lo`` first + and pass its result as ``val`` to ``mbcnt_hi``. + * - ``uint64_t __builtin_amdgcn_s_memtime()`` + - Returns the current 64-bit timestamp. Requires ``s-memtime-inst``. + +Instruction Builtins +==================== + +Scalar Instruction Builtins +--------------------------- + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Builtin + - Description + * - ``unsigned int __builtin_amdgcn_s_getreg(int hwreg)`` + - Reads a hardware register field. ``hwreg`` is an encoded specifier + (register ID, bit offset, and bit width packed into 16 bits). + * - ``void __builtin_amdgcn_s_setreg(int hwreg, unsigned int val)`` + - Writes ``val`` into a hardware register field. ``hwreg`` is encoded + the same way as in ``s_getreg`` and must be in range [0, 65535]. + * - ``uint64_t __builtin_amdgcn_s_getpc()`` + - Returns the address of the next instruction (program counter). + * - ``void __builtin_amdgcn_s_waitcnt(int cnt)`` + - Waits until the number of outstanding memory and export operations drops + to the levels encoded in ``cnt``. + * - ``void __builtin_amdgcn_s_sendmsg(int msg, unsigned int gsdata)`` + - Sends a hardware message identified by ``msg``, with auxiliary data in + ``gsdata``. + * - ``void __builtin_amdgcn_s_sendmsghalt(int msg, unsigned int gsdata)`` + - Sends a hardware message and then halts the wavefront. + * - ``void __builtin_amdgcn_s_barrier()`` + - Synchronizes waves within a threadgroup (workgroup barrier). + * - ``void __builtin_amdgcn_s_ttracedata(int data)`` + - Sends ``data`` as user data to the thread trace stream. + * - ``void __builtin_amdgcn_s_sleep(int duration)`` + - Causes the wave to sleep for approximately ``duration`` cycles (up to + ~8000 clocks), or until an external event wakes the wave. + * - ``void __builtin_amdgcn_s_incperflevel(int level)`` + - Increments the performance counter specified in ``level`` by 1. + * - ``void __builtin_amdgcn_s_decperflevel(int level)`` + - Decrements the performance counter specified in ``level`` by 1. + * - ``void __builtin_amdgcn_s_setprio(short prio)`` + - Sets the wave scheduling priority to ``prio``. + * - ``void __builtin_amdgcn_s_dcache_inv()`` + - Invalidates the scalar data L0 cache. + * - ``void __builtin_amdgcn_buffer_wbinvl1()`` + - Writes back and invalidates the shader L1 cache. + * - ``unsigned int __builtin_amdgcn_groupstaticsize()`` + - Returns the size of static LDS allocation in the current workgroup. + * - ``unsigned int __builtin_amdgcn_wavefrontsize()`` + - Returns the wavefront size (32 or 64). + * - ``void __builtin_amdgcn_wave_barrier()`` + - Acts as a scheduling barrier within a wave, preventing the compiler from + moving memory operations across this point. + +Division and Math Builtins +-------------------------- + +Division Support +^^^^^^^^^^^^^^^^ + +These builtins implement individual steps of AMDGPU's iterative floating-point +division algorithm. They are typically used together to perform a full division. + +``__builtin_amdgcn_div_scale`` / ``__builtin_amdgcn_div_scalef`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + double __builtin_amdgcn_div_scale(double numer, double denom, bool select_quotient, bool *flag_out); + float __builtin_amdgcn_div_scalef(float numer, float denom, bool select_quotient, bool *flag_out); + +Scales the numerator or denominator to prepare for iterative division, +preventing subnormal intermediate values that would reduce precision. + +- ``numer``: The numerator. +- ``denom``: The denominator. +- ``select_quotient``: If ``true``, selects the numerator for scaling; if + ``false``, selects the denominator. +- ``flag_out``: Pointer to a ``bool`` where the flag indicating whether + post-scaling is required is written. + +**Implementation note**: The intrinsics return both the result and the flag. The +builtin unpacks them, returning the result directly and writing the flag through +the pointer. + +``__builtin_amdgcn_div_fmas`` / ``__builtin_amdgcn_div_fmasf`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + double __builtin_amdgcn_div_fmas(double a, double b, double c, bool vcc); + float __builtin_amdgcn_div_fmasf(float a, float b, float c, bool vcc); + +Fused multiply-add for division. Computes ``a * b + c``, then applies an +exponent correction if the ``vcc`` flag (from ``div_scale``) indicates it is +needed. + +- ``a``, ``b``, ``c``: FMA operands. +- ``vcc``: The flag from ``div_scale`` indicating whether scaling is needed. + +``__builtin_amdgcn_div_fixup`` / ``__builtin_amdgcn_div_fixupf`` / ``__builtin_amdgcn_div_fixuph`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + double __builtin_amdgcn_div_fixup(double a, double b, double c); + float __builtin_amdgcn_div_fixupf(float a, float b, float c); + __fp16 __builtin_amdgcn_div_fixuph(__fp16 a, __fp16 b, __fp16 c); // requires 16-bit-insts + +Final fixup step for division. Given a quotient in ``a``, denominator in ``b``, +and numerator in ``c``, handles special cases (divide by zero, NaN inputs, +overflow) and corrects the quotient accordingly. Also raises the appropriate +floating-point exceptions (invalid, denormal, divide-by-zero). + +Trigonometric Argument Reduction +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: c + + double __builtin_amdgcn_trig_preop(double src, int segment); + float __builtin_amdgcn_trig_preopf(float src, int segment); + +Performs a step of trigonometric argument reduction by looking up a segment of +the constant 2/PI, selected by ``segment`` (bits [4:0]). The result is scaled +by the exponent of ``src`` and returned. To reduce a large angle, call this +builtin multiple times with different ``segment`` values and combine the +results. Rounding is toward zero. + +Single-Argument Math Builtins +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +These builtins perform single-instruction math operations. Reciprocal, square +root, and reciprocal square root follow IEEE rules. Sine, cosine, log, and exp +are hardware approximations (roughly 0.5 ULP for most inputs). +The ``h``-suffixed variants operate on half-precision and require +``16-bit-insts``. + +.. list-table:: + :header-rows: 1 + :widths: 25 25 25 25 + + * - Operation + - f64 + - f32 + - f16 + * - Reciprocal + - ``__builtin_amdgcn_rcp`` + - ``__builtin_amdgcn_rcpf`` + - ``__builtin_amdgcn_rcph`` + * - Square root + - ``__builtin_amdgcn_sqrt`` + - ``__builtin_amdgcn_sqrtf`` + - ``__builtin_amdgcn_sqrth`` + * - Reciprocal sqrt + - ``__builtin_amdgcn_rsq`` + - ``__builtin_amdgcn_rsqf`` + - ``__builtin_amdgcn_rsqh`` + * - Reciprocal sqrt clamp + - ``__builtin_amdgcn_rsq_clamp`` + - ``__builtin_amdgcn_rsq_clampf`` + - + * - Sine (input: turns) + - + - ``__builtin_amdgcn_sinf`` + - ``__builtin_amdgcn_sinh`` + * - Cosine (input: turns) + - + - ``__builtin_amdgcn_cosf`` + - ``__builtin_amdgcn_cosh`` + * - Log2 + - + - ``__builtin_amdgcn_logf`` + - + * - Log clamp ---------------- jayfoad wrote:
Should this be "Log2 clamp"? (I'm not familiar with this builtin) https://github.com/llvm/llvm-project/pull/181193 _______________________________________________ cfe-commits mailing list [email protected] https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
