[clang] [RFC][Docs][Clang][AMDGPU] Add AMDGPU builtins documentation (PR #181193)

Jay Foad via cfe-commits Mon, 16 Feb 2026 03:25:28 -0800

================
@@ -0,0 +1,1990 @@
+===============
+AMDGPU Builtins
+===============
+
+.. contents::
+   :local:
+   :depth: 2
+
+This document describes the AMDGPU target-specific builtins available in Clang.
+Most of these builtins provide direct access to AMDGPU hardware instructions
+and intrinsics.
+
+.. note::
+
+   This document was generated with AI assistance, cross-referencing the
+   following sources:
+
+   - ``clang/include/clang/Basic/BuiltinsAMDGPU.td`` (builtin definitions)
+   - ``llvm/include/llvm/IR/IntrinsicsAMDGPU.td`` (intrinsic definitions)
+   - ``clang/lib/Sema/SemaAMDGPU.cpp`` (argument validation and constraints)
+   - ``clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp`` (lowering logic)
+   - `GPUOpen Machine-Readable ISA 
<https://gpuopen.com/machine-readable-isa/>`_
+     (ISA documents)
+
+.. warning::
+
+   These builtins, including their names, arguments, and target requirements,
+   are all subject to change without warning across LLVM releases.
+
+All AMDGPU builtins use the ``__builtin_amdgcn_`` prefix (or 
``__builtin_r600_``
+for R600 targets). Some arguments must be compile-time constant expressions;
+this is noted in the descriptions where applicable.
+
+ABI / Special Register Builtins
+===============================
+
+These builtins provide access to kernel dispatch metadata, work-item and
+workgroup identification, and other ABI-level information. They are available
+on all SI+ targets.
+
+Pointer Builtins
+----------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 60
+
+   * - Builtin
+     - Description
+   * - ``void __constant * __builtin_amdgcn_dispatch_ptr()``
+     - Returns a read-only pointer to the dispatch packet, which contains
+       workgroup size, grid size, and other dispatch parameters.
+   * - ``void __constant * __builtin_amdgcn_kernarg_segment_ptr()``
+     - Returns a pointer to the beginning of the kernel argument segment.
+   * - ``void __constant * __builtin_amdgcn_implicitarg_ptr()``
+     - Returns a pointer to the implicit arguments appended after explicit
+       kernel arguments. Layout depends on the code object version.
+   * - ``void __constant * __builtin_amdgcn_queue_ptr()``
+     - Returns a pointer to the queue_t object for the queue executing the
+       current kernel.
+
+Work-Item and Workgroup Identification
+--------------------------------------
+
+These builtins take no arguments and have no side effects. They return
+``unsigned int`` (or ``unsigned short`` for workgroup size).
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 30 30
+
+   * - Builtin
+     - Return Type
+     - Description
+   * - ``__builtin_amdgcn_workgroup_id_{x,y,z}()``
+     - ``unsigned int``
+     - Workgroup ID in the specified dimension.
+   * - ``__builtin_amdgcn_workitem_id_{x,y,z}()``
+     - ``unsigned int``
+     - Work-item (thread) ID within the workgroup.
+   * - ``__builtin_amdgcn_workgroup_size_{x,y,z}()``
+     - ``unsigned short``
+     - Workgroup size in the specified dimension.
+   * - ``__builtin_amdgcn_grid_size_{x,y,z}()``
+     - ``unsigned int``
+     - Total grid size in the specified dimension.
+
+**GFX1250+ Cluster Identification** (requires ``gfx1250-insts``):
+
+.. list-table::
+   :header-rows: 1
+   :widths: 50 50
+
+   * - Builtin
+     - Description
+   * - ``__builtin_amdgcn_cluster_id_{x,y,z}()``
+     - Cluster ID in the specified dimension.
+   * - ``__builtin_amdgcn_cluster_workgroup_id_{x,y,z}()``
+     - Workgroup ID within the cluster.
+   * - ``__builtin_amdgcn_cluster_workgroup_flat_id()``
+     - Flat (linearized) workgroup ID within the cluster.
+   * - ``__builtin_amdgcn_cluster_workgroup_max_id_{x,y,z}()``
+     - Maximum workgroup ID within the cluster.
+   * - ``__builtin_amdgcn_cluster_workgroup_max_flat_id()``
+     - Maximum flat workgroup ID within the cluster.
+
+Other ABI Builtins
+------------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 60
+
+   * - Builtin
+     - Description
+   * - ``unsigned int __builtin_amdgcn_mbcnt_lo(unsigned int mask, unsigned 
int val)``
+     - Counts the number of bits set in ``mask`` among lanes 0--31 that are
+       below the current lane, and adds ``val``. Lanes 32--63 count all 32
+       bits.
+   * - ``unsigned int __builtin_amdgcn_mbcnt_hi(unsigned int mask, unsigned 
int val)``
+     - Counts the number of bits set in ``mask`` among lanes 32--63 that are
+       below the current lane, and adds ``val``. Lanes 0--31 add zero.
+
+       **Typical usage**: To count the number of matching lanes below the
+       current lane across a full 64-lane wavefront, call ``mbcnt_lo`` first
+       and pass its result as ``val`` to ``mbcnt_hi``.
+   * - ``uint64_t __builtin_amdgcn_s_memtime()``
+     - Returns the current 64-bit timestamp. Requires ``s-memtime-inst``.
+
+Instruction Builtins
+====================
+
+Scalar Instruction Builtins
+---------------------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 60
+
+   * - Builtin
+     - Description
+   * - ``unsigned int __builtin_amdgcn_s_getreg(int hwreg)``
+     - Reads a hardware register field. ``hwreg`` is an encoded specifier
+       (register ID, bit offset, and bit width packed into 16 bits).
+   * - ``void __builtin_amdgcn_s_setreg(int hwreg, unsigned int val)``
+     - Writes ``val`` into a hardware register field. ``hwreg`` is encoded
+       the same way as in ``s_getreg`` and must be in range [0, 65535].
+   * - ``uint64_t __builtin_amdgcn_s_getpc()``
+     - Returns the address of the next instruction (program counter).
+   * - ``void __builtin_amdgcn_s_waitcnt(int cnt)``
+     - Waits until the number of outstanding memory and export operations drops
+       to the levels encoded in ``cnt``.
+   * - ``void __builtin_amdgcn_s_sendmsg(int msg, unsigned int gsdata)``
+     - Sends a hardware message identified by ``msg``, with auxiliary data in
+       ``gsdata``.
+   * - ``void __builtin_amdgcn_s_sendmsghalt(int msg, unsigned int gsdata)``
+     - Sends a hardware message and then halts the wavefront.
+   * - ``void __builtin_amdgcn_s_barrier()``
+     - Synchronizes waves within a threadgroup (workgroup barrier).
+   * - ``void __builtin_amdgcn_s_ttracedata(int data)``
+     - Sends ``data`` as user data to the thread trace stream.
+   * - ``void __builtin_amdgcn_s_sleep(int duration)``
+     - Causes the wave to sleep for approximately ``duration`` cycles (up to
+       ~8000 clocks), or until an external event wakes the wave.
+   * - ``void __builtin_amdgcn_s_incperflevel(int level)``
+     - Increments the performance counter specified in ``level`` by 1.
+   * - ``void __builtin_amdgcn_s_decperflevel(int level)``
+     - Decrements the performance counter specified in ``level`` by 1.
+   * - ``void __builtin_amdgcn_s_setprio(short prio)``
+     - Sets the wave scheduling priority to ``prio``.
+   * - ``void __builtin_amdgcn_s_dcache_inv()``
+     - Invalidates the scalar data L0 cache.
+   * - ``void __builtin_amdgcn_buffer_wbinvl1()``
+     - Writes back and invalidates the shader L1 cache.
+   * - ``unsigned int __builtin_amdgcn_groupstaticsize()``
+     - Returns the size of static LDS allocation in the current workgroup.
+   * - ``unsigned int __builtin_amdgcn_wavefrontsize()``
+     - Returns the wavefront size (32 or 64).
+   * - ``void __builtin_amdgcn_wave_barrier()``
+     - Acts as a scheduling barrier within a wave, preventing the compiler from
+       moving memory operations across this point.
+
+Division and Math Builtins
+--------------------------
+
+Division Support
+^^^^^^^^^^^^^^^^
+
+These builtins implement individual steps of AMDGPU's iterative floating-point
+division algorithm. They are typically used together to perform a full 
division.
+
+``__builtin_amdgcn_div_scale`` / ``__builtin_amdgcn_div_scalef``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+   double __builtin_amdgcn_div_scale(double numer, double denom, bool 
select_quotient, bool *flag_out);
+   float  __builtin_amdgcn_div_scalef(float numer, float denom, bool 
select_quotient, bool *flag_out);
+
+Scales the numerator or denominator to prepare for iterative division,
+preventing subnormal intermediate values that would reduce precision.
+
+- ``numer``: The numerator.
+- ``denom``: The denominator.
+- ``select_quotient``: If ``true``, selects the numerator for scaling; if
+  ``false``, selects the denominator.
+- ``flag_out``: Pointer to a ``bool`` where the flag indicating whether
+  post-scaling is required is written.
+
+**Implementation note**: The intrinsics return both the result and the flag. 
The
+builtin unpacks them, returning the result directly and writing the flag 
through
+the pointer.
+
+``__builtin_amdgcn_div_fmas`` / ``__builtin_amdgcn_div_fmasf``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+   double __builtin_amdgcn_div_fmas(double a, double b, double c, bool vcc);
+   float  __builtin_amdgcn_div_fmasf(float a, float b, float c, bool vcc);
+
+Fused multiply-add for division. Computes ``a * b + c``, then applies an
+exponent correction if the ``vcc`` flag (from ``div_scale``) indicates it is
+needed.
+
+- ``a``, ``b``, ``c``: FMA operands.
+- ``vcc``: The flag from ``div_scale`` indicating whether scaling is needed.
+
+``__builtin_amdgcn_div_fixup`` / ``__builtin_amdgcn_div_fixupf`` / 
``__builtin_amdgcn_div_fixuph``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+   double __builtin_amdgcn_div_fixup(double a, double b, double c);
+   float  __builtin_amdgcn_div_fixupf(float a, float b, float c);
+   __fp16 __builtin_amdgcn_div_fixuph(__fp16 a, __fp16 b, __fp16 c);  // 
requires 16-bit-insts
+
+Final fixup step for division. Given a quotient in ``a``, denominator in ``b``,
+and numerator in ``c``, handles special cases (divide by zero, NaN inputs,
+overflow) and corrects the quotient accordingly. Also raises the appropriate
+floating-point exceptions (invalid, denormal, divide-by-zero).
+
+Trigonometric Argument Reduction
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: c
+
+   double __builtin_amdgcn_trig_preop(double src, int segment);
+   float  __builtin_amdgcn_trig_preopf(float src, int segment);
+
+Performs a step of trigonometric argument reduction by looking up a segment of
+the constant 2/PI, selected by ``segment`` (bits [4:0]). The result is scaled
+by the exponent of ``src`` and returned. To reduce a large angle, call this
+builtin multiple times with different ``segment`` values and combine the
+results. Rounding is toward zero.
+
+Single-Argument Math Builtins
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These builtins perform single-instruction math operations. Reciprocal, square
+root, and reciprocal square root follow IEEE rules. Sine, cosine, log, and exp
+are hardware approximations (roughly 0.5 ULP for most inputs).
+The ``h``-suffixed variants operate on half-precision and require
+``16-bit-insts``.
+
+.. list-table::
+   :header-rows: 1
+   :widths: 25 25 25 25
+
+   * - Operation
+     - f64
+     - f32
+     - f16
+   * - Reciprocal
+     - ``__builtin_amdgcn_rcp``
+     - ``__builtin_amdgcn_rcpf``
+     - ``__builtin_amdgcn_rcph``
+   * - Square root
+     - ``__builtin_amdgcn_sqrt``
+     - ``__builtin_amdgcn_sqrtf``
+     - ``__builtin_amdgcn_sqrth``
+   * - Reciprocal sqrt
+     - ``__builtin_amdgcn_rsq``
+     - ``__builtin_amdgcn_rsqf``
+     - ``__builtin_amdgcn_rsqh``
+   * - Reciprocal sqrt clamp
+     - ``__builtin_amdgcn_rsq_clamp``
+     - ``__builtin_amdgcn_rsq_clampf``
+     -
+   * - Sine (input: turns)
+     -
+     - ``__builtin_amdgcn_sinf``
+     - ``__builtin_amdgcn_sinh``
+   * - Cosine (input: turns)
+     -
+     - ``__builtin_amdgcn_cosf``
+     - ``__builtin_amdgcn_cosh``
+   * - Log2
+     -
+     - ``__builtin_amdgcn_logf``
+     -
+   * - Log clamp
----------------
jayfoad wrote:


Should this be "Log2 clamp"? (I'm not familiar with this builtin)

https://github.com/llvm/llvm-project/pull/181193
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[clang] [RFC][Docs][Clang][AMDGPU] Add AMDGPU builtins documentation (PR #181193)

Reply via email to