[clang] [RFC][Docs][Clang][AMDGPU] Add AMDGPU builtins documentation (PR #181193)

Jay Foad via cfe-commits Mon, 16 Feb 2026 03:25:28 -0800

================
@@ -0,0 +1,1990 @@
+===============
+AMDGPU Builtins
+===============
+
+.. contents::
+   :local:
+   :depth: 2
+
+This document describes the AMDGPU target-specific builtins available in Clang.
+Most of these builtins provide direct access to AMDGPU hardware instructions
+and intrinsics.
+
+.. note::
+
+   This document was generated with AI assistance, cross-referencing the
+   following sources:
+
+   - ``clang/include/clang/Basic/BuiltinsAMDGPU.td`` (builtin definitions)
+   - ``llvm/include/llvm/IR/IntrinsicsAMDGPU.td`` (intrinsic definitions)
+   - ``clang/lib/Sema/SemaAMDGPU.cpp`` (argument validation and constraints)
+   - ``clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp`` (lowering logic)
+   - `GPUOpen Machine-Readable ISA 
<https://gpuopen.com/machine-readable-isa/>`_
+     (ISA documents)
+
+.. warning::
+
+   These builtins, including their names, arguments, and target requirements,
+   are all subject to change without warning across LLVM releases.
+
+All AMDGPU builtins use the ``__builtin_amdgcn_`` prefix (or 
``__builtin_r600_``
+for R600 targets). Some arguments must be compile-time constant expressions;
+this is noted in the descriptions where applicable.
+
+ABI / Special Register Builtins
+===============================
+
+These builtins provide access to kernel dispatch metadata, work-item and
+workgroup identification, and other ABI-level information. They are available
+on all SI+ targets.
+
+Pointer Builtins
+----------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 60
+
+   * - Builtin
+     - Description
+   * - ``void __constant * __builtin_amdgcn_dispatch_ptr()``
+     - Returns a read-only pointer to the dispatch packet, which contains
+       workgroup size, grid size, and other dispatch parameters.
+   * - ``void __constant * __builtin_amdgcn_kernarg_segment_ptr()``
+     - Returns a pointer to the beginning of the kernel argument segment.
+   * - ``void __constant * __builtin_amdgcn_implicitarg_ptr()``
+     - Returns a pointer to the implicit arguments appended after explicit
+       kernel arguments. Layout depends on the code object version.
+   * - ``void __constant * __builtin_amdgcn_queue_ptr()``
+     - Returns a pointer to the queue_t object for the queue executing the
+       current kernel.
+
+Work-Item and Workgroup Identification
+--------------------------------------
+
+These builtins take no arguments and have no side effects. They return
+``unsigned int`` (or ``unsigned short`` for workgroup size).
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 30 30
+
+   * - Builtin
+     - Return Type
+     - Description
+   * - ``__builtin_amdgcn_workgroup_id_{x,y,z}()``
+     - ``unsigned int``
+     - Workgroup ID in the specified dimension.
+   * - ``__builtin_amdgcn_workitem_id_{x,y,z}()``
+     - ``unsigned int``
+     - Work-item (thread) ID within the workgroup.
+   * - ``__builtin_amdgcn_workgroup_size_{x,y,z}()``
+     - ``unsigned short``
+     - Workgroup size in the specified dimension.
+   * - ``__builtin_amdgcn_grid_size_{x,y,z}()``
+     - ``unsigned int``
+     - Total grid size in the specified dimension.
+
+**GFX1250+ Cluster Identification** (requires ``gfx1250-insts``):
+
+.. list-table::
+   :header-rows: 1
+   :widths: 50 50
+
+   * - Builtin
+     - Description
+   * - ``__builtin_amdgcn_cluster_id_{x,y,z}()``
+     - Cluster ID in the specified dimension.
+   * - ``__builtin_amdgcn_cluster_workgroup_id_{x,y,z}()``
+     - Workgroup ID within the cluster.
+   * - ``__builtin_amdgcn_cluster_workgroup_flat_id()``
+     - Flat (linearized) workgroup ID within the cluster.
+   * - ``__builtin_amdgcn_cluster_workgroup_max_id_{x,y,z}()``
+     - Maximum workgroup ID within the cluster.
+   * - ``__builtin_amdgcn_cluster_workgroup_max_flat_id()``
+     - Maximum flat workgroup ID within the cluster.
+
+Other ABI Builtins
+------------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 60
+
+   * - Builtin
+     - Description
+   * - ``unsigned int __builtin_amdgcn_mbcnt_lo(unsigned int mask, unsigned 
int val)``
+     - Counts the number of bits set in ``mask`` among lanes 0--31 that are
+       below the current lane, and adds ``val``. Lanes 32--63 count all 32
+       bits.
+   * - ``unsigned int __builtin_amdgcn_mbcnt_hi(unsigned int mask, unsigned 
int val)``
+     - Counts the number of bits set in ``mask`` among lanes 32--63 that are
+       below the current lane, and adds ``val``. Lanes 0--31 add zero.
+
+       **Typical usage**: To count the number of matching lanes below the
+       current lane across a full 64-lane wavefront, call ``mbcnt_lo`` first
+       and pass its result as ``val`` to ``mbcnt_hi``.
+   * - ``uint64_t __builtin_amdgcn_s_memtime()``
+     - Returns the current 64-bit timestamp. Requires ``s-memtime-inst``.
+
+Instruction Builtins
+====================
+
+Scalar Instruction Builtins
+---------------------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 60
+
+   * - Builtin
+     - Description
+   * - ``unsigned int __builtin_amdgcn_s_getreg(int hwreg)``
+     - Reads a hardware register field. ``hwreg`` is an encoded specifier
+       (register ID, bit offset, and bit width packed into 16 bits).
+   * - ``void __builtin_amdgcn_s_setreg(int hwreg, unsigned int val)``
+     - Writes ``val`` into a hardware register field. ``hwreg`` is encoded
+       the same way as in ``s_getreg`` and must be in range [0, 65535].
+   * - ``uint64_t __builtin_amdgcn_s_getpc()``
+     - Returns the address of the next instruction (program counter).
+   * - ``void __builtin_amdgcn_s_waitcnt(int cnt)``
+     - Waits until the number of outstanding memory and export operations drops
+       to the levels encoded in ``cnt``.
+   * - ``void __builtin_amdgcn_s_sendmsg(int msg, unsigned int gsdata)``
+     - Sends a hardware message identified by ``msg``, with auxiliary data in
+       ``gsdata``.
+   * - ``void __builtin_amdgcn_s_sendmsghalt(int msg, unsigned int gsdata)``
+     - Sends a hardware message and then halts the wavefront.
+   * - ``void __builtin_amdgcn_s_barrier()``
+     - Synchronizes waves within a threadgroup (workgroup barrier).
+   * - ``void __builtin_amdgcn_s_ttracedata(int data)``
+     - Sends ``data`` as user data to the thread trace stream.
+   * - ``void __builtin_amdgcn_s_sleep(int duration)``
+     - Causes the wave to sleep for approximately ``duration`` cycles (up to
+       ~8000 clocks), or until an external event wakes the wave.
+   * - ``void __builtin_amdgcn_s_incperflevel(int level)``
+     - Increments the performance counter specified in ``level`` by 1.
+   * - ``void __builtin_amdgcn_s_decperflevel(int level)``
+     - Decrements the performance counter specified in ``level`` by 1.
+   * - ``void __builtin_amdgcn_s_setprio(short prio)``
+     - Sets the wave scheduling priority to ``prio``.
+   * - ``void __builtin_amdgcn_s_dcache_inv()``
+     - Invalidates the scalar data L0 cache.
+   * - ``void __builtin_amdgcn_buffer_wbinvl1()``
+     - Writes back and invalidates the shader L1 cache.
+   * - ``unsigned int __builtin_amdgcn_groupstaticsize()``
+     - Returns the size of static LDS allocation in the current workgroup.
+   * - ``unsigned int __builtin_amdgcn_wavefrontsize()``
+     - Returns the wavefront size (32 or 64).
+   * - ``void __builtin_amdgcn_wave_barrier()``
+     - Acts as a scheduling barrier within a wave, preventing the compiler from
+       moving memory operations across this point.
----------------
jayfoad wrote:


Seems odd to call these last three "instruction" builtins since they don't 
correspond to any single instruction in the ISA.

https://github.com/llvm/llvm-project/pull/181193
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[clang] [RFC][Docs][Clang][AMDGPU] Add AMDGPU builtins documentation (PR #181193)

Reply via email to