================ @@ -0,0 +1,1990 @@ +=============== +AMDGPU Builtins +=============== + +.. contents:: + :local: + :depth: 2 + +This document describes the AMDGPU target-specific builtins available in Clang. +Most of these builtins provide direct access to AMDGPU hardware instructions +and intrinsics. + +.. note:: + + This document was generated with AI assistance, cross-referencing the + following sources: + + - ``clang/include/clang/Basic/BuiltinsAMDGPU.td`` (builtin definitions) + - ``llvm/include/llvm/IR/IntrinsicsAMDGPU.td`` (intrinsic definitions) + - ``clang/lib/Sema/SemaAMDGPU.cpp`` (argument validation and constraints) + - ``clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp`` (lowering logic) + - `GPUOpen Machine-Readable ISA <https://gpuopen.com/machine-readable-isa/>`_ + (ISA documents) + +.. warning:: + + These builtins, including their names, arguments, and target requirements, + are all subject to change without warning across LLVM releases. + +All AMDGPU builtins use the ``__builtin_amdgcn_`` prefix (or ``__builtin_r600_`` +for R600 targets). Some arguments must be compile-time constant expressions; +this is noted in the descriptions where applicable. + +ABI / Special Register Builtins +=============================== + +These builtins provide access to kernel dispatch metadata, work-item and +workgroup identification, and other ABI-level information. They are available +on all SI+ targets. + +Pointer Builtins +---------------- + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Builtin + - Description + * - ``void __constant * __builtin_amdgcn_dispatch_ptr()`` + - Returns a read-only pointer to the dispatch packet, which contains + workgroup size, grid size, and other dispatch parameters. + * - ``void __constant * __builtin_amdgcn_kernarg_segment_ptr()`` + - Returns a pointer to the beginning of the kernel argument segment. + * - ``void __constant * __builtin_amdgcn_implicitarg_ptr()`` + - Returns a pointer to the implicit arguments appended after explicit + kernel arguments. Layout depends on the code object version. + * - ``void __constant * __builtin_amdgcn_queue_ptr()`` + - Returns a pointer to the queue_t object for the queue executing the + current kernel. + +Work-Item and Workgroup Identification +-------------------------------------- + +These builtins take no arguments and have no side effects. They return +``unsigned int`` (or ``unsigned short`` for workgroup size). + +.. list-table:: + :header-rows: 1 + :widths: 40 30 30 + + * - Builtin + - Return Type + - Description + * - ``__builtin_amdgcn_workgroup_id_{x,y,z}()`` + - ``unsigned int`` + - Workgroup ID in the specified dimension. + * - ``__builtin_amdgcn_workitem_id_{x,y,z}()`` + - ``unsigned int`` + - Work-item (thread) ID within the workgroup. + * - ``__builtin_amdgcn_workgroup_size_{x,y,z}()`` + - ``unsigned short`` + - Workgroup size in the specified dimension. + * - ``__builtin_amdgcn_grid_size_{x,y,z}()`` + - ``unsigned int`` + - Total grid size in the specified dimension. + +**GFX1250+ Cluster Identification** (requires ``gfx1250-insts``): + +.. list-table:: + :header-rows: 1 + :widths: 50 50 + + * - Builtin + - Description + * - ``__builtin_amdgcn_cluster_id_{x,y,z}()`` + - Cluster ID in the specified dimension. + * - ``__builtin_amdgcn_cluster_workgroup_id_{x,y,z}()`` + - Workgroup ID within the cluster. + * - ``__builtin_amdgcn_cluster_workgroup_flat_id()`` + - Flat (linearized) workgroup ID within the cluster. + * - ``__builtin_amdgcn_cluster_workgroup_max_id_{x,y,z}()`` + - Maximum workgroup ID within the cluster. + * - ``__builtin_amdgcn_cluster_workgroup_max_flat_id()`` + - Maximum flat workgroup ID within the cluster. + +Other ABI Builtins +------------------ + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Builtin + - Description + * - ``unsigned int __builtin_amdgcn_mbcnt_lo(unsigned int mask, unsigned int val)`` + - Counts the number of bits set in ``mask`` among lanes 0--31 that are + below the current lane, and adds ``val``. Lanes 32--63 count all 32 + bits. + * - ``unsigned int __builtin_amdgcn_mbcnt_hi(unsigned int mask, unsigned int val)`` + - Counts the number of bits set in ``mask`` among lanes 32--63 that are + below the current lane, and adds ``val``. Lanes 0--31 add zero. + + **Typical usage**: To count the number of matching lanes below the + current lane across a full 64-lane wavefront, call ``mbcnt_lo`` first + and pass its result as ``val`` to ``mbcnt_hi``. + * - ``uint64_t __builtin_amdgcn_s_memtime()`` + - Returns the current 64-bit timestamp. Requires ``s-memtime-inst``. + +Instruction Builtins +==================== + +Scalar Instruction Builtins +--------------------------- + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Builtin + - Description + * - ``unsigned int __builtin_amdgcn_s_getreg(int hwreg)`` + - Reads a hardware register field. ``hwreg`` is an encoded specifier + (register ID, bit offset, and bit width packed into 16 bits). + * - ``void __builtin_amdgcn_s_setreg(int hwreg, unsigned int val)`` + - Writes ``val`` into a hardware register field. ``hwreg`` is encoded + the same way as in ``s_getreg`` and must be in range [0, 65535]. + * - ``uint64_t __builtin_amdgcn_s_getpc()`` + - Returns the address of the next instruction (program counter). + * - ``void __builtin_amdgcn_s_waitcnt(int cnt)`` + - Waits until the number of outstanding memory and export operations drops + to the levels encoded in ``cnt``. + * - ``void __builtin_amdgcn_s_sendmsg(int msg, unsigned int gsdata)`` + - Sends a hardware message identified by ``msg``, with auxiliary data in + ``gsdata``. + * - ``void __builtin_amdgcn_s_sendmsghalt(int msg, unsigned int gsdata)`` + - Sends a hardware message and then halts the wavefront. + * - ``void __builtin_amdgcn_s_barrier()`` + - Synchronizes waves within a threadgroup (workgroup barrier). + * - ``void __builtin_amdgcn_s_ttracedata(int data)`` + - Sends ``data`` as user data to the thread trace stream. + * - ``void __builtin_amdgcn_s_sleep(int duration)`` + - Causes the wave to sleep for approximately ``duration`` cycles (up to + ~8000 clocks), or until an external event wakes the wave. + * - ``void __builtin_amdgcn_s_incperflevel(int level)`` + - Increments the performance counter specified in ``level`` by 1. + * - ``void __builtin_amdgcn_s_decperflevel(int level)`` + - Decrements the performance counter specified in ``level`` by 1. + * - ``void __builtin_amdgcn_s_setprio(short prio)`` + - Sets the wave scheduling priority to ``prio``. + * - ``void __builtin_amdgcn_s_dcache_inv()`` + - Invalidates the scalar data L0 cache. + * - ``void __builtin_amdgcn_buffer_wbinvl1()`` + - Writes back and invalidates the shader L1 cache. + * - ``unsigned int __builtin_amdgcn_groupstaticsize()`` + - Returns the size of static LDS allocation in the current workgroup. + * - ``unsigned int __builtin_amdgcn_wavefrontsize()`` + - Returns the wavefront size (32 or 64). + * - ``void __builtin_amdgcn_wave_barrier()`` + - Acts as a scheduling barrier within a wave, preventing the compiler from + moving memory operations across this point. ---------------- jayfoad wrote:
Seems odd to call these last three "instruction" builtins since they don't correspond to any single instruction in the ISA. https://github.com/llvm/llvm-project/pull/181193 _______________________________________________ cfe-commits mailing list [email protected] https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
