This is an automated email from the ASF dual-hosted git repository.
ruihangl pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm.git
The following commit(s) were added to refs/heads/main by this push:
new 1d02ac3f2c [Docs] Add Relax VM architecture documentation (#19389)
1d02ac3f2c is described below
commit 1d02ac3f2c305f9d5b83fb7094492cdd65dad8cc
Author: Shushi Hong <[email protected]>
AuthorDate: Sun Apr 12 11:46:25 2026 -0400
[Docs] Add Relax VM architecture documentation (#19389)
- Add a dedicated architecture document (`docs/arch/relax_vm.rst`) for
the Relax Virtual
Machine, covering the compilation pipeline (Relax IR → VMCodeGen →
VMExecutable), the
4-opcode instruction set and encoding format, the register-based
execution model (VMFrame,
RunLoop, VMClosure dispatch), built-in operations, serialization, and
the Python-level
interface (direct invocation, stateful API, profiling, instrumentation).
- Refactor the inline VM section in `docs/arch/index.rst` into a brief
summary with a
cross-reference to the new page, and add `relax_vm` to the toctree.
---
docs/arch/index.rst | 29 ++--
docs/arch/relax_vm.rst | 441 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 453 insertions(+), 17 deletions(-)
diff --git a/docs/arch/index.rst b/docs/arch/index.rst
index 17b26a5ddc..9479d22948 100644
--- a/docs/arch/index.rst
+++ b/docs/arch/index.rst
@@ -243,23 +243,18 @@ Relax Virtual Machine
Relax defines *what* to compute — it is a graph-level IR that describes the
operators and dataflow
of a model. The Relax Virtual Machine (VM) handles *how* to run it — it is the
runtime component
-that executes the compiled result. During compilation, ``tvm.compile()``
invokes ``VMCodeGen`` to
-translate Relax functions into a compact bytecode representation. The
resulting ``VMExecutable``
-bundles the bytecode together with a constant pool and per-function metadata,
and can be serialized
-to disk for deployment.
-
-The VM uses a register-based interpreter with an intentionally minimal
instruction set — only four
-opcodes: ``Call``, ``Ret``, ``Goto``, and ``If``. The VM itself performs no
mathematical computation;
-it only orchestrates control flow (function calls, conditional branches,
loops). The actual
-compute-intensive work — matrix multiplications, convolutions, and other
operators — is carried out
-by TIR functions that have been compiled down to native GPU/CPU kernels, or by
external libraries
-such as cuBLAS and cuDNN. The VM dispatches to them through the PackedFunc
mechanism. Internally the
-VM recognizes three function kinds: *PackedFunc* for external C/C++ functions,
*VMFunc* for
-bytecode-interpreted Relax functions, and *VMTIRFunc* for compiled TIR kernels.
-
-On the Python side, users interact with the VM through
``relax.VirtualMachine(executable, device)``,
-which provides both a direct invocation interface and a stateful set-input /
invoke / get-output
-interface suitable for RPC-based remote execution.
+that executes the compiled result. The VM uses a register-based interpreter
with only four opcodes
+(``Call``, ``Ret``, ``Goto``, ``If``) and performs no mathematical computation
itself — it
+orchestrates control flow while dispatching actual work to compiled TIR
kernels or external
+libraries.
+
+See :ref:`relax-vm-arch` for the full architecture documentation, including
the compilation
+pipeline, instruction set details, execution model, and Python interface.
+
+.. toctree::
+ :maxdepth: 1
+
+ relax_vm
Disco: Distributed Runtime
^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/docs/arch/relax_vm.rst b/docs/arch/relax_vm.rst
new file mode 100644
index 0000000000..dddee57b36
--- /dev/null
+++ b/docs/arch/relax_vm.rst
@@ -0,0 +1,441 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+.. _relax-vm-arch:
+
+Relax Virtual Machine
+=====================
+
+Relax defines *what* to compute — it is a graph-level IR that describes the
operators and dataflow
+of a model. The Relax Virtual Machine (VM) handles *how* to run it — it is the
runtime component
+that executes the compiled result. This document explains the VM architecture
in detail, covering
+the compilation pipeline from Relax IR to bytecode, the instruction set, the
execution model, and
+the Python-level user interface.
+
+Overview
+--------
+
+The end-to-end flow from model to execution is:
+
+1. **Relax IR** — a high-level computational graph (``relax.Function`` inside
an ``IRModule``).
+2. **Compilation** — ``tvm.compile()`` applies the Relax transformation
pipeline, then invokes
+ ``VMCodeGen`` to translate each Relax function into bytecode instructions.
+3. **Linking** — TIR functions are compiled to native kernels (via LLVM, CUDA,
etc.); the bytecode,
+ constant pool, and compiled kernels are packaged together into a
``VMExecutable``.
+4. **Execution** — at runtime, a ``VirtualMachine`` loads the executable,
initializes devices and
+ memory allocators, and runs the bytecode.
+
+.. code-block:: text
+
+ IRModule (Relax + TIR)
+ │
+ ▼ relax_pipeline (FuseOps, LegalizeOps, ...)
+ IRModule (optimized)
+ │
+ ▼ VMCodeGen
+ ExecBuilder (bytecode) + IRModule (TIR only)
+ │ │
+ │ ▼ tirx.build()
+ │ runtime.Module (native kernels)
+ │ │
+ ▼ VMLink ▼
+ VMExecutable ◄───────── linked together
+ │
+ ▼ VirtualMachine(exec, device)
+ Runtime execution
+
+
+Compilation: From Relax IR to Bytecode
+--------------------------------------
+
+Build entry point
+~~~~~~~~~~~~~~~~~
+
+The main entry point is ``tvm.compile()`` (which delegates to
``relax.build()`` in
+``python/tvm/relax/vm_build.py``):
+
+.. code-block:: python
+
+ import tvm
+ from tvm import relax
+
+ @tvm.script.ir_module
+ class MyModule:
+ @R.function
+ def main(x: R.Tensor((3, 4), "float32")):
+ return R.add(x, x)
+
+ target = tvm.target.Target("llvm")
+ ex = tvm.compile(MyModule, target)
+
+Internally, ``relax.build()`` performs these steps:
+
+1. Apply the **Relax pipeline** (``relax.get_pipeline("default")``), which
includes operator
+ legalization, fusion, buffer planning, and other graph-level passes.
+2. Create an ``ExecBuilder`` and run **VMCodeGen**
(``src/relax/backend/vm/codegen_vm.cc``),
+ which walks each ``relax.Function`` and emits bytecode instructions. The
Relax functions are
+ removed from the IRModule; only TIR functions remain.
+3. Compile the remaining TIR functions to native code via ``tirx.build()``.
+4. **Link** the bytecode executable with the compiled native module using
``VMLink``, producing
+ a ``VMExecutable``.
+
+Two execution modes are supported:
+
+- ``exec_mode="bytecode"`` (default): Relax functions are interpreted by the
VM's bytecode
+ dispatch loop.
+- ``exec_mode="compiled"``: Relax functions are compiled into TIR functions
(``VMTIRCodeGen``)
+ that directly manipulate the register file, bypassing the interpreter loop.
This avoids
+ dispatch overhead but produces more code.
+
+Bytecode generation
+~~~~~~~~~~~~~~~~~~~
+
+The ``CodeGenVM`` class (``src/relax/backend/vm/codegen_vm.cc``) is an
``ExprFunctor`` that visits
+each Relax expression and emits instructions through the ``ExecBuilder``:
+
+- Each ``relax.Var`` is mapped to a register.
+- Function parameters occupy registers 0 through N-1.
+- Each binding in a ``SeqExpr`` generates one or more instructions; the result
is stored in a
+ new register.
+- Function calls (``R.call_tir``, ``R.call_packed``, operator calls) become
``Call`` instructions.
+- Conditional expressions (``relax.If``, written as Python ``if`` in
TVMScript) become an ``If``
+ instruction followed by ``Goto`` to skip branches.
+- The function body ends with a ``Ret`` instruction.
+
+
+Instruction Set
+---------------
+
+The VM uses a **register-based** architecture with an intentionally minimal
instruction set.
+There are only four opcodes:
+
+.. list-table::
+ :header-rows: 1
+ :widths: 15 30 55
+
+ * - Opcode
+ - Fields
+ - Semantics
+ * - ``Call``
+ - ``dst``, ``func_idx``, ``num_args``, ``args[]``
+ - Call function ``func_idx`` with the given arguments; store the result
in register ``dst``.
+ * - ``Ret``
+ - ``result``
+ - Return the value in register ``result`` to the caller.
+ * - ``Goto``
+ - ``pc_offset``
+ - Jump forward or backward by ``pc_offset`` instructions.
+ * - ``If``
+ - ``cond``, ``false_offset``
+ - If register ``cond`` is nonzero, fall through (pc++); otherwise jump by
``false_offset``.
+
+The VM itself performs **no mathematical computation**. All actual work —
matrix multiplications,
+convolutions, elementwise operations — is carried out by compiled TIR kernels
or external
+libraries (cuBLAS, cuDNN, etc.), dispatched through ``Call`` instructions.
+
+Instruction encoding
+~~~~~~~~~~~~~~~~~~~~
+
+Each instruction argument (``Instruction::Arg``) is a 64-bit word encoded as:
+
+- **Bits [63:56]** — ``ArgKind`` (8 bits): ``kRegister`` (0), ``kImmediate``
(1), ``kConstIdx`` (2),
+ or ``kFuncIdx`` (3).
+- **Bits [55:0]** — value (56 bits, sign-extended).
+
+Two special register values exist:
+
+- ``kVoidRegister``: indicates "no destination" (the return value is
discarded).
+- ``kVMRegister``: refers to the VM context pointer itself, passed as the
first argument to
+ closures.
+
+The instruction stream is stored as a flat ``vector<ExecWord>``
(``instr_data``) with an offset
+table (``instr_offset``) for random access.
+
+
+Executable
+----------
+
+A ``VMExecutable`` (``include/tvm/runtime/vm/executable.h``) bundles
everything needed for
+execution:
+
+- **Function table** (``func_table``): a ``vector<VMFuncInfo>`` describing
every function. Each
+ entry records the function's kind, name, instruction range (``start_instr``
to ``end_instr``),
+ number of arguments, register file size, and parameter names.
+- **Constant pool** (``constants``): model weights, shape tuples, and other
compile-time constants.
+- **Bytecode** (``instr_data`` + ``instr_offset``): the instruction stream.
+- **Imported modules**: the compiled TIR kernels and external libraries.
+
+Function kinds
+~~~~~~~~~~~~~~
+
+The VM recognizes three function kinds (``VMFuncInfo::FuncKind``):
+
+.. list-table::
+ :header-rows: 1
+ :widths: 20 80
+
+ * - Kind
+ - Description
+ * - ``kPackedFunc``
+ - An external C/C++ function looked up from imported modules or the
global PackedFunc
+ registry. Examples: ``vm.builtin.alloc_shape_heap``,
``vm.builtin.match_shape``.
+ * - ``kVMFunc``
+ - A bytecode-interpreted Relax function. The VM interprets its
instructions in ``RunLoop()``.
+ * - ``kVMTIRFunc``
+ - A Relax function compiled to a TIR function (``exec_mode="compiled"``).
Found in
+ imports under the name ``__vmtir__<func_name>``. Called directly with
register file
+ pointers, bypassing the interpreter loop.
+
+Serialization
+~~~~~~~~~~~~~
+
+The executable supports binary serialization for deployment:
+
+.. code-block:: python
+
+ # Save
+ ex.export_library("model.so")
+
+ # Load
+ loaded = tvm.runtime.load_module("model.so")
+ vm = relax.VirtualMachine(loaded, tvm.cuda())
+
+The binary format includes a magic number (``0xD225DE2F4214151E``), a version
string
+(currently ``"0.14"``), followed by four sections: globals (the function
table), memory scopes,
+constant pool, and bytecode. ``AsText()`` and ``AsPython()`` provide
human-readable representations
+for debugging.
+
+
+Runtime Execution
+-----------------
+
+VM initialization
+~~~~~~~~~~~~~~~~~
+
+At runtime, a ``VirtualMachine`` is created and initialized:
+
+.. code-block:: python
+
+ from tvm.relax import VirtualMachine
+
+ vm = VirtualMachine(exec_module, tvm.cuda())
+
+Under the hood:
+
+1. **LoadExecutable**: the bytecode and metadata are loaded from the
``VMExecutable``.
+2. **Init**: devices and memory allocators are set up. Each device gets an
``Allocator``
+ (either ``NAIVE_ALLOCATOR`` or ``POOLED_ALLOCATOR``, defaulting to pooled).
A CPU device
+ is always added for shape computations.
+3. **InitFuncPool**: the function pool is populated — ``kPackedFunc`` entries
are resolved from
+ imports or the global registry; ``kVMFunc`` and ``kVMTIRFunc`` entries are
wrapped in
+ ``VMClosure`` objects.
+4. **Constant pool**: model constants are loaded and optionally transferred to
the target device.
+
+The bytecode dispatch loop
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a ``kVMFunc`` is invoked, the VM enters ``InvokeBytecode()``:
+
+1. A new ``VMFrame`` is pushed onto the call stack. Each frame contains:
+
+ - A **register file** (``vector<ffi::Any>``) — type-erased slots that can
hold tensors,
+ shapes, closures, or any TVM object. The size is determined at compile
time
+ (``VMFuncInfo::register_file_size``).
+ - The **return program counter** — where to resume after the function
returns.
+ - The **caller's return register** — which register in the parent frame
receives the result.
+
+2. Function arguments are written to registers 0..N-1.
+3. The program counter (``pc_``) is set to the function's ``start_instr``.
+4. ``RunLoop()`` executes instructions until a ``Ret`` is encountered:
+
+ - **Call**: resolve arguments (from registers, immediates, constant pool,
or function pool),
+ invoke the target function via ``InvokeClosurePacked()``, store the
result in ``dst``.
+ - **Ret**: read the return value from the specified register, write the
result to the
+ caller's return register, and return from ``RunLoop()`` (the frame is
popped by an RAII
+ guard when ``InvokeBytecode()`` exits).
+ - **Goto**: adjust ``pc_`` by the offset.
+ - **If**: check the condition register; if nonzero, fall through; otherwise
jump by
+ ``false_offset``.
+
+The dispatch loop is implemented in ``src/runtime/vm/vm.cc``
(``VirtualMachineImpl::RunLoop``).
+
+.. code-block:: text
+
+ Frame Stack Register File (per frame)
+ ┌─────────────┐ ┌────┬────┬────┬─────┬────┐
+ │ Frame 2 │ ───────► │ R0 │ R1 │ R2 │ ... │ Rn │
+ ├─────────────┤ └────┴────┴────┴─────┴────┘
+ │ Frame 1 │ ───────► [register file]
+ ├─────────────┤
+ │ Frame 0 │ ───────► [register file]
+ └─────────────┘
+
+VMClosure and function dispatch
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Functions in the VM are stored in a ``func_pool_`` indexed by function table
position.
+``kVMFunc`` and ``kVMTIRFunc`` entries are wrapped as ``VMClosure`` objects,
while ``kPackedFunc``
+entries are stored as plain ``ffi::Function``. A ``VMClosure`` stores:
+
+- ``func_name``: the function's string name.
+- ``impl``: a ``ffi::Function`` that takes the VM context pointer as its first
argument, followed
+ by the actual parameters.
+
+When the VM encounters a ``Call`` instruction, it looks up the function in
``func_pool_`` by
+index and dispatches via ``InvokeClosurePacked()``. If the target is a
``VMClosure``, the VM
+pointer is prepended to the arguments and ``impl`` is invoked. If it is a plain
+``ffi::Function``, it is called directly.
+
+``VMClosure::BindLastArgs`` enables partial application — it creates a new
function with
+some arguments pre-bound at the end, useful for implementing captured closures
in Relax.
+
+Built-in operations
+~~~~~~~~~~~~~~~~~~~
+
+The VM relies on several built-in PackedFuncs (registered in
``src/runtime/vm/builtin.cc``)
+for runtime support:
+
+- ``vm.builtin.alloc_shape_heap``: allocate workspace for symbolic shape
computations.
+- ``vm.builtin.match_shape``: validate tensor shapes against expected patterns
at runtime,
+ supporting assertions (``kAssertEqualToImm``, ``kAssertEqualToLoad``),
storing symbolic
+ dimensions to the shape heap (``kStoreToHeap``), or no-ops (``kNoOp``).
+- ``vm.builtin.make_shape``: construct shape tuples from immediates or
heap-loaded values.
+- ``vm.builtin.match_prim_value``: validate primitive values (e.g., integers)
against expected
+ patterns.
+- ``vm.builtin.copy``: copy a value into a register. Used in several codegen
scenarios:
+ materializing non-register arguments (immediates, constants) into registers,
ensuring each
+ variable binding gets its own register, and merging results from if/else
branches.
+
+
+Python Interface
+----------------
+
+Users interact with the VM through ``tvm.relax.VirtualMachine``:
+
+.. code-block:: python
+
+ import tvm
+ from tvm import relax
+ import numpy as np
+
+ # Compile
+ ex = tvm.compile(MyModule, target="llvm")
+
+ # Create VM
+ vm = relax.VirtualMachine(ex, tvm.cpu())
+
+ # Direct invocation
+ inp = tvm.runtime.tensor(np.random.rand(3, 4).astype("float32"))
+ result = vm["main"](inp)
+
+ # Stateful interface (useful for RPC)
+ vm.set_input("main", inp)
+ vm.invoke_stateful("main")
+ output = vm.get_outputs("main")
+
+Key methods:
+
+- ``vm["func_name"](*args)`` — direct invocation, returns the result.
+- ``vm.set_input()`` / ``vm.invoke_stateful()`` / ``vm.get_outputs()`` —
stateful interface
+ that avoids sending output over the wire, useful for RPC-based remote
execution.
+- ``vm.save_function(func_name, saved_name, *args)`` — pre-bind arguments for
repeated calls,
+ reducing dictionary lookup overhead during benchmarking.
+- ``vm.time_evaluator(func_name, dev)`` — returns a timing function following
the same convention
+ as ``tvm.runtime.Module.time_evaluator``.
+- ``vm.profile(func_name, *args)`` — returns a per-operator profiling report
(requires
+ ``profile=True`` at VM construction).
+- ``vm.set_instrument(func)`` — register an instrumentation callback that is
invoked before/after
+ every ``Call`` instruction. The callback can return
``VMInstrumentReturnKind.SKIP_RUN`` to
+ skip the call.
+
+Profiling and instrumentation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The VM supports two levels of observability:
+
+**Profiling** via ``VirtualMachine(exec, dev, profile=True)``:
+
+.. code-block:: python
+
+ vm = relax.VirtualMachine(ex, tvm.cuda(), profile=True)
+ report = vm.profile("main", inp)
+ print(report)
+
+This produces a ``tvm.runtime.profiling.Report`` with per-operator timing
breakdown.
+
+**Instrumentation** via ``set_instrument()``:
+
+.. code-block:: python
+
+ def my_instrument(func, func_symbol, before_run, ret_value, *args):
+ if before_run:
+ print(f"About to call: {func_symbol}")
+ return VMInstrumentReturnKind.NO_OP
+
+ vm.set_instrument(my_instrument)
+ vm["main"](inp)
+
+The instrument function is called before and after every ``Call`` instruction,
receiving the
+function object, its symbol name, a flag indicating before/after, the return
value (only valid
+after), and all arguments.
+
+
+Inspecting Bytecode
+-------------------
+
+The executable provides text and Python representations of the compiled
bytecode:
+
+.. code-block:: python
+
+ ex = tvm.compile(MyModule, target="llvm")
+ print(ex.as_text()) # Human-readable instruction listing
+ print(ex.as_python()) # Equivalent Python program
+ print(ex.stats()) # Summary statistics
+
+These are invaluable for debugging compilation issues — they show exactly
which functions
+are called, in what order, and how registers are used.
+
+
+Source Code Map
+---------------
+
+.. list-table::
+ :header-rows: 1
+ :widths: 45 55
+
+ * - Path
+ - Contents
+ * - ``include/tvm/runtime/vm/bytecode.h``
+ - Instruction, Opcode, and Arg definitions
+ * - ``include/tvm/runtime/vm/executable.h``
+ - VMExecutable, VMFuncInfo, serialization
+ * - ``include/tvm/runtime/vm/vm.h``
+ - VirtualMachine base class, VMClosure
+ * - ``src/runtime/vm/vm.cc``
+ - VirtualMachineImpl, RunLoop, InvokeBytecode
+ * - ``src/runtime/vm/executable.cc``
+ - Serialization/deserialization, text output
+ * - ``src/runtime/vm/builtin.cc``
+ - Built-in operations (shape matching, allocation)
+ * - ``src/relax/backend/vm/codegen_vm.cc``
+ - CodeGenVM: Relax IR → bytecode
+ * - ``src/relax/backend/vm/codegen_vm_tir.cc``
+ - VMTIRCodeGen: Relax IR → compiled TIR
+ * - ``python/tvm/runtime/vm.py``
+ - Python VirtualMachine wrapper
+ * - ``python/tvm/relax/vm_build.py``
+ - ``relax.build()`` and VMExecutable Python class