junrushao created an issue (apache/tvm-ffi#459)
## Context
TVM-FFI currently provides two mechanisms for attaching behavior to types:
| Mechanism | Storage | Lookup | Inheritance | Designed For |
| --------- | ------- | ------ | ----------- | ------------ |
| **Member methods** | Row-major (per-type dict) | Name-based hash | Yes (base
-> derived) | Per-type methods (`sum`, `append`, ...) |
| **TypeAttrColumn** | Column-major (dense array indexed by `type_index`) |
O(1) pointer offset | No (per-type opt-in) | Builtin dunder dispatch
(`__any_hash__`, `__ffi_repr__`, ...) |
TypeAttrColumn is currently a **closed set** — only a handful of builtin
attributes are registered as dense-array columns. The implicit rule is that type
attrs should be "rare and restrained by tvm-ffi package."
This document proposes making type attrs an **open set** backed by hash maps,
while preserving dense-array **fast paths** for the few builtins that genuinely
need
them.
## Motivation: IR Trait Dispatch
Consider a compiler IR built on tvm-ffi. Every IR node can have **traits** —
static properties that optimization passes query:
- `SideEffectFree`: Can this node be eliminated if its result is unused?
- `Commutative`: Can operands be reordered?
- `HasCanonicalizer`: Does this node type provide a canonicalization rule?
- `Fusible`: Can this node be fused into a neighboring kernel?
These traits are queried across all node types during compilation:
```cpp
void DCE(const Object* node) {
if (HasTrait(node, "side_effect_free")) {
// safe to remove if unused
}
}
```
Under the current closed-set policy, an IR framework cannot define new type
attrs — it must fall back to member method lookup or reinvent its own dispatch
table.
## Problem with Dense Arrays at Scale
The current TypeAttrColumn uses a dense array indexed by `type_index`. This
works well for a small fixed set of builtins where nearly every type *could*
have the attribute. But it does **not** scale to an open set:
- With 500 registered types and 100 user-defined trait columns, each dense
array allocates 500 entries (8 bytes each) = **400 KB of mostly-null
pointers**.
- As the type registry grows (downstream packages registering types), every
existing column must be resized.
- The sparsity is wasteful: if `Commutative` only applies to 10 out of 500 op
types, 98% of the column is null.
Dense arrays are the right choice for a small, fixed set of hot-path builtins.
They are the wrong choice for an open, extensible set of user-defined traits.
## Proposed Design: Two-Tier Type Attr Dispatch
### Tier 1: General type attrs via `unordered_map`
The default storage for type attrs is a hash map keyed by `(attr_name,
type_index)`. Any framework can register arbitrary type attrs without
pre-declaring columns:
```cpp
// IR framework registers traits (no EnsureTypeAttrColumn needed)
TypeAttrSet(AddOp::type_index, "ir_trait.commutative", true);
TypeAttrSet(AddOp::type_index, "ir_trait.side_effect_free", true);
// Query: hash map lookup (~20-50 cycles)
AnyView val = TypeAttrGet("ir_trait.commutative", node->type_index());
```
**Performance**: Hash map lookup is ~20-50 cycles. For IR trait queries that
run millions of times during compilation, this adds ~50 ms per 1M queries —
negligible compared to the IR transformations themselves. This is fast enough
for the vast majority of cross-type dispatch use cases.
**Memory**: Only entries that are actually set are stored. No pre-allocation,
no sparsity waste.
**Open set**: Any downstream framework can define new attrs with any string
key. No allowlist, no pre-registration.
### Tier 2: Specialized fast paths for hot builtins
For the few builtins that sit on genuinely critical paths — where every
object must be dispatched and the overhead compounds — the system provides
specialized fast paths:
| Builtin | Why it's hot | Specialization |
| ------- | ------------ | -------------- |
| `__any_hash__` | Called on every key in every Map operation | Dense array
column (existing) |
| `__structural_equal__` | Called on every node pair in IR comparison | Dense
array column (existing) |
| `__ffi_repr__` | Called once per object in repr | General tier is fine |
| `__ffi_shallow_copy__` | Called once per object in copy | General tier is
fine |
The criterion for tier-2 specialization is empirical: **does hash-map overhead
show up in profiles?** Most dunder methods (repr, copy, serialization) are
called infrequently enough that the general tier suffices. Only a handful
(hash, equality) need the dense-array fast path.
This is analogous to how CPython works: most dunder methods go through
`tp_dict` (hash map), but the most critical ones (`tp_hash`, `tp_richcompare`,
`tp_iter`) have dedicated function pointer slots in the type struct.
### Implementation Sketch
```cpp
class TypeAttrRegistry {
public:
// Tier 1: general hash-map storage
void Set(int32_t type_index, const std::string& attr_name, Any value) {
general_[attr_name][type_index] = std::move(value);
}
AnyView Get(const std::string& attr_name, int32_t type_index) const {
auto col_it = general_.find(attr_name);
if (col_it == general_.end()) return AnyView();
auto val_it = col_it->second.find(type_index);
if (val_it == col_it->second.end()) return AnyView();
return AnyView(val_it->second);
}
// Tier 2: dense-array fast path (opt-in per attr)
// Existing TypeAttrColumn mechanism, unchanged.
// Used only for __any_hash__, __structural_equal__, etc.
private:
// attr_name -> (type_index -> value)
std::unordered_map<std::string, std::unordered_map<int32_t, Any>> general_;
};
```
For tier-2, the existing `TVMFFITypeAttrColumn` dense-array mechanism is
retained as-is. The two tiers coexist:
- Hot-path builtins are registered in *both* the dense column (for fast C++
dispatch) and the general map (for uniform Python-side access via
`getattr`).
- User-defined attrs go only in the general map.
### Optimization: Per-Column Hash Maps
If even the two-level hash lookup (`attr_name` then `type_index`) is too
slow for certain user-defined attrs, a framework can cache the inner map:
```cpp
// Cache the inner map once at startup
static auto* commutative_map = &TypeAttrRegistry::Get("ir_trait.commutative");
// Fast query: single hash lookup by type_index
bool is_commutative = commutative_map->count(node->type_index());
```
This gives ~20-cycle lookup (single `unordered_map` probe on a small integer
key) without requiring a dense array allocation for every registered type.
## What This Achieves
### For tvm-ffi core (Tianqi's concerns)
- **Hot-path builtins stay fast**: `__any_hash__` and `__structural_equal__`
keep their dense-array O(1) lookup. Zero regression.
- **No sparsity waste**: User-defined attrs use hash maps, not dense arrays.
- **No ABI bloat**: The general tier is a pure C++ implementation detail.
Dense-array columns remain the ABI for builtins.
- **Clear boundary**: Tier-2 specialization is driven by profiling data, not
an arbitrary allowlist.
### For downstream frameworks (Junru's concerns)
- **Open set**: Any framework can define new type attrs with any string key.
No gatekeeping.
- **Uniform API**: `TypeAttrGet(name, type_index)` works for both builtins and
user-defined attrs. Python-side `getattr` works for everything.
- **No confusion**: Users don't need to know about the two tiers. They define
attrs, and the system routes them to the right storage.
- **IR traits work**: `SideEffectFree`, `Commutative`, etc. can be defined as
type attrs and queried efficiently across all node types.
## Python API
```python
@ffi.py_class
class MyIRNode(ffi.Object):
# Trait: registered as general type attr
ir_trait = SideEffectFree
# Repr: registered as type attr (tier-1 general map is fine for repr)
def __ffi_repr__(self, fn_print):
return f"MyIRNode(...)"
# Normal method: registered as member method (row store)
def simplify(self):
...
```
The `@ffi.py_class` decorator routes:
- `__ffi_*__` methods -> type attrs (general tier; tier-2 for known hot
builtins)
- Class-level trait assignments -> type attrs (general tier)
- Everything else -> member methods
## Comparison with CPython
CPython uses the same two-tier approach:
| Aspect | CPython | Proposed TVM-FFI |
| ------ | ------- | ---------------- |
| General methods | `tp_dict` (hash map) | General type attr map |
| Hot slots | `tp_hash`, `tp_richcompare`, `tp_iter`, ... | Dense-array columns
for `__any_hash__`, ... |
| Criterion | Fixed set of ~40 type slots | Empirically profiled hot builtins |
| Open/closed | Open (any method in `tp_dict`) | Open (any attr in general map)
|
## Summary
The current TypeAttrColumn system conflates two concerns:
1. **Open extensibility**: Downstream frameworks need to define new type attrs.
2. **Hot-path performance**: A few builtins need sub-10-cycle dispatch.
Dense arrays solve (2) but block (1). Hash maps solve (1) but are too slow for
(2). The solution is both:
- **General tier** (`unordered_map`): Open set, no sparsity, ~20-50 cycle
lookup. Good enough for IR traits, repr, copy, serialization, and the vast
majority of cross-type dispatch.
- **Specialized tier** (dense array): Closed set of profiled hot builtins.
~4 cycle lookup. Reserved for `__any_hash__`, `__structural_equal__`, and
any future builtin that shows up in profiles.
This gives Tianqi's performance guarantee where it matters, and Junru's open
extensibility everywhere else.
--
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-ffi/issues/459
You are receiving this because you are subscribed to this thread.
Message ID: <apache/tvm-ffi/issues/[email protected]>