Since Relay is a graph-level IR, its ops do not have the compute and schedule
but just the input and output types, latency measurement has to happen at the
TIR level. If you want to profile the latency of each op, you could turn off op
fusion.
However, simply turn off fusion will result in errors, because TVM requires
every op to be in a primitive function during lowering. The right way to turn
off fusion is writing a simple Relay pass that puts every single op to a
function. For example:
```
%1 = nn.conv2d(...)
%2 = nn.bias_add(%1, ...)
%3 = nn.relu(%2)
```
becomes
```
%1 = fn(..., Primitive=1) {
nn.conv2d(...)
}
%2 = %1(...)
%3 = fn(..., Primitive=1) {
nn.bias_add(...)
}
%4 = %3(%2, ...)
%5 = fn(..., Primitive=1) {
nn.relu(...)
}
%6 = %5(...)
```
Then each function will contain a single op.
On the other hand, I personally don't recommend this profiling approach,
because in the normal compilation flow op fusion would definitely happen. If
you would like to know whether offloading some ops to your device could improve
the end-to-end performance, you should compare the latency of a fused function
vs. the latency of offloading this function to your device to get a fair
conclusion.
---
[Visit Topic](https://discuss.tvm.apache.org/t/profile-on-relay-level/9568/2)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
here](https://discuss.tvm.apache.org/email/unsubscribe/accf77cf0188e585a87c60ad89b06296ab599c05f53556aba8e38ca7cb1ed6c2).