Since Relay is a graph-level IR, its ops do not have the compute and schedule 
but just the input and output types, latency measurement has to happen at the 
TIR level. If you want to profile the latency of each op, you could turn off op 
fusion.

However, simply turn off fusion will result in errors, because TVM requires 
every op to be in a primitive function during lowering. The right way to turn 
off fusion is writing a simple Relay pass that puts every single op to a 
function. For example:

```
%1 = nn.conv2d(...)
%2 = nn.bias_add(%1, ...)
%3 = nn.relu(%2)
```

becomes

```
%1 = fn(..., Primitive=1) {
  nn.conv2d(...)
}
%2 = %1(...)
%3 = fn(..., Primitive=1) {
  nn.bias_add(...)
}
%4 = %3(%2, ...)
%5 = fn(..., Primitive=1) {
  nn.relu(...)
}
%6 = %5(...)
```

Then each function will contain a single op.

On the other hand, I personally don't recommend this profiling approach, 
because in the normal compilation flow op fusion would definitely happen. If 
you would like to know whether offloading some ops to your device could improve 
the end-to-end performance, you should compare the latency of a fused function 
vs. the latency of offloading this function to your device to get a fair 
conclusion.





---
[Visit Topic](https://discuss.tvm.apache.org/t/profile-on-relay-level/9568/2) 
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/accf77cf0188e585a87c60ad89b06296ab599c05f53556aba8e38ca7cb1ed6c2).

Reply via email to