Since Relay is a graph-level IR, its ops do not have the compute and schedule but just the input and output types, latency measurement has to happen at the TIR level. If you want to profile the latency of each op, you could turn off op fusion.
However, simply turn off fusion will result in errors, because TVM requires every op to be in a primitive function during lowering. The right way to turn off fusion is writing a simple Relay pass that puts every single op to a function. For example: ``` %1 = nn.conv2d(...) %2 = nn.bias_add(%1, ...) %3 = nn.relu(%2) ``` becomes ``` %1 = fn(..., Primitive=1) { nn.conv2d(...) } %2 = %1(...) %3 = fn(..., Primitive=1) { nn.bias_add(...) } %4 = %3(%2, ...) %5 = fn(..., Primitive=1) { nn.relu(...) } %6 = %5(...) ``` Then each function will contain a single op. On the other hand, I personally don't recommend this profiling approach, because in the normal compilation flow op fusion would definitely happen. If you would like to know whether offloading some ops to your device could improve the end-to-end performance, you should compare the latency of a fused function vs. the latency of offloading this function to your device to get a fair conclusion. --- [Visit Topic](https://discuss.tvm.apache.org/t/profile-on-relay-level/9568/2) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/accf77cf0188e585a87c60ad89b06296ab599c05f53556aba8e38ca7cb1ed6c2).