I have been trying to study how TVM does layout transformation during runtime
(eg. NHWC16c -> NHWC4c, etc.). Where in the source code is the required data
copy or move of the data tensor handled?
Also, where is the same for the weights tensor handled?
Is it in the `CopyDataFromTo` function of
Are the times reported by `tvm.runtime.profiler_vm.VirtualMachineProfiler`'s
`profile()` function the actual times observed on running inference on the
target device?
---
[Visit
Topic](https://discuss.tvm.apache.org/t/time-reported-by-the-virtual-machine-profiler/11522/1)
to respond.
Y
Hi @tkonolige
Sorry for the delay in this response.
I modified the target to "llvm -mcpu=cascadelake" according to the target and
re-did the tuning. Now I get a much better inference time of < 100ms on
benchmark and VirtualMachineProfiler, but a 4x discrepancy still remains
between the output
[2] Without graph tuning
(a) profiler_vm
```
One or more operators have not been tuned. Please tune your model for better
performance. Use DEBUG logging level to see more details.
NameDuration (us) Percent
I have run both profilers multiple times. The vm_profiler's inference times are
consistently 270-272 ms, while that of the debug_executor's is within 800ms -
1.2s.
Here is the whole code just in case:
```
import numpy as np
import pytest
from io import StringIO
import csv
import os
import json
@tkonolige Thank you for responding.
I just want to find out the amount of time spent on data layout transformations
while running inference on ResNet-50. profiler_vm seems to report a much lower
inference cost (1) than debug_executor (2). Does this not contradict your
statement that profiler_