Hi @tkonolige Sorry for the delay in this response. I modified the target to "llvm -mcpu=cascadelake" according to the target and re-did the tuning. Now I get a much better inference time of < 100ms on benchmark and VirtualMachineProfiler, but a 4x discrepancy still remains between the output of the two profilers. The outputs are attached below ([1]). I tried ResNet-18 as well, but I am observing the same discrepancy there as well.
On running without graph tuning, I am observing almost no discrepancy. Interestingly the debug_executor's total inference time worsens when I enable graph tuning, while that of the other two improves. The outputs are attached below ([2]). I haven't yet been able to get hold of another system to install and run these experiments on, I will update this thread as soon as that happens. __________________________________________________________________________________ Outputs: [1] With Graph Tuning (a) profiler_vm ``` Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake, workload=('dense_nopack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression. Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake, workload=('dense_pack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression. One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details. Name Duration (us) Percent Count out_layout Device data_layout kernel_layout Hash layout Argument Shapes dst_layout weight_layout src_layout fused_nn_contrib_conv2d_NCHWc_add_nn_relu_8 15,909.52 16.00 6 NCHW16c cpu0 NCHW16c OIHW16i16o efb9044cdd43e0b8 float32[1, 16, 14, 14, 16], float32[16, 16, 3, 3, 16, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 14, 14, 16] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_5 10,522.82 10.58 4 NCHW32c cpu0 NCHW8c OIHW8i32o 0d551fd3800939e1 float32[1, 16, 28, 28, 8], float32[4, 16, 3, 3, 8, 32], float32[1, 4, 1, 1, 32], float32[1, 4, 28, 28, 32] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_11 9,095.54 9.15 3 NCHW16c cpu0 NCHW16c OIHW16i16o 68695c5cd347ce57 float32[1, 32, 7, 7, 16], float32[32, 32, 3, 3, 16, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 7, 7, 16] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_2 8,034.25 8.08 3 NCHW32c cpu0 NCHW64c OIHW64i32o 83e0f5d1673ff2ae float32[1, 1, 56, 56, 64], float32[2, 1, 3, 3, 64, 32], float32[1, 2, 1, 1, 32], float32[1, 2, 56, 56, 32] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_9 6,451.60 6.49 5 NCHW16c cpu0 NCHW16c OIHW16i16o c8d2fb74508242fa float32[1, 64, 14, 14, 16], float32[16, 64, 1, 1, 16, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 14, 14, 16] fused_nn_contrib_conv2d_NCHWc_add_2 6,219.45 6.25 5 NCHW16c cpu0 NCHW4c OIHW4i16o 991e77362efe315d float32[1, 64, 14, 14, 4], float32[64, 64, 1, 1, 4, 16], float32[1, 64, 14, 14, 16], float32[1, 64, 14, 14, 16] fused_nn_contrib_conv2d_NCHWc_add_1 4,069.38 4.09 3 NCHW64c cpu0 NCHW32c OIHW32i64o b8f45dade76ef8ee float32[1, 4, 28, 28, 32], float32[8, 4, 1, 1, 32, 64], float32[1, 8, 28, 28, 64], float32[1, 8, 28, 28, 64] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_6 3,627.03 3.65 3 NCHW32c cpu0 NCHW64c OIHW64i32o 435cfe42fcb8d0b0 float32[1, 8, 28, 28, 64], float32[4, 8, 1, 1, 64, 32], float32[1, 4, 1, 1, 32], float32[1, 4, 28, 28, 32] fused_nn_contrib_conv2d_NCHWc_add 3,069.46 3.09 2 NCHW16c cpu0 NCHW32c OIHW32i16o 6fb734c77ed64bde float32[1, 2, 56, 56, 32], float32[16, 2, 1, 1, 32, 16], float32[1, 16, 56, 56, 16], float32[1, 16, 56, 56, 16] fused_nn_contrib_conv2d_NCHWc_add_nn_relu 2,898.89 2.92 1 NCHW16c cpu0 NCHW3c OIHW3i16o 10a40e9231ff15a6 float32[1, 1, 224, 224, 3], float32[4, 1, 7, 7, 3, 16], float32[1, 4, 1, 1, 16], float32[1, 4, 112, 112, 16] fused_nn_contrib_conv2d_NCHWc_3 2,659.84 2.67 1 NCHW16c cpu0 NCHW16c OIHW16i16o 9c3ea371f8ec4054 float32[1, 64, 14, 14, 16], float32[128, 64, 1, 1, 16, 16], float32[1, 128, 7, 7, 16] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_12 2,592.41 2.61 2 NCHW16c cpu0 NCHW16c OIHW16i16o 1cc8a4dccc794a64 float32[1, 128, 7, 7, 16], float32[32, 128, 1, 1, 16, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 7, 7, 16] fused_nn_contrib_conv2d_NCHWc_add_3 2,587.97 2.60 2 NCHW16c cpu0 NCHW32c OIHW32i16o 528b9cb523882d7e float32[1, 16, 7, 7, 32], float32[128, 16, 1, 1, 32, 16], float32[1, 128, 7, 7, 16], float32[1, 128, 7, 7, 16] fused_nn_contrib_conv2d_NCHWc_1 2,568.47 2.58 1 NCHW64c cpu0 NCHW16c OIHW16i64o 9b9c1d5fc56b0353 float32[1, 16, 56, 56, 16], float32[8, 16, 1, 1, 16, 64], float32[1, 8, 28, 28, 64] fused_nn_contrib_conv2d_NCHWc_2 2,560.30 2.57 1 NCHW16c cpu0 NCHW64c OIHW64i16o 371a9e61ecaeecce float32[1, 8, 28, 28, 64], float32[64, 8, 1, 1, 64, 16], float32[1, 64, 14, 14, 16] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_3 2,393.13 2.41 2 NCHW64c cpu0 NCHW16c OIHW16i64o 850ecaa157c95aac float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16, 64], float32[1, 1, 1, 1, 64], float32[1, 1, 56, 56, 64] fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu 1,519.12 1.53 1 NCHW16c cpu0 NCHW32c OIHW32i16o abe40a1f08b34bad float32[1, 2, 56, 56, 32], float32[16, 2, 1, 1, 32, 16], float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 56, 56, 16] fused_nn_contrib_conv2d_NCHWc 1,382.10 1.39 1 NCHW16c cpu0 NCHW16c OIHW16i16o 7661eb48c0b8a7e6 float32[1, 4, 56, 56, 16], float32[16, 4, 1, 1, 16, 16], float32[1, 16, 56, 56, 16] fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_1 1,319.25 1.33 1 NCHW64c cpu0 NCHW32c OIHW32i64o 88bbb32f8f542f98 float32[1, 4, 28, 28, 32], float32[8, 4, 1, 1, 32, 64], float32[1, 8, 28, 28, 64], float32[1, 8, 1, 1, 64], float32[1, 8, 28, 28, 64] fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_2 1,299.49 1.31 1 NCHW16c cpu0 NCHW4c OIHW4i16o c7b912640028a9e2 float32[1, 64, 14, 14, 4], float32[64, 64, 1, 1, 4, 16], float32[1, 64, 14, 14, 16], float32[1, 64, 1, 1, 16], float32[1, 64, 14, 14, 16] fused_nn_contrib_conv2d_NCHWc_add_multiply_add_nn_relu 1,252.95 1.26 1 NCHW16c cpu0 NCHW32c OIHW32i16o 21cb6d538731ba92 float32[1, 16, 7, 7, 32], float32[128, 16, 1, 1, 32, 16], float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 7, 7, 16] fused_add_nn_relu 823.04 0.83 2 cpu0 e907ce81104cda7a float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 56, 56, 16] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_10 759.67 0.76 1 NCHW16c cpu0 NCHW16c OIHW16i16o 8d07031ff51d0737 float32[1, 64, 14, 14, 16], float32[32, 64, 1, 1, 16, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 7, 7, 16] fused_nn_contrib_dense_pack_add 710.21 0.71 1 cpu0 7641a0cce9852143 float32[1, 2048], float32[40, 2048, 25], float32[1, 1000], float32[1, 1000] NC25n fused_nn_contrib_conv2d_NCHWc_add_nn_relu_7 693.32 0.70 1 NCHW16c cpu0 NCHW64c OIHW64i16o dc31662fedbb8185 float32[1, 8, 28, 28, 64], float32[16, 8, 1, 1, 64, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 14, 14, 16] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_4 656.53 0.66 1 NCHW128c cpu0 NCHW16c OIHW16i128o 9b01f6479b89fd68 float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16, 128], float32[1, 1, 1, 1, 128], float32[1, 1, 28, 28, 128] fused_add_nn_relu_1 631.05 0.63 3 cpu0 0e82013d73aa68c1 float32[1, 8, 28, 28, 64], float32[1, 8, 1, 1, 64], float32[1, 8, 28, 28, 64] fused_add_nn_relu_2 542.71 0.55 5 cpu0 f12067172f61c850 float32[1, 64, 14, 14, 16], float32[1, 64, 1, 1, 16], float32[1, 64, 14, 14, 16] fused_nn_max_pool2d_add_nn_relu 364.59 0.37 1 cpu0 6f701a4fa071030f NCHW16c float32[1, 4, 112, 112, 16], float32[1, 4, 1, 1, 16], float32[1, 4, 56, 56, 16] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_1 330.20 0.33 1 NCHW64c cpu0 NCHW16c OIHW16i64o 0f7bbb0e363c360c float32[1, 4, 56, 56, 16], float32[1, 4, 1, 1, 16, 64], float32[1, 1, 1, 1, 64], float32[1, 1, 56, 56, 64] fused_layout_transform_1 188.19 0.19 3 cpu0 b8cbb72b4035894d float32[1, 4, 28, 28, 32], float32[1, 16, 28, 28, 8] NCHW8c NCHW32c fused_layout_transform_2 172.13 0.17 6 cpu0 f5e631fb93d23d4d float32[1, 16, 14, 14, 16], float32[1, 64, 14, 14, 4] NCHW4c NCHW16c fused_add_nn_relu_3 106.41 0.11 2 cpu0 5d16c15878cc73d4 float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 7, 7, 16] fused_add_layout_transform 96.21 0.10 1 cpu0 69355d3cc810f874 float32[1, 3, 224, 224], float32[3, 1, 1], float32[1, 1, 224, 224, 3] NCHW3c NCHW fused_nn_global_avg_pool2d 56.33 0.06 1 cpu0 f18307e2786f4cb3 NCHW16c float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16] fused_layout_transform 52.16 0.05 1 cpu0 2c5d64d5f9faa001 float32[1, 1, 28, 28, 128], float32[1, 16, 28, 28, 8] NCHW8c NCHW128c fused_layout_transform_3 48.26 0.05 3 cpu0 add43c0d2d8a8a3c float32[1, 32, 7, 7, 16], float32[1, 16, 7, 7, 32] NCHW32c NCHW16c fused_nn_softmax 9.76 0.01 1 cpu0 ca61e79ea24e53f0 float32[1, 1000], float32[1, 1000] fused_layout_transform_nn_batch_flatten 1.73 0.00 1 cpu0 2db99463d18696a4 float32[1, 128, 1, 1, 16], float32[1, 2048] NCHW NCHW16c ---------- Sum 98,275.48 98.83 84 Total 99,441.43 1 cpu0 ``` (b) debug_executor ``` Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake, workload=('dense_nopack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression. Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake, workload=('dense_pack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression. One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details. Name Duration (us) Percent Count out_layout Device data_layout kernel_layout Hash layout Argument Shapes dst_layout weight_layout src_layout tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_3 1,39,559.24 36.43 2 NCHW16c cpu0 NCHW32c OIHW32i16o 6fb734c77ed64bde float32[1, 2, 56, 56, 32], float32[16, 2, 1, 1, 32, 16], float32[1, 16, 56, 56, 16], float32[1, 16, 56, 56, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_12 1,18,024.98 30.81 1 NCHW16c cpu0 NCHW3c OIHW3i16o 10a40e9231ff15a6 float32[1, 1, 224, 224, 3], float32[4, 1, 7, 7, 3, 16], float32[1, 4, 1, 1, 16], float32[1, 4, 112, 112, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc 23,051.66 6.02 1 NCHW16c cpu0 NCHW16c OIHW16i16o 7661eb48c0b8a7e6 float32[1, 4, 56, 56, 16], float32[16, 4, 1, 1, 16, 16], float32[1, 16, 56, 56, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_3 15,185.61 3.96 6 NCHW16c cpu0 NCHW16c OIHW16i16o efb9044cdd43e0b8 float32[1, 16, 14, 14, 16], float32[16, 16, 3, 3, 16, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 14, 14, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_9 13,328.36 3.48 3 NCHW32c cpu0 NCHW64c OIHW64i32o 83e0f5d1673ff2ae float32[1, 1, 56, 56, 64], float32[2, 1, 3, 3, 64, 32], float32[1, 2, 1, 1, 32], float32[1, 2, 56, 56, 32] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_11 13,159.49 3.44 1 NCHW64c cpu0 NCHW16c OIHW16i64o 0f7bbb0e363c360c float32[1, 4, 56, 56, 16], float32[1, 4, 1, 1, 16, 64], float32[1, 1, 1, 1, 64], float32[1, 1, 56, 56, 64] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_6 10,205.32 2.66 4 NCHW32c cpu0 NCHW8c OIHW8i32o 0d551fd3800939e1 float32[1, 16, 28, 28, 8], float32[4, 16, 3, 3, 8, 32], float32[1, 4, 1, 1, 32], float32[1, 4, 28, 28, 32] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu 7,727.92 2.02 3 NCHW16c cpu0 NCHW16c OIHW16i16o 68695c5cd347ce57 float32[1, 32, 7, 7, 16], float32[32, 32, 3, 3, 16, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 7, 7, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_1 5,840.79 1.52 5 NCHW16c cpu0 NCHW4c OIHW4i16o 991e77362efe315d float32[1, 64, 14, 14, 4], float32[64, 64, 1, 1, 4, 16], float32[1, 64, 14, 14, 16], float32[1, 64, 14, 14, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_4 5,746.35 1.50 5 NCHW16c cpu0 NCHW16c OIHW16i16o c8d2fb74508242fa float32[1, 64, 14, 14, 16], float32[16, 64, 1, 1, 16, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 14, 14, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_2 3,745.35 0.98 3 NCHW64c cpu0 NCHW32c OIHW32i64o b8f45dade76ef8ee float32[1, 4, 28, 28, 32], float32[8, 4, 1, 1, 32, 64], float32[1, 8, 28, 28, 64], float32[1, 8, 28, 28, 64] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_7 3,425.00 0.89 3 NCHW32c cpu0 NCHW64c OIHW64i32o 435cfe42fcb8d0b0 float32[1, 8, 28, 28, 64], float32[4, 8, 1, 1, 64, 32], float32[1, 4, 1, 1, 32], float32[1, 4, 28, 28, 32] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_2 2,508.48 0.65 1 NCHW16c cpu0 NCHW64c OIHW64i16o 371a9e61ecaeecce float32[1, 8, 28, 28, 64], float32[64, 8, 1, 1, 64, 16], float32[1, 64, 14, 14, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_10 2,400.83 0.63 2 NCHW64c cpu0 NCHW16c OIHW16i64o 850ecaa157c95aac float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16, 64], float32[1, 1, 1, 1, 64], float32[1, 1, 56, 56, 64] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_1 2,396.47 0.63 1 NCHW64c cpu0 NCHW16c OIHW16i64o 9b9c1d5fc56b0353 float32[1, 16, 56, 56, 16], float32[8, 16, 1, 1, 16, 64], float32[1, 8, 28, 28, 64] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_1 2,271.00 0.59 2 NCHW16c cpu0 NCHW16c OIHW16i16o 1cc8a4dccc794a64 float32[1, 128, 7, 7, 16], float32[32, 128, 1, 1, 16, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 7, 7, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add 2,260.06 0.59 2 NCHW16c cpu0 NCHW32c OIHW32i16o 528b9cb523882d7e float32[1, 16, 7, 7, 32], float32[128, 16, 1, 1, 32, 16], float32[1, 128, 7, 7, 16], float32[1, 128, 7, 7, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_3 2,240.88 0.59 1 NCHW16c cpu0 NCHW16c OIHW16i16o 9c3ea371f8ec4054 float32[1, 64, 14, 14, 16], float32[128, 64, 1, 1, 16, 16], float32[1, 128, 7, 7, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_2 1,401.25 0.37 1 NCHW16c cpu0 NCHW32c OIHW32i16o abe40a1f08b34bad float32[1, 2, 56, 56, 32], float32[16, 2, 1, 1, 32, 16], float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 56, 56, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_1 1,249.88 0.33 1 NCHW64c cpu0 NCHW32c OIHW32i64o 88bbb32f8f542f98 float32[1, 4, 28, 28, 32], float32[8, 4, 1, 1, 32, 64], float32[1, 8, 28, 28, 64], float32[1, 8, 1, 1, 64], float32[1, 8, 28, 28, 64] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_multiply_add_nn_relu 1,220.86 0.32 1 NCHW16c cpu0 NCHW32c OIHW32i16o 21cb6d538731ba92 float32[1, 16, 7, 7, 32], float32[128, 16, 1, 1, 32, 16], float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 7, 7, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu 1,160.16 0.30 1 NCHW16c cpu0 NCHW4c OIHW4i16o c7b912640028a9e2 float32[1, 64, 14, 14, 4], float32[64, 64, 1, 1, 4, 16], float32[1, 64, 14, 14, 16], float32[1, 64, 1, 1, 16], float32[1, 64, 14, 14, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_8 599.86 0.16 1 NCHW128c cpu0 NCHW16c OIHW16i128o 9b01f6479b89fd68 float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16, 128], float32[1, 1, 1, 1, 128], float32[1, 1, 28, 28, 128] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_5 579.10 0.15 1 NCHW16c cpu0 NCHW64c OIHW64i16o dc31662fedbb8185 float32[1, 8, 28, 28, 64], float32[16, 8, 1, 1, 64, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 14, 14, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_2 571.03 0.15 1 NCHW16c cpu0 NCHW16c OIHW16i16o 8d07031ff51d0737 float32[1, 64, 14, 14, 16], float32[32, 64, 1, 1, 16, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 7, 7, 16] tvmgen_default_fused_add_nn_relu_3 519.35 0.14 2 cpu0 e907ce81104cda7a float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 56, 56, 16] tvmgen_default_fused_nn_contrib_dense_pack_add 488.16 0.13 1 cpu0 7641a0cce9852143 float32[1, 2048], float32[40, 2048, 25], float32[1, 1000], float32[1, 1000] NC25n tvmgen_default_fused_add_nn_relu_2 360.30 0.09 3 cpu0 0e82013d73aa68c1 float32[1, 8, 28, 28, 64], float32[1, 8, 1, 1, 64], float32[1, 8, 28, 28, 64] tvmgen_default_fused_nn_max_pool2d_add_nn_relu 342.65 0.09 1 cpu0 6f701a4fa071030f NCHW16c float32[1, 4, 112, 112, 16], float32[1, 4, 1, 1, 16], float32[1, 4, 56, 56, 16] tvmgen_default_fused_add_nn_relu_1 291.22 0.08 5 cpu0 f12067172f61c850 float32[1, 64, 14, 14, 16], float32[1, 64, 1, 1, 16], float32[1, 64, 14, 14, 16] tvmgen_default_fused_layout_transform_2 106.25 0.03 3 cpu0 b8cbb72b4035894d float32[1, 4, 28, 28, 32], float32[1, 16, 28, 28, 8] NCHW8c NCHW32c tvmgen_default_fused_add_layout_transform 76.45 0.02 1 cpu0 69355d3cc810f874 float32[1, 3, 224, 224], float32[3, 1, 1], float32[1, 1, 224, 224, 3] NCHW3c NCHW tvmgen_default_fused_layout_transform_1 68.45 0.02 6 cpu0 f5e631fb93d23d4d float32[1, 16, 14, 14, 16], float32[1, 64, 14, 14, 4] NCHW4c NCHW16c tvmgen_default_fused_add_nn_relu 51.61 0.01 2 cpu0 5d16c15878cc73d4 float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 7, 7, 16] tvmgen_default_fused_nn_global_avg_pool2d 46.06 0.01 1 cpu0 f18307e2786f4cb3 NCHW16c float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16] tvmgen_default_fused_layout_transform_3 36.66 0.01 1 cpu0 2c5d64d5f9faa001 float32[1, 1, 28, 28, 128], float32[1, 16, 28, 28, 8] NCHW8c NCHW128c tvmgen_default_fused_layout_transform 11.41 0.00 3 cpu0 add43c0d2d8a8a3c float32[1, 32, 7, 7, 16], float32[1, 16, 7, 7, 32] NCHW32c NCHW16c tvmgen_default_fused_nn_softmax 9.50 0.00 1 cpu0 ca61e79ea24e53f0 float32[1, 1000], float32[1, 1000] tvmgen_default_fused_layout_transform_nn_batch_flatten 1.05 0.00 1 cpu0 2db99463d18696a4 float32[1, 128, 1, 1, 16], float32[1, 2048] NCHW NCHW16c ---------- Sum 3,82,269.07 99.80 84 Total 3,83,036.30 1 cpu0 ``` (c) benchmark ``` Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake, workload=('dense_nopack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression. Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake, workload=('dense_pack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression. One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details. Evaluate inference time cost... Execution time summary: mean (ms) median (ms) max (ms) min (ms) std (ms) 95.1157 95.0706 95.2259 95.0505 0.0784 ``` --- [Visit Topic](https://discuss.tvm.apache.org/t/difference-in-profiler-outputs/11255/7) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/5caef41472d3f106bd825fe0e2066cbc1a0cceb73c11350f652c02bc43447615).