@tkonolige Thank you for responding. I just want to find out the amount of time spent on data layout transformations while running inference on ResNet-50. profiler_vm seems to report a much lower inference cost (1) than debug_executor (2). Does this not contradict your statement that profiler_vm may be slower than graph executor? Also I ran benchmarking via `tvm.contrib.graph_executor`: ``` with autotvm.apply_graph_best(opt_sch_file): with tvm.transform.PassContext(opt_level=3): lib = relay.build_module.build(mod, target=target, params=params) # runtime is tvm.contrib.graph_executor module = runtime.GraphModule(lib["default"](dev)) module.set_input("data", data) print("Evaluate inference time cost...") print(module.benchmark(dev, func_name="main", number=100, repeat=3, end_to_end=True)) ``` The inference costs I get via this (3) is always close but lower than (1). Do you have any idea why this is so?
The Outputs: (1) [profiler_vm] ``` Config for target=llvm -keys=cpu -link-params=0, workload=('dense_nopack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression. Config for target=llvm -keys=cpu -link-params=0, workload=('dense_pack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression. One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details. Name Duration (us) Percent layout Count out_layout Device data_layout kernel_layout Hash Argument Shapes src_layout dst_layout weight_layout fused_nn_contrib_conv2d_NCHWc_add_nn_relu_11 38,648.93 14.19 5 NCHW16c cpu0 NCHW64c OIHW64i16o 5c16c122a657ba21 float32[1, 4, 14, 14, 64], float32[16, 4, 3, 3, 64, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 14, 14, 16] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_6 31,069.39 11.41 4 NCHW8c cpu0 NCHW16c OIHW16i8o f2c6de1cbe5c0ddb float32[1, 8, 28, 28, 16], float32[16, 8, 3, 3, 16, 8], float32[1, 16, 1, 1, 8], float32[1, 16, 28, 28, 8] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_13 23,726.42 8.71 3 NCHW8c cpu0 NCHW2c OIHW2i8o cb108aaf00eff9e2 float32[1, 256, 7, 7, 2], float32[64, 256, 3, 3, 2, 8], float32[1, 64, 1, 1, 8], float32[1, 64, 7, 7, 8] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_10 18,153.16 6.66 5 NCHW8c cpu0 NCHW1024c OIHW1024i8o e4cba4831bd46d2c float32[1, 1, 14, 14, 1024], float32[32, 1, 1, 1, 1024, 8], float32[1, 32, 1, 1, 8], float32[1, 32, 14, 14, 8] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_4 15,697.88 5.76 2 NCHW16c cpu0 NCHW16c OIHW16i16o b2d690588ecaac96 float32[1, 4, 56, 56, 16], float32[4, 4, 3, 3, 16, 16], float32[1, 4, 1, 1, 16], float32[1, 4, 56, 56, 16] fused_nn_contrib_conv2d_NCHWc_add_3 14,098.72 5.18 4 NCHW16c cpu0 NCHW16c OIHW16i16o 84bec82add215ebe float32[1, 16, 14, 14, 16], float32[64, 16, 1, 1, 16, 16], float32[1, 64, 14, 14, 16], float32[1, 64, 14, 14, 16] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_7 10,840.88 3.98 3 NCHW16c cpu0 NCHW16c OIHW16i16o d930aa7bf46c34e1 float32[1, 32, 28, 28, 16], float32[8, 32, 1, 1, 16, 16], float32[1, 8, 1, 1, 16], float32[1, 8, 28, 28, 16] fused_nn_contrib_conv2d_NCHWc_add_1 10,638.57 3.91 3 NCHW16c cpu0 NCHW8c OIHW8i16o 6beba43d92784786 float32[1, 16, 28, 28, 8], float32[32, 16, 1, 1, 8, 16], float32[1, 32, 28, 28, 16], float32[1, 32, 28, 28, 16] fused_nn_contrib_conv2d_NCHWc_add_nn_relu 8,112.57 2.98 1 NCHW8c cpu0 NCHW3c OIHW3i8o 2f8575d36cac57f0 float32[1, 1, 224, 224, 3], float32[8, 1, 7, 7, 3, 8], float32[1, 8, 1, 1, 8], float32[1, 8, 112, 112, 8] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_9 7,847.28 2.88 1 NCHW8c cpu0 NCHW16c OIHW16i8o 7baee5c8a4d8e4ab float32[1, 16, 14, 14, 16], float32[32, 16, 3, 3, 16, 8], float32[1, 32, 1, 1, 8], float32[1, 32, 14, 14, 8] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_2 7,684.11 2.82 1 NCHW16c cpu0 NCHW32c OIHW32i16o 25fd1c3d9d4e561e float32[1, 2, 56, 56, 32], float32[4, 2, 3, 3, 32, 16], float32[1, 4, 1, 1, 16], float32[1, 4, 56, 56, 16] fused_nn_contrib_conv2d_NCHWc_add 7,625.64 2.80 2 NCHW32c cpu0 NCHW16c OIHW16i32o 667036afd5deee1b float32[1, 4, 56, 56, 16], float32[8, 4, 1, 1, 16, 32], float32[1, 8, 56, 56, 32], float32[1, 8, 56, 56, 32] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_3 7,622.32 2.80 2 NCHW16c cpu0 NCHW32c OIHW32i16o 6e49d3c836077ac7 float32[1, 8, 56, 56, 32], float32[4, 8, 1, 1, 32, 16], float32[1, 4, 1, 1, 16], float32[1, 4, 56, 56, 16] fused_nn_contrib_conv2d_NCHWc_2 7,530.83 2.76 1 NCHW16c cpu0 NCHW16c OIHW16i16o b6e66601adaeb1e3 float32[1, 32, 28, 28, 16], float32[64, 32, 1, 1, 16, 16], float32[1, 64, 14, 14, 16] fused_nn_contrib_conv2d_NCHWc_add_4 7,305.51 2.68 2 NCHW16c cpu0 NCHW4c OIHW4i16o d0d1536228842867 float32[1, 128, 7, 7, 4], float32[128, 128, 1, 1, 4, 16], float32[1, 128, 7, 7, 16], float32[1, 128, 7, 7, 16] fused_nn_contrib_conv2d_NCHWc_3 7,303.69 2.68 1 NCHW8c cpu0 NCHW1024c OIHW1024i8o 493c374dd5e37c2b float32[1, 1, 14, 14, 1024], float32[256, 1, 1, 1, 1024, 8], float32[1, 256, 7, 7, 8] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_14 7,199.44 2.64 2 NCHW8c cpu0 NCHW2048c OIHW2048i8o af5e7bf563de2757 float32[1, 1, 7, 7, 2048], float32[64, 1, 1, 1, 2048, 8], float32[1, 64, 1, 1, 8], float32[1, 64, 7, 7, 8] fused_nn_contrib_conv2d_NCHWc_1 7,185.16 2.64 1 NCHW16c cpu0 NCHW32c OIHW32i16o 5e7a95757d65e24e float32[1, 8, 56, 56, 32], float32[32, 8, 1, 1, 32, 16], float32[1, 32, 28, 28, 16] fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu 3,905.42 1.43 1 NCHW32c cpu0 NCHW16c OIHW16i32o 18ea4e7c768c292e float32[1, 4, 56, 56, 16], float32[8, 4, 1, 1, 16, 32], float32[1, 8, 56, 56, 32], float32[1, 8, 1, 1, 32], float32[1, 8, 56, 56, 32] fused_nn_contrib_conv2d_NCHWc 3,776.76 1.39 1 NCHW32c cpu0 NCHW8c OIHW8i32o 7ff40af88acd710e float32[1, 8, 56, 56, 8], float32[8, 8, 1, 1, 8, 32], float32[1, 8, 56, 56, 32] fused_nn_contrib_conv2d_NCHWc_add_multiply_add_nn_relu 3,693.25 1.36 1 NCHW16c cpu0 NCHW4c OIHW4i16o a3a86603f87a1daa float32[1, 128, 7, 7, 4], float32[128, 128, 1, 1, 4, 16], float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 7, 7, 16] fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_1 3,616.06 1.33 1 NCHW16c cpu0 NCHW8c OIHW8i16o faa415ce8e443d42 float32[1, 16, 28, 28, 8], float32[32, 16, 1, 1, 8, 16], float32[1, 32, 28, 28, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 28, 28, 16] fused_nn_contrib_conv2d_NCHWc_add_2 3,601.05 1.32 1 NCHW16c cpu0 NCHW8c OIHW8i16o c3c48546ccd1c8e4 float32[1, 32, 14, 14, 8], float32[64, 32, 1, 1, 8, 16], float32[1, 64, 14, 14, 16], float32[1, 64, 14, 14, 16] fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_2 3,509.62 1.29 1 NCHW16c cpu0 NCHW16c OIHW16i16o 237b36f60eadc660 float32[1, 16, 14, 14, 16], float32[64, 16, 1, 1, 16, 16], float32[1, 64, 14, 14, 16], float32[1, 64, 1, 1, 16], float32[1, 64, 14, 14, 16] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_12 2,119.10 0.78 1 NCHW16c cpu0 NCHW16c OIHW16i16o 8d07031ff51d0737 float32[1, 64, 14, 14, 16], float32[32, 64, 1, 1, 16, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 7, 7, 16] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_8 1,969.95 0.72 1 NCHW16c cpu0 NCHW16c OIHW16i16o 8ec1781e87f7f62e float32[1, 32, 28, 28, 16], float32[16, 32, 1, 1, 16, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 14, 14, 16] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_5 1,869.16 0.69 1 NCHW16c cpu0 NCHW32c OIHW32i16o 39975a03990f0ed6 float32[1, 8, 56, 56, 32], float32[8, 8, 1, 1, 32, 16], float32[1, 8, 1, 1, 16], float32[1, 8, 28, 28, 16] fused_nn_contrib_conv2d_NCHWc_add_nn_relu_1 920.43 0.34 1 NCHW32c cpu0 NCHW8c OIHW8i32o ce29dd2da9289ac4 float32[1, 8, 56, 56, 8], float32[2, 8, 1, 1, 8, 32], float32[1, 2, 1, 1, 32], float32[1, 2, 56, 56, 32] fused_add_nn_relu_layout_transform 814.00 0.30 5 cpu0 7590737f314ee1d9 float32[1, 64, 14, 14, 16], float32[1, 64, 1, 1, 16], float32[1, 1, 14, 14, 1024] NCHW16c NCHW1024c fused_add_nn_relu 751.40 0.28 2 cpu0 f6724216088f2bf7 float32[1, 8, 56, 56, 32], float32[1, 8, 1, 1, 32], float32[1, 8, 56, 56, 32] fused_nn_contrib_dense_pack_add 658.90 0.24 1 cpu0 ced18cccebfa2ada float32[1, 2048], float32[125, 2048, 8], float32[1, 1000], float32[1, 1000] NC8n fused_add_nn_relu_1 624.30 0.23 3 cpu0 848825acfc73218b float32[1, 32, 28, 28, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 28, 28, 16] fused_nn_max_pool2d_add_nn_relu 378.72 0.14 NCHW8c 1 cpu0 4883943910905d24 float32[1, 8, 112, 112, 8], float32[1, 8, 1, 1, 8], float32[1, 8, 56, 56, 8] fused_layout_transform 173.49 0.06 5 cpu0 0693edb3d97dc77f float32[1, 32, 14, 14, 8], float32[1, 4, 14, 14, 64] NCHW8c NCHW64c fused_add_nn_relu_layout_transform_1 172.54 0.06 2 cpu0 468080b095af509a float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16], float32[1, 1, 7, 7, 2048] NCHW16c NCHW2048c fused_layout_transform_3 138.92 0.05 1 cpu0 6dda5720a553f260 float32[1, 64, 14, 14, 16], float32[1, 1, 14, 14, 1024] NCHW16c NCHW1024c fused_add_layout_transform 90.72 0.03 1 cpu0 69355d3cc810f874 float32[1, 3, 224, 224], float32[3, 1, 1], float32[1, 1, 224, 224, 3] NCHW NCHW3c fused_nn_global_avg_pool2d 83.09 0.03 NCHW16c 1 cpu0 f18307e2786f4cb3 float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16] fused_layout_transform_4 79.73 0.03 1 cpu0 aad3e266e27c5054 float32[1, 256, 7, 7, 8], float32[1, 128, 7, 7, 16] NCHW8c NCHW16c fused_layout_transform_2 50.75 0.02 3 cpu0 bd0b0c2ae84f7e09 float32[1, 64, 7, 7, 8], float32[1, 128, 7, 7, 4] NCHW8c NCHW4c fused_layout_transform_5 39.88 0.01 2 cpu0 69f132fa7e1d6749 float32[1, 64, 7, 7, 8], float32[1, 256, 7, 7, 2] NCHW8c NCHW2c fused_layout_transform_1 14.62 0.01 1 cpu0 9bd937910d443787 float32[1, 32, 7, 7, 16], float32[1, 256, 7, 7, 2] NCHW16c NCHW2c fused_nn_softmax 7.80 0.00 1 cpu0 ca61e79ea24e53f0 float32[1, 1000], float32[1, 1000] fused_layout_transform_nn_batch_flatten 1.41 0.00 1 cpu0 2db99463d18696a4 float32[1, 128, 1, 1, 16], float32[1, 2048] NCHW16c NCHW ---------- Sum 2,71,351.60 99.61 84 Total 2,72,418.15 1 cpu0 ``` (2): [debug_executor] ``` Config for target=llvm -keys=cpu -link-params=0, workload=('dense_nopack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression. Config for target=llvm -keys=cpu -link-params=0, workload=('dense_pack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression. One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details. Name Duration (us) Percent layout Count out_layout Device data_layout kernel_layout Hash Argument Shapes src_layout dst_layout weight_layout tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_14 5,68,263.92 48.76 1 NCHW8c cpu0 NCHW3c OIHW3i8o 2f8575d36cac57f0 float32[1, 1, 224, 224, 3], float32[8, 1, 7, 7, 3, 8], float32[1, 8, 1, 1, 8], float32[1, 8, 112, 112, 8] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_2 1,75,988.78 15.10 1 NCHW32c cpu0 NCHW16c OIHW16i32o 18ea4e7c768c292e float32[1, 4, 56, 56, 16], float32[8, 4, 1, 1, 16, 32], float32[1, 8, 56, 56, 32], float32[1, 8, 1, 1, 32], float32[1, 8, 56, 56, 32] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_10 82,241.79 7.06 2 NCHW16c cpu0 NCHW16c OIHW16i16o b2d690588ecaac96 float32[1, 4, 56, 56, 16], float32[4, 4, 3, 3, 16, 16], float32[1, 4, 1, 1, 16], float32[1, 4, 56, 56, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc 67,905.70 5.83 1 NCHW32c cpu0 NCHW8c OIHW8i32o 7ff40af88acd710e float32[1, 8, 56, 56, 8], float32[8, 8, 1, 1, 8, 32], float32[1, 8, 56, 56, 32] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_3 39,639.91 3.40 5 NCHW16c cpu0 NCHW64c OIHW64i16o 5c16c122a657ba21 float32[1, 4, 14, 14, 64], float32[16, 4, 3, 3, 64, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 14, 14, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_7 31,242.14 2.68 4 NCHW8c cpu0 NCHW16c OIHW16i8o f2c6de1cbe5c0ddb float32[1, 8, 28, 28, 16], float32[16, 8, 3, 3, 16, 8], float32[1, 16, 1, 1, 8], float32[1, 16, 28, 28, 8] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_4 29,317.11 2.52 2 NCHW32c cpu0 NCHW16c OIHW16i32o 667036afd5deee1b float32[1, 4, 56, 56, 16], float32[8, 4, 1, 1, 16, 32], float32[1, 8, 56, 56, 32], float32[1, 8, 56, 56, 32] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu 23,174.76 1.99 3 NCHW8c cpu0 NCHW2c OIHW2i8o cb108aaf00eff9e2 float32[1, 256, 7, 7, 2], float32[64, 256, 3, 3, 2, 8], float32[1, 64, 1, 1, 8], float32[1, 64, 7, 7, 8] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_4 18,815.10 1.61 5 NCHW8c cpu0 NCHW1024c OIHW1024i8o e4cba4831bd46d2c float32[1, 1, 14, 14, 1024], float32[32, 1, 1, 1, 1024, 8], float32[1, 32, 1, 1, 8], float32[1, 32, 14, 14, 8] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_1 14,143.22 1.21 4 NCHW16c cpu0 NCHW16c OIHW16i16o 84bec82add215ebe float32[1, 16, 14, 14, 16], float32[64, 16, 1, 1, 16, 16], float32[1, 64, 14, 14, 16], float32[1, 64, 14, 14, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_8 10,807.49 0.93 3 NCHW16c cpu0 NCHW16c OIHW16i16o d930aa7bf46c34e1 float32[1, 32, 28, 28, 16], float32[8, 32, 1, 1, 16, 16], float32[1, 8, 1, 1, 16], float32[1, 8, 28, 28, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_3 10,635.87 0.91 3 NCHW16c cpu0 NCHW8c OIHW8i16o 6beba43d92784786 float32[1, 16, 28, 28, 8], float32[32, 16, 1, 1, 8, 16], float32[1, 32, 28, 28, 16], float32[1, 32, 28, 28, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_13 8,887.56 0.76 1 NCHW32c cpu0 NCHW8c OIHW8i32o ce29dd2da9289ac4 float32[1, 8, 56, 56, 8], float32[2, 8, 1, 1, 8, 32], float32[1, 2, 1, 1, 32], float32[1, 2, 56, 56, 32] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_11 8,865.53 0.76 2 NCHW16c cpu0 NCHW32c OIHW32i16o 6e49d3c836077ac7 float32[1, 8, 56, 56, 32], float32[4, 8, 1, 1, 32, 16], float32[1, 4, 1, 1, 16], float32[1, 4, 56, 56, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_5 8,017.28 0.69 1 NCHW8c cpu0 NCHW16c OIHW16i8o 7baee5c8a4d8e4ab float32[1, 16, 14, 14, 16], float32[32, 16, 3, 3, 16, 8], float32[1, 32, 1, 1, 8], float32[1, 32, 14, 14, 8] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_12 7,585.56 0.65 1 NCHW16c cpu0 NCHW32c OIHW32i16o 25fd1c3d9d4e561e float32[1, 2, 56, 56, 32], float32[4, 2, 3, 3, 32, 16], float32[1, 4, 1, 1, 16], float32[1, 4, 56, 56, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_2 7,442.40 0.64 1 NCHW16c cpu0 NCHW16c OIHW16i16o b6e66601adaeb1e3 float32[1, 32, 28, 28, 16], float32[64, 32, 1, 1, 16, 16], float32[1, 64, 14, 14, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_1 7,293.48 0.63 2 NCHW8c cpu0 NCHW2048c OIHW2048i8o af5e7bf563de2757 float32[1, 1, 7, 7, 2048], float32[64, 1, 1, 1, 2048, 8], float32[1, 64, 1, 1, 8], float32[1, 64, 7, 7, 8] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_3 7,140.03 0.61 1 NCHW8c cpu0 NCHW1024c OIHW1024i8o 493c374dd5e37c2b float32[1, 1, 14, 14, 1024], float32[256, 1, 1, 1, 1024, 8], float32[1, 256, 7, 7, 8] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_1 7,041.60 0.60 1 NCHW16c cpu0 NCHW32c OIHW32i16o 5e7a95757d65e24e float32[1, 8, 56, 56, 32], float32[32, 8, 1, 1, 32, 16], float32[1, 32, 28, 28, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add 6,836.18 0.59 2 NCHW16c cpu0 NCHW4c OIHW4i16o d0d1536228842867 float32[1, 128, 7, 7, 4], float32[128, 128, 1, 1, 4, 16], float32[1, 128, 7, 7, 16], float32[1, 128, 7, 7, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_2 3,727.41 0.32 1 NCHW16c cpu0 NCHW8c OIHW8i16o c3c48546ccd1c8e4 float32[1, 32, 14, 14, 8], float32[64, 32, 1, 1, 8, 16], float32[1, 64, 14, 14, 16], float32[1, 64, 14, 14, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_1 3,596.31 0.31 1 NCHW16c cpu0 NCHW8c OIHW8i16o faa415ce8e443d42 float32[1, 16, 28, 28, 8], float32[32, 16, 1, 1, 8, 16], float32[1, 32, 28, 28, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 28, 28, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_multiply_add_nn_relu 3,468.59 0.30 1 NCHW16c cpu0 NCHW4c OIHW4i16o a3a86603f87a1daa float32[1, 128, 7, 7, 4], float32[128, 128, 1, 1, 4, 16], float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 7, 7, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu 3,440.23 0.30 1 NCHW16c cpu0 NCHW16c OIHW16i16o 237b36f60eadc660 float32[1, 16, 14, 14, 16], float32[64, 16, 1, 1, 16, 16], float32[1, 64, 14, 14, 16], float32[1, 64, 1, 1, 16], float32[1, 64, 14, 14, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_9 3,144.19 0.27 1 NCHW16c cpu0 NCHW32c OIHW32i16o 39975a03990f0ed6 float32[1, 8, 56, 56, 32], float32[8, 8, 1, 1, 32, 16], float32[1, 8, 1, 1, 16], float32[1, 8, 28, 28, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_2 1,997.84 0.17 1 NCHW16c cpu0 NCHW16c OIHW16i16o 8d07031ff51d0737 float32[1, 64, 14, 14, 16], float32[32, 64, 1, 1, 16, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 7, 7, 16] tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_6 1,783.56 0.15 1 NCHW16c cpu0 NCHW16c OIHW16i16o 8ec1781e87f7f62e float32[1, 32, 28, 28, 16], float32[16, 32, 1, 1, 16, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 14, 14, 16] tvmgen_default_fused_add_nn_relu_1 473.00 0.04 2 cpu0 f6724216088f2bf7 float32[1, 8, 56, 56, 32], float32[1, 8, 1, 1, 32], float32[1, 8, 56, 56, 32] tvmgen_default_fused_add_nn_relu 338.92 0.03 3 cpu0 848825acfc73218b float32[1, 32, 28, 28, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 28, 28, 16] tvmgen_default_fused_add_nn_relu_layout_transform_1 286.62 0.02 5 cpu0 7590737f314ee1d9 float32[1, 64, 14, 14, 16], float32[1, 64, 1, 1, 16], float32[1, 1, 14, 14, 1024] NCHW16c NCHW1024c tvmgen_default_fused_nn_contrib_dense_pack_add 265.74 0.02 1 cpu0 ced18cccebfa2ada float32[1, 2048], float32[125, 2048, 8], float32[1, 1000], float32[1, 1000] NC8n tvmgen_default_fused_nn_max_pool2d_add_nn_relu 251.56 0.02 NCHW8c 1 cpu0 4883943910905d24 float32[1, 8, 112, 112, 8], float32[1, 8, 1, 1, 8], float32[1, 8, 56, 56, 8] tvmgen_default_fused_layout_transform_3 132.62 0.01 5 cpu0 0693edb3d97dc77f float32[1, 32, 14, 14, 8], float32[1, 4, 14, 14, 64] NCHW8c NCHW64c tvmgen_default_fused_nn_global_avg_pool2d 69.42 0.01 NCHW16c 1 cpu0 f18307e2786f4cb3 float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16] tvmgen_default_fused_layout_transform_4 60.92 0.01 1 cpu0 aad3e266e27c5054 float32[1, 256, 7, 7, 8], float32[1, 128, 7, 7, 16] NCHW8c NCHW16c tvmgen_default_fused_layout_transform_5 58.94 0.01 1 cpu0 6dda5720a553f260 float32[1, 64, 14, 14, 16], float32[1, 1, 14, 14, 1024] NCHW16c NCHW1024c tvmgen_default_fused_add_layout_transform 56.01 0.00 1 cpu0 69355d3cc810f874 float32[1, 3, 224, 224], float32[3, 1, 1], float32[1, 1, 224, 224, 3] NCHW NCHW3c tvmgen_default_fused_add_nn_relu_layout_transform 54.40 0.00 2 cpu0 468080b095af509a float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16], float32[1, 1, 7, 7, 2048] NCHW16c NCHW2048c tvmgen_default_fused_layout_transform_1 42.67 0.00 2 cpu0 69f132fa7e1d6749 float32[1, 64, 7, 7, 8], float32[1, 256, 7, 7, 2] NCHW8c NCHW2c tvmgen_default_fused_layout_transform 33.90 0.00 3 cpu0 bd0b0c2ae84f7e09 float32[1, 64, 7, 7, 8], float32[1, 128, 7, 7, 4] NCHW8c NCHW4c tvmgen_default_fused_layout_transform_2 19.34 0.00 1 cpu0 9bd937910d443787 float32[1, 32, 7, 7, 16], float32[1, 256, 7, 7, 2] NCHW16c NCHW2c tvmgen_default_fused_nn_softmax 7.03 0.00 1 cpu0 ca61e79ea24e53f0 float32[1, 1000], float32[1, 1000] tvmgen_default_fused_layout_transform_nn_batch_flatten 0.96 0.00 1 cpu0 2db99463d18696a4 float32[1, 128, 1, 1, 16], float32[1, 2048] NCHW16c NCHW ---------- Sum 11,64,595.59 99.94 84 Total 11,65,326.68 1 cpu0 ``` (3) [benchmark] ``` Config for target=llvm -keys=cpu -link-params=0, workload=('dense_nopack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression. Config for target=llvm -keys=cpu -link-params=0, workload=('dense_pack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression. One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details. Evaluate inference time cost... Execution time summary: mean (ms) median (ms) max (ms) min (ms) std (ms) 269.9458 270.0297 270.0697 269.7381 0.1478 ``` --- [Visit Topic](https://discuss.tvm.apache.org/t/difference-in-profiler-outputs/11255/3) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/3be1cc7d891dd0827cc7db9c0c190811ef820a2f81f495b58a315fd38b305cca).