[TVM Discuss] [Questions] Performance of same op and workload in different model varies differently

nolan via TVM Discuss Sun, 30 Aug 2020 19:44:38 -0700


Compared two similar Bert models running on CPU with TVM, one is PyTorch model, 
the other is MXNet model. Due to the large performance difference, I did some 
profiling. The result shows the run time of the same operation(matmul) with 
same workload varies big.


ENV:

1. TVM: build with MKL.
2. Intel CPU
3. OpenMP:  `KMP_AFFINITY=compact,1,0 OMP_NUM_THREADS=24`

Model inference time:
 
    # mxnet model
    TVM Mean inference time: 5.53 ms
    # pytorch model
    TVM Mean inference time: 23.05 ms

Profiling result:

    # MXNet model
    Node Name                           Ops.                Time(us)   Time(%)  
Shape.  Inputs  Outputs
    --------- 
    fused_nn_dense_add_15        fused_nn_dense_add_1       308.926   5.58     
(32, 768)      3       1
    fused_nn_dense_add_11         fused_nn_dense_add_1       307.277   5.551    
(32, 768)        3       1

    # PyTorch Model
    Node Name                           Ops.                Time(us)   Time(%)  
Shape.  Inputs  Outputs
    --------- 
    fused_nn_dense_add_3        fused_nn_dense_add_3       1783.75    7.631    
(32, 768)     3       1
    fused_nn_dense_add_31      fused_nn_dense_add_3        1593.08    6.815    
(32, 768)    3       1

IR code (same between PyTorch model and MXNet model)

      attr [0] "compute_scope" = "fused_nn_dense_add_3_compute_";
      attr [C: handle] "storage_scope" = "global";
      allocate(C, float32, [24576]) {
        attr [0] "extern_scope" = 0;
        @tir.tvm_call_packed("tvm.contrib.cblas.matmul", 
@tir.tvm_stack_make_array(placeholder, @tir.tvm_stack_make_shape(32, 3072, 
dtype=handle), 0, 2, 0f32, 0, dtype=handle), 
@tir.tvm_stack_make_array(placeholder_1, @tir.tvm_stack_make_shape(768, 3072, 
dtype=handle), 0, 2, 0f32, 0, dtype=handle), @tir.tvm_stack_make_array(C, 
@tir.tvm_stack_make_shape(32, 768, dtype=handle), 0, 2, 0f32, 0, dtype=handle), 
False, True, dtype=int32)
        for (ax0: int32, 0, 32) "parallel" {
          for (ax1: int32, 0, 768) {
            T_add[((ax0*768) + ax1)] = ((float32*)C[((ax0*768) + ax1)] + 
(float32*)placeholder_2[ax1])
          }
        }

However, when setting  `OMP_NUM_THREADS=1`  the model inference time is same, 
seems it's a problem with multiple threads.

What may cause the difference?

Refer to: https://github.com/apache/incubator-tvm/issues/6354





---
[Visit 
Topic](https://discuss.tvm.ai/t/performance-of-same-op-and-workload-in-different-model-varies-differently/7766/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/1f49529833bf95b6136dfa3765c338c333e41f2eab8e1d07c23236576a59fb62).

[TVM Discuss] [Questions] Performance of same op and workload in different model varies differently

Reply via email to