[Apache TVM Discuss] [Questions] TVM CUDA generating unnecessary sync operations

Zhi Lin via Apache TVM Discuss Tue, 22 Sep 2020 23:18:55 -0700


Recently I tried to implement GE-SpMM in TVM, using hybrid script:
```python
def mergespmm(num_rows, num_cols, nnz, indice_type, feat_type, feat_len):
    indptr = tvm.te.placeholder((num_rows+1,), indice_type, 'indptr')
    indices = tvm.te.placeholder((nnz,), indice_type, name='indices')
    ufeat = tvm.te.placeholder((num_cols, feat_len), feat_type, name='ufeat')
    CF = 1 if feat_len < 64 else 2
    row_factor = 4 if feat_len < 64 else 8
    @tvm.te.hybrid.script
    def _mergespmm(indptr, indices, ufeat):
        out = output_tensor((indptr.shape[0]-1, ufeat.shape[1]), 'float32')
        sm_k = allocate((32*row_factor,), 'int32', 'shared')
        result = allocate((CF,), 'float32', 'local')
        row_start = allocate((1,), 'int32', 'local')
        row_end = allocate((1,), 'int32', 'local')
        for row_outer in bind('blockIdx.x', (indptr.shape[0]+row_factor-2) // 
row_factor):
            for feat_outer in bind('blockIdx.y', feat_len // 32 // CF):
                for row_inner in bind('threadIdx.y', row_factor):
                    for elem_inner in bind('threadIdx.x', 32):
                        if row_outer * row_factor + row_inner < 
indptr.shape[0]-1:
                            row_start[0] = indptr[row_outer * row_factor + 
row_inner]
                            row_end[0] = indptr[row_outer * row_factor + 
row_inner + 1]
                            for elem_outer in range((row_end[0] - row_start[0] 
+ 31) // 32):
                                if row_start[0] + elem_outer * 32 + elem_inner 
< row_end[0]:
                                    sm_k[row_inner * 32 + elem_inner] = 
indices[row_start[0] + elem_outer * 32 + elem_inner]
                                for kk in range(32):
                                    if row_start[0] + elem_outer * 32 + kk < 
row_end[0]:
                                        for cf in unroll(CF):
                                            result[cf] += ufeat[sm_k[row_inner 
* 32 + kk], feat_outer * CF * 32 + cf * 32 + elem_inner]
                            for cf in unroll(CF):
                                out[row_outer*row_factor+row_inner, feat_outer 
* CF * 32 + cf * 32 + elem_inner] = result[cf]
        return out
    out = _mergespmm(indptr, indices, ufeat)
    sched = tvm.te.create_schedule(out.op)
    f = tvm.build(sched, [indptr, indices, ufeat, out], target='cuda')
    print(f.imported_modules[0].get_source())
    return f
```


This will fail at src/tir/transforms/thread_storage_sync.cc:100, saying cannot 
insert syncs inside condition. This is reasonable, because usually it can 
produce deadlock. However, in this kernel, GE-SpMM, warps in a block do not 
share shared memory, so `__syncthreads()` is not needed. Is it possible to let 
programmers control sync operations? Or do we need another pass to check 
whether sync is needed?

Besides, I don't know whether it is related, but I had this issue before. 
https://discuss.tvm.apache.org/t/tvm-access-beyond-array-boundary/6998

@Huyuwei





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/tvm-cuda-generating-unnecessary-sync-operations/7975/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/2a38f8e9ea5e494bcbe82fc69f0337d9cb5a741a2d92cd0766d7c380f885d9db).

[Apache TVM Discuss] [Questions] TVM CUDA generating unnecessary sync operations

Reply via email to