[quote="LeiWang1999, post:1, topic:18685"]
improve both readability and performance
[/quote]

I test cuda code like below and indeed get different inst sequence & register 
use counts. It is a surprise since backend compiler do not optimize them to the 
same binary codes :joy:.  

```C++
__global__ void vecAdd(const float *A, const float *B, float *C, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (int i = tid; i < n; i += stride) {
        C[i] = A[i] + B[i];
    }
}

__global__ void vecAdd2(const float *A, const float *B, float *C, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (int j = 0; j < (n + stride - 1) / stride; ++j) {
        int i = tid + j * stride;
        C[i] = A[i] + B[i];
    }
}
```

So it seems to be good to support steped loop node. Is there already any 
(pre)rfcs about this thread? cc @LeiWang1999 @tqchen





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/do-we-have-plan-to-introduce-step-attribute-to-fornode/18685/3)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/2c7671fe986eb8150ac110fc1840320e58d8cd87e9358f2ecb516e51789cfbee).

Reply via email to