In many CUDA kernels, the conventional pattern for thread iteration looks like 
this:

```cuda
for (int i = thread_idx; i < numel; i += num_threads)
    out[i] = 0;
```

However, in TileLang we currently have to write:

```python
for i in T.serial(0, T.ceildiv(numel - thread_idx, num_threads)):
    j = thread_idx + i * num_threads
    out[j] = -1
```

This is not only cumbersome — since it requires manually computing the range 
and performing index transformations — but it also introduces additional 
register usage and reduces index computation efficiency.

Introducing a `step` attribute to `ForNode` could simplify such patterns and 
improve both readability and performance but I guess there's a lot of 
challenges about this part.





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/do-we-have-plan-to-introduce-step-attribute-to-fornode/18685/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/6448926d28421acafdf75bd426b4505ee614f197fbb0b7f39229bea4ba1189d2).

Reply via email to