In many CUDA kernels, the conventional pattern for thread iteration looks like
this:
```cuda
for (int i = thread_idx; i < numel; i += num_threads)
out[i] = 0;
```
However, in TileLang we currently have to write:
```python
for i in T.serial(0, T.ceildiv(numel - thread_idx, num_threads)):
j = thread_idx + i * num_threads
out[j] = -1
```
This is not only cumbersome — since it requires manually computing the range
and performing index transformations — but it also introduces additional
register usage and reduces index computation efficiency.
Introducing a `step` attribute to `ForNode` could simplify such patterns and
improve both readability and performance but I guess there's a lot of
challenges about this part.
---
[Visit
Topic](https://discuss.tvm.apache.org/t/do-we-have-plan-to-introduce-step-attribute-to-fornode/18685/1)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
here](https://discuss.tvm.apache.org/email/unsubscribe/6448926d28421acafdf75bd426b4505ee614f197fbb0b7f39229bea4ba1189d2).