[clang] [llvm] [AMDGPU] Introduce asyncmark/wait intrinsics (PR #173259)

Krzysztof Drewniak via cfe-commits Fri, 16 Jan 2026 10:04:53 -0800

krzysz00 wrote:

Ok, let me write all this out just to make sure I've convinced myself of what 
you're going for in the hopes that it'll help improve the documentation.


One way to look at asyncwait(N) is that, when asyncwait(N), any operation 
that's within an async group other than one of the last N to be created must 
have completed. These async operations (ignoring the fact that the gfx9/10 ones 
key off of vmem) correspond to different hardware counters for different types 
of async operations.

What this implies is that, to materialize/implement asyncwait(N), the compiler 
must first determine all the sequences of async groups that must be the N alive 
ones. Then, for each such group and each counter type, we must determine a 
lower bound on the number of async operations that increment that counter that 
are in the group. For example, if we execute two async loads to LDS and one 
tensor load in the previous async group, we'd know we need asynccnt <= 2 and 
tensorcnt <= 1 after an asyncwait(1) - or that there are >= 2 async ops and >= 
1 tensor ops in the previous 1 groups. In a more complex scenario like a 
double-buffered loop, we could have something like this

```
tensor_load(... to A1)
tensor_load(... to B1)
asyncmark()
for (..) {
  barrier
  tensor_load(... to A2)
  tensor_load(... to B2)
  asyncmark()
  asyncwait(1)
  barrier
  compute(A1, B1) // no async ops
  barrier
  tensor_load(... to A1)
  tensor_load(... to B1)
  asyncmark()
  asyncwait(1)
  barrier
  compute(A2, B2)
[
asyncwait(0)
barrier
compute(A1, B1)
```
we know that every one of those asyncwait(1) calls should be implemented with a 
tensor count wait of 2 and an async count wait of 0 that can be eliminated 
(that is, we have lower bounds of async count >= 0  and tensor count >= 2 if 
the only async operations com from the previously marked group ... and then a 
later analysis can go "hey wait a minute, there's an async wait that's 
trivially a nop, let's delete it").

For another case,
```
tensor_load(...)
asyncmark() // A
async_to_lds(...)
asyncmark() // B
if (...) {
  async_to_lds(...)
  asyncmark() // C
} else {
  tensor_load()
  asyncmark() // D
}
asyncwait(2)
```
the asyncwait will become asynccnt_wait(1) and tensorcnt_wait(0) because
- If the two previous marks are C and B, we'll need two async waits outstanding 
and no tensor waits outstanding
- If the two previous groups are D and B, we'll need one asyncwait and one 
tensor wait outstanding

Intersecting these two bounds gives us a pattern of waits that'll always work, 
and, unlike in the previous example, we can't eliminate that wait to 0 tensor 
ops, because it's possible that the program execution path was A B C, which 
means we need the wait operation to drive he tensor load from A to completion.

Now, function calls.

Since the thing we care about is lower bounds for how much each mark increments 
each async counter, it's safe to assume that each function call does not 
increment this counter. That's the key point I was missing until now. If the 
function _did_ increment the counter, we'd just be driving operations that we 
are explicitly looking at in our analysis of its asyncwait()-using caller to 
completion because we didn't know about the new ones the function issued.

Now, if we can that the number of async operations of a particular kind a 
function launches is in [L, U] for some L and U, we can be less conservative. 
In particular, if we know that some function f does not contain any async 
operations of a given kind (or, most generally, if we could stick an 
`amdgpu-noasync` on it), then spurious async waits (say, a tensor-count wait in 
a program that never does any tensor loads) could be eliminated by the logic 
that does that sort of thing. If we had more precise knowledge what the 
function was doing - let's say we know it created no async calls and between 2 
and 4 tensor DMAs, we'd be able to use that information to get more precise 
upper bounds on our asyncwait()s, but that might be out of scope for initial 
work.

Now, suppose that there's an asyncwait(N) inside a non-entry-point function. If 
that function established at least N async marks, then everything's fine and we 
just analyze those locally. The callers of that function don't need to know 
about this - it just means that their own local analysis of how the marks work 
will insert waits for operations that have already been driven to completion. 
(the most extreme example being a function that starts or ends with an 
asyncwai(0) and never does any async operations, which will make the caller's 
more precise analysis pointless by way of introducing the "memory is 
synchronized at "calls" we have for non-async ops[.

However, if a function issues N async marks and than does an asyncwait(N + K[, 
we don't know anything abut that function's callers (ok maybe sometimes we do 
but we're not going there in v1 or possibly ever) and so it can only formulate 
its lower bounds based on the N async marks it has access to. So this'll also 
be safe, but it won't be performant or do what one really expects.

What all this means is that, while we probably want an attributor scheme for 
performance's sake, we don't need to reason about the contents of called 
functions for correctness like you said and it took me a while to realize.

There's also an interesting corollary here: it is always correct to replace 
asyncwait(N) with asyncwait(M) where M < N - it'll be bad for performance, but 
it'll be correct. 



https://github.com/llvm/llvm-project/pull/173259
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[clang] [llvm] [AMDGPU] Introduce asyncmark/wait intrinsics (PR #173259)

Reply via email to