krzysz00 wrote:
Ok, let me write all this out just to make sure I've convinced myself of what
you're going for in the hopes that it'll help improve the documentation.
One way to look at asyncwait(N) is that, when asyncwait(N), any operation
that's within an async group other than one of the last N to be created must
have completed. These async operations (ignoring the fact that the gfx9/10 ones
key off of vmem) correspond to different hardware counters for different types
of async operations.
What this implies is that, to materialize/implement asyncwait(N), the compiler
must first determine all the sequences of async groups that must be the N alive
ones. Then, for each such group and each counter type, we must determine a
lower bound on the number of async operations that increment that counter that
are in the group. For example, if we execute two async loads to LDS and one
tensor load in the previous async group, we'd know we need asynccnt <= 2 and
tensorcnt <= 1 after an asyncwait(1) - or that there are >= 2 async ops and >=
1 tensor ops in the previous 1 groups. In a more complex scenario like a
double-buffered loop, we could have something like this
```
tensor_load(... to A1)
tensor_load(... to B1)
asyncmark()
for (..) {
barrier
tensor_load(... to A2)
tensor_load(... to B2)
asyncmark()
asyncwait(1)
barrier
compute(A1, B1) // no async ops
barrier
tensor_load(... to A1)
tensor_load(... to B1)
asyncmark()
asyncwait(1)
barrier
compute(A2, B2)
[
asyncwait(0)
barrier
compute(A1, B1)
```
we know that every one of those asyncwait(1) calls should be implemented with a
tensor count wait of 2 and an async count wait of 0 that can be eliminated
(that is, we have lower bounds of async count >= 0 and tensor count >= 2 if
the only async operations com from the previously marked group ... and then a
later analysis can go "hey wait a minute, there's an async wait that's
trivially a nop, let's delete it").
For another case,
```
tensor_load(...)
asyncmark() // A
async_to_lds(...)
asyncmark() // B
if (...) {
async_to_lds(...)
asyncmark() // C
} else {
tensor_load()
asyncmark() // D
}
asyncwait(2)
```
the asyncwait will become asynccnt_wait(1) and tensorcnt_wait(0) because
- If the two previous marks are C and B, we'll need two async waits outstanding
and no tensor waits outstanding
- If the two previous groups are D and B, we'll need one asyncwait and one
tensor wait outstanding
Intersecting these two bounds gives us a pattern of waits that'll always work,
and, unlike in the previous example, we can't eliminate that wait to 0 tensor
ops, because it's possible that the program execution path was A B C, which
means we need the wait operation to drive he tensor load from A to completion.
Now, function calls.
Since the thing we care about is lower bounds for how much each mark increments
each async counter, it's safe to assume that each function call does not
increment this counter. That's the key point I was missing until now. If the
function _did_ increment the counter, we'd just be driving operations that we
are explicitly looking at in our analysis of its asyncwait()-using caller to
completion because we didn't know about the new ones the function issued.
Now, if we can that the number of async operations of a particular kind a
function launches is in [L, U] for some L and U, we can be less conservative.
In particular, if we know that some function f does not contain any async
operations of a given kind (or, most generally, if we could stick an
`amdgpu-noasync` on it), then spurious async waits (say, a tensor-count wait in
a program that never does any tensor loads) could be eliminated by the logic
that does that sort of thing. If we had more precise knowledge what the
function was doing - let's say we know it created no async calls and between 2
and 4 tensor DMAs, we'd be able to use that information to get more precise
upper bounds on our asyncwait()s, but that might be out of scope for initial
work.
Now, suppose that there's an asyncwait(N) inside a non-entry-point function. If
that function established at least N async marks, then everything's fine and we
just analyze those locally. The callers of that function don't need to know
about this - it just means that their own local analysis of how the marks work
will insert waits for operations that have already been driven to completion.
(the most extreme example being a function that starts or ends with an
asyncwai(0) and never does any async operations, which will make the caller's
more precise analysis pointless by way of introducing the "memory is
synchronized at "calls" we have for non-async ops[.
However, if a function issues N async marks and than does an asyncwait(N + K[,
we don't know anything abut that function's callers (ok maybe sometimes we do
but we're not going there in v1 or possibly ever) and so it can only formulate
its lower bounds based on the N async marks it has access to. So this'll also
be safe, but it won't be performant or do what one really expects.
What all this means is that, while we probably want an attributor scheme for
performance's sake, we don't need to reason about the contents of called
functions for correctness like you said and it took me a while to realize.
There's also an interesting corollary here: it is always correct to replace
asyncwait(N) with asyncwait(M) where M < N - it'll be bad for performance, but
it'll be correct.
https://github.com/llvm/llvm-project/pull/173259
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits