On Mon, Nov 11, 2024 at 01:57:15PM +0100, Christian König wrote: > Am 08.11.24 um 23:27 schrieb Matthew Brost: > > On Tue, Sep 24, 2024 at 11:30:53AM +0200, Simona Vetter wrote: > > > Apologies for the late reply ... > > > > > Also late reply, just read this. > > > > > On Wed, Sep 04, 2024 at 01:34:18PM +0200, Christian König wrote: > > > > Hi Boris, > > > > > > > > Am 04.09.24 um 13:23 schrieb Boris Brezillon: > > > > > > > > > Please read up here on why that stuff isn't allowed: > > > > > > > > > https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences > > > > > > > > panthor doesn't yet have a shrinker, so all memory is pinned, > > > > > > > > which means > > > > > > > > memory management easy mode. > > > > > > > Ok, that at least makes things work for the moment. > > > > > > Ah, perhaps this should have been spelt out more clearly ;) > > > > > > > > > > > > The VM_BIND mechanism that's already in place jumps through some > > > > > > hoops > > > > > > to ensure that memory is preallocated when the memory operations are > > > > > > enqueued. So any memory required should have been allocated before > > > > > > any > > > > > > sync object is returned. We're aware of the issue with memory > > > > > > allocations on the signalling path and trying to ensure that we > > > > > > don't > > > > > > have that. > > > > > > > > > > > > I'm hoping that we don't need a shrinker which deals with (active) > > > > > > GPU > > > > > > memory with our design. > > > > > That's actually what we were planning to do: the panthor shrinker was > > > > > about to rely on fences attached to GEM objects to know if it can > > > > > reclaim the memory. This design relies on each job attaching its fence > > > > > to the GEM mapped to the VM at the time the job is submitted, such > > > > > that > > > > > memory that's in-use or about-to-be-used doesn't vanish before the GPU > > > > > is done. > > > > Yeah and exactly that doesn't work any more when you are using user > > > > queues, > > > > because the kernel has no opportunity to attach a fence for each > > > > submission. > > > > > > > > > > Memory which user space thinks the GPU might > > > > > > need should be pinned before the GPU work is submitted. APIs which > > > > > > require any form of 'paging in' of data would need to be > > > > > > implemented by > > > > > > the GPU work completing and being resubmitted by user space after > > > > > > the > > > > > > memory changes (i.e. there could be a DMA fence pending on the GPU > > > > > > work). > > > > > Hard pinning memory could work (ioctl() around gem_pin/unpin()), but > > > > > that means we can't really transparently swap out GPU memory, or we > > > > > have to constantly pin/unpin around each job, which means even more > > > > > ioctl()s than we have now. Another option would be to add the XGS > > > > > fence > > > > > to the BOs attached to the VM, assuming it's created before the job > > > > > submission itself, but you're no longer reducing the number of user > > > > > <-> > > > > > kernel round trips if you do that, because you now have to create an > > > > > XSG job for each submission, so you basically get back to one ioctl() > > > > > per submission. > > > > For AMDGPU we are currently working on the following solution with > > > > memory > > > > management and user queues: > > > > > > > > 1. User queues are created through an kernel IOCTL, submissions work by > > > > writing into a ring buffer and ringing a doorbell. > > > > > > > > 2. Each queue can request the kernel to create fences for the currently > > > > pushed work for a queues which can then be attached to BOs, syncobjs, > > > > syncfiles etc... > > > > > > > > 3. Additional to that we have and eviction/preemption fence attached to > > > > all > > > > BOs, page tables, whatever resources we need. > > > > > > > > 4. When this eviction fences are requested to signal they first wait > > > > for all > > > > submission fences and then suspend the user queues and block creating > > > > new > > > > submission fences until the queues are restarted again. > > > Yup this works, at least when I play it out in my head. > > > > > I just started experimenting with user submission in Xe last week and > > ended up landing on a different PoC, blissfully unaware future fences / > > Mesa submit thread. However, after Sima filled me in, I’ve essentially > > landed on exactly what Christian is describing in Xe. I haven’t coded it > > yet, but have the design in my head. > > Sounds like going over that design again and again was good invested time. > > And yeah we have it working and at least so far it really looks like it > works. >
>From the progress I've made, yea I think this can work with some cooperation from user space. Engaging with the Mesa team this week to get a more clear picture on that end. > > I also generally agree with Sima’s comments about having a somewhat > > generic preempt fence (Christian refers to this as an eviction fence) > > as well. > > Well that is really a bike-sheet. > > I don't care if we call it preempt fence or eviction fence as long as > everybody understands what that thing is supposed to do. > > Probably something we should document. > Agree something common / documented would be good given how tricky this will be to get correct. > > Additionally, I’m thinking it might be beneficial for us to add a new > > 'preempt' dma-resv slot to track these, which would make it easier to > > enforce the ordering of submission fence signaling before preempt > > fences. > > That's exactly what DMA_RESV_USAGE_BOOKKEEP is good for. > We can move this specific discussion to the RFC I posted. Saw you replied there and will respond shortly / once I have followed up on a few things on my end. > And yes, I spend really *a lot of time* planning this :) > I've looked at the list and bunch of patches are on there modifying code which is not merged to drm-tip. e.g. This patch [1]. Trying to piece together what AMD is doing compared to what I had in mind is bit difficult without being able to directly look at the code. Any chance you have a public repo I can look at? [1] https://patchwork.freedesktop.org/patch/622074/?series=140648&rev=1 > > Depending on bandwidth, I may post an RFC to the list soon. I’ll also > > gauge the interest and bandwidth from our Mesa team to begin UMD work. > > Please loop me in as well. > Will do. Coming together pretty quick so it shouldn't be too long until I can post something or push a public branch. I think the Mesa side is going to be the more difficult piece and unsure how much interest there be to move this along. Matt > Regards, > Christian. > > > > > Matt > > > > > Note that the completion fence is only deadlock free if userspace is > > > really, really careful. Which in practice means you need the very > > > carefully constructed rules for e.g. vulkan winsys fences, otherwise you > > > do indeed deadlock. > > > > > > But if you keep that promise in mind, then it works, and step 2 is > > > entirely option, which means we can start userspace in a pure long-running > > > compute mode where there's only eviction/preemption fences. And then if > > > userspace needs a vulkan winsys fence, we can create that with step 2 as > > > needed. > > > > > > But the important part is that you need really strict rules on userspace > > > for when step 2 is ok and won't result in deadlocks. And those rules are > > > uapi, which is why I think doing this in panthor without the shrinker and > > > eviction fences (i.e. steps 3&4 above) is a very bad mistake. > > > > > > > This way you can still do your memory management inside the kernel (e.g. > > > > move BOs from local to system memory) or even completely suspend and > > > > resume > > > > applications without their interaction, but as Sima said it is just > > > > horrible > > > > complicated to get right. > > > > > > > > We have been working on this for like two years now and it still could > > > > be > > > > that we missed something since it is not in production testing yet. > > > Ack. > > > -Sima > > > -- > > > Simona Vetter > > > Software Engineer, Intel Corporation > > > http://blog.ffwll.ch >
