This series implements cached maps and explicit flushing for both panfrost and panthor. To avoid code/bug duplication, the tricky guts of the cache flushing ioctl which walk the sg list are broken into a new common shmem helper which can be used by any driver.
The PanVK MR to use this lives here: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36385 A few words about this v5: Faith and I discussed a few cache maintenance aspects that were bothering me, and she suggested to forbid exports of CPU-cacheable BOs to simplify things. But while doing that I realized this was still broken on systems where the Mali GPU is coherent but other potential importers are not, because in that case, even BOs declared without WB_MMAP are upgraded to CPU-cacheable. I first considered making those truly uncached, but that's not possible because coherency is flagged at the device level and making a BO non-cacheable on these systems messes up with how shmem zeroes out the pages at alloc time (with a temporary cached mapping) and how Arm64 lazily defers flushes. Because dma_map_sgtable() ends up being a NOP if the device is coherent, we end up with dirty cachelines hanging around even after we've created an uncached mapping to the same memory region, and this leads to corruptions later on, or even worse, potential data leaks because the uncached mapping can access memory before it's been zeroed out. TLDR; CPU cache maintenance is an mess on Arm unless everything is coherent, and we'll have to sort it out at some point, but I believe this is orthogonal to us implementing proper dma_buf_ops::{begin,end}_cpu_access(), so, I eventually decided to implement dma_buf_ops::{begin,end}_cpu_access() instead of pretending we're good if we only export CPU-uncached BOs (which we can't really do on systems where Mali is coherent, see above). I've hacked an IGT test to make sure this does the right thing, and it seems to work. Now, the real question is, is this the right thing to do. I basically went for the system dma_heap approach where not only the exporter, but also all importers have a dma_sync_sgtable_for_xxx() called on their sgtable to prepare/end the CPU access. This should work if only CPU cache maintenance is involved, but as soon as the importer needs more than that (copying memory around to make it CPU/device visible), it's going to be problematic. The only way to do that properly, would be to add {begin,end}_cpu_access() hooks to dma_buf_attach_ops and let the dma_buf core walk the importers to forward CPU access requests. Sorry for the noise if you've been Cc-ed and don't care, but my goal is to gather feedback on what exactly is expected from GPU drivers exporting their CPU-cacheable buffers, and why so many drivers get away with no dma_buf_ops::xxx_cpu_access() hooks, or very simple ones that don't cover the cases I described above. Changes in v2: - Expose the coherency so userspace can know when it should skip cache maintenance - Hook things up at drm_gem_object_funcs level to dma-buf cpu_prep hooks can be implemented generically - Revisit the semantics of the flags passed to gem_sync() - Add BO_QUERY_INFO ioctls to query BO flags on imported objects and let the UMD know when cache maintenance is needed on those Changes in v3: - New patch to fix panthor_gpu_coherency_set() - No other major changes, check each patch changelog for more details Changes in v4: - Two trivial fixes, check each patch changelog for more details Changes in v5: - Add a way to overload dma_buf_ops while still relying on the drm_prime boilerplate - Add default shmem implementation for dma_buf_ops::{begin,end}_cpu_access() - Provide custom dma_buf_ops to deal with CPU cache flushes around CPU accesses when the BO is CPU-cacheable - Go back to a version of drm_gem_shmem_sync() that only deals with cache maintenance, and adjust the semantics to make it clear this is the only thing it cares about - Adjust the BO_SYNC ioctls according to the new drm_gem_shmem_sync() semantics Boris Brezillon (10): drm/prime: Simplify life of drivers needing custom dma_buf_ops drm/shmem: Provide a generic {begin,end}_cpu_access() implementation drm/panthor: Provide a custom dma_buf implementation drm/panthor: Fix panthor_gpu_coherency_set() drm/panthor: Expose the selected coherency protocol to the UMD drm/panthor: Add a PANTHOR_BO_SYNC ioctl drm/panthor: Add an ioctl to query BO flags drm/panfrost: Provide a custom dma_buf implementation drm/panfrost: Expose the selected coherency protocol to the UMD drm/panfrost: Add an ioctl to query BO flags Faith Ekstrand (5): drm/shmem: Add a drm_gem_shmem_sync() helper drm/panthor: Bump the driver version to 1.6 drm/panfrost: Add a PANFROST_SYNC_BO ioctl drm/panfrost: Add flag to map GEM object Write-Back Cacheable drm/panfrost: Bump the driver version to 1.6 Loïc Molinari (1): drm/panthor: Add flag to map GEM object Write-Back Cacheable drivers/gpu/drm/drm_gem_shmem_helper.c | 207 +++++++++++++++++++++ drivers/gpu/drm/drm_prime.c | 14 +- drivers/gpu/drm/panfrost/panfrost_device.h | 1 + drivers/gpu/drm/panfrost/panfrost_drv.c | 98 +++++++++- drivers/gpu/drm/panfrost/panfrost_gem.c | 73 ++++++++ drivers/gpu/drm/panfrost/panfrost_gem.h | 9 + drivers/gpu/drm/panfrost/panfrost_gpu.c | 26 ++- drivers/gpu/drm/panfrost/panfrost_regs.h | 10 +- drivers/gpu/drm/panthor/panthor_device.c | 10 +- drivers/gpu/drm/panthor/panthor_drv.c | 80 +++++++- drivers/gpu/drm/panthor/panthor_gem.c | 77 +++++++- drivers/gpu/drm/panthor/panthor_gem.h | 6 + drivers/gpu/drm/panthor/panthor_gpu.c | 2 +- drivers/gpu/drm/panthor/panthor_sched.c | 18 +- include/drm/drm_drv.h | 8 + include/drm/drm_gem_shmem_helper.h | 24 +++ include/uapi/drm/panfrost_drm.h | 76 +++++++- include/uapi/drm/panthor_drm.h | 160 +++++++++++++++- 18 files changed, 875 insertions(+), 24 deletions(-) -- 2.51.0
