Re: Future direction of the Mesa Vulkan runtime (or "should we build a new gallium?")
Hello Faith and everyfrogy! I've been developing a new Vulkan driver for Mesa — Terakan, for AMD TeraScale Evergreen and Northern Islands GPUs — since May of 2023. You can find it in amd/terascale/vulkan on the Terakan branch of my fork at Triang3l/mesa. While it currently lacks many of the graphical features, the architecture of state management, meta, and descriptors, has already largely been implemented in its code. I'm overall relatively new to Mesa, in the past having contributed the fragment shader interlock implementation to RADV that included working with the state management, but never having written a Gallium driver, or a Vulkan driver in the ANV copy-pasting era, so this may be a somewhat fresh — although quite conservative — take on this. Due to various hardware and kernel driver differences (bindings being individually loaded into fixed slots as part of the command buffer state, the lack of command buffer chaining in the kernel resulting in having to reapply all of the state when the size of the hardware command buffer exceeds the HW/KMD limits), I've been designing the architecture of my Vulkan driver largely from scratch, without using the existing Mesa drivers as a reference. Unfortunately, it seems like we ended up going in fundamentally opposite directions in our designs, so I'd say that I'm much more scared about this approach than I am excited about it. My primary concerns about this architecture can be summarized into two categories: • The obligation to manage pipeline and dynamic state in the common representation — essentially mostly the same Vulkan function call arguments, but with an additional layer for processing pNext and merging pipeline and dynamic state — restricts the abilities of drivers to optimize state management for specific hardware. Most importantly, it hampers precompiling of state in pipeline objects. In state management, this would make Mesa Vulkan implementations closer not even to Gallium, but to the dreaded OpenGL. • Certain parts of the common code are designed around assumptions about the majority of the hardware, however some devices may have large architectural differences in specific areas, and trying to adapt the way of programming such hardware subsystems results in having to write suboptimal algorithms, as well as sometimes artificially restricting the VkPhysicalDeviceLimits the device can report. An example from my driver is the meaning of a pipeline layout on fixed-slot TeraScale. Because it uses flat binding indices throughout all sets (sets don't exist in the hardware at all), it needs base offsets for each set within the stage's bindings — which are precomputed at pipeline layout creation. This is fundamentally incompatible with MR !27024's direction to remove the concept of a pipeline layout — and if the common vkCmdBindDescriptorSets makes the VK_KHR_maintenance6 layout-object-less path the only available one, it would add a lot of overhead by making it necessary to recompute the offsets at every bind. I think what we need to consider about pipeline state (in the broader sense, including both state objects and dynamic state) is that it inherently has very different properties from anything the common runtime already covers. What most of the current objects in the common runtime have in common is that they: • Are largely hardware-independent and can work everywhere the same way. • Either: • Provide a complex solution to a large-scale problem, essentially being sort of advanced "middleware". Examples are WSI, synchronization, pipeline cache, secondary command buffer emulation, render pass emulation. • Or, solve a trivial task in a way that's non-intrusive towards algorithms employed by the drivers — such as managing object handles, invoking allocators, reference-counting descriptor set and pipeline layouts, pooling VkCommandBuffer instances. • Rarely influence the design of "hot path" functions, such as changes to pipeline state and bindings. On the other hand, pipeline state: 1. Is entirely hardware-specific. 2. Is modified very frequently — making up the majority of command buffer recording time. 3. Can be precompiled in pipeline objects — and that's highly desirable due to the previous point. Because of 1, there's almost nothing in the pipeline state that the common runtime can help share between drivers. Yes, it can potentially be used to automate running some NIR passes for baking static state into shaders, but currently it looks like the runtime is going in a somewhat different direction, and that needs only some helper functions invoked at pipeline creation time. Aside from that, I can't see it being able to be useful for anything other than merging static and dynamic state into a single structure. For drivers where developers would prefer this approach for various
Re: Future direction of the Mesa Vulkan runtime (or "should we build a new gallium?")
we should not forget that the communication between them is two-way, which includes: • Interface calls done by the app. • Limits and features exposed by the driver. Having accurate information about the other party is important for both to be able to make optimal decisions considering the real strong points and the real constraints of the two. And note that the reason why I'm talking about interface calls and limits collectively is because the application's Vulkan usage approaches essentially represent the "limits" of the application as well — like whether it has sufficient information to precompile pipeline state. When the common runtime just gets in the way here, it means that it's basically acting *against* the goal of the two… green sus 🐸 If NVK developers consider that for their target hardware, the near-unprocessed representation of the state behind both static and dynamic interfaces is sufficiently optimal, it's fine. But other drivers should not be punished for essentially doing what the application and the specification are enabling and even expecting them to do. Like baking immutable samplers into shader code. Or taking advantage of static, non-update-after-bind descriptors to assign UBOs to a fast hardware path in a more straightforward way. Or preprocessing static state in a pipeline object. Maybe it wouldn't have been a very big deal from the performance perspective in reality if in my driver Terakan, calling vkCmdBindPipeline with static blending equation state would have resulted in 6 enum translations and some shifts/ORs, instead of just one 32-bit assignment. Or, with my target hardware having fixed binding slots, if every vkCmdBindDescriptorSets call ran a += loop until firstSet instead of looking up the base slots in the pipeline layout. However, the conceptual thing here is that I'm not trying to make small improvements over some "default" behavior. There's no "default" here. I'm not supposed to implement static on top of dynamic in the first place, as I said, they are not only different concepts, but even opposite ones. Of course there are always exceptions on a case-by-case, driver-by-driver basis. For instance, due to the constraints of the memory and binding architectures of my target hardware, the kernel driver and the microcode, it's more optimal for my driver to record secondary command buffers on the level of Vulkan commands using Mesa's common encoder. But this kind of cherry-picking aligns much more closely with the "opt-in" approach than the "watertight" one. On the topic of limits, I also think the best we can do is to actually be honest about the specific driver and the specific hardware, and to view them from the perspective of enabling richer communication with the application. Blatantly lying (at least without an environment variable switch), like in the scary old days when, as I heard, some drivers resorted to being LLVMpipe having met a repeating NPOT texture, is definitely not contributing to productive communication. But at the same time, if AMD see how apps can take advantage of the explicit cubemap 3D>2D transformation from VK_AMD_gcn_shader, or there are potential scenarios for something like VK_ARM_render_pass_striped… why can't I just tell the app that spreading descriptors across sets more granularly costs my driver nothing and report maxBoundDescriptorSets = UINT32_MAX or at least an integer-overflow-safer (maxSamplers + maxUB + maxSB + maxSampled + maxSI) * 6 + maxIA, giving it one more close-to-metal tool for cases where it may be useful? P.S.: So far, the list of architectural concepts I'm not willing to sacrifice in Terakan, which I'd consider a loss of a major regression, includes: • Pipeline objects with pretranslated fixed-function state, as well as everything needed to enable that (like storing the current state in a close-to-hardware representation, which may require custom vkCmdSet* implementations). • Pipeline layout objects where already available, most importantly in vkCmdBindDescriptorSets and vkCmdPushDescriptorSetKHR. • maxBoundDescriptorSets, maxPushDescriptors, maxPushConstantsSize significantly higher than on hardware with root-signature-based binding. • Inside a VkCommandPool, separate pooling of entities allocated at different frequencies: hardware command buffer portions, command encoders (containing things like the current state that is pretty large due to fixed binding slots, relocation hash maps), BOs with push constants and dynamic vertex fetch subroutines. — Triang3l
Re: Future direction of the Mesa Vulkan runtime (or "should we build a new gallium?")
elation of vertex buffer bindings and CSOs, I just didn't have anything useful to say. • Faster iteration inside the common meta code, with the meta interface not having to take the demands of regular draws into account as much. And vice versa, of course — especially when it comes to implementing new extensions, many of which would still need handling in every driver with Gallium2, but also in the Gallium2 interface itself in addition. • Breaking changes to the meta-specific interface would only require adjusting meta handling in affected drivers. Breaking changes to something used by everyone across a vast code surface… Maybe you, Faith, are already well used to doing them, but that's still a very special kind of fun 😜 — Triang3l
Re: time for amber2 branch?
The shader compiler in R600g is actively developed (and I think OpenGL 4.6 support is among the main goals), I don't see why it needs to be moved to a low-priority branch or to stop getting new NIR infrastructure updates with the current amount of maintenance it receives. On 19/06/2024 18:26, Thomas Debesse wrote: > Maybe the work-in-progress “Terakan” vulkan driver for r600 cards > won't be affected because vulkan is not a gallium thing. Terakan uses R600g files heavily, specifically the register structures, as well as the shader compiler (with additional intrinsics implemented and plans to move some SFN parts, primarily those interacting with bindings, such as storage images/buffers, to NIR lowerings in R600g too for more straightforward code sharing) and other shader-related parts. Moreover, there likely will be backporting of some parts of Terakan to R600g (architectural-scale bugfixes primarily) in the somewhat distant future (when they're fully implemented and well-tested in Terakan), specifically: • GL_ARB_shader_draw_parameters. • New vertex fetch subroutine generation, correctly dividing by the instance divisor, and co-issuing instructions where possible. • 2048 vertex stride fetch subroutine workaround on pre-Cayman GPUs (which have bits only for up to 2047). • Color attachment index compaction in fragment shaders to allow gaps to be filled with storage resources. • Handling different alignment of pitch calculated by texture fetching hardware for 1D-thin-tiled mips of depth and stencil surface aspects that can't be respected on the depth/stencil attachment side where the pitch register is shared (likely will be using an intermediate overaligned surface). • True indirect compute dispatch via a different packet sequence on the existing kernel versions, and later, when the involved command parsing is fixed in the kernel, using actual INDIRECT_DISPATCH type-3 packets. — Triang3l
Re: time for amber2 branch?
lly all, I don't know for sure yet, AMD R8xx hardware, for instance, hangs with linear storage images according to one comment in R800AddrLib, and that's why a quad with a color target may be preferable for copying — and it also has fast resolves inside its color buffer hardware, as well as a DMA engine). — Triang3l