Re: Future direction of the Mesa Vulkan runtime (or "should we build a new gallium?")

2024-01-20 Thread Triang3l

Hello Faith and everyfrogy!

I've been developing a new Vulkan driver for Mesa — Terakan, for AMD
TeraScale Evergreen and Northern Islands GPUs — since May of 2023. You can
find it in amd/terascale/vulkan on the Terakan branch of my fork at
Triang3l/mesa. While it currently lacks many of the graphical features, the
architecture of state management, meta, and descriptors, has already
largely been implemented in its code. I'm overall relatively new to Mesa,
in the past having contributed the fragment shader interlock implementation
to RADV that included working with the state management, but never having
written a Gallium driver, or a Vulkan driver in the ANV copy-pasting era,
so this may be a somewhat fresh — although quite conservative — take on
this.

Due to various hardware and kernel driver differences (bindings being
individually loaded into fixed slots as part of the command buffer state,
the lack of command buffer chaining in the kernel resulting in having to
reapply all of the state when the size of the hardware command buffer
exceeds the HW/KMD limits), I've been designing the architecture of my
Vulkan driver largely from scratch, without using the existing Mesa drivers
as a reference.

Unfortunately, it seems like we ended up going in fundamentally opposite
directions in our designs, so I'd say that I'm much more scared about this
approach than I am excited about it.

My primary concerns about this architecture can be summarized into two
categories:

• The obligation to manage pipeline and dynamic state in the common
  representation — essentially mostly the same Vulkan function call
  arguments, but with an additional layer for processing pNext and merging
  pipeline and dynamic state — restricts the abilities of drivers to
  optimize state management for specific hardware. Most importantly, it
  hampers precompiling of state in pipeline objects.
  In state management, this would make Mesa Vulkan implementations closer
  not even to Gallium, but to the dreaded OpenGL.

• Certain parts of the common code are designed around assumptions about
  the majority of the hardware, however some devices may have large
  architectural differences in specific areas, and trying to adapt the way
  of programming such hardware subsystems results in having to write
  suboptimal algorithms, as well as sometimes artificially restricting the
  VkPhysicalDeviceLimits the device can report.
  An example from my driver is the meaning of a pipeline layout on
  fixed-slot TeraScale. Because it uses flat binding indices throughout all
  sets (sets don't exist in the hardware at all), it needs base offsets for
  each set within the stage's bindings — which are precomputed at pipeline
  layout creation. This is fundamentally incompatible with MR !27024's
  direction to remove the concept of a pipeline layout — and if the common
  vkCmdBindDescriptorSets makes the VK_KHR_maintenance6 layout-object-less
  path the only available one, it would add a lot of overhead by making it
  necessary to recompute the offsets at every bind.


I think what we need to consider about pipeline state (in the broader
sense, including both state objects and dynamic state) is that it
inherently has very different properties from anything the common runtime
already covers. What most of the current objects in the common runtime have
in common is that they:

• Are largely hardware-independent and can work everywhere the same way.
• Either:
  • Provide a complex solution to a large-scale problem, essentially being
    sort of advanced "middleware". Examples are WSI, synchronization,
    pipeline cache, secondary command buffer emulation, render pass
    emulation.
  • Or, solve a trivial task in a way that's non-intrusive towards
    algorithms employed by the drivers — such as managing object handles,
    invoking allocators, reference-counting descriptor set and pipeline
    layouts, pooling VkCommandBuffer instances.
• Rarely influence the design of "hot path" functions, such as changes to
  pipeline state and bindings.

On the other hand, pipeline state:

1. Is entirely hardware-specific.
2. Is modified very frequently — making up the majority of command buffer
   recording time.
3. Can be precompiled in pipeline objects — and that's highly desirable due
   to the previous point.

Because of 1, there's almost nothing in the pipeline state that the common
runtime can help share between drivers. Yes, it can potentially be used to
automate running some NIR passes for baking static state into shaders, but
currently it looks like the runtime is going in a somewhat different
direction, and that needs only some helper functions invoked at pipeline
creation time. Aside from that, I can't see it being able to be useful for
anything other than merging static and dynamic state into a single
structure. For drivers where developers would prefer this approach for
various 

Re: Future direction of the Mesa Vulkan runtime (or "should we build a new gallium?")

2024-01-24 Thread Triang3l
we should not forget that the communication between them is
two-way, which includes:
 • Interface calls done by the app.
 • Limits and features exposed by the driver.

Having accurate information about the other party is important for both to
be able to make optimal decisions considering the real strong points and
the real constraints of the two. And note that the reason why I'm talking
about interface calls and limits collectively is because the application's
Vulkan usage approaches essentially represent the "limits" of the
application as well — like whether it has sufficient information to
precompile pipeline state.

When the common runtime just gets in the way here, it means that it's
basically acting *against* the goal of the two… green sus 🐸


If NVK developers consider that for their target hardware, the
near-unprocessed representation of the state behind both static and dynamic
interfaces is sufficiently optimal, it's fine. But other drivers should not
be punished for essentially doing what the application and the
specification are enabling and even expecting them to do. Like baking
immutable samplers into shader code. Or taking advantage of static,
non-update-after-bind descriptors to assign UBOs to a fast hardware path in
a more straightforward way. Or preprocessing static state in a pipeline
object.

Maybe it wouldn't have been a very big deal from the performance
perspective in reality if in my driver Terakan, calling vkCmdBindPipeline
with static blending equation state would have resulted in 6 enum
translations and some shifts/ORs, instead of just one 32-bit assignment.
Or, with my target hardware having fixed binding slots, if every
vkCmdBindDescriptorSets call ran a += loop until firstSet instead of
looking up the base slots in the pipeline layout.

However, the conceptual thing here is that I'm not trying to make small
improvements over some "default" behavior. There's no "default" here.
I'm not supposed to implement static on top of dynamic in the first place,
as I said, they are not only different concepts, but even opposite ones.

Of course there are always exceptions on a case-by-case, driver-by-driver
basis. For instance, due to the constraints of the memory and binding
architectures of my target hardware, the kernel driver and the microcode,
it's more optimal for my driver to record secondary command buffers on the
level of Vulkan commands using Mesa's common encoder. But this kind of
cherry-picking aligns much more closely with the "opt-in" approach than the
"watertight" one.


On the topic of limits, I also think the best we can do is to actually be
honest about the specific driver and the specific hardware, and to view
them from the perspective of enabling richer communication with the
application. Blatantly lying (at least without an environment variable
switch), like in the scary old days when, as I heard, some drivers resorted
to being LLVMpipe having met a repeating NPOT texture, is definitely not
contributing to productive communication. But at the same time, if AMD see
how apps can take advantage of the explicit cubemap 3D>2D transformation
from VK_AMD_gcn_shader, or there are potential scenarios for something like
VK_ARM_render_pass_striped… why can't I just tell the app that spreading
descriptors across sets more granularly costs my driver nothing and report
maxBoundDescriptorSets = UINT32_MAX or at least an integer-overflow-safer
(maxSamplers + maxUB + maxSB + maxSampled + maxSI) * 6 + maxIA, giving it
one more close-to-metal tool for cases where it may be useful?



P.S.: So far, the list of architectural concepts I'm not willing to
sacrifice in Terakan, which I'd consider a loss of a major regression,
includes:
 • Pipeline objects with pretranslated fixed-function state, as well as
   everything needed to enable that (like storing the current state in a
   close-to-hardware representation, which may require custom vkCmdSet*
   implementations).
 • Pipeline layout objects where already available, most importantly in
   vkCmdBindDescriptorSets and vkCmdPushDescriptorSetKHR.
 • maxBoundDescriptorSets, maxPushDescriptors, maxPushConstantsSize
   significantly higher than on hardware with root-signature-based binding.
 • Inside a VkCommandPool, separate pooling of entities allocated at
   different frequencies: hardware command buffer portions, command
   encoders (containing things like the current state that is pretty large
   due to fixed binding slots, relocation hash maps), BOs with push
   constants and dynamic vertex fetch subroutines.

— Triang3l


Re: Future direction of the Mesa Vulkan runtime (or "should we build a new gallium?")

2024-01-25 Thread Triang3l
elation of vertex
 buffer bindings and CSOs, I just didn't have anything useful to say.
 • Faster iteration inside the common meta code, with the meta interface
   not having to take the demands of regular draws into account as much.
   And vice versa, of course — especially when it comes to implementing new
   extensions, many of which would still need handling in every driver with
   Gallium2, but also in the Gallium2 interface itself in addition.
 • Breaking changes to the meta-specific interface would only require
   adjusting meta handling in affected drivers.
   Breaking changes to something used by everyone across a vast code
   surface… Maybe you, Faith, are already well used to doing them, but
   that's still a very special kind of fun 😜

— Triang3l



Re: time for amber2 branch?

2024-06-19 Thread Triang3l

The shader compiler in R600g is actively developed (and I think OpenGL 4.6
support is among the main goals), I don't see why it needs to be moved to a
low-priority branch or to stop getting new NIR infrastructure updates 
with the

current amount of maintenance it receives.

On 19/06/2024 18:26, Thomas Debesse wrote:
> Maybe the work-in-progress “Terakan” vulkan driver for r600 cards
> won't be affected because vulkan is not a gallium thing.

Terakan uses R600g files heavily, specifically the register structures, 
as well
as the shader compiler (with additional intrinsics implemented and plans 
to move

some SFN parts, primarily those interacting with bindings, such as storage
images/buffers, to NIR lowerings in R600g too for more straightforward code
sharing) and other shader-related parts.

Moreover, there likely will be backporting of some parts of Terakan to R600g
(architectural-scale bugfixes primarily) in the somewhat distant future 
(when

they're fully implemented and well-tested in Terakan), specifically:
 • GL_ARB_shader_draw_parameters.
 • New vertex fetch subroutine generation, correctly dividing by the 
instance

   divisor, and co-issuing instructions where possible.
 • 2048 vertex stride fetch subroutine workaround on pre-Cayman GPUs 
(which have

   bits only for up to 2047).
 • Color attachment index compaction in fragment shaders to allow gaps 
to be

   filled with storage resources.
 • Handling different alignment of pitch calculated by texture fetching 
hardware
   for 1D-thin-tiled mips of depth and stencil surface aspects that 
can't be
   respected on the depth/stencil attachment side where the pitch 
register is

   shared (likely will be using an intermediate overaligned surface).
 • True indirect compute dispatch via a different packet sequence on the
   existing kernel versions, and later, when the involved command 
parsing is

   fixed in the kernel, using actual INDIRECT_DISPATCH type-3 packets.

— Triang3l


Re: time for amber2 branch?

2024-06-20 Thread Triang3l
lly all, I 
don't know for sure yet, AMD R8xx hardware, for instance, hangs with 
linear storage images according to one comment in R800AddrLib, and 
that's why a quad with a color target may be preferable for copying — 
and it also has fast resolves inside its color buffer hardware, as well 
as a DMA engine).


— Triang3l