NIR: is_used_once breaks multi-pass rendering

2022-01-20 Thread Marek Olšák
Hi,

"is_used_once" within an inexact transformation in nir_opt_algebraic can
lead to geometry differences with multi-pass rendering, causing incorrect
output. Here's an example to proof this:

Let's assume there is a pass that writes out some intermediate value from
the position calculation as a varying. Let's assume there is another pass
that does the same thing, but only draws to the depth buffer, so varyings
are eliminated. The second pass would get "is_used_once" because there is
just the position, and let's assume there is an inexact transformation with
"is_used_once" that matches that. On the other hand, the first pass
wouldn't get "is_used_once" because there is the varying. Now the same
position calculation is different for each pass, causing depth test
functions commonly used in multi-pass rendering such as EQUAL to fail.

The application might even use the exact same shader for both passes, and
the driver might just look for depth-only rendering and remove the varyings
based on that. Or it can introduce more "is_used_once" cases via uniform
inlining. From the app's point of view, the positions should be identical
between both passes if it's the exact same shader.

The workaround we have for this issue is called
"vs_position_always_invariant", which was added for inexact FMA fusing, but
it works with all inexact transformations containing "is_used_once".

This issue could be exacerbated by future optimizations.

Some of the solutions are:
- Remove "is_used_once" (safe)
- Enable vs_position_always_invariant by default (not helpful if the data
flow is shader->texture->shader->position)
- Always suppress inexact transformations containing "is_used_once" for all
instructions contributing to the final position value (less aggressive than
vs_position_always_invariant; it needs a proof that it's equivalent to
vs_position_always_invariant in terms of invariance, not behavior)
- Continue using app workarounds.

Just some food for thought.

Marek


Re: NIR: is_used_once breaks multi-pass rendering

2022-01-20 Thread Connor Abbott
There are precise rules for when calculations in GL must return the
same result, which are laid out in Appendix A ("Invariance"). The
relevant rule here:

"Rule 4 The same vertex or fragment shader will produce the same result when
run multiple times with the same input. The wording ‘the same shader’ means a
program object that is populated with the same source strings, which
are compiled
and then linked, possibly multiple times, and which program object is
then executed
using the same GL state vector. Invariance is relaxed for shaders with
side effects,
such as accessing atomic counters (see section A.5)"

The key part is "using the same GL state vector." In particular that
includes which buffers are attached - and the entire program has to be
the same, which allows varying linking optimizations. The intent of
this was probably exactly to enable these sorts of optimizations based
on what's using a value while allowing aggressive cross-stage
optimizations. This means that in your example, the app is broken and
it should be using "invariant gl_Position;" (i.e. what the app
workaround does).

There's a question of whether apps are broken often enough that we
ought to enable vs_position_always_invariant by default. That will
obviously come with some performance cost, although maybe it's
acceptable enough in practice.

Connor

On Thu, Jan 20, 2022 at 9:31 AM Marek Olšák  wrote:
>
> Hi,
>
> "is_used_once" within an inexact transformation in nir_opt_algebraic can lead 
> to geometry differences with multi-pass rendering, causing incorrect output. 
> Here's an example to proof this:
>
> Let's assume there is a pass that writes out some intermediate value from the 
> position calculation as a varying. Let's assume there is another pass that 
> does the same thing, but only draws to the depth buffer, so varyings are 
> eliminated. The second pass would get "is_used_once" because there is just 
> the position, and let's assume there is an inexact transformation with 
> "is_used_once" that matches that. On the other hand, the first pass wouldn't 
> get "is_used_once" because there is the varying. Now the same position 
> calculation is different for each pass, causing depth test functions commonly 
> used in multi-pass rendering such as EQUAL to fail.
>
> The application might even use the exact same shader for both passes, and the 
> driver might just look for depth-only rendering and remove the varyings based 
> on that. Or it can introduce more "is_used_once" cases via uniform inlining. 
> From the app's point of view, the positions should be identical between both 
> passes if it's the exact same shader.
>
> The workaround we have for this issue is called 
> "vs_position_always_invariant", which was added for inexact FMA fusing, but 
> it works with all inexact transformations containing "is_used_once".
>
> This issue could be exacerbated by future optimizations.
>
> Some of the solutions are:
> - Remove "is_used_once" (safe)
> - Enable vs_position_always_invariant by default (not helpful if the data 
> flow is shader->texture->shader->position)
> - Always suppress inexact transformations containing "is_used_once" for all 
> instructions contributing to the final position value (less aggressive than 
> vs_position_always_invariant; it needs a proof that it's equivalent to 
> vs_position_always_invariant in terms of invariance, not behavior)
> - Continue using app workarounds.
>
> Just some food for thought.
>
> Marek


Re: [PATCH v2 4/4] drm/i915/uapi: document behaviour for DG2 64K support

2022-01-20 Thread Ramalingam C
On 2022-01-18 at 17:50:37 +, Robert Beckett wrote:
> From: Matthew Auld 
> 
> On discrete platforms like DG2, we need to support a minimum page size
> of 64K when dealing with device local-memory. This is quite tricky for
> various reasons, so try to document the new implicit uapi for this.
> 
> v2: Fixed suggestions on formatting [Daniel]
> 
> Signed-off-by: Matthew Auld 
> Signed-off-by: Ramalingam C 
> Signed-off-by: Robert Beckett 
> cc: Simon Ser 
> cc: Pekka Paalanen 
> Cc: Jordan Justen 
> Cc: Kenneth Graunke 
> Cc: mesa-dev@lists.freedesktop.org
> Cc: Tony Ye 
> Cc: Slawomir Milczarek 
> ---
>  include/uapi/drm/i915_drm.h | 44 -
>  1 file changed, 39 insertions(+), 5 deletions(-)
> 
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 5e678917da70..486b7b96291e 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1118,10 +1118,16 @@ struct drm_i915_gem_exec_object2 {
>   /**
>* When the EXEC_OBJECT_PINNED flag is specified this is populated by
>* the user with the GTT offset at which this object will be pinned.
> +  *
>* When the I915_EXEC_NO_RELOC flag is specified this must contain the
>* presumed_offset of the object.
> +  *
>* During execbuffer2 the kernel populates it with the value of the
>* current GTT offset of the object, for future presumed_offset writes.
> +  *
> +  * See struct drm_i915_gem_create_ext for the rules when dealing with
> +  * alignment restrictions with I915_MEMORY_CLASS_DEVICE, on devices with
> +  * minimum page sizes, like DG2.
>*/
>   __u64 offset;
>  
> @@ -3145,11 +3151,39 @@ struct drm_i915_gem_create_ext {
>*
>* The (page-aligned) allocated size for the object will be returned.
>*
> -  * Note that for some devices we have might have further minimum
> -  * page-size restrictions(larger than 4K), like for device local-memory.
> -  * However in general the final size here should always reflect any
> -  * rounding up, if for example using the 
> I915_GEM_CREATE_EXT_MEMORY_REGIONS
> -  * extension to place the object in device local-memory.
> +  *
> +  * **DG2 64K min page size implications:**
> +  *
> +  * On discrete platforms, starting from DG2, we have to contend with GTT
> +  * page size restrictions when dealing with I915_MEMORY_CLASS_DEVICE
> +  * objects.  Specifically the hardware only supports 64K or larger GTT
> +  * page sizes for such memory. The kernel will already ensure that all
> +  * I915_MEMORY_CLASS_DEVICE memory is allocated using 64K or larger page
> +  * sizes underneath.
> +  *
> +  * Note that the returned size here will always reflect any required
> +  * rounding up done by the kernel, i.e 4K will now become 64K on devices
> +  * such as DG2.
> +  *
> +  * **Special DG2 GTT address alignment requirement:**
> +  *
> +  * The GTT alignment will also need be at least 2M for  such objects.
> +  *
> +  * Note that due to how the hardware implements 64K GTT page support, we
> +  * have some further complications:
> +  *
> +  *   1) The entire PDE(which covers a 2MB virtual address range), must
> +  *   contain only 64K PTEs, i.e mixing 4K and 64K PTEs in the same
> +  *   PDE is forbidden by the hardware.
> +  *
> +  *   2) We still need to support 4K PTEs for I915_MEMORY_CLASS_SYSTEM
> +  *   objects.
> +  *
> +  * To keep things simple for userland, we mandate that any GTT mappings
> +  * must be aligned to and rounded up to 2MB. As this only wastes virtual
> +  * address space and avoids userland having to copy any needlessly
> +  * complicated PDE sharing scheme (coloring) and only affects GD2, this
> +  * id deemed to be a good compromise.
"only affects DG2, this is" 

Except these typos, patch looks good to me

Reviewed-by: Ramalingam C 
>*/
>   __u64 size;
>   /**
> -- 
> 2.25.1
> 


[PATCH v3 5/5] drm/i915/uapi: document behaviour for DG2 64K support

2022-01-20 Thread Robert Beckett
From: Matthew Auld 

On discrete platforms like DG2, we need to support a minimum page size
of 64K when dealing with device local-memory. This is quite tricky for
various reasons, so try to document the new implicit uapi for this.

v3: fix typos and less emphasis
v2: Fixed suggestions on formatting [Daniel]

Signed-off-by: Matthew Auld 
Signed-off-by: Ramalingam C 
Signed-off-by: Robert Beckett 
Acked-by: Jordan Justen 
cc: Simon Ser 
cc: Pekka Paalanen 
Cc: Jordan Justen 
Cc: Kenneth Graunke 
Cc: mesa-dev@lists.freedesktop.org
Cc: Tony Ye 
Cc: Slawomir Milczarek 
---
 include/uapi/drm/i915_drm.h | 44 -
 1 file changed, 39 insertions(+), 5 deletions(-)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 5e678917da70..77e5e74c32c1 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1118,10 +1118,16 @@ struct drm_i915_gem_exec_object2 {
/**
 * When the EXEC_OBJECT_PINNED flag is specified this is populated by
 * the user with the GTT offset at which this object will be pinned.
+*
 * When the I915_EXEC_NO_RELOC flag is specified this must contain the
 * presumed_offset of the object.
+*
 * During execbuffer2 the kernel populates it with the value of the
 * current GTT offset of the object, for future presumed_offset writes.
+*
+* See struct drm_i915_gem_create_ext for the rules when dealing with
+* alignment restrictions with I915_MEMORY_CLASS_DEVICE, on devices with
+* minimum page sizes, like DG2.
 */
__u64 offset;
 
@@ -3145,11 +3151,39 @@ struct drm_i915_gem_create_ext {
 *
 * The (page-aligned) allocated size for the object will be returned.
 *
-* Note that for some devices we have might have further minimum
-* page-size restrictions(larger than 4K), like for device local-memory.
-* However in general the final size here should always reflect any
-* rounding up, if for example using the 
I915_GEM_CREATE_EXT_MEMORY_REGIONS
-* extension to place the object in device local-memory.
+*
+* DG2 64K min page size implications:
+*
+* On discrete platforms, starting from DG2, we have to contend with GTT
+* page size restrictions when dealing with I915_MEMORY_CLASS_DEVICE
+* objects.  Specifically the hardware only supports 64K or larger GTT
+* page sizes for such memory. The kernel will already ensure that all
+* I915_MEMORY_CLASS_DEVICE memory is allocated using 64K or larger page
+* sizes underneath.
+*
+* Note that the returned size here will always reflect any required
+* rounding up done by the kernel, i.e 4K will now become 64K on devices
+* such as DG2.
+*
+* Special DG2 GTT address alignment requirement:
+*
+* The GTT alignment will also need to be at least 2M for such objects.
+*
+* Note that due to how the hardware implements 64K GTT page support, we
+* have some further complications:
+*
+*   1) The entire PDE (which covers a 2MB virtual address range), must
+*   contain only 64K PTEs, i.e mixing 4K and 64K PTEs in the same
+*   PDE is forbidden by the hardware.
+*
+*   2) We still need to support 4K PTEs for I915_MEMORY_CLASS_SYSTEM
+*   objects.
+*
+* To keep things simple for userland, we mandate that any GTT mappings
+* must be aligned to and rounded up to 2MB. As this only wastes virtual
+* address space and avoids userland having to copy any needlessly
+* complicated PDE sharing scheme (coloring) and only affects DG2, this
+* is deemed to be a good compromise.
 */
__u64 size;
/**
-- 
2.25.1



Re: Replacing NIR with SPIR-V?

2022-01-20 Thread Abel Bernabeu
In principle, all the properties you highlight in your blog
 as key points
of NIR also apply to SPIR-V. I was curious to know where in the details
that I miss, NIR starts shining as a more suitable IR than SPIR-V for the
task of communicating front-end and back-end. By the way, thanks for
putting together that blog post.

As it seems clear that the NIR question is well settled within the mesa
community and I really see value in having mesa drivers, I promise to pay
as much attention to the NIR use cases as I did with SPIR-V :-)

By the way, we are not planning on supporting with specific RISC-V
instructions everything that has an instruction on SPIR-V. Regarding the
two areas you mention:

- Arrays and structs: SPIR-V's OpAccessChain would need to be processed by
a backend and translated to pointer arithmetic plus dereferencing (kind of
the same thing as having to process a nir_deref). This translation can be
done in RISC-V with no issue, whether it is OpAccessChain or nir_deref.

- Trigonometric operations: personally I consider that only "sin" and "cos"
are needed additions for RISC-V. Unclear what precision yet, likely 8 bits,
for serving as initial value for a Newton-Rapson style computation.

Regards.

On Thu, Jan 20, 2022 at 2:36 AM Jason Ekstrand  wrote:

> > - Does it make sense to move to SPIR-V?
>
> None whatsoever.  SPIR-V is an interchange format, not a set of
> manipulatable data structures suitable for compiler lowering and
> optimization.
>
> You also don't want to build hardware around consuming SPIR-V.  There are
> lots of things that the SPIR-V has which you wouldn't want to support
> natively in hardware such as structures and arrays in SSA values or complex
> trig ops like atan2().  Part of the purpose of NIR is to lower these things
> to simpler constructs which are supported in native hardware.
>
> --Jason
>
> On Wed, Jan 19, 2022 at 7:17 PM Abel Bernabeu <
> abel.berna...@esperantotech.com> wrote:
>
>> Hi,
>>
>> My name Abel Bernabeu and I currently chair the Graphics and ML Special
>> Interest Group within RISC-V.
>>
>> As part of my work for RISC-V I am currently looking at what is needed
>> for supporting a graphics product that uses a (potentially extended) RISC-V
>> ISA for its shading cores. My initial focus has been on analyzing the
>> functional gap between RISC-V and SPIR-V, assuming that whatever is needed
>> for a modern graphics accelerator is inevitably present on SPIR-V.
>>
>> Now, the thing is that most of the potential adopters on our committee
>> will likely be interested in using mesa for developing their drivers and
>> that means using NIR as intermediate representation. Thus, I also need to
>> consider NIR when looking at the functional gap, doubling the amount of
>> work during the analysis.
>>
>> Why is mesa using NIR as intermediate representation rather than SPIR-V?
>> It would make my life easier if mesa used SPIR-V rather than NIR for
>> communicating the front-end and the backends.
>>
>> I know it is a lot of work to migrate to SPIR-V, but I am interested in
>> knowing what is the opinion of the mesa developers:
>>
>> - My understanding is that when mesa adopted NIR, there was no SPIR-V.
>> Was a comparison made after the SPIR-V ratification?
>>
>> - Does it make sense to move to SPIR-V?
>>
>> - Is it feasible in terms of functionality supported by SPIR-V?
>>
>> - Is the cost worth the potential advantage of using a more commonly
>> adopted standard?
>>
>> Thanks in advance for your time and thoughts.
>>
>> Regards.
>>
>


Re: Replacing NIR with SPIR-V?

2022-01-20 Thread Jason Ekstrand
On Thu, Jan 20, 2022 at 5:49 PM Abel Bernabeu <
abel.berna...@esperantotech.com> wrote:

> In principle, all the properties you highlight in your blog
>  as key points
> of NIR also apply to SPIR-V.
>

First off, that blog post is truly ancient.  Based on the quote from
nir_opt_algebraic.c, it looks like less than 6 months after the original
NIR patches landed which puts it at 5-6 years old.  A lot has changed since
then.


> I was curious to know where in the details that I miss, NIR starts shining
> as a more suitable IR than SPIR-V for the task of communicating front-end
> and back-end. By the way, thanks for putting together that blog post.
>

In terms of what they're capable of communicating, yes, SPIR-V and NIR can
express many of the same things.  But that's not the point.  The point is
that there's a lot that happens between coming out of GLSL or SPIR-V and
going into the back-end.  A lot of what we do with NIR is share as much of
that lowering across drivers as possible.  Yes, we could convert back to
SPIR-V before going into back-ends but there's really no point since they
need their own IRs anyway.  If you're dumping straight into LLVM or
similar, then maybe you don't need any of that, but if you're building a
custom back-end, you really want to let NIR do that lowering and you don't
want to handle it all on your own.


> As it seems clear that the NIR question is well settled within the mesa
> community and I really see value in having mesa drivers, I promise to pay
> as much attention to the NIR use cases as I did with SPIR-V :-)
>
> By the way, we are not planning on supporting with specific RISC-V
> instructions everything that has an instruction on SPIR-V. Regarding the
> two areas you mention:
>
> - Arrays and structs: SPIR-V's OpAccessChain would need to be processed by
> a backend and translated to pointer arithmetic plus dereferencing (kind of
> the same thing as having to process a nir_deref). This translation can be
> done in RISC-V with no issue, whether it is OpAccessChain or nir_deref.
>

A big part of the point of NIR is to get rid of these things so that
drivers don't have to deal with them.  Yes, NIR arrays and struct and
nir_deref to deal with them but, by the time you get into the back-end, all
the nir_derefs are gone and you're left with load/store messages with
actual addresses (either a 64-bit memory address or a index+offset pair for
a bound resource).  Again, unless you're going to dump straight into LLVM,
you really don't want to handle that in your back-end unless you really
have to.

Over-all, I think you're asking the wrong set of questions.  If you're
trying to understand Mesa GPU compilers, looking at NIR from documentation
and blog posts and comparing with SPIR-V is likely to raise more questions
than answers.  I would instead recommend looking at an actual driver and
seeing how things flow through the compiler stack.  That's likely to teach
you a lot more about how the Mesa compiler stack works than reading blogs.
That, or start implementing a NIR back-end and see what you run into and
ask questions on #dri-devel.

--Jason



> - Trigonometric operations: personally I consider that only "sin" and
> "cos" are needed additions for RISC-V. Unclear what precision yet, likely 8
> bits, for serving as initial value for a Newton-Rapson style computation.
>
> Regards.
>
> On Thu, Jan 20, 2022 at 2:36 AM Jason Ekstrand 
> wrote:
>
>> > - Does it make sense to move to SPIR-V?
>>
>> None whatsoever.  SPIR-V is an interchange format, not a set of
>> manipulatable data structures suitable for compiler lowering and
>> optimization.
>>
>> You also don't want to build hardware around consuming SPIR-V.  There are
>> lots of things that the SPIR-V has which you wouldn't want to support
>> natively in hardware such as structures and arrays in SSA values or complex
>> trig ops like atan2().  Part of the purpose of NIR is to lower these things
>> to simpler constructs which are supported in native hardware.
>>
>> --Jason
>>
>> On Wed, Jan 19, 2022 at 7:17 PM Abel Bernabeu <
>> abel.berna...@esperantotech.com> wrote:
>>
>>> Hi,
>>>
>>> My name Abel Bernabeu and I currently chair the Graphics and ML Special
>>> Interest Group within RISC-V.
>>>
>>> As part of my work for RISC-V I am currently looking at what is needed
>>> for supporting a graphics product that uses a (potentially extended) RISC-V
>>> ISA for its shading cores. My initial focus has been on analyzing the
>>> functional gap between RISC-V and SPIR-V, assuming that whatever is needed
>>> for a modern graphics accelerator is inevitably present on SPIR-V.
>>>
>>> Now, the thing is that most of the potential adopters on our committee
>>> will likely be interested in using mesa for developing their drivers and
>>> that means using NIR as intermediate representation. Thus, I also need to
>>> consider NIR when looking at the functional gap, doubling the amount of
>>> work