gen7: split instructions that run into exec masking bugs

Francisco Jerez Thu, 07 Jul 2016 19:39:09 -0700

Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes:

> From: Iago Toral Quiroga <ito...@igalia.com>
>
> In fp64 we can produce code like this:
>
> mov(16) vgrf2<2>:UD, vgrf3<2>:UD
>
> That our simd lowering pass would typically split in instructions with a
> width of 8, writing to two consecutive registers each. Unfortunately, gen7
> hardware has a bug affecting execution masking and as a result, the
> second GRF register write won't work properly. Curro verified this:
>
> "The problem is that pre-Gen8 EUs are hardwired to use the QtrCtrl+1
>  (where QtrCtrl is the 8-bit quarter of the execution mask signals
>  specified in the instruction control fields) for the second
>  compressed half of any single-precision instruction (for
>  double-precision instructions it's hardwired to use NibCtrl+1),
>  which means that the EU will apply the wrong execution controls
>  for the second sequential GRF write if the number of channels per
>  GRF is not exactly eight in single-precision mode (or four in
>  double-float mode)."
>
> In practice, this means that we cannot write more than one
> consecutive GRF in a single instruction if the number of channels
> per GRF is not exactly eight in single-precision mode (or four
> in double-float mode).
>
> This patch makes our SIMD lowering pass split this kind of instructions
> so that the split versions only write to a single register. In the
> example above this means that we split the write in 4 instructions, each
> one writing 4 UD elements (width = 4) to a single register.
> ---
>  src/mesa/drivers/dri/i965/brw_fs.cpp | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp 
> b/src/mesa/drivers/dri/i965/brw_fs.cpp
> index 2f473cc..caf88d1 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> @@ -4677,6 +4677,26 @@ static unsigned
>  get_fpu_lowered_simd_width(const struct brw_device_info *devinfo,
>                             const fs_inst *inst)
>  {
> +   /* Pre-Gen8 EUs are hardwired to use the QtrCtrl+1 (where QtrCtrl is
> +    * the 8-bit quarter of the execution mask signals specified in the
> +    * instruction control fields) for the second compressed half of any
> +    * single-precision instruction (for double-precision instructions
> +    * it's hardwired to use NibCtrl+1), which means that the EU will


When I found out the hardware issue you describe in this comment I only
had a HSW at hand, so I looked into this again today in order to verify
whether IVB/VLV behave in the same way, and I'm afraid they don't...  I
haven't tried it on real IVB hardware, but at least the simulator
behaves the same as HSW for single precision execution types, while for
double precision types instruction decompression seems to be completely
busted.  AFAICT it applies the same channel enable signals to both
halves of the compressed instruction which will be just wrong under
non-uniform control flow.  Can you clarify in the comment above that the
text in parentheses referring to double-precision instructions may only
apply to HSW?

Have you been able to get any of the FP64 non-uniform control flow tests
to pass on IVB?  If you have I guess this may be a simulator-only bug,
although I'm not sure the FS tests you have written will be non-uniform
enough to reproduce the issue.  If you haven't, we may have to add
another check here in order to lower all non-force_writemask_all DF
instructions to SIMD4 on IVB/VLV...  :(

> +    * apply the wrong execution controls for the second sequential GRF
> +    * write if the number of channels per GRF is not exactly eight in
> +    * single-precision mode (or four in double-float mode).
> +    *
> +    * In this situation we calculate the maximum size of the split
> +    * instructions so they only ever write to a single register.
> +    */
> +   unsigned type_size = type_sz(inst->dst.type);
> +   unsigned channels_per_grf = inst->exec_size / inst->regs_written;

This will cause a division by zero if the instruction doesn't write any
registers.  Strictly speaking you'd need to check the source types too
in order to find out whether the instruction is compressed...

> +   assert(channels_per_grf > 0);
> +   if (devinfo->gen < 8 && inst->regs_written > 1 &&
> +       channels_per_grf != REG_SIZE / type_size) {

I believe the hardware is more stupid than that, it doesn't really
calculate the number of components that fit in a single GRF and then
shifts QtrCtrl based on that, but rather it's hardwired to shift by four
channels in DF mode (at least on HSW) or by eight channels for any other
execution type, so you need to find out what the execution type of the
instruction is (which is not necessarily the same as the destination
type).

> +      return channels_per_grf;

For this to interact nicely with the other restrictions implemented in
the same function in case several of them ever apply at the same time,
move the check down and have it assign 'max_width = MIN2(max_width,
channels_per_grf)' instead of the early return.

> +   }
> +
>     /* Maximum execution size representable in the instruction controls. */
>     unsigned max_width = MIN2(32, inst->exec_size);
>  
> -- 
> 2.7.4
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev

signature.asc
Description: PGP signature

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH 3/6] i965/fs/gen7: split instructions that run into exec masking bugs

Reply via email to