Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes: > From: Iago Toral Quiroga <ito...@igalia.com> > > In fp64 we can produce code like this: > > mov(16) vgrf2<2>:UD, vgrf3<2>:UD > > That our simd lowering pass would typically split in instructions with a > width of 8, writing to two consecutive registers each. Unfortunately, gen7 > hardware has a bug affecting execution masking and as a result, the > second GRF register write won't work properly. Curro verified this: > > "The problem is that pre-Gen8 EUs are hardwired to use the QtrCtrl+1 > (where QtrCtrl is the 8-bit quarter of the execution mask signals > specified in the instruction control fields) for the second > compressed half of any single-precision instruction (for > double-precision instructions it's hardwired to use NibCtrl+1), > which means that the EU will apply the wrong execution controls > for the second sequential GRF write if the number of channels per > GRF is not exactly eight in single-precision mode (or four in > double-float mode)." > > In practice, this means that we cannot write more than one > consecutive GRF in a single instruction if the number of channels > per GRF is not exactly eight in single-precision mode (or four > in double-float mode). > > This patch makes our SIMD lowering pass split this kind of instructions > so that the split versions only write to a single register. In the > example above this means that we split the write in 4 instructions, each > one writing 4 UD elements (width = 4) to a single register. > --- > src/mesa/drivers/dri/i965/brw_fs.cpp | 20 ++++++++++++++++++++ > 1 file changed, 20 insertions(+) > > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp > b/src/mesa/drivers/dri/i965/brw_fs.cpp > index 2f473cc..caf88d1 100644 > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp > @@ -4677,6 +4677,26 @@ static unsigned > get_fpu_lowered_simd_width(const struct brw_device_info *devinfo, > const fs_inst *inst) > { > + /* Pre-Gen8 EUs are hardwired to use the QtrCtrl+1 (where QtrCtrl is > + * the 8-bit quarter of the execution mask signals specified in the > + * instruction control fields) for the second compressed half of any > + * single-precision instruction (for double-precision instructions > + * it's hardwired to use NibCtrl+1), which means that the EU will
When I found out the hardware issue you describe in this comment I only had a HSW at hand, so I looked into this again today in order to verify whether IVB/VLV behave in the same way, and I'm afraid they don't... I haven't tried it on real IVB hardware, but at least the simulator behaves the same as HSW for single precision execution types, while for double precision types instruction decompression seems to be completely busted. AFAICT it applies the same channel enable signals to both halves of the compressed instruction which will be just wrong under non-uniform control flow. Can you clarify in the comment above that the text in parentheses referring to double-precision instructions may only apply to HSW? Have you been able to get any of the FP64 non-uniform control flow tests to pass on IVB? If you have I guess this may be a simulator-only bug, although I'm not sure the FS tests you have written will be non-uniform enough to reproduce the issue. If you haven't, we may have to add another check here in order to lower all non-force_writemask_all DF instructions to SIMD4 on IVB/VLV... :( > + * apply the wrong execution controls for the second sequential GRF > + * write if the number of channels per GRF is not exactly eight in > + * single-precision mode (or four in double-float mode). > + * > + * In this situation we calculate the maximum size of the split > + * instructions so they only ever write to a single register. > + */ > + unsigned type_size = type_sz(inst->dst.type); > + unsigned channels_per_grf = inst->exec_size / inst->regs_written; This will cause a division by zero if the instruction doesn't write any registers. Strictly speaking you'd need to check the source types too in order to find out whether the instruction is compressed... > + assert(channels_per_grf > 0); > + if (devinfo->gen < 8 && inst->regs_written > 1 && > + channels_per_grf != REG_SIZE / type_size) { I believe the hardware is more stupid than that, it doesn't really calculate the number of components that fit in a single GRF and then shifts QtrCtrl based on that, but rather it's hardwired to shift by four channels in DF mode (at least on HSW) or by eight channels for any other execution type, so you need to find out what the execution type of the instruction is (which is not necessarily the same as the destination type). > + return channels_per_grf; For this to interact nicely with the other restrictions implemented in the same function in case several of them ever apply at the same time, move the check down and have it assign 'max_width = MIN2(max_width, channels_per_grf)' instead of the early return. > + } > + > /* Maximum execution size representable in the instruction controls. */ > unsigned max_width = MIN2(32, inst->exec_size); > > -- > 2.7.4 > > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev