On Fri, 2016-07-08 at 09:21 +0200, Iago Toral wrote: > On Thu, 2016-07-07 at 19:36 -0700, Francisco Jerez wrote: > > > > Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes: > > > > > > > > > > > From: Iago Toral Quiroga <ito...@igalia.com> > > > > > > In fp64 we can produce code like this: > > > > > > mov(16) vgrf2<2>:UD, vgrf3<2>:UD > > > > > > That our simd lowering pass would typically split in instructions > > > with a > > > width of 8, writing to two consecutive registers each. > > > Unfortunately, gen7 > > > hardware has a bug affecting execution masking and as a result, > > > the > > > second GRF register write won't work properly. Curro verified > > > this: > > > > > > "The problem is that pre-Gen8 EUs are hardwired to use the > > > QtrCtrl+1 > > > (where QtrCtrl is the 8-bit quarter of the execution mask > > > signals > > > specified in the instruction control fields) for the second > > > compressed half of any single-precision instruction (for > > > double-precision instructions it's hardwired to use NibCtrl+1), > > > which means that the EU will apply the wrong execution controls > > > for the second sequential GRF write if the number of channels > > > per > > > GRF is not exactly eight in single-precision mode (or four in > > > double-float mode)." > > > > > > In practice, this means that we cannot write more than one > > > consecutive GRF in a single instruction if the number of channels > > > per GRF is not exactly eight in single-precision mode (or four > > > in double-float mode). > > > > > > This patch makes our SIMD lowering pass split this kind of > > > instructions > > > so that the split versions only write to a single register. In > > > the > > > example above this means that we split the write in 4 > > > instructions, > > > each > > > one writing 4 UD elements (width = 4) to a single register. > > > --- > > > src/mesa/drivers/dri/i965/brw_fs.cpp | 20 ++++++++++++++++++++ > > > 1 file changed, 20 insertions(+) > > > > > > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp > > > b/src/mesa/drivers/dri/i965/brw_fs.cpp > > > index 2f473cc..caf88d1 100644 > > > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp > > > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp > > > @@ -4677,6 +4677,26 @@ static unsigned > > > get_fpu_lowered_simd_width(const struct brw_device_info > > > *devinfo, > > > const fs_inst *inst) > > > { > > > + /* Pre-Gen8 EUs are hardwired to use the QtrCtrl+1 (where > > > QtrCtrl is > > > + * the 8-bit quarter of the execution mask signals specified > > > in > > > the > > > + * instruction control fields) for the second compressed half > > > of any > > > + * single-precision instruction (for double-precision > > > instructions > > > + * it's hardwired to use NibCtrl+1), which means that the EU > > > will > > When I found out the hardware issue you describe in this comment I > > only > > had a HSW at hand, so I looked into this again today in order to > > verify > > whether IVB/VLV behave in the same way, and I'm afraid they > > don't... I > > haven't tried it on real IVB hardware, but at least the simulator > > behaves the same as HSW for single precision execution types, while > > for > > double precision types instruction decompression seems to be > > completely > > busted. AFAICT it applies the same channel enable signals to both > > halves of the compressed instruction which will be just wrong under > > non-uniform control flow. Can you clarify in the comment above > > that > > the > > text in parentheses referring to double-precision instructions may > > only > > apply to HSW? > Sure. > > > > > Have you been able to get any of the FP64 non-uniform control flow > > tests > > to pass on IVB? If you have I guess this may be a simulator-only > > bug, > > although I'm not sure the FS tests you have written will be non- > > uniform > > enough to reproduce the issue. If you haven't, we may have to add > > another check here in order to lower all non-force_writemask_all DF > > instructions to SIMD4 on IVB/VLV... :( > For now we have focused on HSW only but thanks for the update > regarding > IVB. I'll make sure to pay special attention to non-uniform control- > flow cases there when we start testing things there (we intend to > start > putting effort on IVB in the next days). > > I suppose that for now we can leave this patch as gen7 specific > (rather > than HSW specific) and fix it up if needed for IVB when we actually > verify that it needs extra work there. Does that sound like a good > plan? > > > > > > > > > > > > + * apply the wrong execution controls for the second > > > sequential > > > GRF > > > + * write if the number of channels per GRF is not exactly > > > eight > > > in > > > + * single-precision mode (or four in double-float mode). > > > + * > > > + * In this situation we calculate the maximum size of the > > > split > > > + * instructions so they only ever write to a single register. > > > + */ > > > + unsigned type_size = type_sz(inst->dst.type); > > > + unsigned channels_per_grf = inst->exec_size / inst- > > > > > > > > regs_written; > > This will cause a division by zero if the instruction doesn't write > > any > > registers. Strictly speaking you'd need to check the source types > > too > > in order to find out whether the instruction is compressed... > Aha, makes sense, so I check the dst and sources and pick the largest > to figure out the actual size of the instruction. > > > > > > > > > > > > + assert(channels_per_grf > 0); > > > + if (devinfo->gen < 8 && inst->regs_written > 1 && > > > + channels_per_grf != REG_SIZE / type_size) { > > I believe the hardware is more stupid than that, it doesn't really > > calculate the number of components that fit in a single GRF and > > then > > shifts QtrCtrl based on that, but rather it's hardwired to shift by > > four > > channels in DF mode (at least on HSW) or by eight channels for any > > other > > execution type, so you need to find out what the execution type of > > the > > instruction is (which is not necessarily the same as the > > destination > > type). > Ok, then summing up, the process would be like this: > > Split gen7 instructions that write more than 1 register and have a DF > execution type (i.e. either the dst or any of the sources are DF) and > have a stride that is not exactly 1 on either the DST (if DF) or any > of > the DF sources (with the caveat that for IVB we might need to do > something different in the end).
Well, not DF sources/dst but any source/dst that has a stride other than 1 since the problem happens when we don't use all the channels available in the register for either 32b or 64b types. > Does that sound right to you? > > > > > > > > > > > > + return channels_per_grf; > > For this to interact nicely with the other restrictions implemented > > in > > the same function in case several of them ever apply at the same > > time, > > move the check down and have it assign 'max_width = MIN2(max_width, > > channels_per_grf)' instead of the early return. > > Ah, good point. Will do. > > Thanks Curro! > > > > > > > > > > > > + } > > > + > > > /* Maximum execution size representable in the instruction > > > controls. */ > > > unsigned max_width = MIN2(32, inst->exec_size); > > > > > > _______________________________________________ > > > mesa-dev mailing list > > > mesa-dev@lists.freedesktop.org > > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/mesa-dev _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev