On Wed, Jul 15, 2015 at 7:49 AM, Iago Toral <ito...@igalia.com> wrote: > Hi, > > when we sent the patches for the new nir->vec4 backend we mentioned that > we had a few dEQP tests that would fail to link because of register > spilling. Now that we have added GS support we see a few instances of > this problem popping up in a few GS piglit tests too, for example this > one: > > tests/spec/glsl-1.50/execution/variable-indexing/gs-input-array-vec4-index-rd.shader_test > > I have been looking into what is going on with these tests and I came to > the conclusion that the problem is a consequence of various factors, but > probably the main thing contributing to it is the way our SSA pass > works. That said, I am not that experienced with NIR, so it could also > be that my analysis is missing something and I am just arriving to wrong > conclusions, so I'll explain my thoughts below and hopefully someone > else with more NIR experience can jump in and confirm or reject my > analysis. > > The GS code in that test looks like this: > > for (int p = 0; p < 3; p++) { > color = ((index >= ins[p].m1.length() ? > ins[p].m2[index-ins[p].m1.length()] : > ins[p].m1[index]) == expect) ? > vec4(0.0, 1.0, 0.0, 1.0) : vec4(1.0, 0.0, 0.0, 1.0); > gl_Position = gl_in[p].gl_Position; > EmitVertex(); > } > > One thing that is immediately contributing to the register pressure is > some really awful code generated because of the indirect array indexing > on the inputs inside the loop. This is because of the > lower_variable_index_to_cond_assign lowering pass called from > brw_shader.cpp. This pass will convert that color assignment into a > bunch of nested if/else statements which makes the generated GLSL IR > code rather large, involving plenty of temporaries too. This is only > made worse by the fact that loop unrolling will replicate that 3 times. > The result is a huge pile of GLSL IR with a few dozens of nested if/else > statements and temporaries that looks like [1] (that is only a fragment > of the GLSL IR). > > One thing that is particularly relevant in that code is that it has > multiple conditional assignments to the same variable > (dereference_array_value) as a consequence of this lowering pass. > > That much, however, is common to the NIR and non-NIR paths. The problem > in the NIR case is that all these assignments generate new SSA values, > which then become new registers in the final NIR form. This leads to NIR > code like [2]. In contrast, the old vec4 visitor path, is able to have > writes to the same variable write to the same register. > > As a result, if I print the code right before register allocation in the > NIR path [3] and I compare that to what we get with the old vec4 visitor > path at that same point [4], it is clearly visible that this difference > is allowing the vec4 visitor path to reduce register pressure (see how > in [4] we have multiple writes to vgrf5, while in [3] we always write to > a new vgrf every time). > > So, am I missing something or is this kind of result expected with NIR > programs? Is there anything in the nir->vec4 pass that we can do to fix > this or does this need to be fixed when going out of SSA moe inside NIR? > > Iago > > [1] http://pastebin.com/5uA8ex2S > [2] http://pastebin.com/pqLfvAVN > [3] http://pastebin.com/64nSuUH8 > [4] http://pastebin.com/WCrdYxzt > > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Hi Iago, Indeed, NIR does convert conditional writes to conditional selectss -- it's a required part of the conversion to SSA, and since our HW has a conditional select instruction that's just as fast as doing a conditional move, we haven't bothered much to try and change it back during out-of-SSA. However, doing this shouldn't make things worse. In your example, vgrf9, vgrf15, and vgrf17 all have very short live intervals and don't interfere with vgrf11 (unless there's another use of them somewhere after the snippet you pasted), which means that the register allocator is free to allocate the destinations of all the selects to the same register. What's happening, though, is that you're running into our terrible liveness analysis. After doing the proper liveness analysis, we figure out the place each register first becomes live and last becomes dead, and then we consider registers that have overlapping ranges to interfere. So we consider vgrf11 to interfere with vgrf15 and vgrf17, even though it really doesn't. The trouble with making it do the right thing is that we may actually need to extend the live ranges of registers when the exec masks don't match up, either because one uses writemask_all or because they have incompatible exec masks due to containing different datatypes (half-float vs. float, etc.). For example, in your snippet, if vgrf15 were to be written with force_writemask_all, then it actually would interfere with vgrf11. But right now we're simply being conservative and assuming everything is incompatible, which produces extra register pressure in situations like the one you have. We have the same problem with FS, and I have some idea how to make it better there, but I'm not sure if we would care enough to port it to vec4. So the long answer is that there are some unfortunate features of the backend that are making things much worse than they should be wrt register allocation, and separating things out like SSA does seems to make it worse in certain situations. The short answer is: go fix register spilling! Connor _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev