Jason Ekstrand <[email protected]> writes: > On Wed, Sep 6, 2017 at 3:58 PM, Chema Casanova <[email protected]> > wrote: > >> Hi Connor and Curro, >> >> On 28/08/17 12:24, Alejandro Piñeiro wrote: >> > On 27/08/17 20:24, Connor Abbott wrote: >> >> Hi, >> >> >> >> On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro" <[email protected] >> >> <mailto:[email protected]>> wrote: >> >> >> >> On 24/08/17 21:07, Connor Abbott wrote: >> >> > >> >> > Hi Alejandro, >> >> >> >> Hi Connor, >> >> >> >> > >> >> > This seems really suspicious. If the live ranges are really >> >> > independent, then the register allocator should be able to >> >> assign the >> >> > two virtual registers to the same physical register if it needs >> to. >> >> >> >> Yes, it is true, the register allocator should be able to assign two >> >> virtual registers to the same physical register. But that is done >> >> at the >> >> end (or really near the end), so late for the problem this >> >> optimization >> >> is trying to fix. >> >> >> >> >> >> Well, my understanding is that the problem is long compilation times >> >> due to spilling and our not-so-great implementation of it. So no, >> >> register allocation is not late for the problem. As both Curro and I >> >> explained, the change by itself can only pessimise register >> >> allocation, so if it helps then it must be due to a bug in the >> >> register allocator or a problem in a subsequent pass that's getting >> >> hidden by this one. >> > >> > Ok. >> > >> >> >> >> We are also reducing the amount of instructions used. >> >> >> >> >> >> The comments in the source code say otherwise. Any instructions >> >> eliminated were from spilling, which this pass only accidentally >> reduces. >> > >> > Yes, sorry, I explained myself poorly. The optimization itself doesn't >> > remove any instructions. But using it reduces the final number of >> > instructions, although as you say, they are likely due reducing the >> > spilling. >> > >> >> >> >> >> >> >> >> Probably not really clear on the commit message. When I say >> >> "reduce the >> >> pressure of the register allocator" I mean having a code that the >> >> register allocator would be able to handle without using too much >> >> time. >> >> The problem this optimization tries to solve is that for some 16 >> >> bit CTS >> >> tests (some with matrices and geometry shaders), the amount of >> virtual >> >> registers used and instructions was really big. For the record, >> >> initially, some tests needed 24 min just to compile. Right now, >> thanks >> >> to other optimizations, the slower test without this optimization >> >> needs >> >> 1min 30 seconds. Adding some hacky timestamps, the time used at >> >> fs_visitor::allocate_registers (brw_fs.cpp:6096) is: >> >> >> >> * While trying to schedule using the three available pre mode >> >> heuristics: 7 seconds >> >> * Allocation with spilling: 63 seconds >> >> * Final schedule using SCHEDULE_POST: 19 seconds >> >> >> >> With this optimization, the total time goes down to 14 seconds (10 >> >> + 0 + >> >> 3 on the previous bullet point list). >> >> >> >> One could argue that 1min 30 seconds is okish. But taking into >> account >> >> that it goes down to 14 seconds, even with some caveats (see >> below), I >> >> still think that it is worth to use the optimization. >> >> >> >> And a final comment. For that same test, this is the final stats >> >> (using >> >> INTEL_DEBUG): >> >> >> >> * With the optimization: SIMD8 shader: 4610 instructions. 0 loops. >> >> 130320 cycles. 15:9 spills:fills. >> >> * Without the optimization: SIMD8 shader: 12312 instructions. 0 >> >> loops. >> >> 174816 cycles. 751:1851 spills:fills. >> >> >> >> >> >> So, the fact that it helps at all with SIMD8 shows that my theory is >> >> wrong, but since your pass reduces spilling, it clearly must be >> >> avoiding a bug somewhere else. You need to compare the IR for a shader >> >> with the problem with and without this pass right before register >> >> allocation. Maybe the sources and destinations of the conversion >> >> instructions interfere without the change due to some other pass >> >> that's increasing register pressure, in which case that's the problem, >> >> but I doubt it. >> > >> > Ok, thanks for the hints. >> >> After some research we found that we need to adapt the live_variables >> algorithm to support 32 to 16-bit conversions. Because of the HW >> alignment restrictions these conversions need that the result register >> uses stride=2, so it is not continuous (stride!=1) so by definition >> is_partial_write returns true. Any of the next last 3 conditions could >> be true when we use 16-bit types. >> >> bool >> fs_inst::is_partial_write() const >> { >> return ((this->predicate && this->opcode != BRW_OPCODE_SEL) || >> (this->exec_size * type_sz(this->dst.type)) < 32 || >> !this->dst.is_contiguous() || >> this->dst.offset % REG_SIZE != 0); >> } >> >> So at the check on the setup_one_write function at >> brw_fs_live_variables.cpp the variable isn't marked as defined >> completely in the block. >> >> if (inst->dst.file == VGRF && !inst->is_partial_write()) { >> if (!BITSET_TEST(bd->use, var)) >> BITSET_SET(bd->def, var); >> } >> >> That makes that the live start of the variable is expected to defined >> before the beginning of the block. In the commented test by Alejandro we >> have a big unrolled loop that increases the pressure of the register >> artificially making all result registers of a conversion to start life >> at instruction 0. >> > > Quick drive-by comment. Curro and I talked about this quite a bit in the > office yesterday as I've been hitting similar issues with the vec2->double > conversion that we do when loading 64-bit values from UBOs and SSBOs. He's > got a patch to liveness which makes registers which are only ever partially > written not extned their live ranges infinitely upward. That's probably > easier than trying to track things per-component. >
FTR you can find my liveness analysis patch here: https://lists.freedesktop.org/archives/mesa-dev/2017-September/168974.html Chema/Alejandro, let me know if it doesn't have the intended effect on shaders that use FP16, I haven't tested it in combination with your series. Thanks. > --Jason > > >> A simpler approach to the submitted optimization would be to consider as >> a fully defined variable in a block the result of a 32-16 bit >> conversion. This way the register allocator has more freedom to assign >> the register because the live_interval is correct for this variable. >> Something like the following code: >> >> diff --git a/src/intel/compiler/brw_fs_live_variables.cpp >> b/src/intel/compiler/brw_fs_live_variables.cpp >> index c449672a51..98d8583eaa 100644 >> --- a/src/intel/compiler/brw_fs_live_variables.cpp >> +++ b/src/intel/compiler/brw_fs_live_variables.cpp >> @@ -83,7 +83,10 @@ fs_live_variables::setup_one_write(struct block_data >> *bd, fs_inst *inst, >> /* The def[] bitset marks when an initialization in a block completely >> * screens off previous updates of that variable (VGRF channel). >> */ >> - if (inst->dst.file == VGRF && !inst->is_partial_write()) { >> + if (inst->dst.file == VGRF && (!inst->is_partial_write() || >> + ((type_sz(inst->dst.type) == 2) && >> + (inst->opcode == BRW_OPCODE_MOV) && >> + (type_sz(inst->src[0].type) == 4)))) { >> if (!BITSET_TEST(bd->use, var)) >> BITSET_SET(bd->def, var); >> } >> >> This new code reduces the spills from 1217 to 1089 in our worse tests >> compared to using the submitted optimization and without regressions. >> >> Although this is a simple fix that address the problem with the >> conversions (that is the main use that 16bit arithmetic of the >> VK_KHR_16bit_storage extension), there are other cases that we didn't >> addressed with the reuse_16bit_conversions_register that need a more >> complex approach. >> >> I'm thinking about the cases when 16-bit types are packed/unpacked or >> shuffled/unshuffled in the storage operations. They are also >> partial_writes but the aggregation of several operations make that the >> register could be considered as completely defined in a block. >> >> - This would need doing a "subregister" tracking for stride=2 and >> offset=[0,2] for shuffled components that are written using 32-bit >> storage operations. >> - And the same for packet 16-bit values on SIMD8 that use half register >> continuously. >> >> So if only half of the register is defined for a given variable and only >> that half is read by the shader we can consider consistent that the >> variable is completely defined. But in the case that the different >> halves are defined in different blocks things get complicated but could >> stay as partial write. >> >> Does it makes sense to go deeper with an approach like this one? Maybe >> is not strictly needed for this series but it could be an interesting >> follow up for adjusting the liveness of the payloads of the 16-bit store >> operations and any future use of 16-bit types. >> >> Thanks in advance for your feedback. >> >> Chema >> >> >> (IIRC there's a debug option to print out the register pressure for >> >> each instruction, which will be helpful here). >> > >> > I tried to use that option back when I was working on this patch, >> > without too much luck. Will try again. >> > >> >> If that's not the case, you should check to see if the register >> >> allocator thinks the sources and destinations of the conversion >> >> instructions interfere, and if so figure out why - that will likely be >> >> the real bug. >> > >> > Ok. >> > >> > Thanks Connor and Curro for the comments. I will work on a different >> > solution. >> >> _______________________________________________ >> mesa-dev mailing list >> [email protected] >> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >> > _______________________________________________ > mesa-dev mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/mesa-dev
