fs: Add reuse_16bit_conversions_register optimization

Francisco Jerez Thu, 07 Sep 2017 17:59:19 -0700

Jason Ekstrand <[email protected]> writes:

> On Wed, Sep 6, 2017 at 3:58 PM, Chema Casanova <[email protected]>
> wrote:
>
>> Hi Connor and Curro,
>>
>> On 28/08/17 12:24, Alejandro Piñeiro wrote:
>> > On 27/08/17 20:24, Connor Abbott wrote:
>> >> Hi,
>> >>
>> >> On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro" <[email protected]
>> >> <mailto:[email protected]>> wrote:
>> >>
>> >>     On 24/08/17 21:07, Connor Abbott wrote:
>> >>     >
>> >>     > Hi Alejandro,
>> >>
>> >>     Hi Connor,
>> >>
>> >>     >
>> >>     > This seems really suspicious. If the live ranges are really
>> >>     > independent, then the register allocator should be able to
>> >>     assign the
>> >>     > two virtual registers to the same physical register if it needs
>> to.
>> >>
>> >>     Yes, it is true, the register allocator should be able to assign two
>> >>     virtual registers to the same physical register. But that is done
>> >>     at the
>> >>     end (or really near the end), so late for the problem this
>> >>     optimization
>> >>     is trying to fix.
>> >>
>> >>
>> >> Well, my understanding is that the problem is long compilation times
>> >> due to spilling and our not-so-great implementation of it. So no,
>> >> register allocation is not late for the problem. As both Curro and I
>> >> explained, the change by itself can only pessimise register
>> >> allocation, so if it helps then it must be due to a bug in the
>> >> register allocator or a problem in a subsequent pass that's getting
>> >> hidden by this one.
>> >
>> > Ok.
>> >
>> >>
>> >>     We are also reducing the amount of instructions used.
>> >>
>> >>
>> >> The comments in the source code say otherwise. Any instructions
>> >> eliminated were from spilling, which this pass only accidentally
>> reduces.
>> >
>> > Yes, sorry, I explained myself poorly. The optimization itself doesn't
>> > remove any instructions. But using it reduces the final number of
>> > instructions, although as you say, they are likely due reducing the
>> > spilling.
>> >
>> >>
>> >>
>> >>
>> >>     Probably not really clear on the commit message. When I say
>> >>     "reduce the
>> >>     pressure of the register allocator" I mean having a code that the
>> >>     register allocator would be able to handle without using too much
>> >>     time.
>> >>     The problem this optimization tries to solve is that for some 16
>> >>     bit CTS
>> >>     tests (some with matrices and geometry shaders), the amount of
>> virtual
>> >>     registers used and instructions was really big. For the record,
>> >>     initially, some tests needed 24 min just to compile. Right now,
>> thanks
>> >>     to other optimizations, the slower test without this optimization
>> >>     needs
>> >>     1min 30 seconds. Adding some hacky timestamps, the time used  at
>> >>     fs_visitor::allocate_registers (brw_fs.cpp:6096) is:
>> >>
>> >>     * While trying to schedule using the three available pre mode
>> >>     heuristics: 7 seconds
>> >>     * Allocation with spilling: 63 seconds
>> >>     * Final schedule using SCHEDULE_POST: 19 seconds
>> >>
>> >>     With this optimization, the total time goes down to 14 seconds (10
>> >>     + 0 +
>> >>     3 on the previous bullet point list).
>> >>
>> >>     One could argue that 1min 30 seconds is okish. But taking into
>> account
>> >>     that it goes down to 14 seconds, even with some caveats (see
>> below), I
>> >>     still think that it is worth to use the optimization.
>> >>
>> >>     And a final comment. For that same test, this is the final stats
>> >>     (using
>> >>     INTEL_DEBUG):
>> >>
>> >>      * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
>> >>     130320 cycles. 15:9 spills:fills.
>> >>      * Without the optimization: SIMD8 shader: 12312 instructions. 0
>> >>     loops.
>> >>     174816 cycles. 751:1851 spills:fills.
>> >>
>> >>
>> >> So, the fact that it helps at all with SIMD8 shows that my theory is
>> >> wrong, but since your pass reduces spilling, it clearly must be
>> >> avoiding a bug somewhere else. You need to compare the IR for a shader
>> >> with the problem with and without this pass right before register
>> >> allocation. Maybe the sources and destinations of the conversion
>> >> instructions interfere without the change due to some other pass
>> >> that's increasing register pressure, in which case that's the problem,
>> >> but I doubt it.
>> >
>> > Ok, thanks for the hints.
>>
>> After some research we found that we need to adapt the live_variables
>> algorithm to support 32 to 16-bit conversions. Because of the HW
>> alignment restrictions these conversions need that the result register
>> uses stride=2, so it is not continuous (stride!=1) so by definition
>> is_partial_write returns true. Any of the next last 3 conditions could
>> be true when we use 16-bit types.
>>
>> bool
>> fs_inst::is_partial_write() const
>> {
>>    return ((this->predicate && this->opcode != BRW_OPCODE_SEL) ||
>>            (this->exec_size * type_sz(this->dst.type)) < 32 ||
>>            !this->dst.is_contiguous() ||
>>            this->dst.offset % REG_SIZE != 0);
>> }
>>
>> So at the check on the setup_one_write function at
>> brw_fs_live_variables.cpp the variable isn't marked as defined
>> completely in the block.
>>
>>    if (inst->dst.file == VGRF && !inst->is_partial_write()) {
>>       if (!BITSET_TEST(bd->use, var))
>>          BITSET_SET(bd->def, var);
>>    }
>>
>> That makes that the live start of the variable is expected to defined
>> before the beginning of the block. In the commented test by Alejandro we
>> have a big unrolled loop that increases the pressure of the register
>> artificially making all result registers of a conversion to start life
>> at instruction 0.
>>
>
> Quick drive-by comment.  Curro and I talked about this quite a bit in the
> office yesterday as I've been hitting similar issues with the vec2->double
> conversion that we do when loading 64-bit values from UBOs and SSBOs.  He's
> got a patch to liveness which makes registers which are only ever partially
> written not extned their live ranges infinitely upward.  That's probably
> easier than trying to track things per-component.
>


FTR you can find my liveness analysis patch here:

https://lists.freedesktop.org/archives/mesa-dev/2017-September/168974.html

Chema/Alejandro, let me know if it doesn't have the intended effect on
shaders that use FP16, I haven't tested it in combination with your
series.

Thanks.

> --Jason
>
>
>> A simpler approach to the submitted optimization would be to consider as
>> a fully defined variable in a block the result of a 32-16 bit
>> conversion. This way the register allocator has more freedom to assign
>> the register because the live_interval is correct for this variable.
>> Something like the following code:
>>
>> diff --git a/src/intel/compiler/brw_fs_live_variables.cpp
>> b/src/intel/compiler/brw_fs_live_variables.cpp
>> index c449672a51..98d8583eaa 100644
>> --- a/src/intel/compiler/brw_fs_live_variables.cpp
>> +++ b/src/intel/compiler/brw_fs_live_variables.cpp
>> @@ -83,7 +83,10 @@ fs_live_variables::setup_one_write(struct block_data
>> *bd, fs_inst *inst,
>>     /* The def[] bitset marks when an initialization in a block completely
>>      * screens off previous updates of that variable (VGRF channel).
>>      */
>> -   if (inst->dst.file == VGRF && !inst->is_partial_write()) {
>> +   if (inst->dst.file == VGRF && (!inst->is_partial_write() ||
>> +                                  ((type_sz(inst->dst.type) == 2) &&
>> +                                   (inst->opcode == BRW_OPCODE_MOV) &&
>> +                                   (type_sz(inst->src[0].type) == 4)))) {
>>        if (!BITSET_TEST(bd->use, var))
>>           BITSET_SET(bd->def, var);
>>     }
>>
>> This new code reduces the spills from 1217 to 1089 in our worse tests
>> compared to using the submitted optimization and without regressions.
>>
>> Although this is a simple fix that address the problem with the
>> conversions (that is the main use that 16bit arithmetic of the
>> VK_KHR_16bit_storage extension), there are other cases that we didn't
>> addressed with the reuse_16bit_conversions_register that need a more
>> complex approach.
>>
>> I'm thinking about the cases when 16-bit types are packed/unpacked or
>> shuffled/unshuffled in the storage operations. They are also
>> partial_writes but the aggregation of several operations make that the
>> register could be considered as completely defined in a block.
>>
>> - This would need doing a "subregister" tracking for stride=2 and
>> offset=[0,2] for shuffled components that are written using 32-bit
>> storage operations.
>> - And the same for packet 16-bit values on SIMD8 that use half register
>> continuously.
>>
>> So if only half of the register is defined for a given variable and only
>> that half is read by the shader we can consider consistent that the
>> variable is completely defined. But in the case that the different
>> halves are defined in different blocks things get complicated but could
>> stay as partial write.
>>
>> Does it makes sense to go deeper with an approach like this one? Maybe
>> is not strictly needed for this series but it could be an interesting
>> follow up for adjusting the liveness of the payloads of the 16-bit store
>> operations and any future use of 16-bit types.
>>
>> Thanks in advance for your feedback.
>>
>> Chema
>>
>> >> (IIRC there's a debug option to print out the register pressure for
>> >> each instruction, which will be helpful here).
>> >
>> > I tried to use that option back when I was working on this patch,
>> > without too much luck. Will try again.
>> >
>> >> If that's not the case, you should check to see if the register
>> >> allocator thinks the sources and destinations of the conversion
>> >> instructions interfere, and if so figure out why - that will likely be
>> >> the real bug.
>> >
>> > Ok.
>> >
>> > Thanks Connor and Curro for the comments. I will work on a different
>> > solution.
>>
>> _______________________________________________
>> mesa-dev mailing list
>> [email protected]
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
> _______________________________________________
> mesa-dev mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev

signature.asc
Description: PGP signature

_______________________________________________
mesa-dev mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization

Reply via email to