Samuel Iglesias Gonsálvez <[email protected]> writes: > On 14/05/16 01:16, Francisco Jerez wrote: >> Samuel Iglesias Gonsálvez <[email protected]> writes: >> >>> From: Iago Toral Quiroga <[email protected]> >>> >>> UBO loads with constant offset use the UNIFORM_PULL_CONSTANT_LOAD >>> instruction, which reads 16 bytes (a vec4) of data from memory. For dvec >>> types this only provides components x and y. Thus, if we are reading >>> more than 2 components we need to issue a second load at offset+16 to >>> read the next 16-byte chunk with components w and z. >>> >>> UBO loads with non-constant offset emit a load for each component >>> in the vector (and rely in CSE to fix redundant loads), so we only >>> need to consider the size of the data type when computing the offset >>> of each element in a vector. >>> >>> v2 (Sam): >>> - Adapt the code to use component() (Curro). >>> >>> Signed-off-by: Samuel Iglesias Gonsálvez <[email protected]> >>> Reviewed-by: Kenneth Graunke <[email protected]> >>> --- >>> src/mesa/drivers/dri/i965/brw_fs_nir.cpp | 52 >>> +++++++++++++++++++++++++++----- >>> 1 file changed, 45 insertions(+), 7 deletions(-) >>> >>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp >>> b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp >>> index 2d57fd3..02f1e81 100644 >>> --- a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp >>> +++ b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp >>> @@ -3362,6 +3362,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder &bld, >>> nir_intrinsic_instr *instr >>> nir->info.num_ubos - 1); >>> } >>> >>> + /* Number of 32-bit slots in the type */ >>> + unsigned type_slots = MAX2(1, type_sz(dest.type) / 4); >>> + >>> nir_const_value *const_offset = >>> nir_src_as_const_value(instr->src[1]); >>> if (const_offset == NULL) { >>> fs_reg base_offset = retype(get_nir_src(instr->src[1]), >>> @@ -3369,19 +3372,54 @@ fs_visitor::nir_emit_intrinsic(const fs_builder >>> &bld, nir_intrinsic_instr *instr >>> >>> for (int i = 0; i < instr->num_components; i++) >>> VARYING_PULL_CONSTANT_LOAD(bld, offset(dest, bld, i), >>> surf_index, >>> - base_offset, i * 4); >>> + base_offset, i * 4 * type_slots); >> >> Why not 'i * type_sz(...)'? As before it seems like type_slots is just >> going to introduce rounding errors here for no benefit? >> > > Right, I will fix it. > >>> } else { >>> + /* Even if we are loading doubles, a pull constant load will load >>> + * a 32-bit vec4, so should only reserve vgrf space for that. If >>> we >>> + * need to load a full dvec4 we will have to emit 2 loads. This is >>> + * similar to demote_pull_constants(), except that in that case we >>> + * see individual accesses to each component of the vector and >>> then >>> + * we let CSE deal with duplicate loads. Here we see a vector >>> access >>> + * and we have to split it if necessary. >>> + */ >>> fs_reg packed_consts = vgrf(glsl_type::float_type); >>> packed_consts.type = dest.type; >>> >>> - struct brw_reg const_offset_reg = brw_imm_ud(const_offset->u32[0] >>> & ~15); >>> - bld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD, packed_consts, >>> - surf_index, const_offset_reg); >>> + unsigned const_offset_aligned = const_offset->u32[0] & ~15; >>> + >>> + /* A vec4 only contains half of a dvec4, if we need more than 2 >>> + * components of a dvec4 we will have to issue another load for >>> + * components z and w >>> + */ >>> + int num_components; >>> + if (type_slots == 1) >>> + num_components = instr->num_components; >>> + else >>> + num_components = MIN2(2, instr->num_components); >>> >>> - const fs_reg consts = byte_offset(packed_consts, >>> const_offset->u32[0] % 16); >>> + int remaining_components = instr->num_components; >>> + while (remaining_components > 0) { >>> + /* Read the vec4 from a 16-byte aligned offset */ >>> + struct brw_reg const_offset_reg = >>> brw_imm_ud(const_offset_aligned); >>> + bld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD, >>> + retype(packed_consts, BRW_REGISTER_TYPE_F), >>> + surf_index, const_offset_reg); >>> >>> - for (unsigned i = 0; i < instr->num_components; i++) >>> - bld.MOV(offset(dest, bld, i), component(consts, i)); >>> + const fs_reg consts = byte_offset(packed_consts, >>> (const_offset->u32[0] % 16)); >> >> This looks really fishy to me, if the initial offset is not 16B aligned >> you'll apply the same sub-16B offset to the result from each one of the >> subsequent pull constant loads. > > This cannot happen thanks to the layout alignment rules, see below. > >> Also you don't seem to take into >> account whether the initial offset is misaligned in the calculation of >> num_components -- If it is it looks like the first pull constant load >> could return less than "num_components" usable components and you would >> end up reading past the end of the return payload of the message. >> > > This is not a problem thanks to std140 layout alignment rules. The only > case we have a 16B misalignment is when we have a bunch of scalars (or > vec2s) defined together. > > In case of the scalars, we are going to pick only one element in each > case, so there is no problem. In case of the vec2s, the second vec2 is > at offset 8B and we only read two components in total, then there isn't > a second pull constant load. > > When we are dealing with doubles is the same but multiply by two: the > double case would be like the vec2 one but reading one double. dvec2 is > 16B aligned. dvec3 and dvec4 are 32B aligned. Everything else is aligned > 16B (vec4 size). > > std430 layout doesn't apply to UBOs, although it is not a problem if it > does in the future. In Mesa, the rest of layouts (shared, packed) follow > std140's alignment rules. > Oh, wow, that's making quite some assumptions about the behavior of the front-end, sounds like we could use some assertions verifying that the assumptions you've made about the upper layers don't accidentally break, e.g. somewhere close to the num_components computation:
| /* The computation of num_components doesn't take into account | * misalignment, which should be okay according to std140 vector | * alignment rules. | */ | assert(const_offset->u32[0] % 16 + | type_sz(dest.type) * num_components <= 16); and inside the remaining_components while loop: | /* XXX - This doesn't update the sub-16B offset across iterations of | * the loop, which should work for std140 vector alignment rules. | */ | assert(dest_offset == 0 || const_offset->u32[0] % 16 == 0); Honestly, none of these assumptions seem necessary nor particularly helpful for the logic above -- Explaining and verifying them seems almost more effort (and is certainly more confusing) than handling arbitrary offsets consistently. But because Ken seems rather impatient to get FP64 support upstream I suggest you just add the the two assertions above and my: Reviewed-by: Francisco Jerez <[email protected]> to this patch and push it. As a follow-up clean-up you'd probably be able to both simplify the iteration logic and handle arbitrary alignment with less than half the code by doing something like: | for (unsigned c = 0; c < total_components;) { | const fs_reg packed_consts = bld.vgrf(BRW_REGISTER_TYPE_F); | const unsigned base = base_offset_from_the_intrinsic + c * type_size; | | /* Number of usable components we get from the next 16B-aligned load. */ | const unsigned count = min(total_components - c, (16 - base % 16) / type_size); | | bld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD, | packed_consts, surf_index, brw_imm_ud(base & ~15)); | | const fs_reg consts = retype(byte_offset(packed_consts, base & 15), | dest.type); | | for (unsigned d = 0; d < count; d++) | bld.MOV(offset(dest, bld, c + d), component(consts, d)); | | c += count; | } This gets rid of two out of three induction variables you use in the while loop, the second restriction disappears because you always have the *actual* offset in bytes of the next constant to fetch available, and the first restriction is handled by a single '- base % 16' term in the definition of 'count'. Another possibility to consider is to ditch all the above and extend the generator so it's able to fetch a whole 32B block of constants at a time with a single message, which would let you handle this for 64bit types in exactly the same way that the back-end currently handles 32bit constant loads with a single pull load per dvec4. > Sam > >>> + unsigned dest_offset = instr->num_components - >>> remaining_components; >>> + >>> + for (int i = 0; i < num_components; i++) >>> + bld.MOV(offset(dest, bld, i + dest_offset), >>> component(consts, i)); >>> + >>> + /* If this is a large enough 64-bit load, we will need to emit >>> + * another message >>> + */ >>> + remaining_components -= num_components; >>> + assert(remaining_components == 0 || >>> + (remaining_components <= 2 && type_slots == 2)); >>> + num_components = remaining_components; >>> + const_offset_aligned += 16; >>> + } >>> } >>> break; >>> } >>> -- >>> 2.5.0 >>> >>> _______________________________________________ >>> mesa-dev mailing list >>> [email protected] >>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/mesa-dev
