After Igalia's work on SPIRV 16-bit storage question arose how much is needed on top in order to optimize GLES lowp/mediump with 16-bit floats. I took glb 2.7 trex as a target and started drafting a glsl lowering pass re-typing mediump floats into float16. In parallel, I added bit by bit equivalent support into GLSL -> NIR pass and into Intel compiler backend.
This series enables lowering for fragment shaders only. This was sufficient for trex which doesn't use mediump precision for vertex shaders. First of all this is not complete work. I'd like to think of it more as trying to give an idea what is currently missing. And by giving concrete (if not ideal) solutions making each case a little clearer. On SKL this runs trex pretty much on par compared to 32-bit. Intel hardware doesn't have native support for linear interpolation using 16-bit floats and therefore pln() and lrp() incur additional moves from 16-bits to 32-bits (and vice versa). Both can be replaced relatively efficiently using mad() later on. Comparing shader dumps between 16-bit and 32-bit indicates that all optimization passes kick in nicely (sampler-eot, mad(), etc). Only additional are the before mentioned conversion instructions. Series starts with miscellanious bits needed in the glsl and nir. This is followed by equivalent bits in the Intel compiler backend. These are followed up changes that are subject to more debate: 1) Support for SIMD8 fp16 in liveness analysis, copy propagation, dead code elimination, etc. In order to tell if one instruction fully overwrites the results of another one needs to examine how much of a register is written. Until now this has been done in granularity of full or partial register, i.e., there is no concept of "full sub-region write". And until now there was no need as all data types took 4-bytes per element resulting into full 32-byte register even in case of SIMD8. Partial writes where all special and could be safely ignored in various analysis passes. Half precision floats, however, break this assumption. On SIMD8 full write with 16-bit elements results into half register. I tried patching different passes examing partial writes one by one but that started to get out hand. Moreover, just by looking a register type size it is not safe to say if it really is full write or not. Solution here is to explicitly store this information into registers: added new member fs_reg::pad_per_component. Subsequently patching fs_reg::component_size() to take the padding into account propagates the information to all users. Patch 28 updates a few users to use component_size() instead of open coded, 29 adds the actual support and 30-35 update NIR -> FS to signal the padding (these are separated just for review). It should be noted that here one deals with virtual registers. Final hardware allocator is separate and using full registers in virtual space shouldn't prevent it from using thighter packing. Chema, this overlaps with your work, I hope you don't mind. 2) Booleans produced from 16-bit sources. Whereas for GLSL and for NIR booleans are just booleans, on Intel hardware they are integers. And the presentation depends on how they are produced. Comparisons (flt, fge, feq and fne) with 32-bit sources produce 32-bit results (0x0000000/0xFFFFFFFF) while with 16-bits one gets 16-bit results (0x00000/0xFFFF). I thought about introducing 16-bit boolean into NIR but that felt too much hardware specific thing to do. Instead I patched NIR -> FS to take the type of producing instruction into account when setting up the SSA values. See patch 39 for the setup and patches 36-38 for consulting the backend SSA store instead of relying on NIR. Another approach left to try is emitting additional moves into 32-bits (the same way we do for fp64). One could then add an optimization pass that removes unnecessary moves and uses strided sources instead. 3) Following up 2) GLSL -> NIR decides to emit integer typed and/or/xor even for originally boolean typed logic ops. Patch 40 tries to cope with the case where the booleans are produced with non-matching precision. 4) In Intel compiler backend and push/pull constant setup things are relying on values being packed in 32-bit slots. Moreover, these slots are typeless and the laoder doesn't know if it is dealing with floats or integers let alone about precision. Patch 42 takes the first step and simply adds type information into the backend. This is not particularly pretty but I had to start from somewhere. This allows the loader to convert float values from the 32-bit store in the core to 16-bits on the fly. Patch 43 adjusts compiler to use 32-bit slots. Using 16-bit slots would require substantially more work. I think there is no question about core using 32-bit values. And even if the values there were 16-bit, backend would still need to know types. My feeling is that we just need to rewrite fair amount of the Intel push/pull constant setup. 5) Patches 44-50 are all about the GLSL lowering pass. This is really work-in-progress. What I have here is crude attempt to do everything in one pass. It also has several hacks working around shortcomings in the Intel backend. Short story is that there are quite a few things which don't have precision and compiler needs to analyze expressions recursively in order to know what precision to use. Take, for example, variables that don't have precision but are referred to from multiple locations. These require the compiler to examine all the expressions involved and use full precision for the variable even if one of the expressions require it. This in turn alters the requirements in the other expressions - compiler would need to emit conversions for them. And I don't think this can be done cleanly in one pass. I also realized that there may be cases where the compiler would need to use full precision instead of half in order to submit the most optimal code. Such shaders sound just evil and I don't even want to think about that now. There is more than enough work to get even the rules covered... This series doesn't touch hardware register allocator - it still allocates one full register per 16-bit float component even in case of SIMD8. Patches can be found in (it is rebased on current master and Igalia's work): git://people.freedesktop.org/~tpohjola/mesa:16_bit_gles There are also some simple shader runner tests I wrote along the way: git://people.freedesktop.org/~tpohjola/piglit:fp16 All feedback is very welcome. I'm prepared to keep on working on this if people find it useful. Personally I'd be curious to add fp16 for pln() and lrp() and see if 16-bits could beat 32-bits performance wise. Proper push/pull constant support is another thing on the list. Hardware register allocator with sub-register support sounds both interesting and scary. CC: Jose Maria Casanova Crespo <jmcasan...@igalia.com> CC: Jason Ekstrand <ja...@jlekstrand.net> CC: Kenneth Graunke <kenn...@whitecape.org> CC: Matt Turner <matts...@gmail.com> CC: Ian Romanick <i...@freedesktop.org> CC: Francisco Jerez <curroje...@riseup.net> Topi Pohjolainen (51): nir: Prepare constant folding for 16-bits nir: Prepare constant lowering for 16-bits constants nir: Add 16-bit float support into algebraic opts glsl: Print 16-bit constants nir: Print 16-bit constants glsl: Add support for 16-bit float constants in nir-conversion glsl: Add conversion ops to/from 16-bit floats glsl: Add more conversion ops to/from 16-bit floats glsl: Allow 16-bit neg() and dot() glsl: Allow 16-bit math glsl: Enable 16-bit texturing in nir-conversion intel/compiler/disasm: Print 16-bit IMM values intel/compiler/disasm: Print fp16 also for sampler messages intel/compiler/fs: Support for dumping 16-bit IMM values intel/compiler: Add support for loading 16-bit constants intel/compiler: Move type_size_scalar() into brw_shader.cpp intel/compiler: Prepare for glsl mediump float uniforms intel/compiler: Allow 16-bit math intel/compiler/fs: Add helpers for 16-bit null regs intel/compiler/fs: Use two SIMD8 instructions for 16-bit math intel/compiler/fs: Use 16-bit null dest with 16-bit math intel/compiler/fs: Use 16-bit null dest with 16-bit compare intel/compiler: Prepare for 16-bit 3-src ops intel/compiler: Add support for negating 16-bit floats intel/compiler/fs: Support for combining 16-bit immediates intel/compiler/fs: Set 16-bit sampler return format intel/compiler/fs: Set tex type for generator to flag fp16 intel/compiler/fs: Use component_size() instead of open coded intel/compiler/fs: Add register padding support intel/compiler/fs: Pad 16-bit texture return payloads intel/compiler/fs: Pad 16-bit output (store/fb write) payloads intel/compiler/fs: Pad 16-bit nir vec* components into full reg intel/compiler/fs: Pad 16-bit nir intrinsic dest into full reg intel/compiler/fs: Pad 16-bit const loads into full regs intel/compiler/fs: Pad 16-bit payload lowering intel/compiler/fs: Prepare nir_emit_if() for 16-bit sources intel/compiler/fs: Consider original sizes when retyping alu ops intel/compiler/fs: Use original reg size when retyping nir src intel/compiler/fs: Consider logic ops on 16-bit booleans intel/compiler/fs: Prepare 16-bit and/or/xor for 32-bit src intel/compiler/eu: Take stride into account in 16-bit ops i965: WIP: Support for uploading 16-bit uniforms from 32-bit store intel/compiler/fs: WIP: Use 32-bit slots for 16-bit uniforms glsl: WIP: Add lowering pass for treating mediump as float16 glsl: Use 16-bit constants if operation is otherwise 16-bit glsl: Lower float conversions to mediump glsl: HACK: Force texture return into 16-bits glsl: HACK: Treat input varyings as 16-bits by conversion glsl: HACK: Lower builtin float outputs to 16-bits by default glsl: HACK: Lower all temporary float variables to 16-bits i965/fs: Lower gles mediump floats into 16-bits src/compiler/Makefile.sources | 1 + src/compiler/glsl/glsl_to_nir.cpp | 20 ++ src/compiler/glsl/ir.cpp | 8 + src/compiler/glsl/ir_expression_operation.py | 17 + src/compiler/glsl/ir_optimization.h | 1 + src/compiler/glsl/ir_print_visitor.cpp | 1 + src/compiler/glsl/ir_validate.cpp | 48 ++- src/compiler/glsl/lower_mediump.cpp | 405 ++++++++++++++++++++++ src/compiler/nir/nir_lower_load_const_to_scalar.c | 6 +- src/compiler/nir/nir_opt_constant_folding.c | 2 + src/compiler/nir/nir_print.c | 5 + src/compiler/nir/nir_search.c | 4 + src/intel/compiler/brw_compiler.h | 9 + src/intel/compiler/brw_disasm.c | 8 +- src/intel/compiler/brw_eu_emit.c | 27 +- src/intel/compiler/brw_eu_validate.c | 3 + src/intel/compiler/brw_fs.cpp | 103 +++--- src/intel/compiler/brw_fs.h | 4 +- src/intel/compiler/brw_fs_builder.h | 37 +- src/intel/compiler/brw_fs_combine_constants.cpp | 84 ++++- src/intel/compiler/brw_fs_copy_propagation.cpp | 5 +- src/intel/compiler/brw_fs_generator.cpp | 10 +- src/intel/compiler/brw_fs_nir.cpp | 220 ++++++++++-- src/intel/compiler/brw_fs_visitor.cpp | 1 + src/intel/compiler/brw_inst.h | 4 + src/intel/compiler/brw_ir_fs.h | 16 + src/intel/compiler/brw_reg_type.c | 2 + src/intel/compiler/brw_shader.cpp | 64 +++- src/intel/compiler/brw_vec4.cpp | 8 + src/intel/compiler/brw_vec4_gs_visitor.cpp | 8 + src/intel/compiler/brw_vec4_visitor.cpp | 4 + src/mesa/drivers/dri/i965/brw_cs.c | 2 + src/mesa/drivers/dri/i965/brw_curbe.c | 2 + src/mesa/drivers/dri/i965/brw_disk_cache.c | 14 + src/mesa/drivers/dri/i965/brw_gs.c | 2 + src/mesa/drivers/dri/i965/brw_link.cpp | 3 + src/mesa/drivers/dri/i965/brw_nir_uniforms.cpp | 10 + src/mesa/drivers/dri/i965/brw_program.c | 12 +- src/mesa/drivers/dri/i965/brw_state.h | 1 + src/mesa/drivers/dri/i965/brw_tcs.c | 2 + src/mesa/drivers/dri/i965/brw_tes.c | 2 + src/mesa/drivers/dri/i965/brw_vs.c | 2 + src/mesa/drivers/dri/i965/brw_wm.c | 2 + src/mesa/drivers/dri/i965/gen6_constant_state.c | 17 +- src/mesa/program/ir_to_mesa.cpp | 8 + src/mesa/state_tracker/st_glsl_to_tgsi.cpp | 9 + 46 files changed, 1112 insertions(+), 111 deletions(-) create mode 100644 src/compiler/glsl/lower_mediump.cpp -- 2.11.0 _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev