On Wed, Mar 11, 2015 at 01:27:32AM +0100, Marek Olšák wrote: > On Wed, Mar 11, 2015 at 12:09 AM, Tom Stellard <[email protected]> wrote: > > On Tue, Mar 10, 2015 at 11:01:21PM +0100, Marek Olšák wrote: > >> I've looked into how to recognize BFM and BFI and discovered that if > >> TGSI_OPCODE_BFI is expanded, it's _impossible_ to recognize the > >> pattern in the backend due to LLVM transformations. The reason it's > >> impossible is that one particular simplification of the expanded IR > >> can always be done and it always changes the IR in a way that BFI > >> can't be recognized anymore. > >> > >> The ideal transformation from TGSI to ISA is (note that this is also > >> how GLSL defines the opcode): > >> > >> TGSI_OPCODE_BFI(base, insert, offset, bits) > >> = BFI(BFM(bits, offset), SHL(insert, offset), base) = > >> s_lshl_b32 s1, s4, s6 > >> s_bfm_b32 s0, s0, s6 > >> v_mov_b32_e32 v0, s5 > >> v_mov_b32_e32 v1, s1 > >> v_bfi_b32 v0, s0, v1, v0 > >> Ideally 3 instructions if all sources are vector registers. > >> > >> However, if TGSI_OPCODE_BFI is expanded into basic bitwise operations > >> (BTW the result of BFM has 2 uses in BFI), LLVM applies this > >> transformation: > >> (X << S) & (Y << S) ----> (X & Y) << S > >> Which breaks recognition of BFI and also the second use of BFM. > >> Therefore this version calculates the same BFM expression twice. The > >> first BFM is recognized by pattern matching, but the second BFM as > >> well as BFI is unrecognizable due to the transformation. The result > >> is: > >> s_lshl_b32 s1, 1, s0 > >> s_bfm_b32 s0, s0, s6 > >> s_add_i32 s1, s1, -1 > >> s_not_b32 s0, s0 > >> s_and_b32 s1, s1, s4 > >> s_and_b32 s0, s0, s5 > >> s_lshl_b32 s1, s1, s6 > >> s_or_b32 s0, s1, s0 > >> > >> There are 2 ways out of this: > >> > >> 1) Use BFM and BFI intrinsics in Mesa. Simple and unlikely to break in > >> the future. > >> > >> 2) Try to recognize the expression tree seen by the backend. Changes > >> in LLVM core can break it. More complicated shaders with more > >> opportunities for transformations can break it too: > >> > >> def : Pat < > >> (i32 (or (i32 (shl (i32 (and (i32 (add (i32 (shl 1, i32:$a)), > >> -1)), i32:$b)), i32:$c)), > >> (i32 (and (i32 (xor (i32 (shl (i32 (add (i32 (shl 1, i32:$a)), > >> -1)), i32:$c)), -1)), i32:$d)))), > >> (V_BFI_B32 (S_BFM_B32 $a, $c), > >> (S_LSHL_B32 $b, $c), > >> $d) > >> >; > > > > I don't want to waste a lot of time discussing this, because it probably > > doesn't matter too much in the long run. I'm fine with using > > intrinsics, but I just wanted to point out a few things in case you or > > someone else wants to get this working using LLVM IR. > > > > 1. Running the instruction combining pass should help with pattern > > matching. This transforms common sequence into canonical forms which > > make them easier to match in the backend. We should be running this > > pass anyway as it has some good optimization. > > > > > > diff --git a/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c > > b/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c > > index dce5b55..45c9eb8 100644 > > --- a/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c > > +++ b/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c > > @@ -1444,6 +1444,7 @@ void radeon_llvm_finalize_module(struct > > radeon_llvm_context * ctx) > > LLVMAddPromoteMemoryToRegisterPass(gallivm->passmgr); > > > > /* Add some optimization passes */ > > + LLVMAddInstructionCombiningPass(gallivm->passmgr); > > LLVMAddScalarReplAggregatesPass(gallivm->passmgr); > > LLVMAddLICMPass(gallivm->passmgr); > > LLVMAddAggressiveDCEPass(gallivm->passmgr); > > I tried this a long time ago and it broke a few tests which used the > kill intrinsic. > > I'm testing it right now and it increases the shader binary size quite > a lot. It looks like some DAG combines don't work with it anymore and > the generated code looks worse overall. > > I occasionally use llc for debugging, which doesn't seem to use it > either? Anyway, it looks like there's a lot of work needed just to fix > the worse code generation. And when Mesa starts to use it, llc should > use it too. >
When mesa dumps the LLVM IR, all the passes we add in radeonsi have already been run, so we wouldn't need to add it to llc. > > > > > > > > > > > 2. We have a number of patterns for BFI already in AMDGPUInstructions.td > > If we are modeling BFI using LLVM IR, we should be using one of those > > patterns. > > All of them are useless for TGSI_OPCODE_BFI. > I guess I don't understand why. If you implemented TGSI_OPCODE_BFI with the bfm intrinsic and an open-coded bfi. Couldn't you match the open-coded bfi to one of the patterns in AMDGPUInstructions.td? > > > > > 3. The BFI, BFM, LSHL sequence does not need to be matched in a single > > pattern. There should be one pattern for BFI, one for BFM, and one for > > LSHL. > > This won't work for TGSI, which I tried to explain in detail. Matching > the whole pattern is the only way. The big pattern can recognize a > hidden duplicated expression in it which the CSE can't. Any strict > subset of the pattern doesn't make any sense alone. > Ok, perhaps I didn't really understand your example. Would you be able to send the LLVM IR you used. > > > > > 4. Using intrinsics means no constant folding will be done, which could > > lead to worse code in some cases. > > This is not a big issue. The GLSL compiler understands these > instrinsics and takes care of constant folding. > I forgot about that. I was also thinking about the case where we use an intrinsic internally in the driver (as I think you did in one of the patches), but it seems like this would be even less of an issue. -Tom > Marek _______________________________________________ mesa-dev mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/mesa-dev
