On 12/19/18 8:39 AM, Jonathan Marek wrote: > This works by moving the fadd up across the ffma operations, so that it > can eventually can be combined with a fmul. I'm not sure it works in all > cases, but it works in all the common cases. > > This will only affect freedreno since it is the only driver using the > fuse_ffma option.
tl;dr: Optimal generation of FFMAs is much more difficult than you would think it should be. You should collect some actual data before landing this. Any change to ffma generation is likely to have massive, unforeseen changes to lots of shaders. Seemingly simple, obvious changes result in changes to live ranges, register pressure, scheduling, constant folding, and on, and on. I took this patch, substituted !options->lower_ffma for options->fuse_ffma in the pattern you added, and ran it through shader-db for Skylake and Haswell. As I expected, the results were just all over the place (see below). Notice that register spills are helped on one platform but hurt on the other. There are some simple rules in nir_opt_algebraic for generating and reassociating ffmas. Given the complex interactions with live ranges, register pressure, and scheduling, I feel like ffma generation should happen much, much later in the process... it should almost certainly be deep in the backend where register pressure and scheduling information are available. The Intel compiler has its own pass for ffma generation, and I've found that makes really, really bad choices due to lack of this information. For example, consider a sequence like (shaderInputA * uniformB) + (texture(...) * shaderInputC) There are two ways to generate an ffma from that. One will schedule well, and the other will be horrible. You /probably/ want ffma(texture(...), shaderInputC, (shaderInputA * uniformB)) so that the first multiply can happen during the latency of the texture lookup. But maybe not. Maybe shaderInputA and uniformB are still live after the multiply and storing the result of the multiply pushes register pressure too high. Right now our ffma pass is greedy. If it sees a*b+c, it will always generate ffma(a, b, c), regardless of whether or not c is also a multiply. In one of my experiments, I flipped the logic so a*b+c*d would always generate ffma(c, d, a*b). The number of helped and hurt shaders was very close to even. Some shaders were helped by a huge amount, and other shaders were hurt by an equally huge amount. I also tried not generating an ffma at all for the a*b+c*d case. My recollection is that a few shaders were helped by a large amount, and many thousands of shaders were hurt by small amounts. If I add it all up, I probably spent several weeks last year poking at changes like this in our ffma pass. It began to feel like the old woman who swallowed a fly. Every change helped some things, but it made other things fall off a cliff. The next fix helped a few of the things damaged by the previous change, but it made other things fall of a different cliff. I eventually abandoned the project. If I ever pick it back up, it will be as a pass that occurs closer to scheduling and register allocation. Skylake total instructions in shared programs: 15031138 -> 15035206 (0.03%) instructions in affected programs: 1230624 -> 1234692 (0.33%) helped: 1428 HURT: 1067 helped stats (abs) min: 1 max: 671 x̄: 7.08 x̃: 3 helped stats (rel) min: 0.04% max: 24.72% x̄: 2.30% x̃: 1.78% HURT stats (abs) min: 1 max: 1601 x̄: 13.29 x̃: 4 HURT stats (rel) min: 0.05% max: 352.64% x̄: 4.42% x̃: 2.35% 95% mean confidence interval for instructions value: 0.03 3.23 95% mean confidence interval for instructions %-change: 0.24% 0.91% Instructions are HURT. total cycles in shared programs: 369712682 -> 370166527 (0.12%) cycles in affected programs: 128542483 -> 128996328 (0.35%) helped: 1679 HURT: 2639 helped stats (abs) min: 1 max: 27317 x̄: 162.81 x̃: 18 helped stats (rel) min: <.01% max: 60.25% x̄: 2.34% x̃: 1.38% HURT stats (abs) min: 1 max: 57100 x̄: 275.56 x̃: 58 HURT stats (rel) min: <.01% max: 147.37% x̄: 8.62% x̃: 5.01% 95% mean confidence interval for cycles value: 61.86 148.35 95% mean confidence interval for cycles %-change: 4.06% 4.66% Cycles are HURT. total spills in shared programs: 10158 -> 9688 (-4.63%) spills in affected programs: 1829 -> 1359 (-25.70%) helped: 140 HURT: 3 total fills in shared programs: 22117 -> 21371 (-3.37%) fills in affected programs: 2575 -> 1829 (-28.97%) helped: 140 HURT: 3 LOST: 7 GAINED: 0 Haswell total instructions in shared programs: 13625863 -> 13635875 (0.07%) instructions in affected programs: 1554579 -> 1564591 (0.64%) helped: 844 HURT: 1651 helped stats (abs) min: 1 max: 96 x̄: 4.16 x̃: 3 helped stats (rel) min: 0.04% max: 10.26% x̄: 1.91% x̃: 1.90% HURT stats (abs) min: 1 max: 1602 x̄: 8.19 x̃: 5 HURT stats (rel) min: 0.10% max: 346.00% x̄: 2.97% x̃: 1.45% 95% mean confidence interval for instructions value: 2.70 5.33 95% mean confidence interval for instructions %-change: 1.02% 1.63% Instructions are HURT. total cycles in shared programs: 372618507 -> 372381101 (-0.06%) cycles in affected programs: 147167634 -> 146930228 (-0.16%) helped: 1895 HURT: 2001 helped stats (abs) min: 1 max: 8639 x̄: 341.01 x̃: 26 helped stats (rel) min: <.01% max: 29.69% x̄: 2.78% x̃: 1.66% HURT stats (abs) min: 1 max: 40206 x̄: 204.30 x̃: 53 HURT stats (rel) min: 0.01% max: 71.38% x̄: 6.06% x̃: 3.11% 95% mean confidence interval for cycles value: -92.06 -29.82 95% mean confidence interval for cycles %-change: 1.52% 2.00% Inconclusive result (value mean confidence interval and %-change mean confidence interval disagree). total spills in shared programs: 82582 -> 83316 (0.89%) spills in affected programs: 71432 -> 72166 (1.03%) helped: 16 HURT: 348 total fills in shared programs: 93463 -> 94192 (0.78%) fills in affected programs: 72319 -> 73048 (1.01%) helped: 16 HURT: 379 LOST: 6 GAINED: 0 > Example: > matrix * vec4(coord, 1.0) > is compiled as: > fmul, ffma, ffma, fadd > and with this patch: > ffma, ffma, ffma > > Signed-off-by: Jonathan Marek <jonat...@marek.ca> > --- > src/compiler/nir/nir_opt_algebraic.py | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/src/compiler/nir/nir_opt_algebraic.py > b/src/compiler/nir/nir_opt_algebraic.py > index 506d45e55b..97a6c0d8dc 100644 > --- a/src/compiler/nir/nir_opt_algebraic.py > +++ b/src/compiler/nir/nir_opt_algebraic.py > @@ -137,6 +137,7 @@ optimizations = [ > (('~fadd@64', a, ('fmul', c , ('fadd', b, ('fneg', a)))), > ('flrp', a, b, c), '!options->lower_flrp64'), > (('ffma', a, b, c), ('fadd', ('fmul', a, b), c), 'options->lower_ffma'), > (('~fadd', ('fmul', a, b), c), ('ffma', a, b, c), 'options->fuse_ffma'), > + (('~fadd', ('ffma', a, b, c), d), ('ffma', a, b, ('fadd', c, d)), > 'options->fuse_ffma'), > > (('fdot4', ('vec4', a, b, c, 1.0), d), ('fdph', ('vec3', a, b, c), > d)), > (('fdot4', ('vec4', a, 0.0, 0.0, 0.0), b), ('fmul', a, b)), _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev