On Mon, 2020-06-29 at 08:58 +0200, Richard Biener wrote: > On Sat, Jun 27, 2020 at 4:52 PM Marc Glisse <marc.gli...@inria.fr> wrote: > > On Fri, 26 Jun 2020, Jeff Law via Gcc-patches wrote: > > > > > In theory yes, but there are cases where paths converge (like you've > > > shown) where > > > you may have evaluated to a constant on the paths, but it's not a > > > constant at the > > > convergence point. You have to be very careful using b_c_p like this and > > > it's > > > been a regular source of kernel bugs. > > > > > > > > > I'd recommend looking at the .ssa dump and walk forward from there if the > > > .ssa > > > dump looks correct. > > > > Here is the last dump before thread1 (105t.mergephi2). I don't see > > anything incorrect in it. > > > > ledtrig_cpu (_Bool is_active) > > { > > int old; > > int iftmp.0_1; > > int _5; > > > > <bb 2> [local count: 1073741824]: > > if (is_active_2(D) != 0) > > goto <bb 4>; [50.00%] > > else > > goto <bb 3>; [50.00%] > > > > <bb 3> [local count: 536870913]: > > > > <bb 4> [local count: 1073741824]: > > # iftmp.0_1 = PHI <1(2), -1(3)> > > _5 = __builtin_constant_p (iftmp.0_1); > > if (_5 != 0) > > goto <bb 5>; [50.00%] > > else > > goto <bb 8>; [50.00%] > > > > <bb 5> [local count: 536870913]: > > if (iftmp.0_1 >= -128) > > goto <bb 6>; [50.00%] > > else > > goto <bb 8>; [50.00%] > > > > <bb 6> [local count: 268435456]: > > if (iftmp.0_1 <= 127) > > goto <bb 7>; [34.00%] > > else > > goto <bb 8>; [66.00%] > > > > <bb 7> [local count: 91268056]: > > __asm__ __volatile__("asi %0,%1 > > " : "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "i" iftmp.0_1, "Q" > > MEM[(int *)&num_active_cpus] : "memory", "cc"); > > goto <bb 9>; [100.00%] > > > > <bb 8> [local count: 982473769]: > > __asm__ __volatile__("laa %0,%2,%1 > > " : "old" "=d" old_8, "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "d" > > iftmp.0_1, "Q" MEM[(int *)&num_active_cpus] : "memory", "cc"); > > > > <bb 9> [local count: 1073741824]: > > return; > > > > } > > > > There is a single _b_c_p, the immediate asm argument is exactly the > > argument of _b_c_p, and it is in the branch protected by _b_c_p. > > > > Now the thread1 dump, for comparison > > > > ledtrig_cpu (_Bool is_active) > > { > > int old; > > int iftmp.0_4; > > int iftmp.0_6; > > int _7; > > int _12; > > int iftmp.0_13; > > int iftmp.0_14; > > > > <bb 2> [local count: 1073741824]: > > if (is_active_2(D) != 0) > > goto <bb 3>; [50.00%] > > else > > goto <bb 4>; [50.00%] > > > > <bb 3> [local count: 536870912]: > > # iftmp.0_6 = PHI <1(2)> > > _7 = __builtin_constant_p (iftmp.0_6); > > if (_7 != 0) > > goto <bb 6>; [50.00%] > > else > > goto <bb 8>; [50.00%] > > > > <bb 4> [local count: 536870912]: > > # iftmp.0_4 = PHI <-1(2)> > > _12 = __builtin_constant_p (iftmp.0_4); > > if (_12 != 0) > > goto <bb 5>; [50.00%] > > else > > goto <bb 8>; [50.00%] > > > > <bb 5> [local count: 268435456]: > > if (iftmp.0_4 >= -128) > > goto <bb 7>; [20.00%] > > else > > goto <bb 8>; [80.00%] > > > > <bb 6> [local count: 214748364]: > > if (iftmp.0_6 <= 127) > > goto <bb 7>; [12.00%] > > else > > goto <bb 8>; [88.00%] > > > > <bb 7> [local count: 91268056]: > > # iftmp.0_13 = PHI <iftmp.0_6(6), iftmp.0_4(5)> > > __asm__ __volatile__("asi %0,%1 > > " : "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "i" iftmp.0_13, "Q" > > MEM[(int *)&num_active_cpus] : "memory", "cc"); > > goto <bb 9>; [100.00%] > > > > <bb 8> [local count: 982473769]: > > # iftmp.0_14 = PHI <iftmp.0_4(5), iftmp.0_4(4), iftmp.0_6(6), > > iftmp.0_6(3)> > > __asm__ __volatile__("laa %0,%2,%1 > > " : "old" "=d" old_8, "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "d" > > iftmp.0_14, "Q" MEM[(int *)&num_active_cpus] : "memory", "cc"); > > > > <bb 9> [local count: 1073741824]: > > return; > > > > } > > > > Thread1 decides to separate the paths is_active and !is_active > > (surprisingly, for one it optimizes out the comparison <= 127 and for the > > other the comparison >= -128, while it could optimize both in both cases). > > And it decides to converge after the comparisons, but before the asm. > > > > What the pass did does seem to hurt. It looks like if we duplicate _b_c_p, > > we may need to duplicate far enough to include all the blocks dominated by > > _b_c_p==true (including the asm, here). Otherwise, any _b_c_p can be > > optimized to true, because for a boolean > > > > b is the same as b ? true : false > > __builtin_constant_p(b ? true : false) would be the same as b ? > > __builtin_constant_p(true) : __builtin_constant_p(false), i.e. true. > > > > It is too bad we don't have any optimization pass using ranges between IPA > > and thread1, that would have gotten rid of the comparisons, and hence the > > temptation to thread. Adding always_inline on atomic_add (or flatten on > > the caller) does help: EVRP removes the comparisons. > > > > Do you see a way forward without changing what thread1 does or declaring > > the testcase as unsupported? > > Most of the cases I've seen involve transforms that make _b_c_p constant > on one path and then introduce a new PHI merging two _b_c_p values > to be then tested in a not simplified condition. I'm not sure how to fend > off jump threading (yeah, it's nearly always jump threading doing this...) > doing this but certainly the easiest way would be to simply disallow > [jump threading] from duplicating _b_c_p calls. > > Or fold _b_c_p even earlier (though I definitely saw early backwards > jump threading mess up such a case). I'm not 100% sure it's duplication of the b_c_p call that's the problem. Or perhaps better stated, I think anything which results in an incremental update with an SSA_NAME that's used in a b_c_p call is going to potentially be problematical. I suspect those are strong correlated though. Jeff
> > Richard. > > > -- > > Marc Glisse