On Thu, Jul 14, 2022 at 7:32 AM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > On Mon, Jul 11, 2022, H.J. Lu <hjl.to...@gmail.com> wrote: > > On Sun, Jul 10, 2022 at 2:38 PM Roger Sayle <ro...@nextmovesoftware.com> > > wrote: > > > Hi HJ, > > > > > > I believe this should now be handled by the post-reload (CSE) pass. > > > Consider the simple test case: > > > > > > __int128 a, b, c; > > > void foo() > > > { > > > a = 0; > > > b = 0; > > > c = 0; > > > } > > > > > > Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes: > > > movq $0, a(%rip) > > > movq $0, a+8(%rip) > > > movq $0, b(%rip) > > > movq $0, b+8(%rip) > > > movq $0, c(%rip) > > > movq $0, c+8(%rip) > > > ret > > > > > > But with STV, i.e. -O2 -msse4, things get converted to V1TI mode: > > > pxor %xmm0, %xmm0 > > > movaps %xmm0, a(%rip) > > > movaps %xmm0, b(%rip) > > > movaps %xmm0, c(%rip) > > > ret > > > > > > You're quite right internally the STV actually generates the equivalent > > > of: > > > pxor %xmm0, %xmm0 > > > movaps %xmm0, a(%rip) > > > pxor %xmm0, %xmm0 > > > movaps %xmm0, b(%rip) > > > pxor %xmm0, %xmm0 > > > movaps %xmm0, c(%rip) > > > ret > > > > > > And currently because STV run before cse2 and combine, the const0_rtx > > > gets CSE'd be the cse2 pass to produce the code we see. However, if > > > you specify -fno-rerun-cse-after-loop (to disable the cse2 pass), > > > you'll see we continue to generate the same optimized code, as the > > > same const0_rtx gets CSE'd in postreload. > > > > > > I can't be certain until I try the experiment, but I believe that the > > > postreload CSE will clean-up, all of the same common subexpressions. > > > Hence, it should be safe to perform all STV at the same point (after > > > combine), which for a few additional optimizations. > > > > > > Does this make sense? Do you have a test case, > > > -fno-rerun-cse-after-loop produces different/inferior code for TImode STV > > chains? > > > > > > My guess is that the RTL passes have changed so much in the last six > > > or seven years, that some of the original motivation no longer applies. > > > Certainly we now try to keep TI mode operations visible longer, and > > > then allow STV to behave like a pre-reload pass to decide which set of > > > registers to use (vector V1TI or scalar doubleword DI). Any CSE > > > opportunities that cse2 finds with V1TI mode, could/should equally > > > well be found for TI mode (mostly). > > > > You are probably right. If there are no regressions in GCC testsuite, my > > original > > motivation is no longer valid. > > It was good to try the experiment, but H.J. is right, there is still some > benefit > (as well as some disadvantages) to running STV lowering before CSE2/combine. > A clean-up patch to perform all STV conversion as a single pass (removing a > pass from the compiler) results in just a single regression in the test suite: > FAIL: gcc.target/i386/pr70155-17.c scan-assembler-times movv1ti_internal 8 > which looks like: > > __int128 a, b, c, d, e, f; > void foo (void) > { > a = 0; > b = -1; > c = 0; > d = -1; > e = 0; > f = -1; > } > > By performing STV after combine (without CSE), reload prefers to implement > this function using a single register, that then requires 12 instructions > rather > than 8 (if using two registers). Alas there's nothing that postreload > CSE/GCSE > can do. Doh!
Hmm, the RA could be taught to make use of more of the register file I suppose (shouldn't regrename do this job - but it runs after postreload-cse) > pxor %xmm0, %xmm0 > movaps %xmm0, a(%rip) > pcmpeqd %xmm0, %xmm0 > movaps %xmm0, b(%rip) > pxor %xmm0, %xmm0 > movaps %xmm0, c(%rip) > pcmpeqd %xmm0, %xmm0 > movaps %xmm0, d(%rip) > pxor %xmm0, %xmm0 > movaps %xmm0, e(%rip) > pcmpeqd %xmm0, %xmm0 > movaps %xmm0, f(%rip) > ret > > I also note that even without STV, the scalar implementation of this function > when > compiled with -Os is also larger than it needs to be due to poor CSE (notice > in the > following we only need a single zero register, and an all_ones reg would be > helpful). > > xorl %eax, %eax > xorl %edx, %edx > xorl %ecx, %ecx > movq $-1, b(%rip) > movq %rax, a(%rip) > movq %rax, a+8(%rip) > movq $-1, b+8(%rip) > movq %rdx, c(%rip) > movq %rdx, c+8(%rip) > movq $-1, d(%rip) > movq $-1, d+8(%rip) > movq %rcx, e(%rip) > movq %rcx, e+8(%rip) > movq $-1, f(%rip) > movq $-1, f+8(%rip) > ret > > I need to give the problem some more thought. It would be good to > clean-up/unify > the STV passes, but I/we need to solve/CSE HJ's last test case before we do. > Perhaps > by forbidding "(set (mem:ti) (const_int 0))" in movti_internal, would force > the zero > register to become visible, and CSE'd, benefiting both vector code and scalar > -Os code, > then use postreload/peephole2 to fix up the remaining scalar cases. It's > tricky. Not sure if related but ppc(?) folks recently tried to massage CSE to avoid propagating constants by making sure that rtx_cost handles (set (...) (const_int ...)) "properly". But IIRC CSE never does the reverse transform - split out a constant to a pseudo from multiple uses of the same constant - that's probably on the job of reload + postreload-CSE right now, but reload probably does not know that there are multiple uses of the constant so the splitting is worthwhile. > Cheers, > Roger > -- > >