On Thu, Jul 14, 2022 at 7:32 AM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
>
> On Mon, Jul 11, 2022, H.J. Lu <hjl.to...@gmail.com> wrote:
> > On Sun, Jul 10, 2022 at 2:38 PM Roger Sayle <ro...@nextmovesoftware.com>
> > wrote:
> > > Hi HJ,
> > >
> > > I believe this should now be handled by the post-reload (CSE) pass.
> > > Consider the simple test case:
> > >
> > > __int128 a, b, c;
> > > void foo()
> > > {
> > >   a = 0;
> > >   b = 0;
> > >   c = 0;
> > > }
> > >
> > > Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes:
> > >         movq    $0, a(%rip)
> > >         movq    $0, a+8(%rip)
> > >         movq    $0, b(%rip)
> > >         movq    $0, b+8(%rip)
> > >         movq    $0, c(%rip)
> > >         movq    $0, c+8(%rip)
> > >         ret
> > >
> > > But with STV, i.e. -O2 -msse4, things get converted to V1TI mode:
> > >         pxor    %xmm0, %xmm0
> > >         movaps  %xmm0, a(%rip)
> > >         movaps  %xmm0, b(%rip)
> > >         movaps  %xmm0, c(%rip)
> > >         ret
> > >
> > > You're quite right internally the STV actually generates the equivalent 
> > > of:
> > >         pxor    %xmm0, %xmm0
> > >         movaps  %xmm0, a(%rip)
> > >         pxor    %xmm0, %xmm0
> > >         movaps  %xmm0, b(%rip)
> > >         pxor    %xmm0, %xmm0
> > >         movaps  %xmm0, c(%rip)
> > >         ret
> > >
> > > And currently because STV run before cse2 and combine, the const0_rtx
> > > gets CSE'd be the cse2 pass to produce the code we see.  However, if
> > > you specify -fno-rerun-cse-after-loop (to disable the cse2 pass),
> > > you'll see we continue to generate the same optimized code, as the
> > > same const0_rtx gets CSE'd in postreload.
> > >
> > > I can't be certain until I try the experiment, but I believe that the
> > > postreload CSE will clean-up, all of the same common subexpressions.
> > > Hence, it should be safe to perform all STV at the same point (after
> > > combine), which for a few additional optimizations.
> > >
> > > Does this make sense?  Do you have a test case,
> > > -fno-rerun-cse-after-loop produces different/inferior code for TImode STV
> > chains?
> > >
> > > My guess is that the RTL passes have changed so much in the last six
> > > or seven years, that some of the original motivation no longer applies.
> > > Certainly we now try to keep TI mode operations visible longer, and
> > > then allow STV to behave like a pre-reload pass to decide which set of
> > > registers to use (vector V1TI or scalar doubleword DI).  Any CSE
> > > opportunities that cse2 finds with V1TI mode, could/should equally
> > > well be found for TI mode (mostly).
> >
> > You are probably right.  If there are no regressions in GCC testsuite, my 
> > original
> > motivation is no longer valid.
>
> It was good to try the experiment, but H.J. is right, there is still some 
> benefit
> (as well as some disadvantages)  to running STV lowering before CSE2/combine.
> A clean-up patch to perform all STV conversion as a single pass (removing a
> pass from the compiler) results in just a single regression in the test suite:
> FAIL: gcc.target/i386/pr70155-17.c scan-assembler-times movv1ti_internal 8
> which looks like:
>
> __int128 a, b, c, d, e, f;
> void foo (void)
> {
>   a = 0;
>   b = -1;
>   c = 0;
>   d = -1;
>   e = 0;
>   f = -1;
> }
>
> By performing STV after combine (without CSE), reload prefers to implement
> this function using a single register, that then requires 12 instructions 
> rather
> than 8 (if using two registers).  Alas there's nothing that postreload 
> CSE/GCSE
> can do.  Doh!

Hmm, the RA could be taught to make use of more of the register file I suppose
(shouldn't regrename do this job - but it runs after postreload-cse)

>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, a(%rip)
>         pcmpeqd %xmm0, %xmm0
>         movaps  %xmm0, b(%rip)
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, c(%rip)
>         pcmpeqd %xmm0, %xmm0
>         movaps  %xmm0, d(%rip)
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, e(%rip)
>         pcmpeqd %xmm0, %xmm0
>         movaps  %xmm0, f(%rip)
>         ret
>
> I also note that even without STV, the scalar implementation of this function 
> when
> compiled with -Os is also larger than it needs to be due to poor CSE (notice 
> in the
> following we only need a single zero register, and  an all_ones reg would be 
> helpful).
>
>         xorl    %eax, %eax
>         xorl    %edx, %edx
>         xorl    %ecx, %ecx
>         movq    $-1, b(%rip)
>         movq    %rax, a(%rip)
>         movq    %rax, a+8(%rip)
>         movq    $-1, b+8(%rip)
>         movq    %rdx, c(%rip)
>         movq    %rdx, c+8(%rip)
>         movq    $-1, d(%rip)
>         movq    $-1, d+8(%rip)
>         movq    %rcx, e(%rip)
>         movq    %rcx, e+8(%rip)
>         movq    $-1, f(%rip)
>         movq    $-1, f+8(%rip)
>         ret
>
> I need to give the problem some more thought.  It would be good to 
> clean-up/unify
> the STV passes, but I/we need to solve/CSE HJ's last test case before we do.  
> Perhaps
> by forbidding "(set (mem:ti) (const_int 0))" in movti_internal, would force 
> the zero
> register to become visible, and CSE'd, benefiting both vector code and scalar 
> -Os code,
> then use postreload/peephole2 to fix up the remaining scalar cases.  It's 
> tricky.

Not sure if related but ppc(?) folks recently tried to massage CSE to
avoid propagating
constants by making sure that rtx_cost handles (set (...) (const_int
...)) "properly".
But IIRC CSE never does the reverse transform - split out a constant
to a pseudo from
multiple uses of the same constant - that's probably on the job of
reload + postreload-CSE
right now, but reload probably does not know that there are multiple
uses of the constant
so the splitting is worthwhile.

> Cheers,
> Roger
> --
>
>

Reply via email to