On Mon, Aug 21, 2023 at 8:25 PM Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > The following fixes the gcc.target/i386/pr87007-5.c testcase which > changed code generation again after the recent sinking improvements. > We now have > > vxorps %xmm0, %xmm0, %xmm0 > vsqrtsd d2(%rip), %xmm0, %xmm0 > > and an unnecessary xor again in one case, the other vsqrtsd has > a register source and a properly zeroing load: > > vmovsd d3(%rip), %xmm0 > testl %esi, %esi > jg .L11 > .L3: > vsqrtsd %xmm0, %xmm0, %xmm0 > > the following patch XFAILs the scan. I'm not sure what's at > fault here, there are no loops in the CFG, but somehow > r84:DF=sqrt(['d2']) gets a pxor but r84:DF=sqrt(r83:DF) > doesn't. I guess I don't really understand what > remove_partial_avx_dependency is supposed to do so can't > really assess whether the pxor is necessary or not. There's a false dependency on xmm0 when the source operand in the pattern is memory, the pattern only takes xmm0 as dest, but the output instruction takes xmm0 also as input(the second source operand), that's why we need an pxor here. When the source operand in the pattern is register_operand, we can reuse the register_operand for the second source operand. The instructions here are not very obvious, the more representative one should be vsqrtsd %xmm1, %xmm1(rused one), %xmm0. > > OK? Can we add -fno-XXX to disable the optimization to make the assembly more stable? Or current codegen should be optimal(for the sinking), then Ok for the patch.
> > * gcc.target/i386/pr87007-5.c: Update comment, XFAIL > subtest. > --- > gcc/testsuite/gcc.target/i386/pr87007-5.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/gcc/testsuite/gcc.target/i386/pr87007-5.c > b/gcc/testsuite/gcc.target/i386/pr87007-5.c > index a6cdf11522e..5902616d1f1 100644 > --- a/gcc/testsuite/gcc.target/i386/pr87007-5.c > +++ b/gcc/testsuite/gcc.target/i386/pr87007-5.c > @@ -1,6 +1,8 @@ > /* { dg-do compile } */ > /* { dg-options "-Ofast -march=skylake-avx512 -mfpmath=sse > -fno-tree-vectorize -fdump-tree-cddce3-details -fdump-tree-lsplit-optimized" > } */ > -/* Load of d2/d3 is hoisted out, vrndscalesd will reuse loades register to > avoid partial dependence. */ > +/* Load of d2/d3 is hoisted out, the loop is split, store of d1 and sqrt > + are sunk out of the loop and the loop is elided. One vsqrtsd with > + memory operand will need a xor to avoid partial dependence. */ > > #include<math.h> > > @@ -17,4 +19,4 @@ foo (int n, int k) > > /* { dg-final { scan-tree-dump "optimized: loop split" "lsplit" } } */ > /* { dg-final { scan-tree-dump-times "removing loop" 2 "cddce3" } } */ > -/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 0 } } */ > +/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */ > -- > 2.35.3 -- BR, Hongtao