[Bug tree-optimization/97343] AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product

rguenth at gcc dot gnu.org via Gcc-bugs Mon, 12 Oct 2020 06:11:49 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343


--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Michael_S from comment #2)
> (In reply to Richard Biener from comment #1)
> > All below for Part 2.
> > 
> > Without -ffast-math you are seeing GCC using in-order reductions now while
> > with -ffast-math the vectorizer gets a bit confused about reassociations 
> > done
> > before, for me producing
> > 
> 
> Just to understand, when you are saying "Without -ffast-math" does it
> includes my set1 == "-O3 -ffast-math -fno-associative-math" ?

No, -fno-associative-math breaks it again.

> BTW, I like your phrasing: "a bit confused about reassociations"
> 
> > .L3:
> >         vmovupd 32(%rsi,%rax), %ymm3
> >         vmovupd (%rdx,%rax), %ymm7
> >         vinsertf128     $1, (%rsi,%rax), %ymm3, %ymm0
> >         vinsertf128     $1, 32(%rdx,%rax), %ymm7, %ymm2
> >         vmovupd 32(%rsi,%rax), %ymm5
> >         vpermpd $136, %ymm0, %ymm4
> >         vpermpd $40, %ymm2, %ymm7
> >         vpermpd $221, %ymm0, %ymm1
> >         vpermpd $125, %ymm2, %ymm3
> >         vperm2f128      $49, (%rsi,%rax), %ymm5, %ymm0
> >         vmovupd (%rdx,%rax), %ymm2
> >         vperm2f128      $49, 32(%rdx,%rax), %ymm2, %ymm2
> >         addq    $64, %rax
> >         vpermpd $136, %ymm0, %ymm5
> >         vpermpd $221, %ymm0, %ymm0
> >         vpermpd $40, %ymm2, %ymm8
> >         vpermpd $125, %ymm2, %ymm2
> >         vmulpd  %ymm8, %ymm5, %ymm5
> >         vmulpd  %ymm2, %ymm0, %ymm0
> >         vfmadd132pd     %ymm3, %ymm5, %ymm1
> >         vfmadd231pd     %ymm7, %ymm4, %ymm0
> >         vaddpd  %ymm0, %ymm1, %ymm0
> >         vaddpd  %ymm0, %ymm6, %ymm6
> >         cmpq    %rcx, %rax
> >         jne     .L3
> > 
> > -ffast-math vs. non-ffast-math we're using a SLP reduction vs. 4 reduction
> > chains and this SLP reduction ends up looking like
> > 
> > t5.c:7:21: note:   Vectorizing SLP tree:
> > t5.c:7:21: note:   node 0x4100c20 (max_nunits=4, refcnt=2)
> > t5.c:7:21: note:        stmt 0 acc_imre_158 = acc_imre_3 + _34;
> > t5.c:7:21: note:        stmt 1 acc_reim_156 = acc_reim_1 + _8;
> > t5.c:7:21: note:        stmt 2 acc_imim_154 = _21 + acc_imim_35;
> > t5.c:7:21: note:        stmt 3 acc_rere_146 = _11 + acc_rere_29;
> > t5.c:7:21: note:        children 0x3f272e0 0x4100bb0
> > t5.c:7:21: note:   node 0x3f272e0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 acc_imre_3 = PHI <acc_imre_158(7), 0.0(8)>
> > t5.c:7:21: note:        stmt 1 acc_reim_1 = PHI <acc_reim_156(7), 0.0(8)>
> > t5.c:7:21: note:        stmt 2 acc_imim_35 = PHI <acc_imim_154(7), 0.0(8)>
> > t5.c:7:21: note:        stmt 3 acc_rere_29 = PHI <acc_rere_146(7), 0.0(8)>
> > t5.c:7:21: note:        children 0x4100c20
> > t5.c:7:21: note:   node 0x4100bb0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _34 = _36 + _157;
> > t5.c:7:21: note:        stmt 1 _8 = _30 + _155;
> > t5.c:7:21: note:        stmt 2 _21 = _15 + _153;
> > t5.c:7:21: note:        stmt 3 _11 = _6 + _145;
> > t5.c:7:21: note:        children 0x4100920 0x4100b40
> > t5.c:7:21: note:   node 0x4100920 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _36 = _37 + _73;
> > t5.c:7:21: note:        stmt 1 _30 = _32 + _71;
> > t5.c:7:21: note:        stmt 2 _15 = _10 + _69;
> > t5.c:7:21: note:        stmt 3 _6 = _31 + _61;
> > t5.c:7:21: note:        children 0x41004e0 0x41008b0
> > t5.c:7:21: note:   node 0x41004e0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _37 = _101 + _129;
> > t5.c:7:21: note:        stmt 1 _32 = _99 + _127;
> > t5.c:7:21: note:        stmt 2 _10 = _97 + _125;
> > t5.c:7:21: note:        stmt 3 _31 = _89 + _117;
> > t5.c:7:21: note:        children 0x3f2a550 0x3f28700
> > t5.c:7:21: note:   node 0x3f2a550 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _101 = _88 * _94;
> > t5.c:7:21: note:        stmt 1 _99 = _86 * _96;
> > t5.c:7:21: note:        stmt 2 _97 = _94 * _96;
> > t5.c:7:21: note:        stmt 3 _89 = _86 * _88;
> > t5.c:7:21: note:        children 0x40b6990 0x3f29e00
> > t5.c:7:21: note:   node 0x40b6990 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _88 = *_87;
> > t5.c:7:21: note:        stmt 1 _96 = *_95;
> > t5.c:7:21: note:        stmt 2 _96 = *_95;
> > t5.c:7:21: note:        stmt 3 _88 = *_87;
> > t5.c:7:21: note:        load permutation { 1 5 5 1 }
> > t5.c:7:21: note:   node 0x3f29e00 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _94 = *_93;
> > t5.c:7:21: note:        stmt 1 _86 = *_85;
> > t5.c:7:21: note:        stmt 2 _94 = *_93;
> > t5.c:7:21: note:        stmt 3 _86 = *_85;
> > t5.c:7:21: note:        load permutation { 5 1 5 1 }
> > t5.c:7:21: note:   node 0x3f28700 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _129 = _116 * _122;
> > t5.c:7:21: note:        stmt 1 _127 = _114 * _124;
> > t5.c:7:21: note:        stmt 2 _125 = _122 * _124;
> > t5.c:7:21: note:        stmt 3 _117 = _114 * _116;
> > t5.c:7:21: note:        children 0x3f287e0 0x3f28770
> > t5.c:7:21: note:   node 0x3f287e0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _116 = *_115;
> > t5.c:7:21: note:        stmt 1 _124 = *_123;
> > t5.c:7:21: note:        stmt 2 _124 = *_123;
> > t5.c:7:21: note:        stmt 3 _116 = *_115;
> > t5.c:7:21: note:        load permutation { 2 6 6 2 }
> > t5.c:7:21: note:   node 0x3f28770 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _122 = *_121;
> > t5.c:7:21: note:        stmt 1 _114 = *_113;
> > t5.c:7:21: note:        stmt 2 _122 = *_121;
> > t5.c:7:21: note:        stmt 3 _114 = *_113;
> > t5.c:7:21: note:        load permutation { 6 2 6 2 }
> > t5.c:7:21: note:   node 0x41008b0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _73 = _60 * _66;
> > t5.c:7:21: note:        stmt 1 _71 = _58 * _68;
> > t5.c:7:21: note:        stmt 2 _69 = _66 * _68;
> > t5.c:7:21: note:        stmt 3 _61 = _58 * _60;
> > t5.c:7:21: note:        children 0x4100290 0x4100810
> > t5.c:7:21: note:   node 0x4100290 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _60 = *_59;
> > t5.c:7:21: note:        stmt 1 _68 = *_67;
> > t5.c:7:21: note:        stmt 2 _68 = *_67;
> > t5.c:7:21: note:        stmt 3 _60 = *_59;
> > t5.c:7:21: note:        load permutation { 0 4 4 0 }
> > t5.c:7:21: note:   node 0x4100810 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _66 = *_65;
> > t5.c:7:21: note:        stmt 1 _58 = *_57;
> > t5.c:7:21: note:        stmt 2 _66 = *_65;
> > t5.c:7:21: note:        stmt 3 _58 = *_57;
> > t5.c:7:21: note:        load permutation { 4 0 4 0 }
> > t5.c:7:21: note:   node 0x4100b40 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _157 = _144 * _150;
> > t5.c:7:21: note:        stmt 1 _155 = _142 * _152;
> > t5.c:7:21: note:        stmt 2 _153 = _150 * _152;
> > t5.c:7:21: note:        stmt 3 _145 = _142 * _144;
> > t5.c:7:21: note:        children 0x4100990 0x4100a50
> > t5.c:7:21: note:   node 0x4100990 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _144 = *_143;
> > t5.c:7:21: note:        stmt 1 _152 = *_151;
> > t5.c:7:21: note:        stmt 2 _152 = *_151;
> > t5.c:7:21: note:        stmt 3 _144 = *_143;
> > t5.c:7:21: note:        load permutation { 3 7 7 3 }
> > t5.c:7:21: note:   node 0x4100a50 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _150 = *_149;
> > t5.c:7:21: note:        stmt 1 _142 = *_141;
> > t5.c:7:21: note:        stmt 2 _150 = *_149;
> > t5.c:7:21: note:        stmt 3 _142 = *_141;
> > t5.c:7:21: note:        load permutation { 7 3 7 3 }
> > 
> > which eventually shows some non-obvious permute optimization opportunities.
> > I'm currently working on a permute optimization phase btw. but for start
> > only handling cases that do not help here.
> > 
> > Btw, if I use -ffast-math but disable reassociation via -fno-tree-reassoc I
> > get
> > the reduction chain variant which optimizes to
> > 
> > .L3:
> >         vmovupd 32(%rsi,%rax), %ymm6
> >         vmovupd 32(%rdx,%rax), %ymm7
> >         vmovupd (%rsi,%rax), %ymm5
> >         vfmadd231pd     (%rdx,%rax), %ymm6, %ymm0
> >         vfmadd231pd     (%rdx,%rax), %ymm5, %ymm3
> >         vfmadd231pd     (%rsi,%rax), %ymm7, %ymm1
> >         addq    $64, %rax
> >         vfmadd231pd     %ymm6, %ymm7, %ymm2
> >         cmpq    %rcx, %rax
> >         jne     .L3
> > 
> > even with GCC 10 (-Ofast -march=core-avx2 -fno-tree-reassoc). 
> 
> It's very fragile. I made a tiny (and natural for my real app) change in the
> source (see below) and the nice MSVC-like inner loop disappeared.
> 
> void cdot(double* res, const double* a, const double* b, int N)
> {
>   double acc_rere = 0;
>   double acc_imim = 0;
>   double acc_reim = 0;
>   double acc_imre = 0;
>   for (int c = 0; c < N; ++c) {
>     for (int k = 0; k < 4; ++k) {
>       acc_rere += a[c*8+k+0]*b[c*8+k+0];
>       acc_imim += a[c*8+k+4]*b[c*8+k+4];
>       acc_reim -= a[c*8+k+0]*b[c*8+k+4];
>       acc_imre += a[c*8+k+4]*b[c*8+k+0];
>     }
>   }
>   res[0] = acc_rere+acc_imim;
>   res[4] = acc_reim+acc_imre;
> }

wow, OK ... :/

> > Which means
> > the following source change helps:
> > 
> > void __attribute__((optimize("no-tree-reassoc"))) cdot(double* res, const
> > double* a, const double* b, int N)
> > {
> >   double acc_rere = 0;
> >   double acc_imim = 0;
> >   double acc_reim = 0;
> >   double acc_imre = 0;
> >   for (int c = 0; c < N; ++c) {
> >     for (int k = 0; k < 4; ++k) {
> >       acc_rere += a[c*8+k+0]*b[c*8+k+0];
> >       acc_imim += a[c*8+k+4]*b[c*8+k+4];
> >       acc_reim += a[c*8+k+0]*b[c*8+k+4];
> >       acc_imre += a[c*8+k+4]*b[c*8+k+0];
> >     }
> >   }
> >   res[0] = acc_rere+acc_imim;
> >   res[4] = acc_imre-acc_reim;
> > }
> > 
> 
> IMHO, options like "no-tree-reassoc", including and especially within
> __attribute__((optimize("")))  are for people like you.
> People like me, i.e. not compiler guys, don't consider them not only for
> production, but even for hobby. Unless a hobby is compiler research.

That's true.

> Also, several years ago I was told by (not you, as Richard Biener, but "you"
> as "gcc maintainers", more specifically, by Manuel López-Ibáñez. You, as
> Richard Biener, also took part in discussion, but appeared to hold different
> opinion, see 70255) told me that __attribute__((optimize(""))) can't be
> relied on in production. Back than we came to conclusion that this statement
> has to be in official docs. And indeed since GCC7 section 6.33.1 contains
> the folowing sentences:
> "The optimize attribute should be used for debugging purposes only. It is
> not suitable in production code."

Yeah, guess from my HPC background I'd say HPC is never "production" ;)
I indeed would avoid optimized("") in say a github hosted project but
IMHO it's as valid as splitting out the relevant function and compiling
the TU with -fno-tree-reassoc - it's tuning the setup for max performance
with a specific compiler and for a specific host.

> 
> > the reduction epilogue ends up like
> > 
> >         vextractf128    $0x1, %ymm3, %xmm4
> >         vaddpd  %xmm3, %xmm4, %xmm3
> >         vunpckhpd       %xmm3, %xmm3, %xmm4
> >         vaddpd  %xmm3, %xmm4, %xmm3
> >         vextractf128    $0x1, %ymm2, %xmm4
> >         vaddpd  %xmm2, %xmm4, %xmm4
> >         vunpckhpd       %xmm4, %xmm4, %xmm2
> >         vaddpd  %xmm4, %xmm2, %xmm2
> >         vextractf128    $0x1, %ymm1, %xmm4
> >         vaddpd  %xmm1, %xmm4, %xmm4
> >         vaddsd  %xmm2, %xmm3, %xmm2
> >         vunpckhpd       %xmm4, %xmm4, %xmm1
> >         vaddpd  %xmm4, %xmm1, %xmm1
> >         vextractf128    $0x1, %ymm0, %xmm4
> >         vaddpd  %xmm0, %xmm4, %xmm4
> >         vunpckhpd       %xmm4, %xmm4, %xmm0
> >         vaddpd  %xmm4, %xmm0, %xmm0
> >         vsubsd  %xmm1, %xmm0, %xmm0
> >         vzeroupper
> >         vmovsd  %xmm2, (%rdi)
> >         vmovsd  %xmm0, 32(%rdi)
> > 
> > which is not optimal since we miss the opportunity to vectorize the
> > adds of the accumulators
> > 
> >   res[0] = acc_rere+acc_imim;
> >   res[4] = acc_imre-acc_reim;
> 
> Epilogue is a tricky matter.
> There are many ways to do it with about the same speed and which variant is
> faster could depend on fine microarchitectural details, that could be not
> the same even on such quite similar CPUs as Skylake and Zen2 or Skylake and
> Ice Lake, or may be even Skylake and Skylake-X (well, may be the last one is
> an  exaggeration). The optimal sequence also depends on surrounding, e.g. if
> I change the source to:
>   res[0] = acc_rere+acc_imim;
>   res[1] = acc_reim-acc_imre; // results adjacent in memory
> It could already be different.
> In later case it likely would be
>   vaddpd  ymm_rere,ymm_imim,ymm_re
>   vsubpd  ymm_reim,ymm_imre,ymm_im
>   vhadd   ymm_im,ymm_re,ymm_reim
>   vperm2f128 $1,    ymm_reim, ymm_reim, ymm_reimH
>   vaddpd  xmm2_reimH, xmm_reim, xmm_reim
> 
> Even icc can't do it perfectly right now. 
> It would be nice (for you personally, at least) if in this case gcc will
> generate better code than icc, but it is far [far [far...]] less important
> than robust handling of inner loop.

Yeah, the issue really is that with reassociation we miss the obviously
better reduction scheme and the vectorizer cannot translate between one
and the other at the moment (it doesn't itself try reassociating other than
that implied by vectorizing).  We could try to add this capability without
too many issues I guess - the main problem will be that it's decision
upfront for the reduction scheme and that it wouldn't be one based on
actual costs.

Thanks for the insightful loop kernels btw!

[Bug tree-optimization/97343] AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product

Reply via email to