[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

rguenth at gcc dot gnu.org via Gcc-bugs Tue, 17 Nov 2020 02:18:46 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832


--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, so I have a patch to keep the association linear which IMHO is good.  It
fixes the smaller and my testcase but not the original one which now is
linear but still not homogenous.  The store groups are as follows

  *_115 = (((((*_115) - (f00_re_68 * x0_re_108)) - (f10_re_70 * x1_re_140)) -
(f00_im_73 * x0_im_114)) - (f10_im_74 * x1_im_142));
  *_117 = (((((*_117) + (f00_im_73 * x0_re_108)) + (f10_im_74 * x1_re_140)) -
(f00_re_68 * x0_im_114)) - (f10_re_70 * x1_im_142));
  *_119 = (((((*_119) - (f01_re_71 * x0_re_108)) - (f11_re_72 * x1_re_140)) -
(f01_im_75 * x0_im_114)) - (f11_im_76 * x1_im_142));
  *_121 = (((((*_121) + (f01_im_75 * x0_re_108)) + (f11_im_76 * x1_re_140)) -
(f01_re_71 * x0_im_114)) - (f11_re_72 * x1_im_142));

(good)

  *_177 = (((((*_177) - (f00_re_68 * x0_re_170)) - (f00_im_73 * x0_im_176)) -
(f10_re_70 * x1_re_202)) - (f10_im_74 * x1_im_204));
  *_179 = (((((f00_im_73 * x0_re_170) + (f10_im_74 * x1_re_202)) + (*_179)) -
(f00_re_68 * x0_im_176)) - (f10_re_70 * x1_im_204));
  *_181 = (((((*_181) - (f01_re_71 * x0_re_170)) - (f01_im_75 * x0_im_176)) -
(f11_re_72 * x1_re_202)) - (f11_im_76 * x1_im_204));
  *_183 = (((((f01_im_75 * x0_re_170) + (f11_im_76 * x1_re_202)) + (*_183)) -
(f01_re_71 * x0_im_176)) - (f11_re_72 * x1_im_204));

already bad.  Now, this is sth to tackle in the vectorizer which ideally
should not try to match up individual adds during SLP discoverly but
instead (if association is allowed) the whole addition chain, commutating
within the whole change rather than just swapping individual add operands.

I still think the reassoc change I came up with is good since it avoids
the need to linearlize in the vectorizer.  So testing that now.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

Reply via email to