https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57992

--- Comment #4 from Chris Elrod <elrodc at gmail dot com> ---
Created attachment 45016
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45016&action=edit
Assembly from compiling gfortran_internal_pack_test.f90

The code takes in sets of 3-length vectors and 3x3 symmetric positive definite
matrices (storing only the upper triangle). These are stored across columns.
That is, element 1 of the first and second vectors are stored contiguously,
while elements 1 and 2 of each vector are stride apart.

The goal is to factor each PD matrix into S = U*U' (not the Cholesky), and then
computes U^{-1} * x.

There is a function that operates on one vector and matrix at a time
(pdbacksolve).
Another function operates on blocks of 16 at a time (vpdbacksolve).

Three versions of functions operate on these:
Version 1 simply loops over the inputs, calling the scalar version.

Version 2 loops over blocks of 16 at a time, calling the blocked version.

Version 3 manually inlined the function into the do loop.

I used compiler options to ensure that all the functions were inlined into
callers, so that ideally Version 2 and Version 3 would be identical.
Attached assembly shows that they are not.

Letting N = 1024 total vectors and matrices, on my computer
Version 1 takes 97 microseconds to run, version 2 35 microseconds, and version
3 1.4 microseconds.
These differences are dramatic!
Version 1 failed to vectorize and was littered with _gfortran_internal_pack@PLT
and _gfortran_internal_unpack@PLT. Version 2 vectorized, but also had all the
pack/unpacks. Version 3 had neither.
Data layout was the same (and optimal for vectorization) in all three cases.

[Also worth pointing out that without -fdisable-tree-cunrolli, version 3 takes
9 microseconds.]

For what it is worth, ifort takes 0.82, 1.5, and 0.88 microseconds
respectively. 

I'd hope it is possible for gfortran's version 1 and 2 to match it's version 3
(1.4 microseconds) rather than being 70x and 25 slower. 1.4 microseconds is a
good time, and the best I managed to achieve with explicit vectorization in
Julia.
I could file a different bug report, because the failed vectorization of
version 1 is probably a different issue. But this is another example of
unnecessary packing/unpacking.

Reply via email to