https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57992
--- Comment #4 from Chris Elrod <elrodc at gmail dot com> --- Created attachment 45016 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45016&action=edit Assembly from compiling gfortran_internal_pack_test.f90 The code takes in sets of 3-length vectors and 3x3 symmetric positive definite matrices (storing only the upper triangle). These are stored across columns. That is, element 1 of the first and second vectors are stored contiguously, while elements 1 and 2 of each vector are stride apart. The goal is to factor each PD matrix into S = U*U' (not the Cholesky), and then computes U^{-1} * x. There is a function that operates on one vector and matrix at a time (pdbacksolve). Another function operates on blocks of 16 at a time (vpdbacksolve). Three versions of functions operate on these: Version 1 simply loops over the inputs, calling the scalar version. Version 2 loops over blocks of 16 at a time, calling the blocked version. Version 3 manually inlined the function into the do loop. I used compiler options to ensure that all the functions were inlined into callers, so that ideally Version 2 and Version 3 would be identical. Attached assembly shows that they are not. Letting N = 1024 total vectors and matrices, on my computer Version 1 takes 97 microseconds to run, version 2 35 microseconds, and version 3 1.4 microseconds. These differences are dramatic! Version 1 failed to vectorize and was littered with _gfortran_internal_pack@PLT and _gfortran_internal_unpack@PLT. Version 2 vectorized, but also had all the pack/unpacks. Version 3 had neither. Data layout was the same (and optimal for vectorization) in all three cases. [Also worth pointing out that without -fdisable-tree-cunrolli, version 3 takes 9 microseconds.] For what it is worth, ifort takes 0.82, 1.5, and 0.88 microseconds respectively. I'd hope it is possible for gfortran's version 1 and 2 to match it's version 3 (1.4 microseconds) rather than being 70x and 25 slower. 1.4 microseconds is a good time, and the best I managed to achieve with explicit vectorization in Julia. I could file a different bug report, because the failed vectorization of version 1 is probably a different issue. But this is another example of unnecessary packing/unpacking.