https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

            Bug ID: 79930
           Summary: Potentially Missed Optimisation for MATMUL /
                    DOT_PRODUCT
           Product: gcc
           Version: 6.3.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: fortran
          Assignee: unassigned at gcc dot gnu.org
          Reporter: adam at aphirst dot karoo.co.uk
  Target Milestone: ---

In my codebase I'm performing many "Tensor Products", which is by far the
hottest routine. This is something like

tp = NU^T * P * NV
result [3-vector] = [4-vector]^T [4x4 "matrix" of 3-vectors] [4-vector]

I implement this in three different ways

1) use an explicit do (concurrent) loop over i and j returning a 4x4 result
"matrix", then respectively sum the x, y and z components of that into a single
result 3-vector.
2) use three separate matmul + dot_product calls (one for x, y and z),
dot_product(matmul(NU,P),NV)
3) the same, but the other way around, so dot_product(NU,matmul(P,NV))

My code is posted at
https://gist.github.com/aphirst/75e0599e2d4b14d182b52daaa6a74098 and after
discussing at length with JerryD and dominiq in IRC I'd like to summarise our
findings.

0) There are two versions of the test code, one where the 3-vector is
implemented as just a real dimension(3) member, the other as three separate %x
%y and %z members. Across all tests described below, the performance difference
was almost negligible, on my machine only slightly favouring the dimension(3)
implementation.
1) With no optimisations, and -fcheck=all, both "Vector" implementations yield
the "explicit DO" approach as being twice as slow as the matmul approach. This
case is the exception, presumably as -fcheck=all is heavily penalising the
explicit looping.
2) With no optimisations, and no -fcheck, both "Vector" implementations yield
the "explicit DO" approach about 1.5x as fast as one matmul orientation, and
very slightly slower than the other.
3) With -Og, regardless of -fcheck, both "Vector" implementations yield the
"explicit DO" approach to be either twice as fast as, or 1.5x as fast as, the
matmul orientations. Interestingly, the random number generation now takes an
extra 15% or so longer than with no optimisations.
4) Same for -O2, also regardless of -fcheck, except the difference between the
"explicit DO" and matmul approaches is slightly larger.

So to summarise:
* For some reason, either matmul or dot_product is missing some sort of
optimisation here. Whether or not this is actually possible isn't my
prerogative to say, but JerryD said that according to the tree dump, the matmul
isn't being inlined.
* Random number generation surely shouldn't take longer with optimisations than
without, should it?

---

I'm running on Arch Linux (x86_64), and gfortran -v gives:

Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/6.3.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc-multilib/src/gcc/configure --prefix=/usr
--libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man
--infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/
--enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared
--enable-threads=posix --enable-libmpx --with-system-zlib --with-isl
--enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu
--disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object
--enable-linker-build-id --enable-lto --enable-plugin
--enable-install-libiberty --with-linker-hash-style=gnu
--enable-gnu-indirect-function --enable-multilib --disable-werror
--enable-checking=release
Thread model: posix
gcc version 6.3.1 20170109 (GCC)

Reply via email to