https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600
--- Comment #5 from Thomas Koenig <tkoenig at gcc dot gnu.org> --- Another interesting data point. I deleted the DGEMM implementation from the file and linked against the serial version of openblas. OK, openblas is based on GOTO blas, so we have to expect a hit for large matrices. Figures: ig25@linux-fd1f:~/Krempel/Bench> gfortran -O2 -funroll-loops bench-3.f90 -lopenblas_serial ig25@linux-fd1f:~/Krempel/Bench> ./a.out Size Loops Matmul dgemm Matmul Matmul fixed explicit assumed variable explicit ===================================================================================== 2 200000 11.944 0.035 0.136 0.412 4 200000 1.712 0.257 0.458 0.738 8 200000 2.080 1.162 0.824 1.077 16 200000 1.697 3.104 0.939 0.995 32 200000 1.450 4.814 1.388 1.426 64 30757 1.485 5.978 1.351 1.371 128 3829 1.557 6.857 1.534 1.522 256 477 1.568 7.017 1.589 1.537 So far so good. Looks as if the crossover point for the inline and the dgemm version is between 8 and 16, so let us try this: ig25@linux-fd1f:~/Krempel/Bench> gfortran -O2 -funroll-loops -finline-matmul-limit=12 -fexternal-blas bench-3.f90 -lopenblas_serial ig25@linux-fd1f:~/Krempel/Bench> ./a.out Size Loops Matmul dgemm Matmul Matmul fixed explicit assumed variable explicit ===================================================================================== 2 200000 11.948 0.039 0.156 0.464 4 200000 1.999 0.305 0.542 0.859 8 200000 2.435 1.359 0.962 1.255 16 200000 0.802 3.102 0.798 0.799 32 200000 4.878 4.990 4.906 4.906 64 30757 6.045 6.062 5.977 5.968 So, if the user really wants us to call an external BLAS, we had better do so directly and not through our library routines.