https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
--- Comment #6 from Steve Kargl <sgk at troutmask dot apl.washington.edu> --- On Tue, Aug 09, 2022 at 05:14:16PM +0000, quanhua.liu at noaa dot gov wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565 > > --- Comment #4 from Quanhua Liu <quanhua.liu at noaa dot gov> --- > Using > gfortran -O3 -fexternal-blas -L/..... -lblas testmatrixCal.f90 Which BLAS are you using? If you are using BLAS from Netlib, then of course you'll likely get poor results as the Netlib BLAS is not tuned. I specifically wrote **** use OpenBLAS **** OpenBLAS is likely tuned for whatever hardware you have. % gfcx -o z -O3 -march=native a.f90 -fexternal-blas -lopenblas \ -fdump-tree-optimized && ./z 2.44969702 1615.08301 2.00995278 1615.08301 The use of matmal(..., transpose()) is the fastest on a AMD FX(tm)-8350, % grep gemm z-a.f90.252t.optimized sgemm (&"N"[1]{lb: 1 sz: 1}, &"N"[1]{lb: 1 sz: 1}, &C.4300, &C.4301, &C.4302, &C.4303, &a, &C.4304, &bb, &C.4305, &C.4306, &c, &C.4307, 1, 1); sgemm (&"N"[1]{lb: 1 sz: 1}, &"T"[1]{lb: 1 sz: 1}, &C.4379, &C.4380, &C.4381, &C.4382, &a, &C.4383, &b, &C.4384, &C.4385, &c, &C.4386, 1, 1);