https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565

--- Comment #6 from Steve Kargl <sgk at troutmask dot apl.washington.edu> ---
On Tue, Aug 09, 2022 at 05:14:16PM +0000, quanhua.liu at noaa dot gov wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
> 
> --- Comment #4 from Quanhua Liu <quanhua.liu at noaa dot gov> ---
> Using 
> gfortran -O3 -fexternal-blas -L/..... -lblas testmatrixCal.f90

Which BLAS are you using?  If you are using BLAS from
Netlib, then of course you'll likely get poor results
as the Netlib BLAS is not tuned. 

I specifically wrote **** use OpenBLAS ****

OpenBLAS is likely tuned for whatever hardware you have.

% gfcx -o z -O3 -march=native a.f90 -fexternal-blas -lopenblas \
   -fdump-tree-optimized && ./z
   2.44969702       1615.08301    
   2.00995278       1615.08301    

The use of matmal(..., transpose()) is the fastest on a AMD FX(tm)-8350,

% grep gemm z-a.f90.252t.optimized 
  sgemm (&"N"[1]{lb: 1 sz: 1}, &"N"[1]{lb: 1 sz: 1}, &C.4300, &C.4301, &C.4302,
&C.4303, &a, &C.4304, &bb, &C.4305, &C.4306, &c, &C.4307, 1, 1);
  sgemm (&"N"[1]{lb: 1 sz: 1}, &"T"[1]{lb: 1 sz: 1}, &C.4379, &C.4380, &C.4381,
&C.4382, &a, &C.4383, &b, &C.4384, &C.4385, &c, &C.4386, 1, 1);

Reply via email to