MATMUL broken with frontend optimization.

2021-03-18 Thread Steve Kargl via Fortran
It seems that gfortran will inline MATMUL with optimization.
This  produce very poor performance.  In fact, gfortran will
inline MATMUL even if one specifies -fexternal-blas.  This is
very bad.

% cat a.f90
program main

   implicit none

   integer, parameter :: imax = 2, jmax = 1
   real, allocatable :: inVect(:), matrix(:,:), outVect(:)
   real :: start, finish

   allocate(invect(imax), matrix(imax,jmax), outvect(jmax))

   call random_number(inVect)
   call random_number(matrix)

   call cpu_time(start)
   outVect = matmul(inVect, matrix)
   call cpu_time(finish)

   print '("Time = ",f10.7," seconds. – First Value = 
",f10.4)',finish-start,outVect(1)
end program main

% gfcx -o z -O0 a.f90 && ./z
Time =  0.2234111 seconds. – First Value =  4982.6362
% nm z | grep matmul
 U _gfortran_matmul_r4@@GFORTRAN_8
% gfcx -o z -O1 a.f90 && ./z
Time =  0.3295890 seconds. – First Value =  4971.0962
% nm z | grep matmul
% gfcx -o z -O2 a.f90 && ./z
Time =  0.3299561 seconds. – First Value =  5025.4902
% nm z | grep matmul
% gfcx -o z -O2 -fexternal-blas a.f90 && ./z
Time =  0.3295580 seconds. – First Value =  5022.8291

This last one is definitely broken.  I did not link with
an external BLAS library.  Please fix before 11.1 is 
released.

-- 
Steve


Fwd: MATMUL broken with frontend optimization.

2021-03-18 Thread Thomas Koenig via Fortran

Hi Steve,

I can confirm the bug (it's already in gcc 10) and will take a look
at it.

Regards

Thomas


Re: [Patch, fortran] PR99602 - [11 regression] runtime error: pointer actual argument not associated

2021-03-18 Thread Tobias Burnus

Hi Paul, hi all fortran@/gcc-patch@ reader,

it looks as if you replied with your patch submission to the wrong email
address – and your re-submission ended up at https://gcc.gnu.org/PR99602#c17

On 16.03.21 18:08, Tobias Burnus wrote:

On 16.03.21 17:42, Paul Richard Thomas via Gcc-patches wrote:

Fortran: Fix runtime errors for class actual arguments [PR99602].
* trans-array.c (gfc_conv_procedure_call): For class formal
arguments, use the _data field attributes for runtime errors.
* gfortran.dg/pr99602.f90: New test.

Shouldn't there be also a testcase which triggers this run-time error?


Note: The new submission consists of a new testcase (now two) and the
actual patch; the new testcase removes 'pointer' from the dummy argument
of prepare_m2_proc/prepare_whizard_m2 and checks via the
-ftree-original-dump that there is now run-time check code inserted when
passing a (nullified) pointer to a nonpointer dummy argument.

Compared to previous patch, 'fsym_attr.pointer =
fsym_attr.class_pointer' is new, before it was 'fsym_attr.pointer =
fsym_attr.pointer'.

Paul Richard Thomas wrote in PR99602:


Good morning all,

I have attached the revised patch and an additional testcase. I had totally
forgotten about the class pointer gotcha.

OK for master?

Paul

Fortran: Fix runtime errors for class actual arguments [PR99602  
].


LGTM – thanks for the patch.

I am wondering whether the second testcase should be a 'dg-do run' test
instead of 'compile' to ensure that the error is indeed triggered
(currently, it only checks the tree dump that a check is inserted). What
do you think? [If you do so, you need a dg-shouldfail + dg-output, cf.
e.g. pointer_check_5.f90.]

Thanks,

Tobias

-
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München 
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank 
Thürauf


Re: *PING*: Re: [Patch] Fortran: Fix func decl mismatch [PR93660]

2021-03-18 Thread Tobias Burnus

*PING* of my 11.03.21 18:15 CET patch.

The issue is that the TREE_TYPE of the fndecl does not match its arglist.

In some cases, the middle end looks at the function type – and then it 
goes wrong.


The issue only occurs for -fcoarray=lib as other hidden arguments are 
properly handled.


Solution: Add the missing args to the fndecl (in gfc_get_function_type) 
– and increment hidden_typelist in build_function_decl – the latter is 
used in a gcc_assert which was supposed to check for this mismatch ...


Tobias

On 14.03.21 12:04, Tobias Burnus wrote:

Early ping – and minor post script:

+  hidden_typelist = TREE_CHAIN (hidden_typelist);

This change is to avoid running into the ICE:

  gcc_assert (hidden_typelist == NULL_TREE
  || TREE_VALUE (hidden_typelist) == void_type_node);

The purpose of this assert is to check that the TREE_TYPE (fndecl)
arg list and the one created by
  create_function_arglist (gfc_symbol * sym)
are the same (at least in terms of the number of arguments). Namely:
   typelist = TYPE_ARG_TYPES (TREE_TYPE (fndecl));
...
  hidden_typelist = typelist;

Tobias


On 11.03.21 18:15, Tobias Burnus wrote:

This fixes an ICE with OpenMP 'omp decare simd' but is a generic bug.

Namely TREE_TYPE(fndecl) has a mismatch to the arglist chain,
missing some hidden arguments with -fcoarray=lib.

OK for mainline and GCC 10?

Tobias

-
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München 
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, 
Frank Thürauf


[RFC] Fortran: OpenMP (Coarray?) – handling transfer/mapping of allocatable componens, esp. polymorphic ones

2021-03-18 Thread Tobias Burnus

Fortran itself: suggestion is to add a new entry to the vtable
(breaking change) — thus, please also comment if you are not
interested in OpenMP (or coarrays).

For OpenMP: When mapping a derived-type to a non-shared-memory
(accelerator/GPU) device, it gets complicated with (polymorphic)
allocatable components — as OpenMP requires a deep copy of
_allocatable_ components.
[Side note: 'virtual calls' on the device are also permitted,
i.e. the vtable also has to be mapped properly.]

For coarrays: I thought there is the same issue with CO_REDUCE
(arbitrary type w/ user-defined reduction function), but I now think
that I either missed a constraint or that J3/WG5 missed to add one.
See thread starting with my just written email (no reply so far):
to J3: https://mailman.j3-fortran.org/pipermail/j3/2021-March/012965.html

[C++: Side note – OpenMP 5.1 now also permits virtual calls;
but the deep copy problem does not seem to exist (excpt next item?).]

For OpenMP, I think there is a relation between this issue and
how MAPPER might be implemented. — However, I have not looked at
mappers, hence, it could be a completely separate implementation or not.

 * * *

(A) EXAMPLES AS PREREMARK

  type recursive_t
type(recursive_t) :: A   ! recursive types; OpenMP: valid since 5.1
  end type

  type t
  end type t
  type, extends(t) :: t2
   integer, allocatable :: A(:)  ! allocatable component
  end type t2
  type t3
class(t), allocatable :: C  ! allocatable polymorphic component
  end type t3

  type(recursive_t) :: rt, rtc[*]
  class(t), allocatable :: B
  type(t3) :: C[*]
...
  !$omp target enter data map(to:B, rt, C)
...
  call CO_REDUCE (rtc, my_reduct_proc, result_image=1)
  call CO_REDUCE (C, my_reduct_proc3, result_image=1)


And for OpenMP also the following (virtual call on device):

  class(*), intent(in) :: dummy_class
  !$omp target map(to:dummy_class)
select class(dummy_class)
  class is my_cmplx_class
call dummy_class%type_bound_proc(5) ! TBP / virtual call



(B) DESCRIPTION OF THE PROBLEM

Coarrays: While there are some restrictions regarding the use of coarrays,
especially with user-defined reductions data has to be accessed on the
remote image with limited data available on this_image() about details on
the remote image.

OpenMP: While OpenMP 4.5 mostly avoided all pitfalls, 5.0 permitted a lot
more and 5.1 removed additional restrictions. For unified shared memory
or when not using 'target' constructs, there is no issue beyond the normal
Fortran issues (e.g. data-sharing firstprivate with polymorphic variables).
However, when the memory is not shared it becomes harder.


In any case, the information is distributed over several places:

* Run-time library
  libgomp: knows how to transfer the data between the host and the
  device and update pointers
  libcaf: knows how to access remote memory. I think pointer mapping
  (like remove vs. local vtable) is not required, but it looks as
  if the vtab->hash value has to be obtainable for
  same_type_as(var[i], var[j])

* Type and associated: At the location of the type declaration and
  vtable generation, all details about the type is known (except for
  array bounds and the depth of the recursive types, which are both
  only known at run time).

* Code location which calls into the library (openMP construct,
  co_reduction call etc.):
  Here, both the need for the data transfer and the declared type is known
  – including which parts have to be handled in a loop form
  (for A%B(:)%recursive, the compiler can generate an outer loop over A%B(i)
  and then an inner loop over A%B(i)%recursive%recursive%...%recursive).

  For the used data ref itself, the compiler can also add code to handle
  the dynamic type of the last partref - that is both vtable and
  obtaining the vtab->size or similar.
  But if the last partref is a polymorphic type, neither
  allocatable nonpolymorphic nor allocatable polymorphic components
  are known at the code location.


(B) CURRENT LIB SUPPORT

(1) For OpenMP

The current code generation does not permit run-time dependent
mapping as everything is folded into a single libgomp mapping call:
   map(A)
may become something like:
   map(to:a.p [len: 64]) map(to:*a.p.data [len: D.3953 * 4]) 
map(always_pointer: a.p.data [pointer assign, bias: 0]) ...
which then calls
   __builtin_GOMP_target_enter_exit_data (-1, 1, &.omp_data_arr.4, 
&.omp_data_sizes.5, &.omp_data_kinds.6, 0, 0B);

Taking the example of the recusive type (valid since OpenMP 5.1), we would need 
something like:
   __builtin_GOMP_target_enter_exit_data_begin
→ map 'A'
   prev = A;
   for (ptr = A.rt; rt != NULL; rt = rt->rt, prev = prev->rt)
 map (ptr) map(alwaysptr: ptr → prev%rt)
   __builtin_GOMP_target_enter_exit_data_end

(2) Likewise for Coarrays which also only get as argument:
  _gfortran_caf_co_reduce (gfc_descriptor_t *a, ..., int a_len)
which also does not help with allocatable components.

While for simple cases, a loop would do (

Re: MATMUL broken with frontend optimization.

2021-03-18 Thread Richard Biener via Fortran
On Thu, Mar 18, 2021 at 8:49 AM Steve Kargl via Fortran
 wrote:
>
> It seems that gfortran will inline MATMUL with optimization.
> This  produce very poor performance.  In fact, gfortran will
> inline MATMUL even if one specifies -fexternal-blas.  This is
> very bad.
>
> % cat a.f90
> program main
>
>implicit none
>
>integer, parameter :: imax = 2, jmax = 1
>real, allocatable :: inVect(:), matrix(:,:), outVect(:)
>real :: start, finish
>
>allocate(invect(imax), matrix(imax,jmax), outvect(jmax))
>
>call random_number(inVect)
>call random_number(matrix)
>
>call cpu_time(start)
>outVect = matmul(inVect, matrix)
>call cpu_time(finish)
>
>print '("Time = ",f10.7," seconds. – First Value = 
> ",f10.4)',finish-start,outVect(1)
> end program main
>
> % gfcx -o z -O0 a.f90 && ./z
> Time =  0.2234111 seconds. – First Value =  4982.6362
> % nm z | grep matmul
>  U _gfortran_matmul_r4@@GFORTRAN_8
> % gfcx -o z -O1 a.f90 && ./z
> Time =  0.3295890 seconds. – First Value =  4971.0962
> % nm z | grep matmul
> % gfcx -o z -O2 a.f90 && ./z
> Time =  0.3299561 seconds. – First Value =  5025.4902
> % nm z | grep matmul
> % gfcx -o z -O2 -fexternal-blas a.f90 && ./z
> Time =  0.3295580 seconds. – First Value =  5022.8291
>
> This last one is definitely broken.  I did not link with
> an external BLAS library.  Please fix before 11.1 is
> released.

Since the libgfortran MATMUL should be vectorized
I think it's not reasonable to inline any but _very_ small
MATMUL at optimization levels that do not enable vectorization.

Richard.

>
> --
> Steve


Re: [Patch, fortran] PR99602 - [11 regression] runtime error: pointer actual argument not associated

2021-03-18 Thread Paul Richard Thomas via Fortran
Hi Tobias,

Thanks for the review. I am resisting dg-run for this patch simply because
the testsuite already takes an oppressive amount of time to run. That the
runtime error is present in the code should be sufficient IMHO.

Regards

Paul


On Thu, 18 Mar 2021 at 08:46, Tobias Burnus  wrote:

> Hi Paul, hi all fortran@/gcc-patch@ reader,
>
> it looks as if you replied with your patch submission to the wrong email
> address – and your re-submission ended up at
> https://gcc.gnu.org/PR99602#c17
>
> On 16.03.21 18:08, Tobias Burnus wrote:
> > On 16.03.21 17:42, Paul Richard Thomas via Gcc-patches wrote:
> >> Fortran: Fix runtime errors for class actual arguments [PR99602].
> >> * trans-array.c (gfc_conv_procedure_call): For class formal
> >> arguments, use the _data field attributes for runtime errors.
> >> * gfortran.dg/pr99602.f90: New test.
> > Shouldn't there be also a testcase which triggers this run-time error?
>
> Note: The new submission consists of a new testcase (now two) and the
> actual patch; the new testcase removes 'pointer' from the dummy argument
> of prepare_m2_proc/prepare_whizard_m2 and checks via the
> -ftree-original-dump that there is now run-time check code inserted when
> passing a (nullified) pointer to a nonpointer dummy argument.
>
> Compared to previous patch, 'fsym_attr.pointer =
> fsym_attr.class_pointer' is new, before it was 'fsym_attr.pointer =
> fsym_attr.pointer'.
>
> Paul Richard Thomas wrote in PR99602:
>
> > Good morning all,
> >
> > I have attached the revised patch and an additional testcase. I had
> totally
> > forgotten about the class pointer gotcha.
> >
> > OK for master?
> >
> > Paul
> >
> > Fortran: Fix runtime errors for class actual arguments [PR99602  <
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99602>].
>
> LGTM – thanks for the patch.
>
> I am wondering whether the second testcase should be a 'dg-do run' test
> instead of 'compile' to ensure that the error is indeed triggered
> (currently, it only checks the tree dump that a check is inserted). What
> do you think? [If you do so, you need a dg-shouldfail + dg-output, cf.
> e.g. pointer_check_5.f90.]
>
> Thanks,
>
> Tobias
>
> -
> Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München
> Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank
> Thürauf
>


-- 
"If you can't explain it simply, you don't understand it well enough" -
Albert Einstein


Re: MATMUL broken with frontend optimization.

2021-03-18 Thread Tobias Burnus

Richard,

On 18.03.21 13:35, Richard Biener via Fortran wrote:

[...]
Since the libgfortran MATMUL should be vectorized
I think it's not reasonable to inline any but _very_ small
MATMUL at optimization levels that do not enable vectorization.


Besides the obvious if (!flag_external_blas) which should always prevent
inlining (possibly except for tiny N like N=1), your idea is 'if (N
small || flag_tree_loop_vectorize)'?

Or are you thinking of a different or additional flag_... than
flag_tree_loop_vectorize for making this choice?

Tobias

-
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München 
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank 
Thürauf


Re: MATMUL broken with frontend optimization.

2021-03-18 Thread Richard Biener via Fortran
On Thu, Mar 18, 2021 at 3:48 PM Tobias Burnus  wrote:
>
> Richard,
>
> On 18.03.21 13:35, Richard Biener via Fortran wrote:
> > [...]
> > Since the libgfortran MATMUL should be vectorized
> > I think it's not reasonable to inline any but _very_ small
> > MATMUL at optimization levels that do not enable vectorization.
>
> Besides the obvious if (!flag_external_blas) which should always prevent
> inlining (possibly except for tiny N like N=1), your idea is 'if (N
> small || flag_tree_loop_vectorize)'?
>
> Or are you thinking of a different or additional flag_... than
> flag_tree_loop_vectorize for making this choice?

Yes, I was thinking of flag_tree_loop_vectorize.  Of course libgfortran
is far from having micro-optimized matmul for various architectures
but IIRC it uses attribute(target) to provide several overloads.  So
maybe only ever inlining tiny matmul makes sense as well (does the
runtime have specializations for small sizes?)

Richard.

> Tobias
>
> -
> Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München 
> Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank 
> Thürauf


Re: MATMUL broken with frontend optimization.

2021-03-18 Thread Steve Kargl via Fortran
On Thu, Mar 18, 2021 at 04:05:40PM +0100, Richard Biener wrote:
> On Thu, Mar 18, 2021 at 3:48 PM Tobias Burnus  wrote:
> >
> > Richard,
> >
> > On 18.03.21 13:35, Richard Biener via Fortran wrote:
> > > [...]
> > > Since the libgfortran MATMUL should be vectorized
> > > I think it's not reasonable to inline any but _very_ small
> > > MATMUL at optimization levels that do not enable vectorization.
> >
> > Besides the obvious if (!flag_external_blas) which should always prevent
> > inlining (possibly except for tiny N like N=1), your idea is 'if (N
> > small || flag_tree_loop_vectorize)'?
> >
> > Or are you thinking of a different or additional flag_... than
> > flag_tree_loop_vectorize for making this choice?
> 
> Yes, I was thinking of flag_tree_loop_vectorize.  Of course libgfortran
> is far from having micro-optimized matmul for various architectures
> but IIRC it uses attribute(target) to provide several overloads.  So
> maybe only ever inlining tiny matmul makes sense as well (does the
> runtime have specializations for small sizes?)
> 

With -fexternal-blas, there is a cross-over value of N=30,
which can be changed by -fblas-matmul-limit=N option.

I forgot the important example, but Thomas seems to be aware.

% gfcx -o z -O2 -fno-frontend-optimize -fexternal-blas a.f90 && ./z
/usr/local/bin/ld: /tmp/ccOe3VoD.o: in function `MAIN__':
a.f90:(.text+0x156): undefined reference to `sgemm_'
collect2: error: ld returned 1 exit status

sgemm_ would come from a tuned BLAS library such as OpenBLAS.

I was going to suggest adding a testcase that scans a dump
for sgemm.  It seems matmul_blas_1.f tests the -fexternal-blas
and -fblas-matmul-limit=N options, but it doesn't look for sgemm.  
This, I believe, does the checking

diff --git a/gcc/testsuite/gfortran.dg/matmul_blas_1.f 
b/gcc/testsuite/gfortran.dg/matmul_blas_1.f
index 6a88981c9d7..52298d09cce 100644
--- a/gcc/testsuite/gfortran.dg/matmul_blas_1.f
+++ b/gcc/testsuite/gfortran.dg/matmul_blas_1.f
@@ -237,4 +237,4 @@ C Test calling of BLAS routines
   if (any (c /= cres)) stop 20
 
   end
-! { dg-final { scan-tree-dump-times "_gfortran_matmul" 0 "optimized" } }
+! { dg-final { scan-tree-dump "sgemm" "optimized" } }

-- 
Steve


Re: MATMUL broken with frontend optimization.

2021-03-18 Thread Thomas Koenig via Fortran

OK, so I've had a bit of time to look at the actual test case.  I
missed one very important detail before:  This is a vector-matrix
operation.

For this, we do not have a good library routine (Harald just
removed it because of a bug in buffering), and -fexternal-blas
does not work because we do not handle calls to anything but
*GEMM.

The idea is that, for a vector-matrix-multiplication, the
compiler should have enough information about the information


Re: MATMUL broken with frontend optimization.

2021-03-18 Thread Thomas Koenig via Fortran

I didn't finish the previous mail before hitting "send", so here
is the postscript...


OK, so I've had a bit of time to look at the actual test case.  I
missed one very important detail before:  This is a vector-matrix
operation.

For this, we do not have a good library routine (Harald just
removed it because of a bug in buffering), and -fexternal-blas
does not work because we do not handle calls to anything but
*GEMM.


A vector-matrix multiplicatin would be a call to *GEMV, a worthy
goal, but out of scope so close to a release.


The idea is that, for a vector-matrix-multiplication, the
compiler should have enough information about the information

about how to optimize for the relevant architecture, especially
if the user compilers with the right flags.

So, the current idea is that, if we optimize, we can inline.

What would a better heuristic be?

Best regards

Thomas


Re: MATMUL broken with frontend optimization.

2021-03-18 Thread Steve Kargl via Fortran
On Thu, Mar 18, 2021 at 07:24:21PM +0100, Thomas Koenig wrote:
> I didn't finish the previous mail before hitting "send", so here
> is the postscript...
> 
> > OK, so I've had a bit of time to look at the actual test case.  I
> > missed one very important detail before:  This is a vector-matrix
> > operation.
> > 
> > For this, we do not have a good library routine (Harald just
> > removed it because of a bug in buffering), and -fexternal-blas
> > does not work because we do not handle calls to anything but
> > *GEMM.
> 
> A vector-matrix multiplicatin would be a call to *GEMV, a worthy
> goal, but out of scope so close to a release.

Agreed.

> > The idea is that, for a vector-matrix-multiplication, the
> > compiler should have enough information about the information
> about how to optimize for the relevant architecture, especially
> if the user compilers with the right flags.
> 
> So, the current idea is that, if we optimize, we can inline.
> 
> What would a better heuristic be?
> 

Does _gfortran_matmul_r4 (and friends) work for vector-matrix
products?  I haven't checked.  If so, how about disabling
in-lining MATMUL for 11.1; then, for 11.2, this can be revisited
where a small N can be chosen for in-lining.  With -fexternal-blas
and *gemm, the default cross-over is N = 30.

BTW, I cam across this in StackOverflow.

https://stackoverflow.com/questions/66682180/why-is-matmul-slower-with-gfortran-compiler-optimization-turned-on

-- 
Steve


Re: MATMUL broken with frontend optimization.

2021-03-18 Thread Thomas Koenig via Fortran



Am 18.03.21 um 21:22 schrieb Steve Kargl:

On Thu, Mar 18, 2021 at 07:24:21PM +0100, Thomas Koenig wrote:

I didn't finish the previous mail before hitting "send", so here
is the postscript...


OK, so I've had a bit of time to look at the actual test case.  I
missed one very important detail before:  This is a vector-matrix
operation.

For this, we do not have a good library routine (Harald just
removed it because of a bug in buffering), and -fexternal-blas
does not work because we do not handle calls to anything but
*GEMM.


A vector-matrix multiplicatin would be a call to *GEMV, a worthy
goal, but out of scope so close to a release.


Agreed.


The idea is that, for a vector-matrix-multiplication, the
compiler should have enough information about the information

about how to optimize for the relevant architecture, especially
if the user compilers with the right flags.

So, the current idea is that, if we optimize, we can inline.

What would a better heuristic be?



Does _gfortran_matmul_r4 (and friends) work for vector-matrix
products?


Yes.


I haven't checked.  If so, how about disabling
in-lining MATMUL for 11.1;


Absolutely not for the general case. This would cause a huge regression
in execution time for 2*2 matrices, and also for small matrix-vector
multiplications.

What we could do is only to enable the inlining for vector*matrix
at -O2 or higher. Again, this will mean a penalty for smaller loops,
but at less than -O2, people probably don't care too much.

If there is agreement on that, I will prepare a patch.

Regards

Thomas


Re: MATMUL broken with frontend optimization.

2021-03-18 Thread Steve Kargl via Fortran
On Thu, Mar 18, 2021 at 09:55:27PM +0100, Thomas Koenig wrote:
> 
> > I haven't checked.  If so, how about disabling
> > in-lining MATMUL for 11.1;
> 
> Absolutely not for the general case. This would cause a huge regression
> in execution time for 2*2 matrices, and also for small matrix-vector
> multiplications.
> 
> What we could do is only to enable the inlining for vector*matrix
> at -O2 or higher. Again, this will mean a penalty for smaller loops,
> but at less than -O2, people probably don't care too much.
> 

On my old core2 cpu, a quick test with N=1000 and NxN matrix
suggest a cross over near N=1000 for REAL(4).  This cpu doesn't
have any AVX* instruction, so YMMV.  Program follows .sig

-- 
Steve

program t

   implicit none

   character(len=10) str
   integer i, j
   integer, parameter :: &
   &  n(10) = [100, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 1]
   real t0, t1, t3, t4
   real, allocatable :: a(:), b(:,:), c(:)
   !
   ! Loop over n(j) array.  Run each test 5 times and average.
   !
   do j = 1, 10

  allocate(a(n(j)), b(n(j),n(j)), c(n(j)))

  a = 1
  b = 1

  t3 = 0
  do i = 1, 5
 call cpu_time(t0)
 c = matmul(a, b)
 call cpu_time(t1)
 t3 = t3 + (t1 - t0)
 if (c(1) /= n(j)) stop 1
  end do
 
  t4 = 0
  do i = 1, 5
 call cpu_time(t0)
 c = matmul(b, a)
 call cpu_time(t1)
 t4 = t4 + (t1 - t0)
 if (c(1) /= n(j)) stop 2
  end do

  print '(I5,1X,2(F8.4,1X))', n(j), (t3/5) * 1000, (t4/5) * 1000

  deallocate(a, b, c)
   end do

end program t



Re: MATMUL broken with frontend optimization.

2021-03-18 Thread Thomas Koenig via Fortran



Hi Steve,


On my old core2 cpu, a quick test with N=1000 and NxN matrix
suggest a cross over near N=1000 for REAL(4).  This cpu doesn't
have any AVX* instruction, so YMMV.  Program follows .sig


Looking at your data with AVX (which I think we can mostly count
on now),

- The library is always faster for matmul(vector,matrix) for any n >=100
- For matmul(matrix,vector) there is no appreciable difference

So, putting in the same inline limits for matmul(vector,matrix)
that we have for matmul(matrix,matrix), and leaving
mamul(matrix,vector) alone, seems like a reasonable thing to do.

I'll work on a patch.

Regards

Thomas


Re: MATMUL broken with frontend optimization.

2021-03-18 Thread Thomas Koenig via Fortran


Am 19.03.21 um 07:19 schrieb Thomas Koenig:


I'll work on a patch.


So, here's a concept patch.  It still needs a ChangeLog and testsuite
adjustment, but this is what I would propose to use.

Regards

Thomas
diff --git a/gcc/fortran/frontend-passes.c b/gcc/fortran/frontend-passes.c
index cfc47471cf1..7d3eae67632 100644
--- a/gcc/fortran/frontend-passes.c
+++ b/gcc/fortran/frontend-passes.c
@@ -3307,7 +3307,7 @@ get_operand (gfc_intrinsic_op op, gfc_expr *e1, gfc_expr *e2)
removed by DCE. Only called for rank-two matrices A and B.  */
 
 static gfc_code *
-inline_limit_check (gfc_expr *a, gfc_expr *b, int limit)
+inline_limit_check (gfc_expr *a, gfc_expr *b, int limit, int rank_a)
 {
   gfc_expr *inline_limit;
   gfc_code *if_1, *if_2, *else_2;
@@ -3315,16 +3315,28 @@ inline_limit_check (gfc_expr *a, gfc_expr *b, int limit)
   gfc_typespec ts;
   gfc_expr *cond;
 
+  gcc_assert (rank_a == 1 || rank_a == 2);
+
   /* Calculation is done in real to avoid integer overflow.  */
 
   inline_limit = gfc_get_constant_expr (BT_REAL, gfc_default_real_kind,
 	&a->where);
   mpfr_set_si (inline_limit->value.real, limit, GFC_RND_MODE);
-  mpfr_pow_ui (inline_limit->value.real, inline_limit->value.real, 3,
+
+  /* Set the limit according to the rank.  */
+  mpfr_pow_ui (inline_limit->value.real, inline_limit->value.real, rank_a + 1,
 	   GFC_RND_MODE);
 
   a1 = get_array_inq_function (GFC_ISYM_SIZE, a, 1);
-  a2 = get_array_inq_function (GFC_ISYM_SIZE, a, 2);
+
+  /* For a_rank = 1, must use one as the size of a along the second
+ dimension as to avoid too much code duplication.  */
+
+  if (rank_a == 2)
+a2 = get_array_inq_function (GFC_ISYM_SIZE, a, 2);
+  else
+a2 = gfc_get_int_expr (gfc_index_integer_kind, &a->where, 1);
+
   b2 = get_array_inq_function (GFC_ISYM_SIZE, b, 2);
 
   gfc_clear_ts (&ts);
@@ -4243,11 +4255,13 @@ inline_matmul_assign (gfc_code **c, int *walk_subtrees,
   /* Take care of the inline flag.  If the limit check evaluates to a
  constant, dead code elimination will eliminate the unneeded branch.  */
 
-  if (flag_inline_matmul_limit > 0 && matrix_a->rank == 2
+  if (flag_inline_matmul_limit > 0
+  && (matrix_a->rank == 1 || matrix_a->rank == 2)
   && matrix_b->rank == 2)
 {
   if_limit = inline_limit_check (matrix_a, matrix_b,
- flag_inline_matmul_limit);
+ flag_inline_matmul_limit,
+ matrix_a->rank);
 
   /* Insert the original statement into the else branch.  */
   if_limit->block->block->next = co;
@@ -4757,7 +4771,7 @@ call_external_blas (gfc_code **c, int *walk_subtrees ATTRIBUTE_UNUSED,
 return 0;
 
   /* Generate the if statement and hang it into the tree.  */
-  if_limit = inline_limit_check (matrix_a, matrix_b, flag_blas_matmul_limit);
+  if_limit = inline_limit_check (matrix_a, matrix_b, flag_blas_matmul_limit, 2);
   co_next = co->next;
   (*current_code) = if_limit;
   co->next = NULL;