Am 18.03.21 um 21:22 schrieb Steve Kargl:
On Thu, Mar 18, 2021 at 07:24:21PM +0100, Thomas Koenig wrote:
I didn't finish the previous mail before hitting "send", so here
is the postscript...
OK, so I've had a bit of time to look at the actual test case. I
missed one very important detail before: This is a vector-matrix
operation.
For this, we do not have a good library routine (Harald just
removed it because of a bug in buffering), and -fexternal-blas
does not work because we do not handle calls to anything but
*GEMM.
A vector-matrix multiplicatin would be a call to *GEMV, a worthy
goal, but out of scope so close to a release.
Agreed.
The idea is that, for a vector-matrix-multiplication, the
compiler should have enough information about the information
about how to optimize for the relevant architecture, especially
if the user compilers with the right flags.
So, the current idea is that, if we optimize, we can inline.
What would a better heuristic be?
Does _gfortran_matmul_r4 (and friends) work for vector-matrix
products?
Yes.
I haven't checked. If so, how about disabling
in-lining MATMUL for 11.1;
Absolutely not for the general case. This would cause a huge regression
in execution time for 2*2 matrices, and also for small matrix-vector
multiplications.
What we could do is only to enable the inlining for vector*matrix
at -O2 or higher. Again, this will mean a penalty for smaller loops,
but at less than -O2, people probably don't care too much.
If there is agreement on that, I will prepare a patch.
Regards
Thomas