OK, so I've had a bit of time to look at the actual test case. I missed one very important detail before: This is a vector-matrix operation.
For this, we do not have a good library routine (Harald just removed it because of a bug in buffering), and -fexternal-blas does not work because we do not handle calls to anything but *GEMM. The idea is that, for a vector-matrix-multiplication, the compiler should have enough information about the information