On Sun, Nov 13, 2016 at 04:08:50PM -0800, Jerry DeLisle wrote: > Hi all, > > Attached patch implements a fast blocked matrix multiply. The basic algorithm > is > derived from netlib.org tuned blas dgemm. See matmul.m4 for reference. > > The matmul() function is compiled with -Ofast -funroll-loops. This can be > customized further if there is an undesired optimization being used. This is > accomplished using #pragma optimize ( string ). >
Did you run any tests with '--param max-unroll-times=4' where the 4 could be something other than 4. On troutmask, with my code I've found that 4 seems to work the best with -funroll-loops. -- Steve