http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #5 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-04-17 14:12:30 UTC --- I have investigated why test_fpu is slower with --param max-inline-insns-auto=400 (11.18s) compared to -finline-limit=600 (10.84s) in the timings of comment #2. This is due to the inlining of dgemm in the fourth test Lapack 2: [macbook] lin/test% gfc -Ofast -funroll-loops -fstack-arrays --param max-inline-insns-auto=385 test_lap.f90 [macbook] lin/test% time a.out Benchmark running, hopefully as only ACTIVE task Test4 - Lapack 2 (1001x1001) inverts 2.6 sec Err= 0.000000000000250 total = 2.6 sec 2.824u 0.081s 0:02.90 100.0% 0+0k 0+0io 0pf+0w [macbook] lin/test% gfc -Ofast -funroll-loops -fstack-arrays --param max-inline-insns-auto=386 test_lap.f90 [macbook] lin/test% time a.out Benchmark running, hopefully as only ACTIVE task Test4 - Lapack 2 (1001x1001) inverts 3.0 sec Err= 0.000000000000250 total = 3.0 sec 3.214u 0.082s 0:03.29 100.0% 0+0k 0+0io 0pf+0w Looking at the assembly, I see 'call _dgemm_' three times for 385 and none for 386 (note there are only two calls in the code one in dgetri always inlined and one in dgetrf not inlined). It would be interesting to understand why inlining dgemm slows down the code.