http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
--- Comment #5 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2011-11-22 18:34:48 UTC --- (In reply to comment #3) > is IMHO just a matter whether graphite can -floop-interchange this or not. > If you swap manually the l and j for lines, the generated code looks better, > though for some reason we unroll even the l loop which increases register > pressure too much. Unfortunately, the issue is not just loop ordering or loop unrolling. I have a code generator which tries systematically all possible loop orderings, and all possible unroll factors. For this testcase (matrix sizes 4,10,10) the best cray output (this one) runs at 10.8 Gflops. The best gcc compiled version runs at 4.7 Gflops (smm_dnn_4_10_10_1_1_10_2). I attach the test code, which I use for testing.