I have had a closer look to the optimization of the polyhedron test induct.f90. 3/4 of the runtime is spent in the subroutine 'mutual_ind_quad_cir_coil' and 1/4 in 'mutual_ind_quad_rec_coil'. The two subroutines contain two main loops with the following structure:
do i = 1, 2*m ... do j = 1, 9 ... do k = 1, 9 q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal) q_vector(2) = 0.5_longreal * b1 * (y2gauss(k) - 1.0_longreal) q_vector(3) = 0.0_longreal ! ! rotate quad vector into the global coordinate system ! rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:)) rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:)) rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:)) ! ! compute and add in quadrature term ! numerator = w1gauss(j) * w2gauss(k) * & dot_product(coil_current_vec,current_vector) denominator = sqrt(dot_product(rot_c_vector-rot_q_vector, & rot_c_vector-rot_q_vector)) l12_lower = l12_lower + numerator/denominator end do end do end do where the six first lines of code in the k loop do not depend on i nor j and can be computed in a 'do k = 1, 9' loop outside the main loop by replacing the length-three vector 'rot_q_vector' by a nine by three array. The original code induct.f90 gives the following timing (all with -O3 -ffast-math -funroll-loops): 93.227u 0.094s 1:33.32 99.9% 0+0k 0+0io 0pf+0w and output: ... Maximum wand/quad abs rel mutual inductance = 5.95379428444656467E-002 Minimum resq coil/quad abs rel mutual inductance = 0.0000000000000000 Maximum resq coil/quad abs rel mutual inductance = 9.63995250242061230E-002 ... Unrolling bye hand 'numerator' and 'denominator' gives (see http://gcc.gnu.org/ml/fortran/2007-11/msg00231.html): 65.563u 0.092s 1:05.66 99.9% 0+0k 0+0io 0pf+0w Looking at the assembly I can see that for the original code the inner loops in k are not unrolled, as guessed by Paul Thomas (only the implied vector loops being unrolled). QUESTION 1: Should the frontend do the unrolling for small vectors itself? or should the middleend be more aggressive for nested loops with small known iterations? Moving the invariants on i and j in the k loops outside the main loops gives: 80.313u 0.074s 1:20.39 99.9% 0+0k 0+1io 0pf+0w Combining the two hand optimizations gives: 35.925u 0.040s 0:35.97 99.9% 0+0k 0+0io 0pf+0w (without -ffast-math the timing is 59.263u 0.067s 0:59.33 99.9% 0+0k 0+1io 0pf+0w) but the results change to: Maximum wand/quad abs rel mutual inductance = 5.95379428444656675E-002 Minimum resq coil/quad abs rel mutual inductance = 0.0000000000000000 Maximum resq coil/quad abs rel mutual inductance = 9.63995250242059842E-002 ( Maximum wand/quad abs rel mutual inductance = 5.95379428444659520E-002 Minimum resq coil/quad abs rel mutual inductance = 0.0000000000000000 Maximum resq coil/quad abs rel mutual inductance = 9.63995250242060675E-002 without -ffast-math). The attached file gives the differences between the original code and the three variants. This is to be compared to further optimizations (indu.v2.f90) leading to: 35.376u 0.062s 0:35.44 99.9% 0+0k 0+1io 0pf+0w or after merging the two loops (indu.v3.f90) to: 34.452u 0.041s 0:34.49 100.0% 0+0k 0+0io 0pf+0w I have counted the number of sqrt() in the assembly code and found 9 of them in the slow codes while I only found 5 (10 for indu.v3.f90) of them for the fast codes. I checked that this was not due to common subexpressions I may have missed. Looking more closely to the assemby I saw that he slow codes used 'sqrtsd' while the fast ones used 'sqrtpd' along with several other packed operations. Now 65.66*5/9=36.48 explaining most of the speedup. Note that 5=9/2+1, i.e., four packed computations followed by a scalar one. Owing to the same structure in the two loops, the two scalar computations could be merged in a packed one, but it is missed by the optimizer. I have tried without success to trigger this "vectorization" by making some variables vectors in k. QUESTION 2: Why the optimizer is able to vectorize in some cases and not in others? Can the frontend help to vectorize? -- Summary: Missed optimizations Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: dominiq at lps dot ens dot fr GCC build triplet: i686-apple-darwin9 GCC host triplet: i686-apple-darwin9 GCC target triplet: i686-apple-darwin9 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265