[Bug tree-optimization/34265] New: Missed optimizations

dominiq at lps dot ens dot fr Wed, 28 Nov 2007 07:27:26 -0800

I have had a closer look to the optimization of the polyhedron test 
induct.f90. 
3/4 of the runtime is spent in the subroutine 'mutual_ind_quad_cir_coil' and
1/4 in 'mutual_ind_quad_rec_coil'.
The two subroutines contain two main loops with the following structure:


do i = 1, 2*m
    ...
    do j = 1, 9
        ...
        do k = 1, 9
            q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
            q_vector(2) = 0.5_longreal * b1 * (y2gauss(k) - 1.0_longreal)
            q_vector(3) = 0.0_longreal
!
!       rotate quad vector into the global coordinate system
!
            rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
            rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
            rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
!
!       compute and add in quadrature term
!
            numerator = w1gauss(j) * w2gauss(k) *                         &
                        dot_product(coil_current_vec,current_vector)
            denominator = sqrt(dot_product(rot_c_vector-rot_q_vector,     &
                                           rot_c_vector-rot_q_vector))
            l12_lower = l12_lower + numerator/denominator
        end do
    end do
end do

where the six first lines of code in the k loop do not depend on i nor j
and can be computed in a 'do k = 1, 9' loop outside the main loop by
replacing the length-three vector 'rot_q_vector' by a nine by three array.

The original code induct.f90 gives the following timing (all with -O3 
-ffast-math -funroll-loops):

93.227u 0.094s 1:33.32 99.9%    0+0k 0+0io 0pf+0w

and output:

...
Maximum wand/quad abs rel mutual inductance =   5.95379428444656467E-002
Minimum resq coil/quad abs rel mutual inductance =    0.0000000000000000     
Maximum resq coil/quad abs rel mutual inductance =   9.63995250242061230E-002
...

Unrolling bye hand 'numerator' and 'denominator' gives (see 
http://gcc.gnu.org/ml/fortran/2007-11/msg00231.html):

65.563u 0.092s 1:05.66 99.9%    0+0k 0+0io 0pf+0w

Looking at the assembly I can see that for the original code the inner loops in
k are not  unrolled, as guessed by Paul Thomas (only the implied vector loops
being unrolled).

QUESTION 1: Should the frontend do the unrolling for small vectors itself? or
should the middleend be more aggressive for nested loops with small known
iterations?

Moving the invariants on i and j in the k loops outside the main loops gives:

80.313u 0.074s 1:20.39 99.9%    0+0k 0+1io 0pf+0w

Combining the two hand optimizations gives:

35.925u 0.040s 0:35.97 99.9%    0+0k 0+0io 0pf+0w

(without -ffast-math the timing is
59.263u 0.067s 0:59.33 99.9%    0+0k 0+1io 0pf+0w)

but the results change to:

Maximum wand/quad abs rel mutual inductance =   5.95379428444656675E-002
Minimum resq coil/quad abs rel mutual inductance =    0.0000000000000000     
Maximum resq coil/quad abs rel mutual inductance =   9.63995250242059842E-002

( Maximum wand/quad abs rel mutual inductance =   5.95379428444659520E-002
  Minimum resq coil/quad abs rel mutual inductance =    0.0000000000000000     
  Maximum resq coil/quad abs rel mutual inductance =   9.63995250242060675E-002
without -ffast-math).

The attached file gives the differences between the original code and the three
variants.

This is to be compared to further optimizations (indu.v2.f90) leading to:

35.376u 0.062s 0:35.44 99.9%    0+0k 0+1io 0pf+0w

or after merging the two loops (indu.v3.f90) to:

34.452u 0.041s 0:34.49 100.0%   0+0k 0+0io 0pf+0w

I have counted the number of sqrt() in the assembly code and found 9 of them in
the slow codes while I only found 5 (10 for indu.v3.f90) of them for the fast
codes. I checked that this was not due to common 
subexpressions I may have missed. Looking more closely to the assemby I saw
that he slow codes used 'sqrtsd' while the fast ones used 'sqrtpd' along with
several other packed operations. Now 65.66*5/9=36.48 explaining most of the
speedup. Note that 5=9/2+1, i.e., four packed computations 
followed by a scalar one.  Owing to the same structure in the two loops, the
two scalar computations could be merged in a packed one, but it is missed by
the optimizer.

I have tried without success to trigger this "vectorization" by making some
variables vectors in k.

QUESTION 2:  Why the optimizer is able to vectorize in some cases and not in
others? Can the frontend help to vectorize?


-- 
           Summary: Missed optimizations
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: dominiq at lps dot ens dot fr
 GCC build triplet: i686-apple-darwin9
  GCC host triplet: i686-apple-darwin9
GCC target triplet: i686-apple-darwin9


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265

[Bug tree-optimization/34265] New: Missed optimizations

Reply via email to