Complex array multiply uses scalar instructions instead of using packed instructions.
subroutine complex_mult_test(Iy, Ix, nx) implicit none integer(kind=kind(1)), intent(in) :: nx complex(kind=kind((1.0d0,1.0d0))), dimension(nx), intent(inout) :: Iy complex(kind=kind((1.0d0,1.0d0))), dimension(nx), intent(in) :: Ix Iy = Iy * Ix end subroutine complex_mult_test Code produced by GCC compiler inside the loop body: movsd 0x8(%rsi),%xmm3 movsd (%rdi),%xmm5 inc %rax movsd 0x8(%rdi),%xmm4 movsd (%rsi),%xmm2 add $0x10,%rsi movapd %xmm3,%xmm1 mulsd %xmm5,%xmm3 movapd %xmm2,%xmm0 mulsd %xmm4,%xmm1 mulsd %xmm5,%xmm0 mulsd %xmm4,%xmm2 subsd %xmm1,%xmm0 addsd %xmm3,%xmm2 movsd %xmm0,(%rdi) movsd %xmm2,0x8(%rdi) A complex multiply (x0,y0)*(x1,y1)=(x0*x1-y0*y1,x0*y1+x1*y0). This could implemented using packed instructions. Following instructions will be useful. i. movhpd, movddup, shufpd to arrange data properly. ii. mulpd to do two multiply at once iii. addsubpd to combine the addition and subtraction. Hand coding we get 9 instructions movupd (%rdi),%xmm2 //xmm2: x0,y0 movddup (%rsi),%xmm0 //xmm0: x1,x1 mulpd %xmm2,%xmm0 //xmm0: x1*x0,x1*y0 movddup 0x8(%rsi),%xmm1 //xmm1: y1,y1 shufpd $0x1,%xmm2,%xmm2 //xmm2: y0,x0 mulpd %xmm2,%xmm1 //xmm1: y0*y1,x0*y1 addsubpd %xmm0,%xmm1 //xmm1: x0*x1-y0*y1,x0*y1+x1*y0 movlpd %xmm1,(%rdi) movhpd %xmm1,0x8(%rdi) Other relevant information: 1. Compile flags: -O3 -ffast-math -m64 -march=amdfam10 2. gfortran version: gfortran -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: /tmp/src/gcc-4.3.0/configure --prefix=/opt/amd/gcc-4.3.0 --enable-languages=c,c++,fortran --enable-stage1-checking --with-as=/opt/amd/gcc-4.3.0/bin/as --with-ld=/opt/amd/gcc-4.3.0/bin/ld --with-mpfr=/tmp/install/mpfr-2.3.0 --with-gmp=/tmp/install/gmp-4.2.2 Thread model: posix gcc version 4.3.1 20080312 (prerelease) (GCC) 3. model name: AMD Phenom(tm) 8650 Triple-Core Processor 4. flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw -- Summary: Fortran complex array multiply missed optimization Product: gcc Version: 4.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: rajiv dot adhikary at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36840