optimization question

2009-05-16 Thread VandeVondele Joost
the attached code (see contract__sparse) is a kernel which I hope gets 
optimized well. Unfortunately, compiling (on opteron or core2) it as


gfortran -O3 -march=native -ffast-math -funroll-loops 
-ffree-line-length-200  test.f90



./a.out

 Sparse: time[s]   0.66804099
 New: time[s]   0.20801300
 speedup3.2115347
  Glfops3.1151900
 Error:   1.11022302462515654E-016

shows that the hand-optimized version (see contract__test) is about 3x 
faster. I played around with options, but couldn't get gcc to generate 
fast code for the original source. I think that this would involve 
unrolling a loop and scalarizing the scratch arrays buffer1 and buffer2 
(as done in the hand-optimized version). So, is there any combination of 
options to get that effect?


Second question, even the code generated for the hand-optimized version is 
not quite ideal. The asm of the inner loop appears (like the source) to 
contain about 4*81 multiplies. However, a 'smarter' way to do the 
calculation would be to compute the constants used for multiplying work(i) 
by retaining common subexpressions (i.e. all values of sa_i * sb_j * sc_k 
* sd_l * work[n] can be computed in 9+9+81+81 multiplies instead of the 
current scheme, which has 4*81). That could bring another factor of 2 
speedup. Is there a chance to have gcc see this, or does this need to be 
done on the source level ?


If considered useful, I can add a PR to bugzilla with the testcase.

Joost

MODULE TEST
  IMPLICIT NONE
  INTEGER :: l
  INTEGER, PARAMETER :: dp=8
  INTEGER, PARAMETER :: nco(0:3)=(/((l+1)*(l+2)/2,l=0,3)/)
  INTEGER, PARAMETER :: nso(0:3)=(/(2*l+1,l=0,3)/)
CONTAINS
  SUBROUTINE contract__sparse(work, &
nl_a, nl_b, nl_c, nl_d,&
sphi_a, sphi_b, sphi_c, sphi_d,&
primitives,&
s_offset_a, s_offset_b, s_offset_c, s_offset_d)
REAL(dp), DIMENSION(3*3*3*3), INTENT(IN) :: work
INTEGER :: nl_a, nl_b, nl_c, nl_d
REAL(dp), DIMENSION(3,3*nl_a), INTENT(IN) :: sphi_a
REAL(dp), DIMENSION(3,3*nl_b), INTENT(IN) :: sphi_b
REAL(dp), DIMENSION(3,3*nl_c), INTENT(IN) :: sphi_c
REAL(dp), DIMENSION(3,3*nl_d), INTENT(IN) :: sphi_d
REAL(dp), DIMENSION(3*nl_a, 3*nl_b,3*nl_c,3*nl_d) :: primitives
INTEGER, INTENT(IN) :: s_offset_a, s_offset_b, s_offset_c, s_offset_d
REAL(dp), DIMENSION(3* 3*3*3) :: buffer1, buffer2
INTEGER :: imax,jmax,kmax, ia, ib, ic, id, s_offset_a1, s_offset_b1, 
s_offset_c1, s_offset_d1,&
  i1 ,i2, i3, i, j, k

s_offset_a1 = 0
DO ia = 1,nl_a
  s_offset_b1 = 0
  DO ib = 1,nl_b
s_offset_c1 = 0
DO ic = 1,nl_c
  s_offset_d1 = 0
  DO id = 1,nl_d
buffer1 = 0.0_dp
imax=3*3*3
jmax=3
kmax=3
DO i=1,imax
buffer1(i+imax*(3-1)) = buffer1(i+imax*(3-1)) + work(1+(i-1)*kmax) * 
sphi_a(1,3+s_offset_a1)
buffer1(i+imax*(1-1)) = buffer1(i+imax*(1-1)) + work(2+(i-1)*kmax) * 
sphi_a(2,1+s_offset_a1)
buffer1(i+imax*(2-1)) = buffer1(i+imax*(2-1)) + work(3+(i-1)*kmax) * 
sphi_a(3,2+s_offset_a1)
ENDDO
buffer2 = 0.0_dp
imax=3*3*3
jmax=3
kmax=3
DO i=1,imax
buffer2(i+imax*(3-1)) = buffer2(i+imax*(3-1)) + buffer1(1+(i-1)*kmax) * 
sphi_b(1,3+s_offset_b1)
buffer2(i+imax*(1-1)) = buffer2(i+imax*(1-1)) + buffer1(2+(i-1)*kmax) * 
sphi_b(2,1+s_offset_b1)
buffer2(i+imax*(2-1)) = buffer2(i+imax*(2-1)) + buffer1(3+(i-1)*kmax) * 
sphi_b(3,2+s_offset_b1)
ENDDO
buffer1 = 0.0_dp
imax=3*3*3
jmax=3
kmax=3
DO i=1,imax
buffer1(i+imax*(3-1)) = buffer1(i+imax*(3-1)) + buffer2(1+(i-1)*kmax) * 
sphi_c(1,3+s_offset_c1)
buffer1(i+imax*(1-1)) = buffer1(i+imax*(1-1)) + buffer2(2+(i-1)*kmax) * 
sphi_c(2,1+s_offset_c1)
buffer1(i+imax*(2-1)) = buffer1(i+imax*(2-1)) + buffer2(3+(i-1)*kmax) * 
sphi_c(3,2+s_offset_c1)
ENDDO
imax=3*3*3
jmax=3
kmax=3
i = 0
DO i1=1,3
DO i2=1,3
DO i3=1,3
  i = i + 1
primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+3) =&
primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+3) &
+ buffer1(1+(i-1)*kmax) * sphi_d(1,3+s_offset_d1)
primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+1) =&
primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+1) &
+ buffer1(2+(i-1)*kmax) * sphi_d(2,1+s_offset_d1)
primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+2) =&
primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+2) &
+ buffer1(3+(i-1)*kmax) * sphi_d(3,2+s_offset_d1)
ENDDO
ENDDO
ENDDO
s_offset_d1 = s_offset_d1 + 3
  END DO
  s_offset_c1 = s_offset_c1 + 3
END DO
s_offset_b1 = s_off

Re: optimization question

2009-05-16 Thread Richard Guenther
On Sat, May 16, 2009 at 10:28 AM, VandeVondele Joost  wrote:
> the attached code (see contract__sparse) is a kernel which I hope gets
> optimized well. Unfortunately, compiling (on opteron or core2) it as
>
> gfortran -O3 -march=native -ffast-math -funroll-loops -ffree-line-length-200
>  test.f90
>
>> ./a.out
>
>  Sparse: time[s]   0.66804099
>  New: time[s]   0.20801300
>     speedup    3.2115347
>      Glfops    3.1151900
>  Error:   1.11022302462515654E-016
>
> shows that the hand-optimized version (see contract__test) is about 3x
> faster. I played around with options, but couldn't get gcc to generate fast
> code for the original source. I think that this would involve unrolling a
> loop and scalarizing the scratch arrays buffer1 and buffer2 (as done in the
> hand-optimized version). So, is there any combination of options to get that
> effect?
>
> Second question, even the code generated for the hand-optimized version is
> not quite ideal. The asm of the inner loop appears (like the source) to
> contain about 4*81 multiplies. However, a 'smarter' way to do the
> calculation would be to compute the constants used for multiplying work(i)
> by retaining common subexpressions (i.e. all values of sa_i * sb_j * sc_k *
> sd_l * work[n] can be computed in 9+9+81+81 multiplies instead of the
> current scheme, which has 4*81). That could bring another factor of 2
> speedup. Is there a chance to have gcc see this, or does this need to be
> done on the source level ?
>
> If considered useful, I can add a PR to bugzilla with the testcase.

I think it is useful to have a bugzilla here.

The loop bodies are too big to be unrolled early, the temporary array
is unfortunately addressable (and thus not scalarized early) because of

:
  # D.2754_658 = PHI 
  __builtin_memset (&buffer2, 0, 648);

repeating all over the place.  Final unrolling unrolls some of the loops,
but that is way too late for followup optimizations.  The factoring you
mention should be already done by reassoc/FRE - it would be interesting
to see why it does not work.

I tested 4.4, what did you test?

Richard.

> Joost
>
>


Re: optimization question

2009-05-16 Thread VandeVondele Joost



thanks for the info


I think it is useful to have a bugzilla here.


will do.



I tested 4.4, what did you test?



4.3 4.4 4.5

Joost


Re: optimization question

2009-05-16 Thread Richard Guenther
On Sat, May 16, 2009 at 11:29 AM, VandeVondele Joost  wrote:
>
>
> thanks for the info
>
>> I think it is useful to have a bugzilla here.
>
> will do.

Btw, complete unrolling is also hindred by the artificial limit of maximally
unrolling 16 iterations.  Your inner loops iterate 27 times.  Also by the
artificial limit of the maximal unrolled size.

With --param max-completely-peel-times=27 --param
max-completely-peeled-insns=666

(values for trunk) the loops are unrolled at -O3.

Richard.

>>
>> I tested 4.4, what did you test?
>>
>
> 4.3 4.4 4.5
>
> Joost
>


Re: optimization question

2009-05-16 Thread VandeVondele Joost



I think it is useful to have a bugzilla here.


will do.


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40168



Btw, complete unrolling is also hindred by the artificial limit of maximally
unrolling 16 iterations.  Your inner loops iterate 27 times.  Also by the
artificial limit of the maximal unrolled size.

With --param max-completely-peel-times=27 --param
max-completely-peeled-insns=666

(values for trunk) the loops are unrolled at -O3.


hmmm. but leading to slower code.



Re: optimization question

2009-05-16 Thread Richard Guenther
On Sat, May 16, 2009 at 11:41 AM, VandeVondele Joost  wrote:
>
 I think it is useful to have a bugzilla here.
>>>
>>> will do.
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40168
>
>>
>> Btw, complete unrolling is also hindred by the artificial limit of
>> maximally
>> unrolling 16 iterations.  Your inner loops iterate 27 times.  Also by the
>> artificial limit of the maximal unrolled size.
>>
>> With --param max-completely-peel-times=27 --param
>> max-completely-peeled-insns=666
>>
>> (values for trunk) the loops are unrolled at -O3.
>
> hmmm. but leading to slower code.

Not for me - but well, the main issue is the memsets which the
frontend generates.

Richard.


[lto] Enabling LTO in selected testsuite directories

2009-05-16 Thread Diego Novillo
On the LTO branch, I am brute-forcing LTO compilation on all the
testsuite directories.  This causes many spurious failures because we
are not going to support LTO compiles on everything.  For instance,
LTO is not supported for fortran, java, ada, mudflap.  Also, for some
tests like pch, the tests fail trivially because of assembly
miscomparison (due to the LTO sections).

I am trying to come up with a generic mechanism that can be used in
individual .exp files so they can decide whether to test the two LTO
modes.  In terms of dg.exp, it would mean adding 3 or 4 new entries to
DG_TORTURE_OPTIONS ({-O0 -flto}, {-O2 -flto}, {-O0 -fwhopr}, {-O2
-fwhopr}).

Do you have any suggestion on how I could implement that?


Thanks.  Diego.


Re: Extending constraints using register subclasses

2009-05-16 Thread Jamie Prescott

> From: Andrew Pinski 

> To: Jamie Prescott 
> Cc: gcc@gcc.gnu.org
> Sent: Monday, May 11, 2009 4:47:57 PM
> Subject: Re: Extending constraints using register subclasses
> 
> On Mon, May 11, 2009 at 4:45 PM, Jamie Prescott wrote:
> >
> > Hi!
> > I wanted to add finer (one per) register subclasses, so that I can more 
> > finely 
> control
> > the register placement inside the inline assembly.
> 
> You don't need that.
> You can just use asm("registername") on variables.
> like so:
> 
> int f(int a)
> {
>   register int r0 __asm__("r0");
>   asm("use %0": "+r"(r0) );
> }

Now I managed to have the approach based on register subclasses working. The 
above
works too, but I somehow found it less clear and more "global" than inline 
assembly
constraints.
One thing that I cannot see how to implement clearly with the approach above, 
is when
you have an instruction (like I use inside the VM with SWI - software 
interrupts), where
the C type of a register changes between input and output.
For some of them the r0 in input can be a 'char*', and an 'int' (error code) on 
output.
With inline assembly this can be done pretty cleanly, while I am not aware of 
how to
do it with the approach above. For clean, I mean cast-less.


- Jamie


  


Re: Extending constraints using register subclasses

2009-05-16 Thread Andrew Pinski
On Sat, May 16, 2009 at 7:57 AM, Jamie Prescott  wrote:
> Now I managed to have the approach based on register subclasses working. The 
> above
> works too, but I somehow found it less clear and more "global" than inline 
> assembly
> constraints.

It is not global as the register variables don't escape out of the
scope.  If you use them in inline functions, it makes them less global
and also easier to read too.

Thanks,
Andrew Pinski


Re: Extending constraints using register subclasses

2009-05-16 Thread Jamie Prescott

> From: Andrew Pinski 

> To: Jamie Prescott 
> Cc: gcc@gcc.gnu.org
> Sent: Saturday, May 16, 2009 8:04:59 AM
> Subject: Re: Extending constraints using register subclasses
> 
> On Sat, May 16, 2009 at 7:57 AM, Jamie Prescott wrote:
> > Now I managed to have the approach based on register subclasses working. 
> > The 
> above
> > works too, but I somehow found it less clear and more "global" than inline 
> assembly
> > constraints.
> 
> It is not global as the register variables don't escape out of the
> scope.  If you use them in inline functions, it makes them less global
> and also easier to read too.

And how would you cleanly solve a case like this?

int swi_open_file(char const *path, int mode)
{
int error;

asm ("swi 32\n\t" : "=r0" (error) : "r0" (path), "r1" (mode));

return error;
}


- Jamie


  


Re: http://gcc.gnu.org/gcc-4.4/changes.html

2009-05-16 Thread Gerald Pfeifer
Hi Jack,

On Tue, 21 Apr 2009, Jack Howarth wrote:
>A couple changes in gcc 4.4.0 were omitted for
> the darwin target. The gcc 4.4.0 release now supports
> a full multilib build on the x86_64-apple-darwin9 and
> x86_64-apple-darwin10 targets. The gfortran compiler
> is now capable of generating binaries linked against
> the static libgfortran library using the compiler option
> -static-libgfortran. FYI.

perhaps you could suggest a patch for the web page (or some 
concrete text to add)?  I'd be happy to help getting things
updated, but I'm not exactly a subject matter expert on any
of these.

Gerald


Re: Memory leak or incorrect use of '#pragma omp parallel'?

2009-05-16 Thread Mikolaj Golub
On Tue, 12 May 2009 09:33:34 +0300 Mikolaj Golub wrote:

 MG> On Mon, 11 May 2009 19:00:54 +0300 Mikolaj Golub wrote:

 >> Hello,
 >>
 >> If I run the following program
 >>
 >> ---
 >>
 >> #include 
 >>
 >> int n = 4, m = 2;
 >>
 >> int main () {
 >> for (;;) {
 >> int i;
 >>
 >> #pragma omp parallel num_threads(m)
 >> {
 >> int just_to_make_some_code_generated;
 >> just_to_make_some_code_generated = 0;
 >> }
 >> #pragma omp parallel for num_threads(n)
 >> for(i = 0; i < 1; i++) {}
 >>
 >> }
 >>
 >> return 0;
 >> }
 >>
 >> ---
 >>
 >> on FreeBSD6/7 (i386 and amd64) built with gcc42 or gcc43 -fopenmp (I have 
 >> not
 >> tried other gcc versions), the memory usage is constantly growing during the
 >> run. The memory growth is observed in case n != m. If n == m, everything 
 >> works
 >> ok.

 MG> The problem is observed also for constructions like this one:

 MG> #pragma omp parallel for num_threads(m)
 MG> ...
 MG> #pragma omp parallel for num_threads(n)
 MG> ...

 MG> Adding some sleep() code I see in top that during the program run m or n
 MG> threads are run in turn. So it looks like memory leaks when thread is
 MG> removed. It might be FreeBSD problem, not gcc.

The problem is in libgomp/team.c.  gomp_thread_start() does gomp_sem_init()
but gomp_sem_destroy() is never called. FreeBSD implementation of sem_init()
allocates some memory for semaphore. This patch solves the problem for me:

--- gcc-4.4-20090227/libgomp/team.c.orig2009-05-16 22:57:05.0 
+0300
+++ gcc-4.4-20090227/libgomp/team.c 2009-05-16 22:57:14.0 +0300
@@ -458,6 +458,7 @@ gomp_team_start (void (*fn) (void *), vo
 void
 gomp_team_end (void)
 {
+  int i;
   struct gomp_thread *thr = gomp_thread ();
   struct gomp_team *team = thr->ts.team;
 
@@ -493,6 +494,9 @@ gomp_team_end (void)
}
   while (ws != NULL);
 }
+
+  for(i = 1; i < team->nthreads; i++)
+gomp_sem_destroy (team->ordered_release[i]);
   gomp_sem_destroy (&team->master_release);
 #ifndef HAVE_SYNC_BUILTINS
   gomp_mutex_destroy (&team->work_share_list_free_lock);

I am going to register this in bugzilla.

The problem is not observed under Linux, because it looks like sem_init()
implementation for Linux does not do memory allocation in the process' virtual
space. The memory for the test program below grows under FreeBSD and does not
under Linux.

#include 

int
main(int argc, char *argv[]) {

sem_t sem;

for(;;) { sem_init(&sem, 0, 0);}

return 0;
}

-- 
Mikolaj Golub