optimization question
the attached code (see contract__sparse) is a kernel which I hope gets optimized well. Unfortunately, compiling (on opteron or core2) it as gfortran -O3 -march=native -ffast-math -funroll-loops -ffree-line-length-200 test.f90 ./a.out Sparse: time[s] 0.66804099 New: time[s] 0.20801300 speedup3.2115347 Glfops3.1151900 Error: 1.11022302462515654E-016 shows that the hand-optimized version (see contract__test) is about 3x faster. I played around with options, but couldn't get gcc to generate fast code for the original source. I think that this would involve unrolling a loop and scalarizing the scratch arrays buffer1 and buffer2 (as done in the hand-optimized version). So, is there any combination of options to get that effect? Second question, even the code generated for the hand-optimized version is not quite ideal. The asm of the inner loop appears (like the source) to contain about 4*81 multiplies. However, a 'smarter' way to do the calculation would be to compute the constants used for multiplying work(i) by retaining common subexpressions (i.e. all values of sa_i * sb_j * sc_k * sd_l * work[n] can be computed in 9+9+81+81 multiplies instead of the current scheme, which has 4*81). That could bring another factor of 2 speedup. Is there a chance to have gcc see this, or does this need to be done on the source level ? If considered useful, I can add a PR to bugzilla with the testcase. Joost MODULE TEST IMPLICIT NONE INTEGER :: l INTEGER, PARAMETER :: dp=8 INTEGER, PARAMETER :: nco(0:3)=(/((l+1)*(l+2)/2,l=0,3)/) INTEGER, PARAMETER :: nso(0:3)=(/(2*l+1,l=0,3)/) CONTAINS SUBROUTINE contract__sparse(work, & nl_a, nl_b, nl_c, nl_d,& sphi_a, sphi_b, sphi_c, sphi_d,& primitives,& s_offset_a, s_offset_b, s_offset_c, s_offset_d) REAL(dp), DIMENSION(3*3*3*3), INTENT(IN) :: work INTEGER :: nl_a, nl_b, nl_c, nl_d REAL(dp), DIMENSION(3,3*nl_a), INTENT(IN) :: sphi_a REAL(dp), DIMENSION(3,3*nl_b), INTENT(IN) :: sphi_b REAL(dp), DIMENSION(3,3*nl_c), INTENT(IN) :: sphi_c REAL(dp), DIMENSION(3,3*nl_d), INTENT(IN) :: sphi_d REAL(dp), DIMENSION(3*nl_a, 3*nl_b,3*nl_c,3*nl_d) :: primitives INTEGER, INTENT(IN) :: s_offset_a, s_offset_b, s_offset_c, s_offset_d REAL(dp), DIMENSION(3* 3*3*3) :: buffer1, buffer2 INTEGER :: imax,jmax,kmax, ia, ib, ic, id, s_offset_a1, s_offset_b1, s_offset_c1, s_offset_d1,& i1 ,i2, i3, i, j, k s_offset_a1 = 0 DO ia = 1,nl_a s_offset_b1 = 0 DO ib = 1,nl_b s_offset_c1 = 0 DO ic = 1,nl_c s_offset_d1 = 0 DO id = 1,nl_d buffer1 = 0.0_dp imax=3*3*3 jmax=3 kmax=3 DO i=1,imax buffer1(i+imax*(3-1)) = buffer1(i+imax*(3-1)) + work(1+(i-1)*kmax) * sphi_a(1,3+s_offset_a1) buffer1(i+imax*(1-1)) = buffer1(i+imax*(1-1)) + work(2+(i-1)*kmax) * sphi_a(2,1+s_offset_a1) buffer1(i+imax*(2-1)) = buffer1(i+imax*(2-1)) + work(3+(i-1)*kmax) * sphi_a(3,2+s_offset_a1) ENDDO buffer2 = 0.0_dp imax=3*3*3 jmax=3 kmax=3 DO i=1,imax buffer2(i+imax*(3-1)) = buffer2(i+imax*(3-1)) + buffer1(1+(i-1)*kmax) * sphi_b(1,3+s_offset_b1) buffer2(i+imax*(1-1)) = buffer2(i+imax*(1-1)) + buffer1(2+(i-1)*kmax) * sphi_b(2,1+s_offset_b1) buffer2(i+imax*(2-1)) = buffer2(i+imax*(2-1)) + buffer1(3+(i-1)*kmax) * sphi_b(3,2+s_offset_b1) ENDDO buffer1 = 0.0_dp imax=3*3*3 jmax=3 kmax=3 DO i=1,imax buffer1(i+imax*(3-1)) = buffer1(i+imax*(3-1)) + buffer2(1+(i-1)*kmax) * sphi_c(1,3+s_offset_c1) buffer1(i+imax*(1-1)) = buffer1(i+imax*(1-1)) + buffer2(2+(i-1)*kmax) * sphi_c(2,1+s_offset_c1) buffer1(i+imax*(2-1)) = buffer1(i+imax*(2-1)) + buffer2(3+(i-1)*kmax) * sphi_c(3,2+s_offset_c1) ENDDO imax=3*3*3 jmax=3 kmax=3 i = 0 DO i1=1,3 DO i2=1,3 DO i3=1,3 i = i + 1 primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+3) =& primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+3) & + buffer1(1+(i-1)*kmax) * sphi_d(1,3+s_offset_d1) primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+1) =& primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+1) & + buffer1(2+(i-1)*kmax) * sphi_d(2,1+s_offset_d1) primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+2) =& primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+2) & + buffer1(3+(i-1)*kmax) * sphi_d(3,2+s_offset_d1) ENDDO ENDDO ENDDO s_offset_d1 = s_offset_d1 + 3 END DO s_offset_c1 = s_offset_c1 + 3 END DO s_offset_b1 = s_off
Re: optimization question
On Sat, May 16, 2009 at 10:28 AM, VandeVondele Joost wrote: > the attached code (see contract__sparse) is a kernel which I hope gets > optimized well. Unfortunately, compiling (on opteron or core2) it as > > gfortran -O3 -march=native -ffast-math -funroll-loops -ffree-line-length-200 > test.f90 > >> ./a.out > > Sparse: time[s] 0.66804099 > New: time[s] 0.20801300 > speedup 3.2115347 > Glfops 3.1151900 > Error: 1.11022302462515654E-016 > > shows that the hand-optimized version (see contract__test) is about 3x > faster. I played around with options, but couldn't get gcc to generate fast > code for the original source. I think that this would involve unrolling a > loop and scalarizing the scratch arrays buffer1 and buffer2 (as done in the > hand-optimized version). So, is there any combination of options to get that > effect? > > Second question, even the code generated for the hand-optimized version is > not quite ideal. The asm of the inner loop appears (like the source) to > contain about 4*81 multiplies. However, a 'smarter' way to do the > calculation would be to compute the constants used for multiplying work(i) > by retaining common subexpressions (i.e. all values of sa_i * sb_j * sc_k * > sd_l * work[n] can be computed in 9+9+81+81 multiplies instead of the > current scheme, which has 4*81). That could bring another factor of 2 > speedup. Is there a chance to have gcc see this, or does this need to be > done on the source level ? > > If considered useful, I can add a PR to bugzilla with the testcase. I think it is useful to have a bugzilla here. The loop bodies are too big to be unrolled early, the temporary array is unfortunately addressable (and thus not scalarized early) because of : # D.2754_658 = PHI __builtin_memset (&buffer2, 0, 648); repeating all over the place. Final unrolling unrolls some of the loops, but that is way too late for followup optimizations. The factoring you mention should be already done by reassoc/FRE - it would be interesting to see why it does not work. I tested 4.4, what did you test? Richard. > Joost > >
Re: optimization question
thanks for the info I think it is useful to have a bugzilla here. will do. I tested 4.4, what did you test? 4.3 4.4 4.5 Joost
Re: optimization question
On Sat, May 16, 2009 at 11:29 AM, VandeVondele Joost wrote: > > > thanks for the info > >> I think it is useful to have a bugzilla here. > > will do. Btw, complete unrolling is also hindred by the artificial limit of maximally unrolling 16 iterations. Your inner loops iterate 27 times. Also by the artificial limit of the maximal unrolled size. With --param max-completely-peel-times=27 --param max-completely-peeled-insns=666 (values for trunk) the loops are unrolled at -O3. Richard. >> >> I tested 4.4, what did you test? >> > > 4.3 4.4 4.5 > > Joost >
Re: optimization question
I think it is useful to have a bugzilla here. will do. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40168 Btw, complete unrolling is also hindred by the artificial limit of maximally unrolling 16 iterations. Your inner loops iterate 27 times. Also by the artificial limit of the maximal unrolled size. With --param max-completely-peel-times=27 --param max-completely-peeled-insns=666 (values for trunk) the loops are unrolled at -O3. hmmm. but leading to slower code.
Re: optimization question
On Sat, May 16, 2009 at 11:41 AM, VandeVondele Joost wrote: > I think it is useful to have a bugzilla here. >>> >>> will do. > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40168 > >> >> Btw, complete unrolling is also hindred by the artificial limit of >> maximally >> unrolling 16 iterations. Your inner loops iterate 27 times. Also by the >> artificial limit of the maximal unrolled size. >> >> With --param max-completely-peel-times=27 --param >> max-completely-peeled-insns=666 >> >> (values for trunk) the loops are unrolled at -O3. > > hmmm. but leading to slower code. Not for me - but well, the main issue is the memsets which the frontend generates. Richard.
[lto] Enabling LTO in selected testsuite directories
On the LTO branch, I am brute-forcing LTO compilation on all the testsuite directories. This causes many spurious failures because we are not going to support LTO compiles on everything. For instance, LTO is not supported for fortran, java, ada, mudflap. Also, for some tests like pch, the tests fail trivially because of assembly miscomparison (due to the LTO sections). I am trying to come up with a generic mechanism that can be used in individual .exp files so they can decide whether to test the two LTO modes. In terms of dg.exp, it would mean adding 3 or 4 new entries to DG_TORTURE_OPTIONS ({-O0 -flto}, {-O2 -flto}, {-O0 -fwhopr}, {-O2 -fwhopr}). Do you have any suggestion on how I could implement that? Thanks. Diego.
Re: Extending constraints using register subclasses
> From: Andrew Pinski > To: Jamie Prescott > Cc: gcc@gcc.gnu.org > Sent: Monday, May 11, 2009 4:47:57 PM > Subject: Re: Extending constraints using register subclasses > > On Mon, May 11, 2009 at 4:45 PM, Jamie Prescott wrote: > > > > Hi! > > I wanted to add finer (one per) register subclasses, so that I can more > > finely > control > > the register placement inside the inline assembly. > > You don't need that. > You can just use asm("registername") on variables. > like so: > > int f(int a) > { > register int r0 __asm__("r0"); > asm("use %0": "+r"(r0) ); > } Now I managed to have the approach based on register subclasses working. The above works too, but I somehow found it less clear and more "global" than inline assembly constraints. One thing that I cannot see how to implement clearly with the approach above, is when you have an instruction (like I use inside the VM with SWI - software interrupts), where the C type of a register changes between input and output. For some of them the r0 in input can be a 'char*', and an 'int' (error code) on output. With inline assembly this can be done pretty cleanly, while I am not aware of how to do it with the approach above. For clean, I mean cast-less. - Jamie
Re: Extending constraints using register subclasses
On Sat, May 16, 2009 at 7:57 AM, Jamie Prescott wrote: > Now I managed to have the approach based on register subclasses working. The > above > works too, but I somehow found it less clear and more "global" than inline > assembly > constraints. It is not global as the register variables don't escape out of the scope. If you use them in inline functions, it makes them less global and also easier to read too. Thanks, Andrew Pinski
Re: Extending constraints using register subclasses
> From: Andrew Pinski > To: Jamie Prescott > Cc: gcc@gcc.gnu.org > Sent: Saturday, May 16, 2009 8:04:59 AM > Subject: Re: Extending constraints using register subclasses > > On Sat, May 16, 2009 at 7:57 AM, Jamie Prescott wrote: > > Now I managed to have the approach based on register subclasses working. > > The > above > > works too, but I somehow found it less clear and more "global" than inline > assembly > > constraints. > > It is not global as the register variables don't escape out of the > scope. If you use them in inline functions, it makes them less global > and also easier to read too. And how would you cleanly solve a case like this? int swi_open_file(char const *path, int mode) { int error; asm ("swi 32\n\t" : "=r0" (error) : "r0" (path), "r1" (mode)); return error; } - Jamie
Re: http://gcc.gnu.org/gcc-4.4/changes.html
Hi Jack, On Tue, 21 Apr 2009, Jack Howarth wrote: >A couple changes in gcc 4.4.0 were omitted for > the darwin target. The gcc 4.4.0 release now supports > a full multilib build on the x86_64-apple-darwin9 and > x86_64-apple-darwin10 targets. The gfortran compiler > is now capable of generating binaries linked against > the static libgfortran library using the compiler option > -static-libgfortran. FYI. perhaps you could suggest a patch for the web page (or some concrete text to add)? I'd be happy to help getting things updated, but I'm not exactly a subject matter expert on any of these. Gerald
Re: Memory leak or incorrect use of '#pragma omp parallel'?
On Tue, 12 May 2009 09:33:34 +0300 Mikolaj Golub wrote: MG> On Mon, 11 May 2009 19:00:54 +0300 Mikolaj Golub wrote: >> Hello, >> >> If I run the following program >> >> --- >> >> #include >> >> int n = 4, m = 2; >> >> int main () { >> for (;;) { >> int i; >> >> #pragma omp parallel num_threads(m) >> { >> int just_to_make_some_code_generated; >> just_to_make_some_code_generated = 0; >> } >> #pragma omp parallel for num_threads(n) >> for(i = 0; i < 1; i++) {} >> >> } >> >> return 0; >> } >> >> --- >> >> on FreeBSD6/7 (i386 and amd64) built with gcc42 or gcc43 -fopenmp (I have >> not >> tried other gcc versions), the memory usage is constantly growing during the >> run. The memory growth is observed in case n != m. If n == m, everything >> works >> ok. MG> The problem is observed also for constructions like this one: MG> #pragma omp parallel for num_threads(m) MG> ... MG> #pragma omp parallel for num_threads(n) MG> ... MG> Adding some sleep() code I see in top that during the program run m or n MG> threads are run in turn. So it looks like memory leaks when thread is MG> removed. It might be FreeBSD problem, not gcc. The problem is in libgomp/team.c. gomp_thread_start() does gomp_sem_init() but gomp_sem_destroy() is never called. FreeBSD implementation of sem_init() allocates some memory for semaphore. This patch solves the problem for me: --- gcc-4.4-20090227/libgomp/team.c.orig2009-05-16 22:57:05.0 +0300 +++ gcc-4.4-20090227/libgomp/team.c 2009-05-16 22:57:14.0 +0300 @@ -458,6 +458,7 @@ gomp_team_start (void (*fn) (void *), vo void gomp_team_end (void) { + int i; struct gomp_thread *thr = gomp_thread (); struct gomp_team *team = thr->ts.team; @@ -493,6 +494,9 @@ gomp_team_end (void) } while (ws != NULL); } + + for(i = 1; i < team->nthreads; i++) +gomp_sem_destroy (team->ordered_release[i]); gomp_sem_destroy (&team->master_release); #ifndef HAVE_SYNC_BUILTINS gomp_mutex_destroy (&team->work_share_list_free_lock); I am going to register this in bugzilla. The problem is not observed under Linux, because it looks like sem_init() implementation for Linux does not do memory allocation in the process' virtual space. The memory for the test program below grows under FreeBSD and does not under Linux. #include int main(int argc, char *argv[]) { sem_t sem; for(;;) { sem_init(&sem, 0, 0);} return 0; } -- Mikolaj Golub