Hello, I'm new to parallel programming and MPI. I've developed a simulator in C++ for which I would like to decrease the running time by using a Beowulf cluster. I'm not interested in optimizing speed, I'm just looking for a quick and easy way to significantly improve the speed over running the program on a single machine. Basically, I need to learn how to parallelize a C++ function. The following functions in particular take the longest to run in my simulator. The first implements LU decomposition on a large matrix and the second implements the backsubstitution method to solve matrix division.
void NR::ludcmp(Mat_IO_DP &a, Vec_O_INT &indx, DP &d) { const DP TINY=1.0e-20; int i,imax,j,k; DP big,dum,sum,temp; int n=a.nrows(); Vec_DP vv(n); d=1.0; for (i=0;i<n;i++) { big=0.0; for (j=0;j<n;j++) if ((temp=fabs(a[i][j])) > big) big=temp; if (big == 0.0) nrerror("Singular matrix in routine ludcmp"); vv[i]=1.0/big; } for (j=0;j<n;j++) { for (i=0;i<j;i++) { sum=a[i][j]; for (k=0;k<i;k++) sum -= a[i][k]*a[k][j]; a[i][j]=sum; } big=0.0; for (i=j;i<n;i++) { sum=a[i][j]; for (k=0;k<j;k++) sum -= a[i][k]*a[k][j]; a[i][j]=sum; if ((dum=vv[i]*fabs(sum)) >= big) { big=dum; imax=i; } } if (j != imax) { for (k=0;k<n;k++) { dum=a[imax][k]; a[imax][k]=a[j][k]; a[j][k]=dum; } d = -d; vv[imax]=vv[j]; } indx[j]=imax; if (a[j][j] == 0.0) a[j][j]=TINY; if (j != n-1) { dum=1.0/(a[j][j]); for (i=j+1;i<n;i++) a[i][j] *= dum; } } } and... void NR::lubksb(Mat_I_DP &a, Vec_I_INT &indx, Vec_IO_DP &b) { int i,ii=0,ip,j; DP sum; int n=a.nrows(); for (i=0;i<n;i++) { ip=indx[i]; sum=b[ip]; b[ip]=b[i]; if (ii != 0) for (j=ii-1;j<i;j++) sum -= a[i][j]*b[j]; else if (sum != 0.0) ii=i+1; b[i]=sum; } for (i=n-1;i>=0;i--) { sum=b[i]; for (j=i+1;j<n;j++) sum -= a[i][j]*b[j]; b[i]=sum/a[i][i]; } } (The functions are borrowed from the library provided by "Numerical Recipes in C++") I'm currently calling these functions from the main loop with the lines: NR::ludcmp(c,indx,d); and NR::lubksb(c,indx,xv); where the variable 'c' is a large matrix (holding image pixel values) and 'xv' is a vector used in backsubstitution. All of the variables passed into these functions are of types defined in " nr.h" 'c' is a Mat_DP (double-precision matrix) 'indx' is a Vec_INT (integer vector) 'd' is a DP (double precision) 'xv' is a Vec_DP (double precision vector) Is there a simple way to call these functions which will cause the cluster to distribute the load of operations? Currently, when I run the program with a 50x50 array on a single machine, it takes about 5 minutes to process a single iteration through the matrix division. Any help would be greatly appreciated.
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf