Hi, please see below On Wed, Sep 2, 2009 at 6:57 PM, Mark Hahn <h...@mcmaster.ca> wrote:
> On Wed, 2 Sep 2009, amjad ali wrote: > > Hi All, >> I have 4-Nodes ( 4 CPUs Xeon3085, total 8 cores) Beowulf cluster on >> ROCKS-5 >> with GiG-Ethernet. I tested runs of a 1D CFD code both serial and parallel >> on it. >> Please reply following: >> >> 1) When I run my serial code on the dual-core head node (or parallel code >> with -np 1); it gives results in about 2 minutes. What I observe is that >> "System Monitor" application show that some times CPU1 become busy 80+% >> and >> CPU2 around 10% busy. After some time CPU1 gets share around 10% busy >> while >> the CPU2 becomes 80+% busy. Such fluctuations/swap-of-busy-ness continue >> till end. Why this is so? Does this busy-ness shifts/swaping harms >> performance/speed? >> > > the kernel decides where to run processes based on demand. if the machine > were otherwise idle, your process would stay on the same CPU. depending on > the particular kernel release, the kernel uses various heuristics to decide > how much to "resist" moving the process among cpus. > > the cost of moving among cpus depends entirely on how much your code > depends > on the resources tied to one cpu or the other. for instance, if your code > has a very small memory footprint, moving will have only trivial cost. > if your process has a larger working set size, but fits in onchip cache, > it may be relatively expensive to move to a different processor in the > system that doesn't share cache. consider a 6M L3 in a 2-socket system, > for instance: the inter-socket bandwidth will be approximately memory > speed, > which on a core2 system is something like 6 GB/s. so migration will incur > about a 1ms overhead (possibly somewhat hidden by concurrency.) > > in your case (if I have the processor spec right), you have 2 cores > sharing a single 4M L2. L1 cache is unshared, but trivial in size, > so migration cost should be considered near-zero. > > the numactl command lets you bind a cpu to a processor if you wish. > this is normally valuable on systems with more complex topologies, > such as combinations of shared and unshared caches, especially when divided > over multiple sockets, and with NUMA memory (such as opterons and nehalems.) > > 2) When I run my parallel code with -np 2 on the dual-core headnode only; >> it gives results in about 1 minute. What I observe is that "System >> Monitor" >> application show that all the time CPU1 and CPU2 remain busy 100%. >> > > no problem there. normally, though, it's best to _not_ run extraneous > processes, and instead only look at the elapsed time that the job takes to > run. that is the metric that you should care about. > > 3) When I run my parallel code with "-np 4" and "-np 8" on the dual-core >> headnode only; it gives results in about 2 and 3.20 minutes respectively. >> What I observe is that "System Monitor" application show that all the time >> CPU1 and CPU2 remain busy 100%. >> > > sure. with 4 cpus, you're overloading the cpus, but they timeslice fairly > efficiently, so you don't lose. once you get to 8 cpus, you lose because > the overcommitted processes start interfering (probably their working set > is blowing the L2 cache.) > > 4) When I run my parallel code with "-np 4" and "-np 8" on the 4-node (8 >> cores) cluster; it gives results in about 9 (NINE) and 12 minutes. What I >> > > well, then I think it's a bit hyperbolic to call it a parallel code ;) > seriously, all you've learned here is that your interconnect is causing > your code to not scale. the problem could be your code or the > interconnect. > > observe is that "System Monitor" application show CPU usage fluctuations >> somewhat as in point number 1 above (CPU1 remains dominant busy most of >> the >> time), in case of -np 4. Does this means that an MPI-process is shifting >> to >> different cores/cpus/nodes? Does these shiftings harm performance/speed? >> > > MPI does not shift anything. the kernel may rebalance runnable processes > within a single node, but not across nodes. it's difficult to tell how much > your monitoring is harming the calculation or perturbing the load-balance. > > 5) Why "-np 4" and "-np 8" on cluster is taking too much time as compare >> to >> -np 2 on the headnode? Obviously its due to communication overhead! but >> how >> to get better performance--lesser run time? My code is not too complicated >> only 2 values are sent and 2 values are received by each process after >> each >> stage. >> > > then do more work between sends and receives. hard to say without knowing > exactly what the communication pattern is. > Here is my subroutine: IF (myrank /= p-1) CALL MPI_ISEND(u_local(Np,E),1,MPI_REAL8,myrank+1,55,MPI_COMM_WORLD,right(1), ierr) IF (myrank /= 0) CALL MPI_ISEND(u_local(1,B),1,MPI_REAL8,myrank-1,66,MPI_COMM_WORLD,left(1), ierr) IF (myrank /= 0) CALL MPI_IRECV(u_left_exterior,1, MPI_REAL8, myrank-1, 55, MPI_COMM_WORLD,right(2), ierr) IF (myrank /= p-1) CALL MPI_IRECV(u_right_exterior,1, MPI_REAL8, myrank+1, 66, MPI_COMM_WORLD,left(2), ierr) u0_local=RESHAPE(u_local,(/Np*K_local/)) du0_local=RESHAPE(du_local,(/Nfp*Nfaces*K_local/)) q_local = rx_local*MATMUL(Dr,u_local) DO I = shift*Nfp*Nfaces+1+1 , shift*Nfp*Nfaces+Nfp*Nfaces*K_local-1 du0_local(I) = (u0_local(vmapM_local(I))-u0_local(vmapP_local(I)))/2.0_8 ENDDO I = shift*Nfp*Nfaces+1 IF (myrank == 0) du0_local(I) = (u0_local(vmapM_local(I))-u0_local(vmapP_local(I)))/2.0_8 IF (myrank /= p-1) CALL MPI_WAIT(right(1), status, ierr) IF (myrank /= 0) CALL MPI_WAIT(left(1), status, ierr) IF (myrank /= 0) CALL MPI_WAIT(right(2), status, ierr) IF (myrank /= p-1) CALL MPI_WAIT(left(2), status, ierr) IF (myrank /= 0) THEN du0_local(I) = (u0_local(vmapM_local(I))-u_left_exterior)/2.0_8 ENDIF I = shift*Nfp*Nfaces+Nfp*Nfaces*K_local IF (myrank == p-1) du0_local(I) = (u0_local(vmapM_local(I))-u0_local(vmapP_local(I)))/2.0_8 IF (myrank /= p-1) THEN du0_local(I) = (u0_local(vmapM_local(I))-u_right_exterior)/2.0_8 ENDIF IF (myrank == 0) du0_local(mapI) = 0.0_8 IF (myrank == p-1) du0_local(mapO) = 0.0_8 du_local = RESHAPE(du0_local,(/Nfp*Nfaces,K_local/)) q_local = q_local-MATMUL(LIFT,Fscale_local*(nx_local*du_local)) IF (myrank /= p-1) CALL MPI_ISEND(q_local(Np,E),1,MPI_REAL8,myrank+1,551,MPI_COMM_WORLD,right(1), ierr) IF (myrank /= 0) CALL MPI_ISEND(q_local(1,B),1,MPI_REAL8,myrank-1,661,MPI_COMM_WORLD,left(1), ierr) IF (myrank /= 0) CALL MPI_IRECV(q_left_exterior,1, MPI_REAL8, myrank-1, 551, MPI_COMM_WORLD,right(2), ierr) IF (myrank /= p-1) CALL MPI_IRECV(q_right_exterior,1, MPI_REAL8, myrank+1, 661, MPI_COMM_WORLD,left(2), ierr) q0_local=RESHAPE(q_local,(/Np*K_local/)) dq0_local=RESHAPE(dq_local,(/Nfp*Nfaces*K_local/)) rhsu_local= rx_local*MATMUL(Dr,q_local) + u_local*(u_local-a1)*(1.0_8-u_local) - v_local DO I = shift*Nfp*Nfaces+1+1 , shift*Nfp*Nfaces+Nfp*Nfaces*K_local-1 dq0_local(I) = (q0_local(vmapM_local(I))-q0_local(vmapP_local(I)))/2.0_8 ENDDO I = shift*Nfp*Nfaces+1 IF (myrank /= p-1) CALL MPI_WAIT(right(1), status, ierr) IF (myrank /= 0) CALL MPI_WAIT(left(1), status, ierr) IF (myrank /= 0) CALL MPI_WAIT(right(2), status, ierr) IF (myrank /= p-1) CALL MPI_WAIT(left(2), status, ierr) IF (myrank /= 0) THEN dq0_local(I) = (q0_local(vmapM_local(I))-q_left_exterior)/2.0_8 ENDIF I = shift*Nfp*Nfaces+Nfp*Nfaces*K_local IF (myrank /= p-1) THEN dq0_local(I) = (q0_local(vmapM_local(I))-q_right_exterior)/2.0_8 ENDIF IF (myrank == 0) dq0_local(mapI) = q0_local(vmapI)+0.225_8 IF (myrank == p-1) dq0_local(mapO) = q0_local(vmapO) dq_local = RESHAPE(dq0_local,(/Nfp*Nfaces,K_local/)) rhsu_local= rhsu_local-MATMUL(LIFT,(Fscale_local*(nx_local*dq_local))) END SUBROUTINE ===================================================================== Is it suffieciently goog or there are some serious problems with the communication pattern. Here Arrays are not very big because Nfp*Nfaces = 2 and K_local = < 30. > > I think you should first validate your cluster to see that the Gb is > running as fast as expected. actually, that everything is running right. > that said, Gb is almost not a cluster interconnect at all, since it's so > much slower than the main competitors (IB mostly, to some extent 10GE). > fatter nodes (dual-socket quad-core, for instance) would at least decrease > the effect of slow interconnect. > > you might also try instaling openMX, which is an ethernet protocol > optimized for MPI (rather than your current MPI which is presumably layered > on top of the usual TCP stack, which is optimized for wide-area > streaming transfers.) heck, you can probably obtain some speedup by > tweaking your coalesce settings via ethtool. >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf