Hello Brad
I will do better timing and also try larger problems.
I think the MPI code also has lot of overheads since it has to transfer data
b/w processess which the Chapel code does not have to do. I also have the same
halo cells in MPI code as in the Chapel code. In the MPI code, each process
copies data from a global vector to a local vector, then does the actual
computations, which the Chapel code doesnt do. Hence I expected the MPI code to
do worse.
I was wondering if my Chapel code is not well written. E.g., there are loops
like this
forall (i,j) in Dx
{
// do some computation
res[i-1,j] += flux * dy;
res[i,j] -= flux * dy;
}
Do I have to worry about different threads writing into same location of the
"res” variable ?
How can I check how much time is spent in different parts of the Chapel code ?
Best
praveen
> On 10-Oct-2016, at 10:49 PM, Brad Chamberlain <[email protected]> wrote:
>
>
> Hi Praveen --
>
> In addition to Jeff's good advice on timing the computation you care about, I
> wanted to point out a difference between the MPI and the Chapel code:
>
> As you know, MPI is designed to be a distributed memory execution model, so
> to take advantage of the four cores on your Mac, you use mpirun -np 4.
>
> Chapel supports both shared- and distributed-memory parallelism, so the way
> you're running on this 4-core Mac is reasonable, yet different than the MPI.
> Specifically, we will create a single process that will use multiple threads
> to implement your forall loops (typically 4). So there will be no
> inter-process communication in the Chapel implementation as there is in the
> MPI version and comparing against an OpenMP implementation would be a more
> fair comparison.
>
> Related: The use of the 'StencilDist' domain map has no positive impact for a
> shared-memory execution like this, and will likely add overhead. It is
> designed for use on distributed-memory executions that do stencil-based
> computations in order to enable caching of values owned by neighboring
> processes. But when you've only got one process like this, there's no remote
> data to cache. So for a shared-memory execution like this, it'd be
> interesting to see how much faster the code would be if the 'dmapped
> StencilDist' clause was commented out (in practice, we often write codes that
> can be compiled with or without distributed data using a 'param' conditional
> -- for example, see the declarations of 'Elems' and 'Nodes' in
> examples/benchmarks/lulesh.chpl).
>
> Running on a distributed memory system using the 'StencilDist' distribution
> against MPI (or better, vs. an MPI + OpenMP code) would also be more of an
> apples-to-apples comparison, though I suspect you'll see Chapel fall further
> behind in terms of performance at that point...
>
> -Brad
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users