On Thu, 29 Mar 2007, Richard Walsh wrote: > Hey Patrick, > > Patrick Geoffray wrote: > > Message aggregation would be much more beneficial in the context of > > UPC, where the compiler will likely generates many small grain > > communications to the same remote process. However, as Greg pointed > > out, MPI applications often already aggregate messages themselves > > (this is something even "challenged" application people understand > > easily). > Right ... > > I would bet that UPC could more efficiently leverage a strided or > > vector communication primitive instead of message aggregation. I don't > > know if GasNet provides one, I know ARMCI does. > Having the alternative of compiling to a pseudo-vector pipelined > GASnet/ARMCI primitive (hiding latency in the pipeline) >or< an > aggregation primitive (amortizing latency over a large data block) would > seem to be a good thing depending on the number and distribution of > remote memory references in your given kernel. The latency bubbles in > interconnect- mediated remote memory references are of course much > larger than in a hardware mediated global addresses space remote > reference. This might change the effectiveness of pipelined based > latency hiding. From what I know the Berkeley UPC > compiler team is focused on optimization through aggregation rather than > pipelining.
Hi Richard, ARMCI, by way of Global Arrays, makes it pretty clear to performance-minded end users that vector pipelining is the optimization that they leverage by exposing the dimensionality and size of in-memory references -- that makes the model more explicit than UPC, but much less so than MPI. With UPC, as I'm sure you know, shared pointers contain some information about data layout but that information can be lost depending on how the shared pointer is manipulated, and can end up being something the compiler has to (or can't) figure out. The fact that UPC has a relaxed shared memory model helps in any prefetching, caching, aggregation or any other latency-hiding technique so vital on systems with orders of magnitude latency differences between local and global memory. It's a difficult optimization problem since some are compile-time and others are dependent on the runtime, and optimizations on both sides shouldn't step on each other's toes. For MPI, if it is to be "the assembly language of parallel computing", it is harder to justify the use of implicit message aggregation (Just to be clear, by assembly here I think the author of this quote meant the "micro-ops" in your favorite processor, not the whole risc-vs-cisc-that-is-really-a-risc debate). For having developed parallel applications using both explicit and implicit programming models, I find that one of the most useful things to know in both is when and how communication happens. "Communication leakage", or unintended communication is what makes this task more difficult on implicit languages. The "how" part is easier as a performance-oriented task on MPI, whereas the "when" is mostly left as an optimization for the MPI implementation to figure out (although the programmer has additional primitives to better control the "when" with synchronous primitives and blocking calls). I would find it surprising that existing MPI codes would really benefit from aggregation since users have had to be mindful of understanding where/how communication happens as a performance concern for decades now. Also, there's the longstanding unwritten MPI rule that "larger messages are better" that weakens the case for message aggregation as a latency-hiding technique. Regardless, I can accept the case for message aggregation in MPI but it shouldn't be a de facto component of an MPI implementation -- it should be a on/off switch so developers and it shouldn't be on by default in benchmarking modes (then again, measuring latency with small message algorithms that only scale up to 16 nodes shouldn't be on by default for benchmarking modes either, but that's a different issue). MPI implementations are horrible beasts to maintain but are beautiful in some regards, they flourish in many directions and I wouldn't stand in the way of yet another performance-minded feature. But if MPI is to remain the reference explicit model, implementations should be explicit about what's going on under the covers when a programmer's expectation of *how* 2 ranks communicate is seriously affected -- I think aggregation in this case qualifies for serious. > Regardless, at this point our more GUPS-like direct remote > memory reference patterns in our UPC codes, which perform well on the > Cray X1E, must be manually aggregated to achieve performance on a cluster. I also butted my head pretty hard against this problem. For UPC, part of it has to do with 'C' and how difficult it is to assert that loops are free of dependencies (typical aliasing problems in C even serially). From the "runtime systems" level, the X1E's lack of flexible inline assembly made it difficult to construct (c.f. name) the scatter/gather ops that are so vital to its performance. But then again, one could say that massaging codes to be vector friendly has been part of the bargain on vectors for decades now. . . christian -- [EMAIL PROTECTED] (QLogic SIG, formerly Pathscale) _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf