> However, looking at the user manual for this application however, I > suspect the bulk of the work can be made parallel, in contrast to the > original post:
yes - the very first page of preface mentions a straightforward decomposition that scales to 40 processors. that would be "shared-nothing", so in this context could mean 40 separate machines. as for the "ginormous machine" approach, it's going to lose. you can indeed put O(TB) into a single box, single address space, by using ccnuma (popularized by AMD, now supported by Intel.) the problem is that a single thread sees only a modest increase in memory performance (for local versus local+remote). so you've got something like a 4-socket box with only the memory controllers in three sockets doing anything - not to mention that in the active socket, all the cores but one are idle too. given the "slicing" methodology that this application uses for decomposition, I wonder whether it's actually closer to sequential in its access patterns, rather than random. the point here is that you absolutely must have ram if your access is random, since your only constraint is latency. if a lot of your accesses are sequential, then they are potentially much more IO-like - specifically disk-like. in short, suppose you set up a machine with a decent amount of ram (say, lga2100, 8x8G dimms) and a lot of swap. then just run your program that uses 512G of virtual address space. depending on the pattern in which it tranverses that space, the results will either be horrible (not enough work per page) or quite decent (enough work in the set of hot pages that the kernel can cache in 64G.) of course, swap to SSD reduces the latency of thrashing and is pretty easy to configure. the real appeal of this approach is that it doesn't need any special hardware to test (you wouldn't bother with a raid controller, since they're absolutely useless for raid0-type patterns.) > have proper memory, it isn't optimized, and as a result you're > constantly swapping. Merges are a good example of what /should/ work if the domain is sliced in the "right" direction, merging should be very efficient. even if sliced in the wrong direction, merging should at least be block-able (and thus not terrible.) > merging on just one of them that is also outfitted with a ramdisk'd 0.5 > TB Fusion-IO PCI-E flash device. If I am not wildly off the mark on the I wouldn't bother with PCI-E flash, myself. they tend to have dumb/traditional raid controllers on them. doing raid0 across a handful of cheap 2.5" SATA SSDs is ridiculously easy to do and will scale well up fairly well (with some attention to the PCIE topology connecting the controllers, of course.) regards, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf