On 11/27/2012 11:34 AM, Eugen Leitl wrote: > On Tue, Nov 27, 2012 at 11:13:25AM -0500, Ellis H. Wilson III wrote: > >> Are these problems EP such that they could be entirely Map tasks? > > Not at all. This particular application is to derive optimal > feature extraction algorithms from high-resolution volumetric data > (mammal or primate connectome). At ~8 nm, even a mouse will > produce a mountain of structural data.
Pardon my possible naiveté on the applied science here, but it's unclear to me why the state space explosion is tied to it being embarrassingly parallel or not. Perhaps, to reword my question, can you describe if, and if, at what frequency, the extraction algorithms will need to barrier sync to communicate? If this is indeed "not at all" EP, then you will likely have a serious communication problem and 1GbE will not work if you need to transmit some or all of the data you are reading locally to some other remote node. >> Because otherwise you are going to have a fairly significant shuffle >> stage in your MapReduce application that will lead to overheads moving >> the data over the network and in and out of memory/disk/etc. Shuffling >> can be a real PITA, but it tends to be present in most real-world >> applications I've run into. > > The extracted feature set would be much more compact than the > raw dataset (at least 10^3 to 10^6 more compact), and could > be loaded over the GBit/s network into the main cluster with > no problems. How are you getting the raw data onto the cluster? This time may become the dominant one if it is not a write-once read-very-many type of situation. Maybe you have lots of different feature extraction algorithms to use on that raw data? >> Maybe you weren't referring to using Hadoop, in which case this >> basically looks just like the FAWN project I had mentioned in the past >> that came out of CMU (with the addition of tiered storage). > > http://www.cs.cmu.edu/~fawnproj/ ? Yep, that's the one. > Cute, and probably the right application for the > Adapteva project. If the boards are credit-card > sized you can mount them on a rackmount tray > along with a 24-port switch, with a couple of > fans. > > However, I'm thinking about a board you directly plug > your SATA or SAS hard drive into, probably using > the hard drive itself (which should be 5k rpm then) > as a heatsink. Why do you want the HDD to be a heatsink (i.e. why is that better in any way than just having the HDD right there and using a normal passive sink)? And can you expound upon the differences between the FAWN setup if it had a HDD saddled right next to it against what you are describing? I feel like you're saying the exact same thing except just connect a HDD for capacity reasons and use the onboard flash for cache instead, both of which are reasonably trivial. Just trying to get a handle on your (interesting IMHO) idea here, no non-constructive criticism intended, ellis _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf