On 06/08/12 20:06, Bill Broadley wrote: > A new user on one of my GigE clusters submits batches of 500 jobs that > need to randomly read a 30-60GB dataset. They aren't the only user of > said cluster so each job will be waiting in the queue with a mix of others.
With a 160TB cluster and only a 30-60GB dataset, is there any reason why the user isn't simply storing their dataset in HDFS? Does the data change frequently via a non-MapReduce framework such that it needs to be pulled from NFS before every job? If the dataset is in a few dozen files and in HDFS in the cluster, there is no reason why MapReduce shouldn't spawn it's tasks directly "on" the data, without need (most of the time) for moving all of the data to every node as you mention. > The clients definitely see MUCH faster performance when access a local > copy instead of a small share of the performance/bandwidth of a central > file server. This makes perfect sense, and is in fact exactly what Hadoop already attempts to do by trying to co-locate MapReduce tasks with pre-placed data in HDFS. Hadoop tries to move the computation to the data in this case, rather than what you are trying to do: Move the data to the computation, which tends to be /way/ harder unless you've got killer storage. All of this said, it is unclear from your email whether this user is using Hadoop or if that was just a side-node and they are operating in a totally different cluster with a different framework (MPI?). Best, ellis _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
