On 06/13/12 11:43, Peter wrote: > I read the initial Q that the full data set may be required by any job > so an upgrade to my personal filters may be required :). If this were
No, you are correct about that, or at least, that's what I understood it to mean as well. So for instance, Job1 has Task1-30 and the 30GB DataSet has Chunk1-30, each 1GB in size, spread over the entire cluster. Hadoop just matches Task1 to the chunk it wants to work on. Yes, this means there at least must be parts of the process that are emb. parallel, but that's pretty much taken for granted with big data computation. The serial parts are typically handled by the shuffle and reduce phases at the end. > Given that 30-60Gb is small enough copy everywhere, that sort of takes I wouldn't expect much performance improvement going from 3 to all 30 chunks on a given node, unless you are incredibly unlucky or something is terribly misconfigured with your Hadoop instance. While 30GB isn't too bad to copy elsewhere, it's incredibly poor use of storage resources, having 30 copies of the data all over. > The comment regarding the obscuring the replication process was directed > more towards the user experience, they don't need to know it > automagically happens BUT behind the scenes the copies are happening all > the same, with the expected impact incurred on IO etc. So HDFS doesn't > make the process impact free. Making 30 copies of a 30GB dataset composed of 30 1GB files is quite different than 3 copies of each file, in size and work passed onto the user to manage. Even if you get unlucky and one of your tasks does require remote data, Hadoop handles streaming it to the task while it needs it and cleans up afterwards. It's going to be far more considerate about storage resources than any human being will be. > If you are able to send more to the list regarding HDFS plan B that > would be great and certainly something I'd be interested in hearing more > about. Do you have a blog or similar with references regarding any of > the above ? If so that would be much appreciated. Not yet. Working on a website as well -- will let you know as soon as that completes. Best, ellis _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
