> On Tue, Oct 13, 2020 at 1:31 PM Douglas Eadline <deadl...@eadline.org> > wrote: > >> >> The reality is almost all Analytics projects require multiple >> tools. For instance, Spark is great, but if you do some >> data munging of CSV files and want to store your results >> at scale you can't write a single file to your local file >> system. Often times you write it as a Hive table to HDFS >> (e.g. in Parquet format) so it is available for Hive SQL >> queries or for other tools to use. >> > > You can also commit to a database (but you can't have those running on a > traditional HPC cluster). What would be nice would be HDFS running on a > traditional cluster. But that would break the whole parallel filesystem > exposed as a single mount point thing.... It is funny how these things > evolved apart from each other to the point they are impossible to marry, > no?
It was two different goals. HDFS was designed to be a write once read many "distributed" file system. It was never intended to be a parallel filesystem or general purpose in any way. Streaming though distributed data from beginning to end was the goal. Most people are shocked to learn HDFS does not allow random reads or writes to files (only appends). It really should be called "Map Reduce Filesystem" HPC parallel filesystems are designed to be general purpose and can be used by Hadoop, there was at one point a shim for Lustre, other file systems are supported, but you lose the data locality of HDFS. Another piece of trivia is an early version of Hadoop used Torque as the scheduler. As for HDFS running on a traditional cluster. It can be done, but I think it easier to run two clusters. There is no support for IB in Hadoop or Spark (IP of IB of course) so if you have invested in IB, it is not going to get used to its fullest potential. It really depends on what you need to do with Hadoop or Spark. IMO many organizations don't have enough data to justify standing up a 16-24 node cluster system with a PB of HDFS. -- Doug > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -- Doug _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf