On Tue, 2 Jun 2009, John Hearns wrote:

In HPC one would hope that all files are different, so de-duplication would not be a feature which you need.

I beg to differ, at least in the academic environment where I come from. Image these 2 scenarios:

1. copying between users

        step1: PhD student does a good job, produces data and writes
                thesis, then leaves the group but keeps data around
                because the final paper is still not written; in
                his/her new position, there's no free time so the
                final paper advances very slowly; however data can't
                be taken on a slow medium because it's still actively
                worked on
        step2: another PhD student takes over the project and does
                something else that needs the data, so a copy is
                created(*).
        step3: a short term practical work by an undergraduate student
                which colaborates with the step2 PhD student needs
                access to the data; as the undergrad student is not
                trusted (he/she can make mistakes that delete/modify
                the data), another copy is created

(*) copies are created for various reasons:
        privacy or intelectual property - people protect their data
                using Unix file access rights or ACLs, the copying is
                done with their explicit consent, either by them or by
                the sysadmin.
        fear of change - people writing up (or hoping to) don't want
                their data to change, so that they can f.e. go back and
                redo the graph that the reviewer asked for. They are
                particularly paranoid about their data and would prefer
                copying than allowing other people to access it directly.
        lazyness - there can be technical solutions for the above 2
                reasons, but if the people involved don't want to make
                the effort to use them, copying seems like a much easier
                solution.

2. copying between machines

   Data is stored on a group file server or on the cluster where is
   was created, but needs to be copied somewhere else for a more
   efficient (mostly from I/O point of view) analysis. A copy is made,
   but later on people don't remember why the copy was made and if
   there was any kind of modification to the data. Sometimes the
   results of the analysis (which can be very small compared with the
   actual data) are stored there as well, making the whole set look
   like a "package" worthy of being stored together, independent of
   the original data. This "package" can be copied back (so the two
   copies live in the same file system) or can remain separate (which
   can make it harder to detect as copies).

I do mean all these in a HPC environment - the analysis mentioned before can involve reading multiple times files ranging from tens of GB to TB (for the moment...). Even if the analysis itself doesn't run as a parallel job, several (many) such jobs can run at the same time looking for different parameters. [ the above scenarios actually come from practice - not imagination - and are written with molecular dynamics simulations in mind ]

Also don't forget backup - a HPC resource is usually backed up, to avoid loss of data which was obtained with precious CPU time (and maybe an expensive interconnect, memory, etc).

--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.coste...@iwr.uni-heidelberg.de
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to