Re: [Beowulf] Station wagon full of tapes

Chris Dagdigian Tue, 26 May 2009 08:37:21 -0700

The flip side to your arguments is that I may not want my tax dollarsspent on allowing the NIH to operate peta-scale data repositories. Ican't be more specific than this -- my most recent exposure to a largegovernment life science directorate revealed that they were spending$500K/year on EMC maintenance costs for a few tens-of-TBs worth ofdisk arrays that were going on 6 years old!

I think my main interest in utility storage providers is that they canoffer geographical redundancy and large capacity at efficiencies thatcan't be matched locally by individual institutions or even localgroups of institutions. When I look at the full costs of hosting,operating and replicating the data in a local facility the numbersfrom the "utility" providers start to look more attractive.

It will be interesting to see how this all shakes out. The rate atwhich raw disk cost is shrinking in price is amazing and may choke offthe profit for the utility providers who have invested heavily inbuilding out.


My $.02 of course!






On May 26, 2009, at 11:16 AM, Robert G. Brown wrote:

On Tue, 26 May 2009, Chris Dagdigian wrote:
I deal quite often with the "next-gen" DNA sequencing instrumentsthat produce 1TB/day in TIFF images that are then distilled down tothe DNA basecalls before the short reads are subjected toalignment. Then the resulting longer sequences are usually alignedagain against a reference genome.
Lots of data, lots of computation.
The 1 Terabyte of TIFF images typically reduces down to about 200GB in intermediate data which is further distilled down into a fewhundred KB of actual sequence data. The entire process isinteresting and it is a massive Bio/IT challenge as these types ofterabyte-scale data producing lab instruments are popping upeverywhere (the cost of one of these instruments is now easilywithin reach of a single grant-funded researcher at a facility ofany size...). We are only a few technology revolutions away fromthese boxes showing up in your point of care primary physician'soffice (well not really, probably a backend service lab that yourphysician outsources to ...)
Anyway the new data ingestion service that Amazon offers is, Ithink, going to be a big deal in our field.
Sure, but why wouldn't it be cheaper for e.g. NSF or NIH to fund an
exact clone of the service Amazon plans to offer and provide it forfree
to its supported research groups (or rather, do bookkeeping but it is
all internal bookkeeping, moving money from one pocket to another).

Amazon has to make a profit.  Granting agencies don't have to pay the
profit that Amazon has to make. Amazon has to take substantialrisks to
make its profit.  Granting agencies have no risk.
All of the things you assert for DNA sequencing are true for highenergy
physics.  Enormous datasets, lots of computation.  HEP's INTERNATIONAL
solution is ATLAS, not Amazon.

Supporting commercial access into such a DB a la >>google<< but for
genomic data, sure, but that's not really cluster computing, that's a
large shared DB. I could see that as a spin off data service ofAmazon
or Google or a new business altogether, but I'd view it as a niche and
not really HPC.

Grant funded research involving large scale shared data resources can
ALWAYS be done more cheaply than by buying the data services from
profit-making third parties unless there are nonlinear e.g.proprietary
IP barriers.  This is trebly true given that research facilities are
typically on a very high speed networks e.g. lambda rail that the
government is funding anyway, where Amazon or other commercial third
parties have to rent time on those networks and then resell the rental
back to the government at a profit or use slower commercial networksand
with the same sort of throughput markup.
Are there any such barriers here? I'd have to say that I would bemostunhappy seeing my own tax dollars going to make Amazon shareholdersrichwhen they could be spent more efficiently without a middleman rakingina 50 to 100% markup on the service. Of course I'm easily irked --whenI think of all the money spent on Windows by the US government itmakes
my blood boil.

I'd want to see a solid CBA proving that this is the cheapest way to
proceed before dumping tons of tax money into it, if I were king ofthe
world (or just in charge of a major granting agency).

  rgb
For the following reasons:

- Bio people are being buried in data
- Once we process the data to get the derived results, the primarydata just needs to go somewhere cheap- Amazon and other internet-scale people can do peta-scale or exa-scale storage far better & cheaper than any of my customers- These instruments are popping up in wet labs across campus withweak/anemic network links to IT core facilities and data centers- Scientists in many cases are required to share data that is grantfunded- Amazon has some neat "downloader pays" models that make it easierfor researchers to affordably offer up peta-scale data sets forsharing
I suspect that very large amount of scientific data will be makinga 1-way trip into the cloud. The data will stay there "forever" asa deep store. In the ocasional cases where the data needs to be re-processed or re-analyized it would be not unreasonable to fire upsome cloud server nodes to do the re-work in-situ.
The disk ingest service was the final piece. I can see thishappening in life science environments:
- Massive data generated in the wet lab
- Captured to local storage (10 - 40TB) with small HPC component
- Data is processed locally into derived and distilled forms
- Derived data replicated to campus/lab facilities for onlineprimary storage- Derived data (and possibly the full raw data) is compressed,placed onto drives and ingested into Amazon for long term storage- If re-analysis is ever needed, have existing EC2 AMIs preloadedwith the necessary software
Basically it comes down to the fact that Amazon may be able tooffer big-yet-slow storage in the terabyte to petabyte range atlevels of cost and geographical redundancy that would be extremelydifficult to match with local resources at a small non-specializedorganization.
My $.02 of course

-Chris






On May 26, 2009, at 8:58 AM, Jeff Layton wrote:
Gerry Creager wrote:
There was an interesting brainstorming session at Rocks-A-Paloozaa couple of weeks ago. Someone wants to offer Amazon resources.Problem remains for me: How can I get sufficient cloud resourcesfor computing (I'll hammer on dataset transport in a moment) thatwill handle reasonable weather models with their small messageMPI chatter, and lots of file I/O? I've been assured thatAmazon's ready to accommodate that.
This is one of the problems - clouds aren't ready for this kind of
usage model yet. They only have GigE and usually it'soversubscribed.
When you say file IO, they hear capacity, not performance (either
throughput or IOPS). And as you point out, the pipe to/from the
cloud is not ready for lots of data.
However, getting data into S3 for availability, when a dailymulti-gigabyte dataset is used for initiation, and another iscreated as output, is going to be expensive, and likely slow. Ithink there are other approaches that have to be evaluated. I amnot sure the cloud is ready for MPI play on a significant basis,just yet.
I haven't seen the cloud ready yet for anything other thanembarrassinglyparallel codes (i.e. since node, small IO requirements). Hasanyone seendifferently? (as an example of what might work, CloudBurst seemsto begaining some traction - doing sequencing in the cloud. The onlyproblemis that sequencing can generate a great deal of data prettyrapidly).
Jeff
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by PenguinComputing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by PenguinComputing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf
Robert G. Brown                        http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:r...@phy.duke.edu


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Station wagon full of tapes

Reply via email to