On Thu, Dec 6, 2012 at 12:53 PM, Chi Chan <chichan2...@gmail.com> wrote: > On Wed, Nov 28, 2012 at 12:10 PM, Rayson Ho <raysonlo...@gmail.com> wrote: >> 1) We ran a 10,000-node cluster on Amazon EC2 for Grid Engine >> scalability testing a few weeks ago: >> >> http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html > > I saw some other large clusters, and most used larger number of cores > and few nodes.
The main goal of the 10,000 node cluster was to stress the networking layer of Grid Engine. In Grid Engine, commlib is the low level library that handles communication between any Grid Engine nodes. In 2005, when Hess Corp added more nodes to their Grid Engine cluster, they found that the cluster stopped working at around 1000 nodes. (That commlib bug was worked around by Sun, and finally fixed by Ron & I and now used in any fork of Grid Engine.) There are some performance issues that we would like to fix before we run something even larger (like 20,000 nodes and beyond :-D ), and I think we are hitting the "C10K problem" that was encountered by web servers a few years ago! > > For example, Numerate’s Drug Design Platform Scales to 10,000+ Cores Running 10,000 cores in EC2 is way easier than booting up 10,000 nodes. If all you need is just 10,000 cores, with cc2.8xlarge (Cluster Compute Eight Extra Large Instance) that has 16 Intel Xeon E5-2670 cores per VM, you only need around 600 instances (instance = VM). With MIT StarCluster, a 100-instance cluster could be provisioned with NFS, Open Grid Scheduler/Grid Engine, user accounts, MPI libraries, etc in less than 10 minutes. Since the overhead to provision an instance is the same whether it be a small instance with 1 core, or a large instance with 16 cores, we can use a larger instance type and get 160,000 cores in EC2 to form the 10,000-node cluster in the same amount of time. On the other hand, the scheduler logic will now see 16 times more job slots than before, so it will then stress the scheduler even more. > Using Spot Instances: > > http://numerate.com/blog/?p=155 We also used spot instances to lower the cost too. The spot price for cc2.8xlarge is $0.27/hr, so one can get the same cluster at $2,700/hr if the jobs can be restarted with issues. The standard price is $2.4/hr for the 16-core cc2.8xlarge, and that's 24,000/hr for the 10,000-node, 160,000-core cluster, that can be very expensive or very cheap depending on how you are using the cluster: if all you need is to run something large and want to get the results quickly, and then let the cluster sit idle for a few months, then IMO Cloud HPC is the best choice! (Note: As a hack, if the cluster is too large, then one could break it up into smaller clusters, but then it is not a *real* large cluster, as the number of job slots seen by the scheduler logic is way less!) > And I think they are also using bare-metal instead of VM, so it can > benefit a few types of parallel apps that require low latency. Yes, that's one of the advantages of Gompute - they offer bare-metal machines (instead of VMs in EC2), so the machines are not shared. In EC2, if you run in a VPC (Virtual Private Cloud), then you can get request for "Dedicated Instances", but the cost is quite high if all you need is just 1 or 2 instances. However, if you have thousands of nodes, then $10/hr is nothing compare to the total cost. http://aws.typepad.com/aws/2011/03/amazon-ec2-dedicated-instances.html Rayson P.S. I will follow up with you offline if you have further questions... ================================================== Open Grid Scheduler - The Official Open Source Grid Engine http://gridscheduler.sourceforge.net/ > > --Chi > > >> >> >> 3) Lastly, we should also mention StarCluster from MIT. While it is >> not backed by any single vendor, StarCluster is used by lots of >> companies - for example the BioTeam recommends it, and we also use it >> for some of our Grid Engine testing as well! If one just needs a small >> to medium cluster, then StarCluster can provision it for you in EC2 >> very quickly; for example, a 100-node cluster could be installed in >> around 10 minutes - and that was using spot instances, which have a >> slightly higher start time due to the bidding process. >> >> Rayson >> >> ================================================== >> Open Grid Scheduler - The Official Open Source Grid Engine >> http://gridscheduler.sourceforge.net/ >> >> >> >> On Fri, Oct 26, 2012 at 5:26 PM, Douglas Eadline <deadl...@eadline.org> >> wrote: >>> Since the North East Coast (yea, the capitals are there for you Lux) >>> will be under some clouds this weekend I thought my >>> recent survey of "HPC Cloud" offerings may be of >>> interest. (notice the quotes) >>> >>> Moving HPC to the Cloud >>> http://hpc.admin-magazine.com/Articles/Moving-HPC-to-the-Cloud >>> >>> Some cold water to dump on your head maybe found here: >>> >>> Will HPC Work In The Cloud? >>> http://clustermonkey.net/Grid/will-hpc-work-in-the-cloud.html >>> >>> >>> -- >>> Doug >>> >>> -- >>> Mailscanner: Clean >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf