Roland, the OpenHPC integration IS interesting. I am on the OpenHPC list and look forward to the announcement there.
On 17 May 2018 at 15:00, Roland Fehrenbacher <r...@q-leap.de> wrote: > >>>>> "J" == Lux, Jim (337K) <james.p....@jpl.nasa.gov> writes: > > J> The reason I hadn't looked at "diskless boot from a > J> server" is the size of the image - assume you don't have a high > J> bandwidth or reliable link. > > This is not something to worry about with Qlustar. A (compressed) > Qlustar 10.0 image containing e.g. the core OS + slurm + OFED + Lustre is > just a mere 165MB to be transferred (eating 420MB of RAM when unpacked > as the OS on the node) from the head to a node. Qlustar (and its > non-public ancestors) were never using anything but RAMDisks (with real > disks for scratch), the first cluster running this at the end of 2001 was > on > Athlons ... and eaten-up RAM in the range of 100MB still mattered a lot > at that time :) > > So over the years, we perfected our image build mechanism to achieve a > close to minimal (size-wise) OS, minimal in the sense of: Given required > functionality (wanted kernel modules, services, binaries/scripts, libs), > generate an image (module) of minimal size providing it. That is maximum > light-weight by definition. > > Yes, I know, you'll probably say "well, but it's just Ubuntu ...". Not for > much longer though: CentOS support (incl. OpenHPC integration) coming > very soon ... And all Open-Source and free. > > Best, > > Roland > > ------- > https://www.q-leap.com / https://qlustar.com > --- HPC / Storage / Cloud Linux Cluster OS --- > > J> On 5/12/18, 12:33 AM, "Beowulf on behalf of Chris Samuel" > J> <beowulf-boun...@beowulf.org on behalf of ch...@csamuel.org> > J> wrote: > > J> On Wednesday, 9 May 2018 2:34:11 AM AEST Lux, Jim (337K) > J> wrote: > > >> While I’d never claim my pack of beagles is HPC, it does share > >> some aspects – there’s parallel work going on, the nodes need to > >> be aware of each other and synchronize their behavior (that is, > >> it’s not an embarrassingly parallel task that’s farmed out from a > >> queue), and most importantly, the management has to be scalable. > >> While I might have 4 beagles on the bench right now – the idea is > >> to scale the approach to hundreds. Typing “sudo apt-get install > >> tbd-package” on 4 nodes sequentially might be ok (although pdsh > >> and csshx help a lot) it’s not viable for 100 nodes. > > J> At ${JOB-1) we moved to diskless nodes and booting RAMdisk > J> images from the management node back in 2013 and it worked > J> really well for us. You no longer have the issue about nodes > J> getting out of step because one of them was down when you ran > J> your install of a package across the cluster, removed HDD > J> failures from the picture (though that's likely less an issue > J> with SSDs these days) and did I mention the peace of mind of > J> knowing everything is the same? :-) > > J> It's not new, the Blue Gene systems we had (BG/P 2010-2012 > J> and BG/Q 2012-2016) booted RAMdisks as they were designed to > J> scale up to huge systems from the beginning and to try and > J> remove as many points of failure as possible - no moving > J> parts on the node cards, no local storage, no local state, > > J> Where I am now we're pretty much the same, except instead of > J> booting a pure RAM disk we boot an initrd that pivots onto an > J> image stored on our Lustre filesystem instead. These nodes > J> do have local SSDs for local scratch, but again no real local > J> state. > > J> I think the place where this is going to get hard is on the > J> application side of things, there were things like > J> Fault-Tolerant MPI (which got subsumed into Open-MPI) but it > J> still relies on the applications being written to use and > J> cope with that. Slurm includes fault tolerance support too, > J> in that you can request an allocation and should a node fail > J> you can have "hot-spare" nodes replace the dead node but > J> again your application needs to be able to cope with it! > > J> It's a fascinating subject, and the exascale folks have been > J> talking about it for a while - LLNL's Dona Crawford keynote > J> was about it at the Slurm User Group in 2013 and is well > J> worth a read. > > J> https://slurm.schedmd.com/SUG13/keynote.pdf > > J> Slide 21 talks about the reliability/recovery side of things: > > J> # Mean time between failures of minutes or seconds for > J> # exascale > J> [...] > J> # Need 100X improvement in MTTI so that applications can run > J> # for many hours. Goal is 10X improvement in hardware > J> # reliability. Local recovery and migration may yield another > J> # 10X. However, for exascale, applications will need to be > J> # fault resilient > > J> She also made the point that checkpoint/restart doesn't > J> scale, you will likely end up spending all your compute time > J> doing C/R at exascale due to failures and never actually > J> getting any work done. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf