On Fri, 7 Feb 2014, Bogdan Costescu wrote:
On Fri, Jan 31, 2014 at 4:30 PM, Mark Hahn <h...@mcmaster.ca> wrote:
it would split the responsibility into one organization concerned
only with hardware capital and operating costs, and another group
that does purely os/software/user support.

Well, in a way this split already exists in larger HPC centers, where
there are different people taking care of the hardware and software
sides. Except that they are part of the same organization, probably
have easy ways to communicate and work together, and a single boss :)

perhaps VERY large centers.  box-monkeying probably requires one person
per 10k or so nodes, perhaps less, depending on the organization's attitude towards vendor service contracts, time-to-repair, whether nodes
are repaired at all, etc.  I would argue that such activity should really be
regarded as "Facilities", and need essentially no contact, communication or (need for) shared bosses.

I think this is a far more natural division than the usual sysadmin vs user/app-specialist.


If I understand correctly your idea, the two organizations would be
separate entities, possibly with different bosses, and the
virtualization layer would separate them at runtime.

sure, let's call it Facilities and Everyoneelse (EOE).


In my view, this
isn't terribly different from having 2 organizations with a complete
stack (HW+SW) each, where the internal communication/workflow/etc. in

I don't follow you at all.  to me, Facilities has essentially no SW,
and to EOE, hardware is almost invisible.  (well, either HW is available
or it is not).  Amazon/Azure/GCE/etc seem to agree that this is a useful
dividing point (though of course all such IaaS providers also attempt to add
value (extract revenue)) via storage, bandwidth, monitoring/automation, etc.


each of them is better, but there's an overall performance loss as the
two HPC installations can't be used as a single one. And here we come
to the difference between grid (2 HPC installations) and cloud (single
HPC installation with VMs). Pick your poison...

yeah, you lost me.  I'm talking about a horizontal partition in the stack.
I don't see any relation between outdated notions of Grid and what I'm talking about. (the division I'm proposing isn't really even dependent
on using VMs: EOE could offer an API to boot on metal, for instance.)


we wouldn't, obviously.  owning an IB card is only relevant for an MPI
program, and one that is pretty interconnect-intensive.  such jobs
could simply be constrained to operatin in multiples of nodes.

I somehow though that the cluster would be homogeneous.

again, I don't follow you.  if you're buying a cluster for a single dedicated
purpose, then there is no real issue.  my context is "generic" academic
research HPC, which is inherently VERY high variance.


If you operate
with constraints/limits/etc, then I would argue that the most cost
effective way is to buy different types of nodes for different types
of jobs:

thats fine if you have a predictable job mix - you can partition it
and specialize each node.  though I'd argue that this is a bit deceptive:
if your workload is serial, you might well still want IB just to provide
decent IO connectivity (in spite of not caring about latency) so the node
might be identical to an MPI workload.

memory-per-core is certainly a parameter that matters, and to some extent cpu-memory-bandwidth can be optimized (mainly by varying the number and clock of cores, of course: at any moment there is pretty much one reasonable choice of memory.

hmm, I suppose add-in-cards are another anti-generic dimension, since jobs in general don't want a GPU, but some really do. I would claim that
add-in accelerators/coprocessors are not a permanent feature, and don't
really change the picture.  (that is, the field is attempting to become
more generally useful in directions such as Phi and AMD-APU, and that will make GP-GPU no longer a thing anyone talks about eventually.)


IB-equipped for MPI jobs,

IB is still desired for IO, even for non-MPI.


many-core nodes for threaded-but-not-MPI jobs,

I don't think so - MPI and serial jobs still want manycore nodes, since the main point of manycore is mainly compute density/efficiency. in a very broad
sense, synchronization is not dramatically faster amongst cores on a node
versus a fast inter-node fabric (or conversely, message passing can
efficiently use shared memory.)


high-GHz few-core nodes for single CPU jobs.

I don't see that happening much.  if people have only a few serial jobs,
they'll run them on their 3.7 GHz desktop, and people who have many
serial jobs would rather have 32 2.4GHz cores rather than 4x3.7.


Also I somehow though that the discussion was mostly about
tight-coupled communicating jobs.

no.


If jobset is very heterogeneous, are
all types of jobs amenable to run in a VM ? F.e. a job taking one full
node both in terms of cores and RAM and with little communication
needs, could run on the physical HW or in a VM and use the same node.

sure. traditionally, running on bare metal has been somewhat slower to setup, but could be faster when running due to virtualization overheads
(if any).


In this case the cost of virtualization (more below) directly
translates to the organizational split, i.e. there is no technical
advantage of running in a VM.

well, not quite: it's always potentially handy to have the hypervisor:
let it step in to provide isolation, or to perform checkpoints/migration...


I don't know why you ask that.  I'm suggesting VMs as a convenient way
of drawing a line between HW and SW responsibilities, for governance
reasons.

You are indeed drawing a line, what I was arguing about is its
thickness :)

sorry, lost me again.


Let's talk about some practical issues:
- the HW guys receive a VM which requires 1CPU and 64GB RAM.

no, HW guys have a datacenter filled with 10k identical nodes, each
with 20 cores and 64G ram, qdr IB, 4x1T local disks.  they are responsible
for keeping it powered, cooled and repaired.


This is a
hard requirement and the VM will not run on a host with less than
this.

memory is a parameter that might well justify "de-homogenizing" nodes,
but the problem is that this (partitioning in general) always introduces
the opportunity for inefficiency when supply and demand don't match.


This VM might come as a result of some user previously running
through the queuing system on the physical HW and not specifying the
amount of required memory - which is very often the case;

never for us: we require a hard memory limit at submit time. it would be amusing to use VMs to address this though, if users really didn't want to predict limits or couldn't. (arguably our experience is that lots of users are totally useless in setting this parameter...)

one could imagine launching such processes on a box that has vast memory,
then, once you think the process has stabilized, migrating it to a box where it just fits. fundamentally, the question is how much you're going
to save by doing this - is memory cheap?  (yes, you'd also have to deal
with the issue of jobs that have multiple phases with different memory use, so might be migrated ("repacked") more than once.)


- the HW guys have to provide some kind of specifications for the VMs.

facilities guys just run the hardware; yes, some sort of negotiation needs
to take place to ensure that the hardware can actually be used.


It will make a large difference in performance whether the VM (say
KVM) will expose a rtl8139 or an optimized virtio device. Same whether
the VM will just provide an Ethernet device or a passthrough-IB one.

will it?  do you have numbers?


Also same whether the VM exposes a generic CPU architecture or gives
access to SSE/FMA/AVX/etc.

I can't imagine any reason to hide physical capabilities. if one had heterogenous HW, it might be valuable to track cpu feature usage of each
job, to maximize scheduling freedom, of course.  same as tracking memory
usage or MPI intensity or IO patterns. but in none of those cases would you actually lie to clients about availability.


or if it allows access (again passthrough?)
to a GPGPU/MIC device.

high-cost heterogeneity is really a strategic question: do you think that demand will be predictable enough to justify hiving off a specialized
cluster?  using VM/containers makes it *easier* to manage a mixed/dynamic
cluster, since, for instance, most GPU jobs don't fully occupy the CPU cores,
which can then be used by migratory serial jobs.


- HW has failures. Do you use some kind of SLA to deal with this ?

I don't see this as a problem. in my organization, none of the compute nodes even has UPS power, and our MTBF is low enough that people get good work done. in the governance structure I'm proposing, there would be some sort of interface between facilities and EOE, but there's nothing difficult there: either nodes work or they don't. SLAs are a legalistic way to approach it, whereas shared monitoring would make it feel
less zero-sum.


More technical, how does a failure in HW translate to a failure of the
VM or in the VM ?

I don't know what you mean.  are you suggesting that byzatine failure modes
would be widespread and thus a concern? it, that facilities and EOE would have a hard time disagreeing on what constitutes failure?


 though it's true that this could all be done bare-metal
(booting PXE is a little clumsier than starting a VM or even container.)

With the degree of automation we have these days, I don't think in
terms of clumsiness but in terms of time needed until the job can
start. It's true that a container can start faster than booting a node
and VMs can be paused+resumed. But do a few seconds to tens of seconds
make a huge difference in node repurposing time for your jobset ? If a

as I've said, we have no jobset - or rather *all* jobsets. we, like most HPC centers, tell users to make their jobs last at least minutes, because we know that our (crappy old) infrastructure has seconds-to-minutes of overhead. obviously this is not inherent, and if all our users suddenly
wanted to run only 5s jobs, we could figure out a way to do it.

let me put it this way: more isolation always costs more startup overhead
(and sometimes some ongoing speed cost.) this is well-known, though not particularly well-handled anywhere to my knowlege. most Compute Canada centers do whole-node scheduling, and some provide layered systems for doing single-user queueing of sub-jobs for this reason. obviously, it would be far better for the user not to need to get dragged down to this level, assuming the system could manage it efficiently. (I also really hate schedulers that have "array jobs" because it's usually nothing more than admission that the scheduler's per-job overhead is too high.)


typical job runtime is in the hours to days range, I would say that it
doesn't...

we have no discernable job mixture, and I think that accurately reflects the reality of HPC (or "Advanced Research Computing") today. a particular
center may decide "we want nothing to do with anything other than
tight-coupled MPI jobs of 10k ranks or greater that run for at least 1d".
bully for them.  I'm talking about the whole farm, not just cherry picking.


and that many jobs don't do anything that would stress the interconnect
(so could survive with just IP provided by the hypervisor.)

This makes a huge difference.

debatable; do you have any data?


If many/most jobs have this pattern,

as I've said several times, I'm talking about a job stew that has no discernable "many/most".

you can force users to sort themselves into partitions based on the kind
of resources they use.  I'd argue that this is irresponsibly BOFHish,
and simply becomes an even less tractable partitioning problem.

that is, if you decide to cherry-pick big-HPC into a single tier1 center,
and then apple-pick all the small-serial into another, separate center,
you still have to decide how much cash to spend on each - and MORE importantly, how much to spend on the "everyone else" center.)


then a traditional HPC installation is probably the wrong solution, a
deal with one of the large cloud providers would probably provide much
better satisfaction.

the interesting thing here is that you seem to think there's something special about HPC or "large cloud providers". I'm suggesting there is not.
that running datacenters is a very straightforward facilities challenge
(with, incidentally, little economy of scale.) and that a Facilities organization (like Amazon) could do HPC perfectly well. (Amazon happens to make an obscene profit on their Facilities business, which is why it's
not a realistic or rational choice to replace forms of HPC.)


But in this case the HW guys (which side are you
? :)) will remain jobless...

I'm full-stack, myself: from dimms to collaborating with users on research.

regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to