Hello List,
recently, I've looked into some dangling lock problems we've had after
partial power loss.
Here's my analysis of what happens:
-A user application on a compute node requests a lock for a file on a
NFS-mounted file system;
-the NFS server grants the lock;
-a partial power loss (just on
On Thu, 16 Jan 2020 23:24:56 "Lux, Jim (US 337K)" wrote:
What I’m interested in is the idea of jobs that, if spread across many
nodes (dozens) can complete in seconds (<1 minute) providing
essentially “interactive” access, in the context of large jobs taking
days to complete. It’s not clear to
The problem with timeslicing is that when one job is pre-empted, it's
state needs to be stored somewhere so the next job can run. Since many
HPC jobs are memory intensive, using RAM for this is not usually an
option. Which leaves writing the state to disk. Since disk is many
orders of magnitude
Hi Jim,
While we allow both batch and interactive, the scheduler handles them the
same. The scheduler uses queue time, node count, requested wall time,
project id, and others to determine when items run. We have backfill turned
on so that when the scheduler allocates a large job and the time to dr
In the Grid Engine world, we've worked around some of the resource
fragmentation issues by assigning static sequence numbers to queue
instances (a node publishing resources to a queue) and then having the
scheduler fill nodes by sequence number rather than spreading jobs across
the cluster. This le
That’s an interesting point (cloud vs cluster) – if your jobs are sufficiently
small that they will “fit” on a single node (so there’s no real need for inter
node communications) and it’s EP, then spinning up 1000 cloud instances might
be a better approach.
From: Beowulf on behalf of Tim Cutt
Hi Jim,
Something like this can be done within traditional resource managers by
using consumable generic resources, and oversubscribing your nodes. E.g.
a 32core node would be defined as a 64 node core, with 32 "batch"
resources, and 32 "interactive" resources. Submitting a job to a batch
queue r
Indeed, and you can quite easily get into a “boulders and sand” scheduling
problem; if you allow the small interactive jobs (the sand) free access to
everything, the scheduler tends to find them easy to schedule, partially fills
nodes with them, and then finds it can’t find contiguous resources