Folks, I could use some advice on which cluster job scheduler (batch queuing system) would be most appropriate for my particular needs. I've looked through docs for SGE, Slurm, etc., but without first-hand experience with each one it's not at all clear to me which I should choose...
I've used Sun Grid Engine for this in the past, but the result was very klunky and hard to maintain. SGE seems to have all the necessary features underneath, but no good programming API, and its command-line tools often behave in ways that make them a poor substitute. Here's my current list of needs/wants, starting with the ones that probably make my use case more unusual: 1. I have lots of embarrassingly parallel tree-structured jobs which I dynamically generate and submit from top-level user code (which happens to be written in R). E.g., my user code generates 10 or 100 or 1000 jobs, and each of those jobs might itself generate N jobs. Any given job cannot complete until all its children complete. Also, multiple users may be submitting unrelated jobs at the same time, some of their jobs should have higher priority than others, etc. (The usual reasons for wanting to use a cluster scheduler in the first place, I think.) Thus, merely assigning the individual jobs to compute nodes is not enough, I need the cluster scheduler to also understand the tree relationships between the jobs. Without that, it'd be too easy to get into a live-lock situation, where all the nodes are tied up with jobs, none of which can complete because they are waiting for child jobs which cannot be scheduled. 2. Sometimes I can statically figure out the full tree structure of my jobs ahead of time, but other times I can't or won't, so I definitely need a scheduler that lets me submit new sub-jobs on the fly, from any node in the cluster. 3. The jobs are ultimately all submitted by a small group of people who talk to each other, so I don't really care about any fancy security, cost accounting, "grid" support, or other such features aimed at large and/or loosely coupled organizations. 4. I really, really want a good API for programmably interacting with the cluster scheduler and ALL of its features. I don't care too much what language the API is in as long as it's reasonably sane and I can readily write glue code to interface it to my language of choice. 5. Although I don't currently do any MPI programming, I would very much like the option to do so in the future, and integrate it smoothly with the cluster scheduler. I assume pretty much all cluster schedulers have that, though. (Erlang integration might also be nice.) 6. Each of my individual leaf-node jobs will typically take c. 3 to 30 minutes to complete, so my use shouldn't stress the scheduler's own performance too much. However, sometimes I screw that up and submit tons of jobs that each want to run for only a small amount of time, say 2 minutes or less, so it'd be nice if the scheduler is sufficiently efficient and low-latency to keep up with that. 7. When I submit a job, I should be able to easily (and optionally) give the scheduler my estimates of how much RAM and cpu time the job will need. The scheduler should track what resources the job ACTUALLY uses, and make it easy for me to monitor job status for both running and completed jobs, and then use that information to improve my resource estimates for future jobs. (AKA good APIs, yet again.) 8. Of course the scheduler must have a good way to track all the basic information about my nodes: CPU sockets and cores, RAM, etc. Ideally it'd also be straightforward for me to extend the database of node properties as I see fit. Bonus points if it uses a good database (e.g. SQLite, PostgreSQL) and a reasonable data model for that stuff. Thanks in advance for your help and advice! -- Andrew Piskorski <a...@piskorski.com> http://www.piskorski.com/ _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf