On 15/04/14 14:02, Gavin W. Burris wrote: >>> > Yes! No doubt! The "simple" queues presuppose a massive > distributed system to take advantage of. Bonus points if that > system can interchangeably be an in-house cluster or major cloud > provider. > > I would be very interested to hear what your preferred tools and > APIs are for your analysis system. I can easily default to the job > script and qsub workflow, but restful cloud APIs and simple queues > seem to be next-level for some workflows.
That's what we've got - we run some stuff in-house for security reasons and some stuff on Amazon EC2, though Rackspace Cloud or MS Azure even could be used (they're just not as cheap). Managing image generation and machine provisioning is half the battle, but there's lots of open source tools that help there (Packer, Puppet, etc). The whole system is written in Ruby in terms of the orchestration code (the algorithms are usually C/Python); the coordination system is actually Amazon's Simple Workflow Service, which is something of a cop-out but we're not keen to reinvent the wheel. It provides statekeeping and job queues in one package; replacing it wouldn't be trivial but wouldn't be a massive task; the cost of using it is tiny, though, and it made our life a lot easier. It's all written in terms of deciders, which make decisions based on a list of events associated with an event (eg a "finished activity" event will have the details about the activity starting, being scheduled, and being completed, output status etc), and workers, which perform activities. State is maintained by passing JSON blobs around as messages; there'll be a blog post or two explaining things on our website soonish and I'll post them across if there's interest. It's being used in production on a regular basis and has had quite a lot of content processed through it so far; these tasks on average run for 2-6 hours and involve ~1GB of data going in and a few megabytes out. The APIs are all simple HTTPS RESTful ones, storage can be cloud provider storage or local shared drive storage. Not very 'traditional HPC' but it does the job - there's an interesting intersection between HPC and these sorts of more abstract run-anywhere sorts of systems, where the performance per job and interprocess communication performance is less important and robustness and dynamic scalability plays a major role. -- Cheers, James _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf