Sean,

For what it's worth, Grid Engine (SGE) has a utility binary called "qevent" that is not part of the official binary distribution but can be built from the source distribution (http:// gridengine.sunsource.net). Do a google search for "sge + qevent" and you'll at least hit a few SGE mailing list messages that cover what it does.

You might also want to check out the DRMAA stuff (http://drmaa.org/ wiki/) -- it is supposed to be a DRM-neutral way of submitting jobs to a queuing system. I'm not very familiar with DRMAA so I can't tell you offhand if the current spec includes notification of completed events or not.

Another option that would work with SGE would be the use of queue level epilog scripts that execute each time a job leaves the system for whatever reason. You can put a heck of a lot of logic and programmable activities/notifications into a custom epilog script.

A third option is the use of job dependency syntax within grid engine. For each of your web service initiated tasks you would submit 2 jobs -- the first job is your "worker" job. The second job is your "notifier" job and it is submitted to SGE with a flag that says "this job is dependent on the worker job". Once your notifier job is fired up it can do whatever sort of results checking and notification would be required.

Regards,
Chris



On Oct 16, 2007, at 10:08 AM, Sean Ward wrote:

I've started work on a web service which contains several potentially long running processing steps (molecular dynamics), which are perfect to farm out to the fairly large (90 node) Beowulf I have access to. The primary issue is translating requests from the event driven web service, to job queues, and back again upon completion. Specifically, the major queuing systems I have immediate access to (Sun Grid Engine and Condor) only support e- mail based notification of job completion. Starting jobs isn't an issue, as my service can simply ssh over and execute shell scripts as needed to start things up, the problem is reliably being informed when the jobs fail or complete, via any programmatic method (such as executing a shell script, calling a web service via SOAP/etc, or an asynchronous message library). My other problem, ensuring that these web service requests don't starve in house jobs on the Beowulf is easily handled via the priority levels built into all the various job managers, although being able to checkpoint a long running job would be a plus (such as is supported by Condor).

I am currently investigating modifications to either Condor (more complex to update, but checkpoint is useful) or Ruby Queue (very easy to update for reliable notification) to solve this issue, but wanted to be sure I wasn't overlooking any existing solutions to programmatic based queuing and receiving notifications on jobs in a Beowulf environment...
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to