Re: [Beowulf] scheduler and perl

Chris Dagdigian Tue, 01 Aug 2006 18:33:32 -0700

As Joe mention, the way we handle this is by using cluster schedulerssitting on robust hardware platforms that are capable of handlinglarge numbers of job submissions without problems. Grid Engine andPlatform LSF are two capable products that come to mind and scale well.

The fact that your users are using "qsub" is a good thing that youcertainly want to encourage. It puts their job under the control of ascheduler and allows you to do policy based allocation of yourcomputing resources.

The alternative is your users bypassing the scheduler altogether bySSH'ing to a node and just manually starting programs. Attempts tobypass the scheduler are common in some environments so consideryourself lucky that your users are using the scheduler at all!

The problem with specific users or perl loops bringing the systemdown with a giant load of rapid qsub submissions is usually besthandled on a per-user or per-use case level.

Its more of a matter of education and making sure your users have aresource who can help them with their job scripts and the generaltasks of cluster application integration. Your users are notintentionally trying to cause problems on the system (most likely)but it appears clear that they may need some assistance on how tobetter use the existing cluster.

Not giving users sufficient application integration and clusterscripting support resources is a problem I see all the time. Too manycluster operators think that training users on a few schedulersubmission and status commands is all the integration help that theyneed to provide. The end result is someone writing a shell or perlscript that tries to submit a few million short running tasks all atonce ...


Ways you can deal with the situation:

- Examine the user scripts, see if their script can be altered to put"more work" into each individual qsub job submission. This willreduce the number of qsub commands required

- Tell your users that the use of rapid loops for job submission iscausing system problems. Work with them to introduce a small delayinto their submissions. It is to everyone's best interest not tobring down the master scheduler

- Look into a feature that some scheduling systems call "array jobs"or "job arrays" -- For schedulers that support this feature it is avery very powerful way to use a single qsub/bsub command to launchhundreds of thousands of jobs. I know that a SGE design goal is tosupport the submission of a single job array with up to 500,000individual sub tasks. Both SGE and LSF do job arrays very well.This feature only works well if the workflow includes similarcommands that vary only slightly (like the input file or a commandline argument for instance).


So in summary:

 - Be happy users are issuing qsub commands at all !

- Treat the looping problem as a sign that your users may need someapplication integration assistance/education- Work with the users that are causing problems, see if they canintroduce a delay

 - Look into "array  job" functionality

Regarding the problem of people bypassing the scheduler and logginginto nodes directly via SSH to run tasks -- I've posted on this exacttopic on this list before, you may be able to find it in an archivesomewhere. In short, my belief is that you'll never win thetechnological "arms race" with the users when you try to block userswho are bypassing the scheduler.

Depending on your organizational environment, it is better to treatthe problem of users bypassing the scheduler as a Management/HR/Policy problem rather than a technological problem. Set up a goodscheduler with resource allocation policies that have been acceptedby the users. Then make a policy that everyone who wants to useshared resources must operate under the scheduler. After that, makesure that people are informed that scheduler/cluster abuse is apolicy matter that will be referred up the management chain andeventually to the human resources department. It's a matter ofpolicy and acceptable use, not technology.



My $.02

-Chris




On Aug 1, 2006, at 5:36 PM, Xu, Jerry wrote:

Hi, Thanks, Joe.
I am not meaning to "ban" anything immediately, I am just curioushow often
this happen to the HPC community.
Perl/shell is really strong tool, one example is to use loop tosubmit hugemount of jobs and puts burden on scheduler server, the otherexample is to haveone job sit idle and frequently to use system call to detect thejob status andresubmit jobs again and again; the other example is that use systemcall and sshto each node and run stuff and bypass the scheduler... It justdrives me crazy
sometime.

 How do you guys handle issue like this?


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] scheduler and perl

Reply via email to