Date: Tue, 7 Nov 2017 11:19:32 +0100 From: Benjamin Redling <benjamin.ra...@uni-jena.de> To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Having errors trying to run a packed jobs script Message-ID: <6979a04b-c9c0-badd-b57b-34d4d0ec8...@uni-jena.de> Content-Type: text/plain; charset=UTF-8
Hi Benjamin, Thank you for the answer > Bigger than a small cluster a decade ago... ;) Nice workhorse I guess. It is sufficient for the moment :) [...] >> The moment I schedule my script I can see that there are 50 instances >> of my process started and running but just a bit afterwards only 5 or >> so of them >> >> I can see running - so I only get full load for the first 50 instances >> and not afterwards. > "a bit afterwards" is too vague to reason anything aside sched_interval just > being the default 60s: I know it's not the best choice of words. Before scheduling my script I start "top" on the compute node so I can see that the first batch of the jobs steps are scheduled simultaneously but after that I only have 4 - 6 processes running resulting in a very poor utilization of the CPU resources. I get the following output in the logs [2017-11-06T11:40:45.625] backfill: Started JobId=1489 in main_compute on cn_burebista [2017-11-06T11:40:45.697] _pick_step_nodes: Configuration for job 1489 is complete [2017-11-06T11:44:48.289] slurmctld: agent retry_list size is 101 [2017-11-06T11:44:48.289] retry_list msg_type=7009,7009,7009,7009,7009 [2017-11-06T11:51:12.132] slurmctld: agent retry_list size is 101 [2017-11-06T11:51:12.132] retry_list msg_type=7009,7009,7009,7009,7009 [2017-11-06T11:52:12.835] job_step_signal step 1489.56 not found [2017-11-06T11:52:12.835] job_step_signal step 1489.59 not found ... > What's the (average) runtime of the jobs? > If your jobs are not running longer than the sched_interval default you might > want to *decrease* that. The average runtime of a job is 4 minutes. I am preprocessing small "video" files. I also tried with a smaller batch(smaller number of job steps) by reducing "--ntasks=25". It seems to improve a bit the total time it takes to process all the files but not very drastically. Best Regards Marius