Hi! I'm trying to prepare and test for some jobs that will arrive and that will
use multiple processes (i have no control on this, there are multiple
executables
that are being started in parallel within the job and communicate between them
with
a customization of zmq)
the submitting method is that a sbatch is submitted on my site (generated by a local running Compute Element (CE)
service) and it has a srun declaration that runs a given script.
for testing i'm trying to use the same format of the sbatch script but with the
additions to srun
line of:
--ntasks=1 --cpus-per-task=8
the result so far is that i have no errors but also the job is not run (the
payload is just some echo)
and i have no time to see it in the queue
this is on :
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory,CR_LLN
with
TaskPlugin=task/affinity,task/cgroup
TaskPluginParam=autobind=threads
So, for the above presented scenario, what are the site-side settings to be
aware of/take care
and what settings should be in the sbatch/srun components that i can ask to the
experiment
to adhere to?
should i ask that the request of resources to be instead in the sbatch file or
command?
in a test where the test job stayed in the queue wainting for execution, info
of the job
shows:
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=3950M,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
then, after execution i have no output, not even the stdout and stderr files.
the sacct shows just:
aliprod@alien: job_test $ sacct -j 8322339
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
8322339 TEST_JOB_+ alien aliprod 1 FAILED 1:0
8322339.bat+ batch aliprod 1 FAILED 1:0
8322339.ext+ extern aliprod 1 COMPLETED 0:0
the slurmctld log shows no info
Any idea how can i debug this further?
Thanks a lot for info!!
Adrian