I would recommend putting a clean up process in your epilog script. We
have a check here that sees if the job completed and if so it then
terminates all the user processes by kill -9 to clean up any residuals.
If it fails it closes of the node so we can reboot it.
-Paul Edmon-
On 04/23/2018 08:10 AM, John Hearns wrote:
Nicolo, I cannot say what your problem is.
However in the past with problems like this I would
a) look at ps -eaf --forest
Try to see what the parent processes of these job processes are
Clearly if the parent PID is 1 then --forest is nto much help. But the
--forest option is my 'goto' option
b) look closely at the slurm logs. Do not fool yourself - force
yourself to read the logs line by line, around the timestamp when the
jobs ends.
Being a bit more helpful, in my last job we had endless problems with
Matlab jobs leaving orphaned processes.
To be fair to Matlab, they have a utility which 'properly' starts
parallel jobs under the control of the batch system (OK, it was PBSpro)
But users can easily start a job and 'fire off' processes in MAtlab
which are nut under the directo control of the batch daemon, leaving
orphaned processes
when the jobs ends.
Actually, if you think about this this is how a batch system works.
The batch system daemon starts running processes on your behalf.
When the job is killed, all the daughter proccesses of that daemon
should die.
It is intructive to run ps -eaf --forest sometimes on a compute node
during a normal job run. Get to know how things are being created, and
what their parents are
(two dashes in front of the forest argument)
Now think of users who start a batch job and get a list of compute hosts.
they MAY use a mechanism such as ssd or indeed pbsdsh to start running
job rocesses on those nodes.
You will then have trouble with orphaned processes when the job ends.
Techniques for dealing with this:
a use the PAM module which stops ssh login (actually - this probably
allows ssh login suring a job time when th euser has a node allocated)
b my favourite - CPU sets - actuallt this wont stop ssh logins either.
c Shouting, much shouting. Screaming.
Regarding users behavng like this, I have seen several cases of
behaviour like this for understandable reasons.
On a ssytem which I did not manage, but was asked fro advice, the
vendor had provided a sample script for running Ansys.
The user wanted to run Abaqus on the compute nodes (or some such - a
different application anyway)
So he started an empty Ansys job, which sat doing nothing. Then took
the list of hosts provided by the batch system
and fired up an interactive Abaqus session on his terminal.
I honestly hesitate to label this behaviour 'wrong'
I als have seen similar when running a CFD job.
On 23 April 2018 at 11:50, Nicolò Parmiggiani
<nicolo.parmiggi...@gmail.com <mailto:nicolo.parmiggi...@gmail.com>>
wrote:
Hi,
I have a job that keeps running even though the internal process
is finished.
What could be the problem?
Thank you.