[slurm-users] Users Logout when job die or complete

Andrea Carotti Fri, 09 Jul 2021 03:52:32 -0700

Dear all,

I've installed an Open hpc cluster 2.3 , Centos 8.4, running Slurm20.11.7 (mostly following this guidehttps://github.com/openhpc/ohpc/releases/download/v2.3.GA/Install_guide-CentOS8-Warewulf-SLURM-2.3-aarch64.pdf).

I've a master node and Hybrid nodes that are GPU/CPU execution hosts andLogin Nodes with X11 running (workstation used by the users). I leavedthe possibility to the users to ssh to other compute-nodes even if theyare not running jobs there (I created an ssh-allowed group followingpage 51 ofhttps://software.intel.com/content/dam/www/public/us/en/documents/guides/installguide-openhpc2-centos82-6feb21.pdf,and did not run this command ' echo "account required pam_slurm.so" >>$CHROOT/etc/pam.d/sshd') . We are few guys using the cluster, so it'snot a big deal.

The GPUs are in Persistence-Mode OFF and "Default" Compute-Mode. SELINUXis disabled. No firewall.


I'm having a strange problem of "connection closed by remote host":

1)when a job is running by user1 under slurm locally (let's sayhybrid-0-1 where user1 is logged and working in X11) and the jobfinishes (or die, or is canceled), the user is logged out. The GDM loginwindow appears

2) when a job is running by user1 under slurm ( user1 is logged andworking in X11 on hybrid-0-2) on a remote host e.g. hybrid-0-2) nd thejob finishes (or die, or is canceled), the user is logged out byhybrid-0-1. I can check it by connecting from hybrid-0-2 by ssh onhybrid-0-1, and seeing that the terminal is disconnected at the end ofthe job. It happens using both srun and sbatch.

I think that the problem can be related to the Slurm configuration, andnot the GPU configuration, because both CPU and GPU jobs lead to thelogout problem.


Here are  the sbatch test , the slurm.conf and gres.conf


############## sbatch.test #####

#!/bin/bash
#SBATCH --job-name=test   # Job name
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --cpus-per-task=1
#SBATCH --partition=allcpu
#SBATCH --nodelist=hybrid-0-1
#SBATCH --output=serial_test_%j.log   # Standard output and error log
# Usage of this script:
#sbatch job-test.sbatch

# Jobname below is set automatically when using "qsub job-orca.sh -Njobname". Can alternatively be set manually here. Should be the name ofthe inputfile without extension (.inp or whatever).

export job=$SLURM_JOB_NAME
   JOB_NAME="$SLURM_JOB_NAME"
     JOB_ID="$SLURM_JOB_ID"

# Here  giving communication protocol

export RSH_COMMAND="/usr/bin/ssh -x"

#######SERIAL COMMANDS HERE

echo "HELLO WORLD"
sleep 10
echo "done"
#########################################

########## slurm.conf ##################

#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=linux
ControlMachine=orthrus
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
# OpenHPC default configuration
TaskPlugin=task/affinity
PropagateResourceLimitsExcept=MEMLOCK
JobCompType=jobcomp/filetxt
Epilog=/etc/slurm/slurm.epilog.clean
GresTypes=gpu

NodeName=hybrid-0-1 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1State=UNKNOWNNodeName=hybrid-0-2 Sockets=1 Gres=gpu:titanxp:1 CoresPerSocket=4ThreadsPerCore=1 State=UNKNOWNNodeName=hybrid-0-3 Sockets=1 Gres=gpu:titanxp:1,gpu:gtx1080:1CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWNNodeName=hybrid-0-4 Sockets=1 Gres=gpu:gtx980:1 CoresPerSocket=4ThreadsPerCore=2 State=UNKNOWNNodeName=hybrid-0-5 Sockets=1 Gres=gpu:gtx980:1 CoresPerSocket=4ThreadsPerCore=2 State=UNKNOWNNodeName=hybrid-0-7 Sockets=1 Gres=gpu:titanxp:1 CoresPerSocket=4ThreadsPerCore=1 State=UNKNOWNPartitionName=gpu Nodes=hybrid-0-[2-5,7] Default=YES MaxTime=INFINITEState=UP Oversubscribe=NOPartitionName=allcpu Nodes=hybrid-0-[1-5,7] Default=YES MaxTime=INFINITEState=UP Oversubscribe=NOPartitionName=fastcpu Nodes=hybrid-0-[3-5,7] Default=YESMaxTime=INFINITE State=UP Oversubscribe=NOPartitionName=fastqm Nodes=hybrid-0-5 Default=YES MaxTime=INFINITEState=UP Oversubscribe=NO

SlurmctldParameters=enable_configless
ReturnToService=1

#################################################

########### gres.conf ####################

NodeName=hybrid-0-[2,3,7] Name=gpu Type=titanxp File=/dev/nvidia0 COREs=0
NodeName=hybrid-0-3 Name=gpu Type=gtx1080 File=/dev/nvidia1 COREs=1
NodeName=hybrid-0-[4-5] Name=gpu Type=gtx980 File=/dev/nvidia0 COREs=0


###############


Thanks and sorry for the looong message

Andrea



--




¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
Andrea Carotti
Dipartimento di Scienze Farmaceutiche
Università di Perugia
Via del Liceo, 1
06123 Perugia, Italy
phone: +39 075 585 5121
fax: +39 075 585 5161
mail: andrea.caro...@unipg.it

[slurm-users] Users Logout when job die or complete

Reply via email to