CentOS 8 is probably not a good idea as support terminates at end of this year
Khushwant Sidhu |Systems Admin, Principal Consultant | Technology Solutions | NTT DATA UK 4020 Lakeside, Birmingham Business Park, Solihull, B37 7YN, United Kingdom M: +44 (0) 7767111776 | Learn more at nttdata.com/uk | Follow us: For more details on our registered companies, click here -----Original Message----- From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Andrea Carotti Sent: 09 July 2021 11:51 To: slurm-us...@schedmd.com Subject: [slurm-users] Users Logout when job die or complete Dear all, I've installed an Open hpc cluster 2.3 , Centos 8.4, running Slurm 20.11.7 (mostly following this guide https://github.com/openhpc/ohpc/releases/download/v2.3.GA/Install_guide-CentOS8-Warewulf-SLURM-2.3-aarch64.pdf). I've a master node and Hybrid nodes that are GPU/CPU execution hosts and Login Nodes with X11 running (workstation used by the users). I leaved the possibility to the users to ssh to other compute-nodes even if they are not running jobs there (I created an ssh-allowed group following page 51 of https://software.intel.com/content/dam/www/public/us/en/documents/guides/installguide-openhpc2-centos82-6feb21.pdf, and did not run this command ' echo "account required pam_slurm.so" >> $CHROOT/etc/pam.d/sshd') . We are few guys using the cluster, so it's not a big deal. The GPUs are in Persistence-Mode OFF and "Default" Compute-Mode. SELINUX is disabled. No firewall. I'm having a strange problem of "connection closed by remote host": 1)when a job is running by user1 under slurm locally (let's say hybrid-0-1 where user1 is logged and working in X11) and the job finishes (or die, or is canceled), the user is logged out. The GDM login window appears 2) when a job is running by user1 under slurm ( user1 is logged and working in X11 on hybrid-0-2) on a remote host e.g. hybrid-0-2) nd the job finishes (or die, or is canceled), the user is logged out by hybrid-0-1. I can check it by connecting from hybrid-0-2 by ssh on hybrid-0-1, and seeing that the terminal is disconnected at the end of the job. It happens using both srun and sbatch. I think that the problem can be related to the Slurm configuration, and not the GPU configuration, because both CPU and GPU jobs lead to the logout problem. Here are the sbatch test , the slurm.conf and gres.conf ############## sbatch.test ##### #!/bin/bash #SBATCH --job-name=test # Job name #SBATCH --ntasks=1 # Run on a single CPU #SBATCH --cpus-per-task=1 #SBATCH --partition=allcpu #SBATCH --nodelist=hybrid-0-1 #SBATCH --output=serial_test_%j.log # Standard output and error log # Usage of this script: #sbatch job-test.sbatch # Jobname below is set automatically when using "qsub job-orca.sh -N jobname". Can alternatively be set manually here. Should be the name of the inputfile without extension (.inp or whatever). export job=$SLURM_JOB_NAME JOB_NAME="$SLURM_JOB_NAME" JOB_ID="$SLURM_JOB_ID" # Here giving communication protocol export RSH_COMMAND="/usr/bin/ssh -x" #######SERIAL COMMANDS HERE echo "HELLO WORLD" sleep 10 echo "done" ######################################### ########## slurm.conf ################## # # Example slurm.conf file. Please run configurator.html # (in doc/html) to build a configuration file customized # for your environment. # # # slurm.conf file generated by configurator.html. # # See the slurm.conf man page for more information. # ClusterName=linux ControlMachine=orthrus #ControlAddr= #BackupController= #BackupAddr= # SlurmUser=slurm #SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= StateSaveLocation=/var/spool/slurm/ctld SlurmdSpoolDir=/var/spool/slurm/d SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/pgid #PluginDir= #FirstJobId= #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #Prolog= #Epilog= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= #TaskPlugin= #TrackWCKey=no #TreeWidth=50 #TmpFS= #UsePAM= # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill #SchedulerAuth= SelectType=select/cons_tres SelectTypeParameters=CR_Core #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0 # # LOGGING SlurmctldDebug=info SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurmd.log #JobCompLoc= # # ACCOUNTING #JobAcctGatherType=jobacct_gather/linux #JobAcctGatherFrequency=30 # #AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStorageUser= # # COMPUTE NODES # OpenHPC default configuration TaskPlugin=task/affinity PropagateResourceLimitsExcept=MEMLOCK JobCompType=jobcomp/filetxt Epilog=/etc/slurm/slurm.epilog.clean GresTypes=gpu NodeName=hybrid-0-1 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN NodeName=hybrid-0-2 Sockets=1 Gres=gpu:titanxp:1 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN NodeName=hybrid-0-3 Sockets=1 Gres=gpu:titanxp:1,gpu:gtx1080:1 CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN NodeName=hybrid-0-4 Sockets=1 Gres=gpu:gtx980:1 CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN NodeName=hybrid-0-5 Sockets=1 Gres=gpu:gtx980:1 CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN NodeName=hybrid-0-7 Sockets=1 Gres=gpu:titanxp:1 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN PartitionName=gpu Nodes=hybrid-0-[2-5,7] Default=YES MaxTime=INFINITE State=UP Oversubscribe=NO PartitionName=allcpu Nodes=hybrid-0-[1-5,7] Default=YES MaxTime=INFINITE State=UP Oversubscribe=NO PartitionName=fastcpu Nodes=hybrid-0-[3-5,7] Default=YES MaxTime=INFINITE State=UP Oversubscribe=NO PartitionName=fastqm Nodes=hybrid-0-5 Default=YES MaxTime=INFINITE State=UP Oversubscribe=NO SlurmctldParameters=enable_configless ReturnToService=1 ################################################# ########### gres.conf #################### NodeName=hybrid-0-[2,3,7] Name=gpu Type=titanxp File=/dev/nvidia0 COREs=0 NodeName=hybrid-0-3 Name=gpu Type=gtx1080 File=/dev/nvidia1 COREs=1 NodeName=hybrid-0-[4-5] Name=gpu Type=gtx980 File=/dev/nvidia0 COREs=0 ############### Thanks and sorry for the looong message Andrea -- ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ Andrea Carotti Dipartimento di Scienze Farmaceutiche Università di Perugia Via del Liceo, 1 06123 Perugia, Italy phone: +39 075 585 5121 fax: +39 075 585 5161 mail: andrea.caro...@unipg.it Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.