Greetings,

I would like to install SLURM on Clear Linux because of its good benchmarks.  I have followed the tutorial at https://docs.01.org/clearlinux/latest/tutorials/hpc.html <https://docs.01.org/clearlinux/latest/tutorials/hpc.html>. When I got to the step of the section "Create slurm.conf configuration file" I noticed that slurmctld service didn't start. The error was related to the slurm.conf file. This was in the log:

jul 11 19:20:00 slurm-controller slurmctld[615]: error: Ignoring obsolete FastSchedule=1 option. Please remove from your configuration. jul 11 19:20:00 slurm-controller slurmctld[615]: fatal: SallocDefaultCommand has been removed. Please consider setting LaunchParameters=use_interactive_step instead.

I deleted FastSchedule and SallocDefaultCommand. After that I added these lines:

LaunchParameters=use_interactive_step
InteractiveStepOptions="srun -n1 -N1 --pty --preserve-env --mpi=pmix_v3 $SHELL"

After I corrected that I could not continue because there is an undefined symbol in a shared object.

This is the log:

[2021-07-11T19:35:14.260] slurmctld version 20.11.8 started on cluster linux
[2021-07-11T19:35:14.261] cred/munge: init: Munge credential signature plugin loaded [2021-07-11T19:35:14.262] debug: auth/munge: init: Munge authentication plugin loaded [2021-07-11T19:35:14.262] select/cons_res: common_init: select/cons_res loaded [2021-07-11T19:35:14.263] select/linear: init: Linear node selection plugin loaded with argument 1 [2021-07-11T19:35:14.263] select/cons_tres: common_init: select/cons_tres loaded
[2021-07-11T19:35:14.263] preempt/none: init: preempt/none loaded
[2021-07-11T19:35:14.264] debug: acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded [2021-07-11T19:35:14.264] debug: acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded [2021-07-11T19:35:14.264] debug: acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded [2021-07-11T19:35:14.264] debug: acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded [2021-07-11T19:35:14.265] debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf) [2021-07-11T19:35:14.265] debug: jobacct_gather/none: init: Job accounting gather NOT_INVOKED plugin loaded [2021-07-11T19:35:14.265] error: plugin_load_from_file: dlopen(/usr/lib64/slurm/prep_script.so): /usr/lib64/slurm/prep_script.so: undefined symbol: run_script [2021-07-11T19:35:14.265] error: Couldn't load specified plugin name for prep/script: Dlopen of plugin file failed [2021-07-11T19:35:14.266] error: prep_plugin_init: cannot create prep context for prep/script
[2021-07-11T19:35:14.266] fatal: failed to initialize prep plugin

Since the slurm.conf file of the bundle (package) of Clear Linux is outdated, I thought that may be using a better configuration file the error would disappear.  My hypothesis was that maybe I needed to load another plugin that has the run_script symbol. Then, I tried creating a better configuration file using https://slurm.schedmd.com/configurator.easy.html.  But I got the same error.

Do you think it is either a bug of SLURM, something missing in the configuration or an error in the compilation of the bundle (package) I installed?  I have noticed that in other Linux distributions there are similar issues with precompiled packages. However, it happens with other shared objects and other symbols.

If the problem is Clear Linux what's the best Linux for SLURM?

I am attaching my latest test configuration file.

I would appreciate any help you may give me.  Thank very much in advance.

Best regards,

Braulio J. Solano-Rojas

# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=slurm-controller
#
#MailProg=/bin/mail
MpiDefault=pmix_v3
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/run/slurm/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/run/slurm/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurm/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=citic-cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
#SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
#
#
# COMPUTE NODES
NodeName=slurm-worker CPUs=2 Boards=1 SocketsPerBoard=2 CoresPerSocket=1 
ThreadsPerCore=1 RealMemory=1968

PartitionName=workers Nodes=slurm-worker Default=YES MaxTime=INFINITE State=UP
PartitionName=debug Nodes=slurm-worker MaxTime=INFINITE State=UP

Reply via email to