On Tuesday, 21 November 2017 16:38:48 CET Ing. Gonzalo E. Arroyo wrote: > I have a problem detecting RAM and Arch (maybe some more), check this... > > NodeName=fisesta-21-3 Arch=x86_64 CoresPerSocket=1 > CPUAlloc=0 CPUErr=0 CPUTot=2 CPULoad=0.01 > AvailableFeatures=rack-21,2CPUs > ActiveFeatures=rack-21,2CPUs > Gres=gpu:1 > NodeAddr=10.1.21.3 NodeHostName=fisesta-21-3 Version=16.05 > OS=Linux RealMemory=3950 AllocMem=0 FreeMem=0 Sockets=2 Boards=1 > State=IDLE ThreadsPerCore=1 TmpDisk=259967 Weight=20479797 Owner=N/A > MCS_label=N/A > BootTime=2017-10-30T16:39:22 SlurmdStartTime=2017-11-06T16:46:54 > CapWatts=n/a > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > > NodeName=fisesta-21-3-cpus CoresPerSocket=1 > CPUAlloc=0 CPUErr=0 CPUTot=6 CPULoad=0.01 > AvailableFeatures=rack-21,6CPUs > ActiveFeatures=rack-21,6CPUs > Gres=(null) > NodeAddr=10.1.21.3 NodeHostName=fisesta-21-3-cpus Version=(null) > RealMemory=1 AllocMem=0 FreeMem=0 Sockets=6 Boards=1 > State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=20483797 Owner=N/A > MCS_label=N/A > BootTime=None SlurmdStartTime=None > CapWatts=n/a > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
I also saw the wrong Sockets, CPU and Threads. I did not recognize the wrong values for RAM. Therefore I did define Sockets, CoresPerSocket, ThreadsPerCore and RealMemory. I did hope that slurm somehow tracks the memory so that it gets shared between the partitions. I do not like to limit for both because depending on the user we need between 2 and 200GB RAM per GPU... > For your problem, please share the important lines of nodes and partitions, > you should check your users have permission to run inside very partition / > node splitted by this new configuration I did already add this lines to my first mail: NodeName=gpu1 NodeAddr=10.1.2.3 RealMemory=229376 Weight=998002 Sockets=2 CoresPerSocket=3 ThreadsPerCore=2 Gres=gpu:TeslaK40c:6 NodeName=gpu1-cpu NodeAddr=10.1.2.3 RealMemory=229376 Weight=998002 Sockets=2 CoresPerSocket=11 ThreadsPerCore=2 PartitionName=gpu Nodes=gpu1 PartitionName=cpu Nodes=gpu1-cpu I get the following error if I submit to node gpu1-cpu: [2017-11-21T09:06:55.840] launch task 999708.0 request from 1044.1000@10.1.2.3 (port 45252) [2017-11-21T09:06:55.840] error: Invalid job 999708.0 credential for user 1044: host gpu1 not in hostset gpu1-cpu [2017-11-21T09:06:55.840] error: Invalid job credential from 1044@10.1.2.3: Invalid job credential The node gpu has 2 sockets with each 14 cores and 2 threads per core 256GB RAM and 6 Tesla K40c. I will investigate it further the next time no jobs are running. I am unsure what I can change without killing jobs. I already learned renaming a partition or removing a node from a partition obviously kills jobs :-( Any suggestions what I should look for? regards Markus -- Markus Koeberl Graz University of Technology Signal Processing and Speech Communication Laboratory E-mail: markus.koeb...@tugraz.at