Re: [slurm-users] "Low socket*core*thre" - solution?

2018-05-09 Thread Werner Saar
Hi, I tried scontrol reconfigure some years ago, but this didn't work in all cases. Best regards Werner On 05/09/2018 04:27 PM, Mahmood Naderan wrote: I think, the problem was: the python script /opt/rocks/lib/python2.7/site-packages/rocks/commands/sync/slurm/__init__py, which is called by

Re: [slurm-users] Built in X11 forwarding in 17.11 won't work on local displays

2018-05-09 Thread Nathan Harper
Yep, exactly the same issue. Our dirty workaround is to ssh -X back into the same host and it will work. > On 24 Apr 2018, at 00:03, Brendan Moloney wrote: > > Hi, > > We recently upgraded to 17.11, and I was trying to setup the new integrated > X11 forwarding instead of using the spank plug

[slurm-users] Runaway jobs issue, slurm 17.11.3

2018-05-09 Thread Christopher Benjamin Coffey
Hi, we have an issue currently where we have a bunch of runaway jobs, but we cannot clear them: sacctmgr show runaway|wc -l sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable 58588 Has anyone run into t

[slurm-users] Built in X11 forwarding in 17.11 won't work on local displays

2018-05-09 Thread Brendan Moloney
Hi, We recently upgraded to 17.11, and I was trying to setup the new integrated X11 forwarding instead of using the spank plugin. Initially I was testing with an SSH session into our login node and things seemed fine. Then I switched to using X2Go to connect to the login node and it broke. The

[slurm-users] Understanding gres binding

2018-05-09 Thread Wiegand, Paul
Greetings, I am setting up our new GPU cluster and trying to ensure that a user may issue a request such that all the cores assigned to them are on the same socket to which the GPU is bound; however, I guess I do not fully understand the settings because I seem to be getting cores from multiple

[slurm-users] sacct fields AllocCPUS and ReqMem are empty

2018-05-09 Thread marcelsommer...@gmail.com
Hi, I have Slurm 17.02.10 installed in a test environment. When I use sacct -o "JobID,JobName,AllocCPUs,ReqMem,Elapsed" and AccountingStorageType = accounting_storage/filetxt, the fields AllocCPUS and ReqMem are empty. JobIDJobName AllocCPUS ReqMemElapsed

[slurm-users] Generic Burst Buffers

2018-05-09 Thread Soysal, Mehmet (SCC)
Hi, im testing (or trying) generic burst buffers. But it is not really clear how to configure the burst buffers. I configured slurm.conf to use the generic plugin: BurstBufferType=burst_buffer/generic #Add Debug flags DebugFlags=BurstBuffer On startup it loads the module: slurmctld: debug3: Try

[slurm-users] Reservation for a partition

2018-05-09 Thread Diego Zuccato
Hello all. Is it possible to reserve some nodes for being used for jobs only in a specific partition? Our mini-cluster will be used for some lessons, so it will need to run "immediately" the jobs submitted by students. A partition spanning the required nodes is already defined, but it overlaps wit

[slurm-users] How to get information about job steps

2018-05-09 Thread Roger Moye
Is there a way to retrieve job step information similar to "scontrol show job"? What I want to be able to do is see all job steps associated with a particular job, whether the step pending, running. or finished. It seems that job step information is only available as long as the step is run

[slurm-users] Slurm version 17.11.6 available

2018-05-09 Thread Tim Wickberg
We are pleased to announce the availability of Slurm version 17.11.6. This includes over 50 fixes made since 17.11.5 was released eight weeks ago, including a race condition within the slurmstepd that can lead to hung extern steps. Slurm can be downloaded from https://www.schedmd.com/download

[slurm-users] Splitting mpi rank output

2018-05-09 Thread Christopher Benjamin Coffey
Hi, I have a user trying to use %t to split the mpi rank outputs into different files and it's not working. I verified this too. Any idea why this might be? This is the first that I've heard of a user trying to do this. Here is an example job script file: - #!/bin/bash #SBATCH --job-name=m

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-09 Thread Eric F. Alemany
Good Morning (at least for those on the West coast of the US) My nodes are no longer “down” eric@radoncmaster:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 4 idle radonc[01-04] I think the NTP configuration did the trick So one possibility there is

Re: [slurm-users] "Low socket*core*thre" - solution?

2018-05-09 Thread Mahmood Naderan
> I think, the problem was: > the python script > /opt/rocks/lib/python2.7/site-packages/rocks/commands/sync/slurm/__init__py, > which is called by the command rocks sync slurm > did not restart slurmd on the Head-Node. Thanks for figuring out that. At the time I was digging, I tried rocks sync co

Re: [slurm-users] slurm reboot node with spank plugin

2018-05-09 Thread Tueur Volvo
I currently use a plugin node feature like knl but i don't like use node feature because i must write "feature" in slurm.conf file I find this solution rather cumbersome for example if i want to add kernel 4.4 in my program : srun -c kernel4.4 i must to updated all slurm.conf file in my cluster, a

Re: [slurm-users] slurm reboot node with spank plugin

2018-05-09 Thread Chris Samuel
On Wednesday, 9 May 2018 9:16:37 PM AEST Tueur Volvo wrote: > if i use srun --reboot hostname, how to tell him to update the kernel before > rebooting ? Ah, now I understand why you mention a spank plugin, as that would allow you to specify a new command line option for sbatch to specify a kerne

Re: [slurm-users] slurm reboot node with spank plugin

2018-05-09 Thread Tueur Volvo
I would like to update the linux kernel then reboot the machine and run the job for example I would like this: srun --chooskernel=4.1 hostname I would like to install kernel 4.1 on my machine, then reboot the machine and run hostname if i use srun --reboot hostname, how to tell him to update th

Re: [slurm-users] slurm reboot node with spank plugin

2018-05-09 Thread Chris Samuel
On Wednesday, 9 May 2018 7:09:12 PM AEST Tueur Volvo wrote: > Hello, i have question, it's possible to reboot slurm node in spank plugin > before execute job ? I don't know about that, but sbatch has a --reboot flag and you could use a submit filter to set it.We do the opposite and always str

Re: [slurm-users] "Low socket*core*thre" - solution?

2018-05-09 Thread Chris Samuel
On Wednesday, 9 May 2018 6:09:08 PM AEST Werner Saar wrote: > I think, the problem was: > the python script > /opt/rocks/lib/python2.7/site-packages/rocks/commands/sync/slurm/__init__py, > which is called by the command rocks sync slurm > did not restart slurmd on the Head-Node. Depending on the

[slurm-users] slurm reboot node with spank plugin

2018-05-09 Thread Tueur Volvo
Hello, i have question, it's possible to reboot slurm node in spank plugin before execute job ?

Re: [slurm-users] srun seg faults immediately from within sbatch but not salloc

2018-05-09 Thread a . vitalis
Hi Benjamin, thanks for getting back to me! I somehow failed to ever arrive at this page. Andreas -"slurm-users" wrote: - To: slurm-users@lists.schedmd.com From: Benjamin Matthews Sent by: "slurm-users" Date: 05/09/2018 01:20AM Subject: Re: [slurm-users] srun seg faults immediately fr

Re: [slurm-users] "Low socket*core*thre" - solution?

2018-05-09 Thread Werner Saar
Hi Mahmood, I think, the problem was: the python script /opt/rocks/lib/python2.7/site-packages/rocks/commands/sync/slurm/__init__py, which is called by the command rocks sync slurm did not restart slurmd on the Head-Node. After the restart of slurmctld, slurmd on the Head-node had the old con

Re: [slurm-users] scancel a list of jobs

2018-05-09 Thread Bjørn-Helge Mevik
Chester Langin writes: > Is there no way to scancel a list of jobs? Like from job 120 to job 150? scancel $(seq 120 150) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature