[slurm-users] Re: srun weirdness
Looks more like a runtime environment issue. Check the binaries: ldd /mnt/local/ollama/ollama on both clusters and comparing the output may give some hints. Best, Feng On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users wrote: > > I'm running into a strange issue and I'm hoping another set of brains > looking at this might help. I would appreciate any feedback. > > I have two Slurm Clusters. The first cluster is running Slurm 21.08.8 > on Rocky Linux 8.9 machines. The second cluster is running Slurm > 23.11.6 on Rocky Linux 9.4 machines. > > This works perfectly fine on the first cluster: > > $ srun --mem=32G --pty /bin/bash > > srun: job 93911 queued and waiting for resources > srun: job 93911 has been allocated resources > > and on the resulting shell on the compute node: > > $ /mnt/local/ollama/ollama help > > and the ollama help message appears as expected. > > However, on the second cluster: > > $ srun --mem=32G --pty /bin/bash > srun: job 3 queued and waiting for resources > srun: job 3 has been allocated resources > > and on the resulting shell on the compute node: > > $ /mnt/local/ollama/ollama help > fatal error: failed to reserve page summary memory > runtime stack: > runtime.throw({0x1240c66?, 0x154fa39a1008?}) > runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 > pc=0x4605dc > runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?) > runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 > sp=0x7ffe6be32648 pc=0x456b7c > runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0) > runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 > pc=0x454565 > runtime.(*mheap).init(0x127b47e0) > runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 > pc=0x451885 > runtime.mallocinit() > runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 > pc=0x434f97 > runtime.schedinit() > runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 > pc=0x464397 > runtime.rt0_go() > runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0 > pc=0x49421c > > > If I ssh directly to the same node on that second cluster (skipping > Slurm entirely), and run the same "/mnt/local/ollama/ollama help" > command, it works perfectly fine. > > > My first thought was that it might be related to cgroups. I switched > the second cluster from cgroups v2 to v1 and tried again, no > difference. I tried disabling cgroups on the second cluster by removing > all cgroups references in the slurm.conf file but that also made no > difference. > > > My guess is something changed with regards to srun between these two > Slurm versions, but I'm not sure what. > > Any thoughts on what might be happening and/or a way to get this to work > on the second cluster? Essentially I need a way to request an > interactive shell through Slurm that is associated with the requested > resources. Should we be using something other than srun for this? > > > Thank you, > > -Dj > > > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: srun weirdness
Not sure, very strange, while the two linux-vdso.so.1 looks different: [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama linux-vdso.so.1 (0x7ffde81ee000) [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama linux-vdso.so.1 (0x7fffa66ff000) Best, Feng On Tue, May 14, 2024 at 3:43 PM Dj Merrill via slurm-users wrote: > > Hi Feng, > Thank you for replying. > > It is the same binary on the same machine that fails. > > If I ssh to a compute node on the second cluster, it works fine. > > It fails when running in an interactive shell obtained with srun on that > same compute node. > > I agree that it seems like a runtime environment difference between the > SSH shell and the srun obtained shell. > > This is the ldd from within the srun obtained shell (and gives the error > when run): > > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama > linux-vdso.so.1 (0x7ffde81ee000) > libresolv.so.2 => /lib64/libresolv.so.2 (0x154f732cc000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x154f732c7000) > libstdc++.so.6 => /lib64/libstdc++.so.6 (0x154f7300) > librt.so.1 => /lib64/librt.so.1 (0x154f732c2000) > libdl.so.2 => /lib64/libdl.so.2 (0x154f732bb000) > libm.so.6 => /lib64/libm.so.6 (0x154f72f25000) > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x154f732a) > libc.so.6 => /lib64/libc.so.6 (0x154f72c0) > /lib64/ld-linux-x86-64.so.2 (0x154f732f8000) > > This is the ldd from the same exact node within an SSH shell which runs > fine: > > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama > linux-vdso.so.1 (0x7fffa66ff000) > libresolv.so.2 => /lib64/libresolv.so.2 (0x14a9d82da000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x14a9d82d5000) > libstdc++.so.6 => /lib64/libstdc++.so.6 (0x14a9d800) > librt.so.1 => /lib64/librt.so.1 (0x14a9d82d) > libdl.so.2 => /lib64/libdl.so.2 (0x14a9d82c9000) > libm.so.6 => /lib64/libm.so.6 (0x14a9d7f25000) > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x14a9d82ae000) > libc.so.6 => /lib64/libc.so.6 (0x14a9d7c0) > /lib64/ld-linux-x86-64.so.2 (0x14a9d8306000) > > > -Dj > > > > On 5/14/24 15:25, Feng Zhang via slurm-users wrote: > > Looks more like a runtime environment issue. > > > > Check the binaries: > > > > ldd /mnt/local/ollama/ollama > > > > on both clusters and comparing the output may give some hints. > > > > Best, > > > > Feng > > > > On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users > > wrote: > >> I'm running into a strange issue and I'm hoping another set of brains > >> looking at this might help. I would appreciate any feedback. > >> > >> I have two Slurm Clusters. The first cluster is running Slurm 21.08.8 > >> on Rocky Linux 8.9 machines. The second cluster is running Slurm > >> 23.11.6 on Rocky Linux 9.4 machines. > >> > >> This works perfectly fine on the first cluster: > >> > >> $ srun --mem=32G --pty /bin/bash > >> > >> srun: job 93911 queued and waiting for resources > >> srun: job 93911 has been allocated resources > >> > >> and on the resulting shell on the compute node: > >> > >> $ /mnt/local/ollama/ollama help > >> > >> and the ollama help message appears as expected. > >> > >> However, on the second cluster: > >> > >> $ srun --mem=32G --pty /bin/bash > >> srun: job 3 queued and waiting for resources > >> srun: job 3 has been allocated resources > >> > >> and on the resulting shell on the compute node: > >> > >> $ /mnt/local/ollama/ollama help > >> fatal error: failed to reserve page summary memory > >> runtime stack: > >> runtime.throw({0x1240c66?, 0x154fa39a1008?}) > >> runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 > >> pc=0x4605dc > >> runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?) > >> runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 > >> sp=0x7ffe6be32648 pc=0x456b7c > >> runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0) > >> runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 > >> pc=0x454565 > >> runtime.(*mheap).init(0x127b47e0) > >> runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 > >> pc=0x451885 > >> runtime.mallocinit() > >> runtime/malloc.go:454 +0xd7 fp=0x7ffe
[slurm-users] Re: srun weirdness
Do you have containers setting? On Tue, May 14, 2024 at 3:57 PM Feng Zhang wrote: > > Not sure, very strange, while the two linux-vdso.so.1 looks different: > > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama > linux-vdso.so.1 (0x7ffde81ee000) > > > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama > linux-vdso.so.1 (0x7fffa66ff000) > > Best, > > Feng > > On Tue, May 14, 2024 at 3:43 PM Dj Merrill via slurm-users > wrote: > > > > Hi Feng, > > Thank you for replying. > > > > It is the same binary on the same machine that fails. > > > > If I ssh to a compute node on the second cluster, it works fine. > > > > It fails when running in an interactive shell obtained with srun on that > > same compute node. > > > > I agree that it seems like a runtime environment difference between the > > SSH shell and the srun obtained shell. > > > > This is the ldd from within the srun obtained shell (and gives the error > > when run): > > > > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama > > linux-vdso.so.1 (0x7ffde81ee000) > > libresolv.so.2 => /lib64/libresolv.so.2 (0x154f732cc000) > > libpthread.so.0 => /lib64/libpthread.so.0 (0x154f732c7000) > > libstdc++.so.6 => /lib64/libstdc++.so.6 (0x154f7300) > > librt.so.1 => /lib64/librt.so.1 (0x154f732c2000) > > libdl.so.2 => /lib64/libdl.so.2 (0x154f732bb000) > > libm.so.6 => /lib64/libm.so.6 (0x154f72f25000) > > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x154f732a) > > libc.so.6 => /lib64/libc.so.6 (0x154f72c0) > > /lib64/ld-linux-x86-64.so.2 (0x154f732f8000) > > > > This is the ldd from the same exact node within an SSH shell which runs > > fine: > > > > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama > > linux-vdso.so.1 (0x7fffa66ff000) > > libresolv.so.2 => /lib64/libresolv.so.2 (0x14a9d82da000) > > libpthread.so.0 => /lib64/libpthread.so.0 (0x14a9d82d5000) > > libstdc++.so.6 => /lib64/libstdc++.so.6 (0x14a9d800) > > librt.so.1 => /lib64/librt.so.1 (0x14a9d82d) > > libdl.so.2 => /lib64/libdl.so.2 (0x000014a9d82c9000) > > libm.so.6 => /lib64/libm.so.6 (0x14a9d7f25000) > > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x14a9d82ae000) > > libc.so.6 => /lib64/libc.so.6 (0x14a9d7c0) > > /lib64/ld-linux-x86-64.so.2 (0x14a9d8306000) > > > > > > -Dj > > > > > > > > On 5/14/24 15:25, Feng Zhang via slurm-users wrote: > > > Looks more like a runtime environment issue. > > > > > > Check the binaries: > > > > > > ldd /mnt/local/ollama/ollama > > > > > > on both clusters and comparing the output may give some hints. > > > > > > Best, > > > > > > Feng > > > > > > On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users > > > wrote: > > >> I'm running into a strange issue and I'm hoping another set of brains > > >> looking at this might help. I would appreciate any feedback. > > >> > > >> I have two Slurm Clusters. The first cluster is running Slurm 21.08.8 > > >> on Rocky Linux 8.9 machines. The second cluster is running Slurm > > >> 23.11.6 on Rocky Linux 9.4 machines. > > >> > > >> This works perfectly fine on the first cluster: > > >> > > >> $ srun --mem=32G --pty /bin/bash > > >> > > >> srun: job 93911 queued and waiting for resources > > >> srun: job 93911 has been allocated resources > > >> > > >> and on the resulting shell on the compute node: > > >> > > >> $ /mnt/local/ollama/ollama help > > >> > > >> and the ollama help message appears as expected. > > >> > > >> However, on the second cluster: > > >> > > >> $ srun --mem=32G --pty /bin/bash > > >> srun: job 3 queued and waiting for resources > > >> srun: job 3 has been allocated resources > > >> > > >> and on the resulting shell on the compute node: > > >> > > >> $ /mnt/local/ollama/ollama help > > >> fatal error: failed to reserve page summary memory > > >> runtime stack: > > >> runtime.throw({0x1240c66?, 0x154fa39a1008?}) > > >> runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 > >
[slurm-users] maxrss reported by sachet is wrong
Hi All, I am having trouble calculating the real RSS memory usage by some kind of users' jobs. Which the sacct returned wrong numbers. Rocky Linux release 8.5, Slurm 21.08 (slurm.conf) ProctrackType=proctrack/cgroup JobAcctGatherType=jobacct_gather/linux The troubling jobs are like: 1. python spawn multithreading 96 threads; 2. Each thread uses SKlearn which again spawns 96 threads using openmp. Which is obviously over running the node, and I want to address it. The node has 300GB RAM, but the "sacct" (and seff) reports 1.2TB MaxRSS(also AveRSS). This does not look correct. I am suspecting that whether the SLurm+jobacct_gather/linux repeatedly sums up the memory used by all these threads, multiple counted the same thing many times. For the openMP part, maybe it is fine for slurm; while for python/multithreading, maybe it can not work well with Slurm for memory accounting? So, if this is the case, maybe 1.2TB/96= 12GB MaxRSS? I want to get the right MaxRSS to report to users. Thanks! Best, Feng -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs
yes, the algorithm should be like that 1 cpu (core) per job(task). Like someone mentioned already, need to to --oversubscribe=10 on cpu cores, meaning 10 jobs on each core for you case. Slurm.conf. Best, Feng On Fri, Jun 21, 2024 at 6:52 AM Arnuld via slurm-users wrote: > > > Every job will need at least 1 core just to run > > and if there are only 4 cores on the machine, > > one would expect a max of 4 jobs to run. > > I have 3500+ GPU cores available. You mean each GPU job requires at least one > CPU? Can't we run a job with just GPU without any CPUs? This sbatch script > requires 100 GPU cores, can;t we run 35 in parallel? > > #! /usr/bin/env bash > > #SBATCH --output="%j.out" > #SBATCH --error="%j.error" > #SBATCH --partition=pgpu > #SBATCH --gres=shard:100 > > sleep 10 > echo "Current date and time: $(date +"%Y-%m-%d %H:%M:%S")" > echo "Running..." > sleep 10 > > > > > > > On Thu, Jun 20, 2024 at 11:23 PM Brian Andrus via slurm-users > wrote: >> >> Well, if I am reading this right, it makes sense. >> >> Every job will need at least 1 core just to run and if there are only 4 >> cores on the machine, one would expect a max of 4 jobs to run. >> >> Brian Andrus >> >> On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote: >> > I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ >> > cores. I want to run around 10 jobs in parallel on the GPU (mostly >> > are CUDA based jobs). >> > >> > PROBLEM: Each job asks for only 100 shards (runs usually for a minute >> > or so), then I should be able to run 3500/100 = 35 jobs in >> > parallel but slurm runs only 4 jobs in parallel keeping the rest in >> > the queue. >> > >> > I have this in slurm.conf and gres.conf: >> > >> > # GPU >> > GresTypes=gpu,shard >> > # COMPUTE NODES >> > PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP` >> > PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP >> > NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500 >> > CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 >> > RealMemory=64255 State=UNKNOWN >> > -- >> > Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1 >> > Name=shard Count=3500 File=/dev/nvidia0 >> > >> > >> > >> >> -- >> slurm-users mailing list -- slurm-users@lists.schedmd.com >> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Print Slurm Stats on Login
You can also check https://github.com/prod-feng/slurm_tools slurm_job_perf_show.py may be helpful. I used to try to use slurm_job_perf_show_email.py to send emails to users to summarize their usage, like monthly. While some users seemed to get confused, so stopped. Best, Feng On Fri, Aug 9, 2024 at 11:13 AM Paul Edmon via slurm-users wrote: > > We are working to make our users more aware of their usage. One of the > ideas we came up with was to having some basic usage stats printed at > login (usage over past day, fairshare, job efficiency, etc). Does anyone > have any scripts or methods that they use to do this? Before baking my > own I was curious what other sites do and if they would be willing to > share their scripts and methodology. > > -Paul Edmon- > > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: /etc/passwd sync?
Keep the /etc/password, group synced to all the nodes should work. And it will need to set up an SSH key for MPI. Best, Feng On Mon, Feb 10, 2025 at 10:29 PM mark.w.moorcroft--- via slurm-users < slurm-users@lists.schedmd.com> wrote: > If you set up slurm elastic cloud in EC2 without LDAP, what is the > recommended method for sync of the passwd/group files? Is this necessary to > get openmpi jobs to run. I would swear I had this working last week without > synced passwd on two nodes. But thinking about it now I'm not sure how this > could have worked. My home directories are in an NFS mount, but the user > accounts don't exist on the node AMI. I'm using ansible/packer to manage > the AMI's. When I ran OpenHPC / Slurm on bare metal there was a sync > process. This is my first AWS Slurm cluster rodeo. I can't use the Amazon > Parallel Computing tools because we are forced to be in GovCloud. I started > with "ClusterInTheCloud", but it's all 4 years old, and semi-broken out of > the box. My manager had me ditch a lot of it (including LDAP). So I'm > building out a fork that is getting heavily modded for our situation. > > An ORTE daemon has unexpectedly failed after launch and before > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack of common network interfaces and/or no > route found between them. Please check network connectivity > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Using more cores/CPUs that requested with sbatch
Also in the cgroup.conf file, you can add constraints on memory, devices(like GPU), etc. Best, Feng On Tue, Mar 25, 2025 at 3:20 AM megan4slurm--- via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hello Gestió, > > Yes, slurm can restrict the resources that are available to the job using > cgroups. I accidentally send my first reply as a separate email in this > mailing list, which you can find here: > > https://lists.schedmd.com/mailman3/hyperkitty/list/slurm-users@lists.schedmd.com/thread/IJHBUWOU5NPZQK7NYUZODTIZJRLLM3H4/ > > Sorry about that, > --Megan > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Job running slower when using Slurm
Beside slurm options, you might also need to set OpenMP env variable: export OMP_NUM_THREADS=32 (the core, not thread number) Also other similar env variables, if you use any Python libs. Best, Feng On Wed, Apr 23, 2025 at 3:22 PM Jeffrey Layton via slurm-users < slurm-users@lists.schedmd.com> wrote: > Roger. It's the code that prints out the threads it sees - I bet it is the > cgroups. I need to look at how that it is configured as well. > > For the time, that comes from the code itself. I'm guessing it has a start > time and and end time in the code and just takes the difference. But again, > this is something in the code. Unfortunately, the code uses the time to > compute Mop/s total and Mop/s/thread so a longer time means slower > performance. > > Thanks! > > Jeff > > > On Wed, Apr 23, 2025 at 2:53 PM Michael DiDomenico via slurm-users < > slurm-users@lists.schedmd.com> wrote: > >> the program probably says 32 threads, because it's just looking at the >> box, not what slurm cgroups allow (assuming your using them) for cpu >> >> i think for an openmp program (not openmpi) you definitely want the >> first command with --cpus-per-task=32 >> >> are you measuring the runtime inside the program or outside it? if >> the later the 10sec addition in time could be the slurm setup/node >> allocation >> >> On Wed, Apr 23, 2025 at 2:41 PM Jeffrey Layton >> wrote: >> > >> > I tried using ntasks and cpus-per-task to get all 32 cores. So I added >> --ntasks=# --cpus-per-task=N to th sbatch command so that it now looks >> like: >> > >> > sbatch --nodes=1 --ntasks=1 --cpus-per-task=32