Afaik, there were some problems with certain versions of UCX, where UCX expected OPAL memory hooks from OMPI, but they were disabled and the physical pages became out-of-sync. But I don't know if this is the case.
Maybe you could run dynamic debug to see if there is something useful in dmesg: echo "module mlx5_core +p"| tee /sys/kernel/debug/dynamic_debug/control And you could also try to run ucx_info in debug mode. Cheers, Barbara On 6/1/20 8:37 PM, Alberto Morillas, Angelines wrote: > Yes I tried it but whit the same result > openmpi@4.0.3 -cuda +cxx_exceptions fabrics=ucx -java -legacylaunchers > -memchecker +pmi schedulers=slurm -sqlite3 -thread_multiple +vt > > You can compile wrf , when you sbatch your job it is running but it doesn´t > do anything and we get the same, with WCHAN=hrtime > 0 S 4556 87383 87361 0 80 0 - 126676 hrtime ? > 00:05:25 real.exe > > ------------------------------ > > Message: 2 > Date: Mon, 1 Jun 2020 16:56:05 +0000 > From: "Pritchard Jr., Howard" <howa...@lanl.gov> > To: Slurm User Community List <slurm-users@lists.schedmd.com> > Subject: Re: [slurm-users] [EXTERNAL] problems with OpenMPI 4.0.3 > Message-ID: <20dc51ae-9f58-4b1c-b619-1a2077d5c...@lanl.gov> > Content-Type: text/plain; charset="utf-8" > > HI Angelines, > > Could you try reinstalling with fabric=ucx and rerunning? > UCX is the preferred way to use Infiniband in the Open MPI 4.0.x release > stream. > > Howard > > ?On 6/1/20, 10:29 AM, "slurm-users on behalf of Alberto Morillas, > Angelines" <slurm-users-boun...@lists.schedmd.com on behalf of > angelines.albe...@ciemat.es> wrote: > > Hello Howard > > I installed it with spack: > openmpi@4.0.3 -cuda +cxx_exceptions fabrics=verbs -java > -legacylaunchers -memchecker +pmi schedulers=slurm -sqlite3 -thread_multiple > +vt > where - --> not enable > + --> enable > > Thanks in advance. > ________________________________________________ > > Angelines Alberto Morillas > > Unidad de Arquitectura Inform?tica > Despacho: 22.1.32 > Telf.: +34 91 346 6119 > Fax: +34 91 346 6537 > > skype: angelines.alberto > > CIEMAT > Avenida Complutense, 40 > 28040 MADRID > ________________________________________________ > > > > > ------------------------------ > > Message: 2 > Date: Mon, 1 Jun 2020 16:13:11 +0000 > From: "Pritchard Jr., Howard" <howa...@lanl.gov> > To: Slurm User Community List <slurm-users@lists.schedmd.com> > Subject: Re: [slurm-users] [EXTERNAL] problems with OpenMPI 4.0.3 > Message-ID: <ca7fe91c-8104-476f-b9a2-528d23ed3...@lanl.gov> > Content-Type: text/plain; charset="utf-8" > > Hello Angelines, > > Do you know how the Open MPI 4.0.3 package was configured and > built? That information would be useful to help diagnose the problem. > > Thanks, > > Howard > > > From: slurm-users <slurm-users-boun...@lists.schedmd.com> on > behalf of "Alberto Morillas, Angelines" <angelines.albe...@ciemat.es> > Reply-To: Slurm User Community List > <slurm-users@lists.schedmd.com> > Date: Friday, May 29, 2020 at 4:25 AM > To: "slurm-users@lists.schedmd.com" > <slurm-users@lists.schedmd.com> > Subject: [EXTERNAL] [slurm-users] problems with OpenMPI 4.0.3 > > Good morning, > > We have a cluster with two kind of infiniband cards, one > connectx-4 and the other connectx-6. > Openmpi-3.1.3 works fine, but when we start with connectx-6 we > started to use openmpi-4.0.3 (that support connectx-6) and the programs that > have several parts, first a call to a secuencial program and inside it a call > to a parallel program, ? (in our case the program is WRF, but we have others > like this with the same problem), this kind of programs suddenly stop, > > ?.. > 0 S 4556 87383 87361 0 80 0 - 126676 hrtime ? > 00:05:25 real.exe > 0 S 4556 87384 87361 0 80 0 - 126677 hrtime ? > 00:05:33 real.exe > 0 S 4556 87385 87361 0 80 0 - 126675 hrtime ? > 00:05:28 real.exe > ?? > The WCHAN=hrtime, and it looks that it is running, but really it > doesn?t work > > We don?t know if it could be problem with slurm and this version > of openmpi? Any idea? > > > ________________________________________________ > > Angelines Alberto Morillas > > Unidad de Arquitectura Inform?tica > Despacho: 22.1.32 > Telf.: +34 91 346 6119 > Fax: +34 91 346 6537 > > skype: angelines.alberto > > CIEMAT > Avenida Complutense, 40 > 28040 MADRID > ________________________________________________ > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200601/e0e1cbee/attachment-0001.htm> > > ------------------------------ > > Message: 3 > Date: Mon, 1 Jun 2020 16:16:00 +0000 > From: Songpon Srisawai <songpons_...@vistec.ac.th> > To: Slurm User Community List <slurm-users@lists.schedmd.com> > Subject: Re: [slurm-users] Slurm Job Count Credit system > Message-ID: <9666f3be-d648-4ee9-9ad2-80df973f87cc@Spark> > Content-Type: text/plain; charset="utf-8" > > Greatly appreciated for your help. I will try to implement > following your suggestion. > On 1 Jun 2020 22:23 +0700, Renfro, Michael <ren...@tntech.edu>, > wrote: > Even without the slurm-bank system, you can enforce a limit on > resources with a QOS applied to those users. Something like: > > ===== > > sacctmgr add qos bank1 flags=NoDecay,DenyOnLimit > sacctmgr modify qos bank1 set grptresmins=cpu=1000 > > sacctmgr add account bank1 > sacctmgr modify account name=bank1 set qos+=bank1 > > sacctmgr add user someuser account=bank1 > sacctmgr modify user someuser set qos+=bank1 > > ===== > > You can do lots with a QOS, including limiting the number of > simultaneous running jobs, simultaneous running/queued jobs, etc. > Unfortunately, the NoDecay flag is only documented to work on GrpTRESMins, > GrpWall, and UsageRaw, not on the job count. > > So if you can live with limiting the number of simultaneous jobs > instead of a total number of jobs per time period, that?s possible with QOS. > Otherwise, maybe someone else will have an idea. > > -- > Mike Renfro, PhD / HPC Systems Administrator, Information > Technology Services > 931 372-3601 / Tennessee Tech University > > On May 31, 2020, at 11:35 AM, Songpon Srisawai > <songpons_...@vistec.ac.th> wrote: > > Hello all, > > I?m Slurm beginner who try to implement our cluster. I would like > to know whether there are any Slurm credit/token system plugin such as the > number of job count. > > I found Slurm-bank that deposit hour to an account. But, I would > like to deposit the jobs token instead of hours. > > Thanks for any recommendation > Songpon > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200601/76ebd6f5/attachment.htm> > > End of slurm-users Digest, Vol 32, Issue 2 > ****************************************** > > > > > End of slurm-users Digest, Vol 32, Issue 3 > ****************************************** >