Afaik, there were some problems with certain versions of UCX, where UCX
expected OPAL memory hooks from OMPI, but they were disabled and the
physical pages became out-of-sync. But I don't know if this is the case.

Maybe you could run dynamic debug to see if there is something useful in
dmesg:

echo "module mlx5_core +p"| tee /sys/kernel/debug/dynamic_debug/control

And you could also try to run ucx_info in debug mode.

Cheers,

Barbara


On 6/1/20 8:37 PM, Alberto Morillas, Angelines wrote:
> Yes I tried it but whit the same result 
> openmpi@4.0.3 -cuda +cxx_exceptions fabrics=ucx  -java -legacylaunchers 
> -memchecker +pmi schedulers=slurm  -sqlite3 -thread_multiple +vt
>
> You can compile wrf , when you sbatch your job it is running but it doesn´t 
> do anything and we get the same, with  WCHAN=hrtime
>             0 S  4556  87383  87361  0  80   0 - 126676 hrtime ?       
> 00:05:25 real.exe
>
>     ------------------------------
>
>     Message: 2
>     Date: Mon, 1 Jun 2020 16:56:05 +0000
>     From: "Pritchard Jr., Howard" <howa...@lanl.gov>
>     To: Slurm User Community List <slurm-users@lists.schedmd.com>
>     Subject: Re: [slurm-users] [EXTERNAL]  problems with OpenMPI 4.0.3
>     Message-ID: <20dc51ae-9f58-4b1c-b619-1a2077d5c...@lanl.gov>
>     Content-Type: text/plain; charset="utf-8"
>
>     HI Angelines,
>
>     Could you try reinstalling with fabric=ucx and rerunning?  
>     UCX is the preferred way to use Infiniband in the Open MPI 4.0.x release 
> stream.
>
>     Howard
>
>     ?On 6/1/20, 10:29 AM, "slurm-users on behalf of Alberto Morillas, 
> Angelines" <slurm-users-boun...@lists.schedmd.com on behalf of 
> angelines.albe...@ciemat.es> wrote:
>
>         Hello     Howard
>
>         I installed it with spack: 
>         openmpi@4.0.3 -cuda +cxx_exceptions fabrics=verbs -java 
> -legacylaunchers -memchecker  +pmi schedulers=slurm -sqlite3 -thread_multiple 
> +vt
>         where - --> not enable
>                     + --> enable
>
>         Thanks in advance.
>         ________________________________________________
>
>         Angelines Alberto Morillas
>
>         Unidad de Arquitectura Inform?tica
>         Despacho: 22.1.32
>         Telf.: +34 91 346 6119
>         Fax:   +34 91 346 6537
>
>         skype: angelines.alberto
>
>         CIEMAT
>         Avenida Complutense, 40
>         28040 MADRID
>         ________________________________________________ 
>
>
>
>
>             ------------------------------
>
>             Message: 2
>             Date: Mon, 1 Jun 2020 16:13:11 +0000
>             From: "Pritchard Jr., Howard" <howa...@lanl.gov>
>             To: Slurm User Community List <slurm-users@lists.schedmd.com>
>             Subject: Re: [slurm-users] [EXTERNAL]  problems with OpenMPI 4.0.3
>             Message-ID: <ca7fe91c-8104-476f-b9a2-528d23ed3...@lanl.gov>
>             Content-Type: text/plain; charset="utf-8"
>
>             Hello Angelines,
>
>             Do you know how the Open MPI 4.0.3 package was configured and 
> built?   That information would be useful to help diagnose the problem.
>
>             Thanks,
>
>             Howard
>
>
>             From: slurm-users <slurm-users-boun...@lists.schedmd.com> on 
> behalf of "Alberto Morillas, Angelines" <angelines.albe...@ciemat.es>
>             Reply-To: Slurm User Community List 
> <slurm-users@lists.schedmd.com>
>             Date: Friday, May 29, 2020 at 4:25 AM
>             To: "slurm-users@lists.schedmd.com" 
> <slurm-users@lists.schedmd.com>
>             Subject: [EXTERNAL] [slurm-users] problems with OpenMPI 4.0.3
>
>             Good morning,
>
>             We have a cluster with two kind of infiniband cards, one 
> connectx-4 and the other connectx-6.
>             Openmpi-3.1.3 works fine, but when we start with connectx-6 we 
> started to use openmpi-4.0.3 (that support connectx-6) and the programs that 
> have several parts, first a call to a secuencial program and inside it a call 
> to a parallel program, ? (in our case the program is WRF, but we have others 
> like this with the same problem),  this kind of programs suddenly stop,
>
>             ?..
>             0 S  4556  87383  87361  0  80   0 - 126676 hrtime ?       
> 00:05:25 real.exe
>             0 S  4556  87384  87361  0  80   0 - 126677 hrtime ?       
> 00:05:33 real.exe
>             0 S  4556  87385  87361  0  80   0 - 126675 hrtime ?       
> 00:05:28 real.exe
>             ??
>             The WCHAN=hrtime, and it looks that it is running, but really it 
> doesn?t work
>
>             We don?t know if it could be  problem with slurm and this version 
> of openmpi? Any idea?
>
>
>             ________________________________________________
>
>             Angelines Alberto Morillas
>
>             Unidad de Arquitectura Inform?tica
>             Despacho: 22.1.32
>             Telf.: +34 91 346 6119
>             Fax:   +34 91 346 6537
>
>             skype: angelines.alberto
>
>             CIEMAT
>             Avenida Complutense, 40
>             28040 MADRID
>             ________________________________________________
>
>
>             -------------- next part --------------
>             An HTML attachment was scrubbed...
>             URL: 
> <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200601/e0e1cbee/attachment-0001.htm>
>
>             ------------------------------
>
>             Message: 3
>             Date: Mon, 1 Jun 2020 16:16:00 +0000
>             From: Songpon Srisawai <songpons_...@vistec.ac.th>
>             To: Slurm User Community List <slurm-users@lists.schedmd.com>
>             Subject: Re: [slurm-users] Slurm Job Count Credit system
>             Message-ID: <9666f3be-d648-4ee9-9ad2-80df973f87cc@Spark>
>             Content-Type: text/plain; charset="utf-8"
>
>             Greatly appreciated for your help. I will try to implement 
> following your suggestion.
>             On 1 Jun 2020 22:23 +0700, Renfro, Michael <ren...@tntech.edu>, 
> wrote:
>             Even without the slurm-bank system, you can enforce a limit on 
> resources with a QOS applied to those users. Something like:
>
>             =====
>
>             sacctmgr add qos bank1 flags=NoDecay,DenyOnLimit
>             sacctmgr modify qos bank1 set grptresmins=cpu=1000
>
>             sacctmgr add account bank1
>             sacctmgr modify account name=bank1 set qos+=bank1
>
>             sacctmgr add user someuser account=bank1
>             sacctmgr modify user someuser set qos+=bank1
>
>             =====
>
>             You can do lots with a QOS, including limiting the number of 
> simultaneous running jobs, simultaneous running/queued jobs, etc. 
> Unfortunately, the NoDecay flag is only documented to work on GrpTRESMins, 
> GrpWall, and UsageRaw, not on the job count.
>
>             So if you can live with limiting the number of simultaneous jobs 
> instead of a total number of jobs per time period, that?s possible with QOS. 
> Otherwise, maybe someone else will have an idea.
>
>             --
>             Mike Renfro, PhD / HPC Systems Administrator, Information 
> Technology Services
>             931 372-3601 / Tennessee Tech University
>
>             On May 31, 2020, at 11:35 AM, Songpon Srisawai 
> <songpons_...@vistec.ac.th> wrote:
>
>             Hello all,
>
>             I?m Slurm beginner who try to implement our cluster. I would like 
> to know whether there are any Slurm credit/token system plugin such as the 
> number of job count.
>
>             I found Slurm-bank that deposit hour to an account. But, I would 
> like to deposit the jobs token instead of hours.
>
>             Thanks for any recommendation
>             Songpon
>
>             -------------- next part --------------
>             An HTML attachment was scrubbed...
>             URL: 
> <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200601/76ebd6f5/attachment.htm>
>
>             End of slurm-users Digest, Vol 32, Issue 2
>             ******************************************
>
>
>
>
>     End of slurm-users Digest, Vol 32, Issue 3
>     ******************************************
>


Reply via email to