Re: [slurm-users] Slurm SPANK GPU Compute Mode plugin

2018-01-23 Thread Miguel Gila
Wow, this is great!

It’s difficult to find SPANK plugins built in Lua, many thanks :-)

M.

> On 23 Jan 2018, at 07:38, Nadav Toledo  wrote:
> 
> Thank you for sharing
> it's indeed of interest of others...
> 
> On 23/01/2018 01:20, Kilian Cavalotti wrote:
>> Hi all,
>> 
>> We (Stanford Research Computing Center) developed a SPANK plugin which
>> allows users to choose the GPU compute mode [1] for their jobs.
>> [1] 
>> http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-modes
>>  
>> 
>> 
>> This came from the need to give our users some control on the way GPUs
>> are set, so they could run specific applications requiring a given
>> mode, while providing defaults optimized for our general environment.
>> 
>> We figured it could be of interest to others, so we released our Slurm
>> SPANK GPU Compute Mode plugin at
>> https://github.com/stanford-rc/slurm-spank-gpu_cmode 
>> 
>> 
>> Feel free to give it a try, and don't hesitate to contact us if you
>> have any question.
>> 
>> 
>> Cheers,
> 
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] Slurm SPANK GPU Compute Mode plugin

2018-01-23 Thread Miguel Gila
Hi Kilian, a question on this: which version of Slurm/Lua are you running this 
against??

I don’t seem able to generate the RPM on 17.02.9/Lua 5.2 ; it throws similar 
errors to what I had seen earlier on the original files from Mark Grondo.

> rpmbuild --define "_sysconfdir /etc/opt" --define "__cc /usr/bin/cc 
> $(pkg-config --cflags slurm)" -ta slurm-spank-lua-0.38.tar.gz
Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.bUiy3F
+ umask 022
+ cd /users/builduser/rpmbuild/BUILD
+ cd /users/builduser/rpmbuild/BUILD
+ rm -rf slurm-spank-lua-0.38
+ /usr/bin/gzip -dc 
/users/builduser/workarea/job_environment/slurm-spank-lua-0.38.tar.gz
+ /bin/tar -xf -
+ STATUS=0
+ '[' 0 -ne 0 ']'
+ cd slurm-spank-lua-0.38
+ /usr/bin/chmod -Rf a+rX,u+w,g-w,o-w .
+ exit 0
Executing(%build): /bin/sh -e /var/tmp/rpm-tmp.p3xXXA
+ umask 022
+ cd /users/builduser/rpmbuild/BUILD
+ /usr/bin/rm -rf /tmp/builduser/BUILDROOT
++ dirname /tmp/builduser/BUILDROOT
+ /usr/bin/mkdir -p /tmp/builduser
+ /usr/bin/mkdir /tmp/builduser/BUILDROOT
+ cd slurm-spank-lua-0.38
+ /usr/bin/cc -I/opt/slurm/17.02.9/include -g -o lua.o -fPIC -c lua.c
lua.c: In function ‘lua_script_create’:
lua.c:1048:31: error: ‘LUA_GLOBALSINDEX’ undeclared (first use in this function)
 lua_pushvalue (script->L, LUA_GLOBALSINDEX);
   ^
lua.c:1048:31: note: each undeclared identifier is reported only once for each 
function it appears in
error: Bad exit status from /var/tmp/rpm-tmp.p3xXXA (%build)


RPM build errors:
Bad exit status from /var/tmp/rpm-tmp.p3xXXA (%build)

Does anybody have an idea to get this to build and work? I’ve done some 
research, but my knowledge of Lua is very basic and the things I’ve tried ([1] 
and [2]) did not help at all. In fact, I’ve changed the source to look like 
this:

> diff -Naur slurm-spank-lua-0.38-modified/lua.c slurm-spank-lua-0.38/lua.c
--- slurm-spank-lua-0.38-modified/lua.c 2018-01-23 13:33:15.808026439 +0100
+++ slurm-spank-lua-0.38/lua.c  2018-01-22 19:56:54.0 +0100
@@ -1045,8 +1045,7 @@
  *   table.
  */
 lua_pushstring (script->L, "__index");
-lua_setglobal (script->L, "__index");
-//lua_pushvalue (script->L, LUA_GLOBALSINDEX);
+lua_pushvalue (script->L, LUA_GLOBALSINDEX);
 lua_settable (script->L, -3);

 /*  Now set metatable for the new globals table */
@@ -1055,7 +1054,7 @@
 /*  And finally replace the globals table with the (empty)  table
  *   now at top of the stack
  */
-//lua_replace (script->L, LUA_GLOBALSINDEX);
+lua_replace (script->L, LUA_GLOBALSINDEX);

 return script;
 }

And, although it initially builds, in the end it dies miserably:

> srun --help
PANIC: unprotected error in call to Lua API (attempt to index a nil value)
Aborted

Thanks,
M.

[1] 
https://stackoverflow.com/questions/9057943/porting-to-lua-5-2-lua-globalsindex-trouble?rq=1
 

[2] 
https://stackoverflow.com/questions/10087226/lua-5-2-lua-globalsindex-alternative
 


> On 23 Jan 2018, at 00:20, Kilian Cavalotti  
> wrote:
> 
> Hi all,
> 
> We (Stanford Research Computing Center) developed a SPANK plugin which
> allows users to choose the GPU compute mode [1] for their jobs.
> [1] 
> http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-modes
> 
> This came from the need to give our users some control on the way GPUs
> are set, so they could run specific applications requiring a given
> mode, while providing defaults optimized for our general environment.
> 
> We figured it could be of interest to others, so we released our Slurm
> SPANK GPU Compute Mode plugin at
> https://github.com/stanford-rc/slurm-spank-gpu_cmode
> 
> Feel free to give it a try, and don't hesitate to contact us if you
> have any question.
> 
> 
> Cheers,
> -- 
> Kilian
> 



smime.p7s
Description: S/MIME cryptographic signature


[slurm-users] No mysql and plugins created by rpmbuild -ta slurm-17.11.2.tar.gz

2018-01-23 Thread HM Li

Dear.
I have tried rpmbuild -ta slurm-17.11.2.tar.gz, but there are not 
slurm-sql and slurm plugins created on CentOS 7.4.
But it shows sql correct when ./configure --with-mysql_config=/usr/bin 
or rpmbuild -ta slurm-17.02.2.tar.gz on the same server.

Can you help me?
Thank you very much.





Re: [slurm-users] No mysql and plugins created by rpmbuild -ta slurm-17.11.2.tar.gz

2018-01-23 Thread Marcus Wagner

Hi Li,

the structure of the rpms has changed.

slurm-17.11.*.rpm contains the client/user binaries, together with the 
docs and the plugins
slurm-slurmd-17.11.*.rpm contains the slurmd, slurmstepd and the 
slurmd.service

slurm-slurmctld-17.11.*.rpm contains the slurmctld and slurmctld.service
...

so. slurm is for all slurmhosts, including submithosts,
slurm-slurmd for compute nodes
slurm-slurmctld for master / backup controller
slurm-slurmdbd for master / backup database host


Best
Marcus


On 01/23/2018 02:46 PM, HM Li wrote:

Dear.
I have tried rpmbuild -ta slurm-17.11.2.tar.gz, but there are not 
slurm-sql and slurm plugins created on CentOS 7.4.
But it shows sql correct when ./configure --with-mysql_config=/usr/bin 
or rpmbuild -ta slurm-17.02.2.tar.gz on the same server.

Can you help me?
Thank you very much.





--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de




Re: [slurm-users] Slurm SPANK GPU Compute Mode plugin

2018-01-23 Thread Kilian Cavalotti
Hi Miguel,

On Tue, Jan 23, 2018 at 4:41 AM, Miguel Gila  wrote:
> Hi Kilian, a question on this: which version of Slurm/Lua are you running
> this against??

Slurm 17.11.x and Lua 5.1

> I don’t seem able to generate the RPM on 17.02.9/Lua 5.2 ; it throws similar
> errors to what I had seen earlier on the original files from Mark Grondo.

> + /usr/bin/cc -I/opt/slurm/17.02.9/include -g -o lua.o -fPIC -c lua.c
> lua.c: In function ‘lua_script_create’:
> lua.c:1048:31: error: ‘LUA_GLOBALSINDEX’ undeclared (first use in this
> function)
>  lua_pushvalue (script->L, LUA_GLOBALSINDEX);

Ah right, LUA_GLOBALSINDEX seems to have been removed in Lua 5.2. :(

I'm not exactly sure what would the best approach be to fix this.
Since that code comes directly unmodified from
https://github.com/grondo/slurm-spank-plugins, maybe it would be worth
reporting the issue there?

Cheers,
-- 
Kilian



[slurm-users] One node does not terminate simple hostname job

2018-01-23 Thread Julien Tailleur

Dear all,

first of all, I am new to slurm and this ML; please accept my apologies 
if I do not provide all needed information. I am setting up a small 
cluster under ddebian. I have slurm & munge installed and configured and 
the controller and daemons run fine on the master node and computing 
nodes, respectively. I have thus reached the "srun -Nx /bin/hostname" 
stage and I have a weird problem...


I have 16 DELL servers, FX11-14, FX21-24, FX31-34 and FX41-44.

If I do a partition with everyone but FX11, the command

srun -N15 /bin/hostname

runs smoothly, without any lag time. When I make a partition with the 
FX11, I have a weird behaviour. If I run


srun -N16 /bin/hostname

I get the correct answer:

:~# srun -N16 /bin/hostname
FX41
FX13
FX14
FX12
FX34
FX42
FX22
FX43
FX23
FX44
FX24
FX11
FX31
FX32
FX33
FX21

But if I run sinfo, the FX11 node is stuck in the "comp" state (is this 
completing ?)


:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
First*   up   infinite  1   comp FX11
First*   up   infinite 15   idle FX[12-14,21-24,31-34,41-44]

If I wait long enough, it will be available again, and I can run again 
the same command. If I try to run the command twice rapidly, I get stuck:


:~# srun -N16 /bin/hostname
FX34
FX42
FX32
FX13
FX21
FX23
FX22
FX41
FX12
FX43
FX11
FX44
FX31
FX33
FX24
FX14
root@kandinsky:~# srun -N16 /bin/hostname
srun: job 802 queued and waiting for resources

[long pause]

srun: error: Lookup failed: Unknown host
srun: job 802 has been allocated resources
FX23
FX42
FX11
FX12
FX33
FX44
FX43
FX22
FX21
FX24
FX13
FX41
FX34
FX31
FX32
FX14

If I do a sinfo during the long pause, I get:
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
First*   up   infinite  1   comp FX11
First*   up   infinite 15   idle FX[12-14,21-24,31-34,41-44]

I have run slurmd with verbose on a node that does not cause problem 
(FX14) and on (FX11). The (very long) details are below. In practice, it 
seems that FX11 does not manage to terminate the job and I have no idea 
why I have cut sequences of identical lines, replaced by [] so 
that the message remains readable.


When I run the two srun command, until the second one gets stuck, this 
is what I have on FX14:


slurmd-FX14: debug3: in the service_connection
slurmd-FX14: debug2: got this type of message 6001
slurmd-FX14: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd-FX14: launch task 801.0 request from 0.0@192.168.6.1 (port 33479)
slurmd-FX14: debug3: state for jobid 4: ctime:1505731239 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 6: ctime:1505745376 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 7: ctime:1505745381 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 20: ctime:1513002753 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 21: ctime:1516358481 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 24: ctime:1516358515 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 25: ctime:1516358518 revoked:0 
expires:0


[...]

slurmd-FX14: debug3: state for jobid 793: ctime:1516379034 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 794: ctime:1516379184 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 796: ctime:1516379297 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 797: ctime:1516379369 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 798: ctime:1516379701 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 799: ctime:1516379772 revoked:0 
expires:0
slurmd-FX14: debug3: state for jobid 800: ctime:1516379920 
revoked:1516379920 expires:1516379920

slurmd-FX14: debug3: destroying job 800 state
slurmd-FX14: debug3: state for jobid 800: ctime:1516379920 revoked:0 
expires:0

slurmd-FX14: debug:  Checking credential with 364 bytes of sig data
slurmd-FX14: debug:  task_p_slurmd_launch_request: 801.0 3
slurmd-FX14: _run_prolog: run job script took usec=4
slurmd-FX14: _run_prolog: prolog with lock for job 801 ran for 0 seconds
slurmd-FX14: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd-FX14: debug3: slurmstepd rank 3 (FX14), parent rank 1 (FX12), 
children 0, depth 2, max_depth 2

slurmd-FX14: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd-FX14: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd-FX14: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd-FX14: debug:  task_p_slurmd_reserve_resources: 801 3
slurmd-FX14: debug3: in the service_connection
slurmd-FX14: debug2: got this type of message 6004
slurmd-FX14: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd-FX14: debug:  _rpc_signal_tasks: sending signal 995 to step 801.0 
flag 0

slurmd-FX14: debug3: in the service_connection
slurmd-FX14: debug2: got this type of message 6011
slurmd-FX14: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd-FX14: debug:  _rpc_terminate_job, uid = 64030
slurmd-FX14: debug:  task_p_slurmd_release_resources: 801
slurmd-FX14: debug:  credential for job 801 revoked
slur

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-23 Thread James A. Peltier
We put SSSD caches on a RAMDISK which helped a little bit with performance.

- On 22 Jan, 2018, at 02:38, Alessandro Federico a.feder...@cineca.it wrote:

| Hi John,
| 
| just an update...
| we not have a solution for the SSSD issue yet, but we changed the ACL
| on the 2 partitions from AllowGroups=g2 to AllowAccounts=g2 and the
| slowdown has gone.
| 
| Thanks for the help
| ale
| 
| - Original Message -
|> From: "Alessandro Federico" 
|> To: "John DeSantis" 
|> Cc: hpc-sysmgt-i...@cineca.it, "Slurm User Community List"
|> , "Isabella Baccarelli"
|> 
|> Sent: Wednesday, January 17, 2018 5:41:54 PM
|> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv
|> operation
|> 
|> Hi John
|> 
|> thanks for the infos.
|> We are investigating the slowdown of sssd and I found some bug
|> reports regarding slow sssd query
|> with almost the same backtrace. Hopefully an update of sssd could
|> solve this issue.
|> 
|> We'll let you know if we found a solution.
|> 
|> thanks
|> ale
|> 
|> - Original Message -
|> > From: "John DeSantis" 
|> > To: "Alessandro Federico" 
|> > Cc: "Slurm User Community List" ,
|> > "Isabella Baccarelli" ,
|> > hpc-sysmgt-i...@cineca.it
|> > Sent: Wednesday, January 17, 2018 3:30:43 PM
|> > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on
|> > send/recv operation
|> > 
|> > Ale,
|> > 
|> > > As Matthieu said it seems something related to SSS daemon.
|> > 
|> > That was a great catch by Matthieu.
|> > 
|> > > Moreover, only 3 SLURM partitions have the AllowGroups ACL
|> > 
|> > Correct, which may seem negligent, but after each `scontrol
|> > reconfigure`, slurmctld restart, and/or AllowGroups= partition
|> > update,
|> > the mapping of UID's for each group will be updated.
|> > 
|> > > So why does the UID-GID mapping take so long?
|> > 
|> > We attempted to use "AllowGroups" previously, but we found (even
|> > with
|> > sssd.conf tuning regarding enumeration) that unless the group was
|> > local
|> > (/etc/group), we were experiencing delays before the AllowGroups
|> > parameter was respected.  This is why we opted to use SLURM's
|> > AllowQOS/AllowAccounts instead.
|> > 
|> > You would have to enable debugging on your remote authentication
|> > software to see where the hang-up is occurring (if it is that at
|> > all,
|> > and not just a delay with the slurmctld).
|> > 
|> > Given the direction that this is going - why not replace the
|> > "AllowGroups" with either a simple "AllowAccounts=" or "AllowQOS="?
|> > 
|> > > @John: we defined many partitions on the same nodes but in the
|> > > production cluster they will be more or less split across the 6K
|> > > nodes.
|> > 
|> > Ok, that makes sense.  Looking initially at your partition
|> > definitions,
|> > I immediately thought of being DRY, especially since the "finer"
|> > tuning
|> > between the partitions could easily be controlled via the QOS'
|> > allowed
|> > to access the resources.
|> > 
|> > John DeSantis
|> > 
|> > On Wed, 17 Jan 2018 13:20:49 +0100
|> > Alessandro Federico  wrote:
|> > 
|> > > Hi Matthieu & John
|> > > 
|> > > this is the backtrace of slurmctld during the slowdown
|> > > 
|> > > (gdb) bt
|> > > #0  0x7fb0e8b1e69d in poll () from /lib64/libc.so.6
|> > > #1  0x7fb0e8617bfa in sss_cli_make_request_nochecks ()
|> > > from /lib64/libnss_sss.so.2 #2  0x7fb0e86185a3 in
|> > > sss_nss_make_request () from /lib64/libnss_sss.so.2 #3
|> > > 0x7fb0e8619104 in _nss_sss_getpwnam_r ()
|> > > from /lib64/libnss_sss.so.2 #4  0x7fb0e8aef07d in
|> > > getpwnam_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #5
|> > > 0x7fb0e9360986 in _getpwnam_r (result=,
|> > > bufsiz=, buf=, pwd=,
|> > > name=) at uid.c:73 #6  uid_from_string
|> > > (name=0x1820e41
|> > > "g2bottin", uidp=uidp@entry=0x7fff07f03a6c) at uid.c:111 #7
|> > > 0x0043587d in get_group_members (group_name=0x10ac500
|> > > "g2")
|> > > at groups.c:139 #8  0x0047525a in _get_groups_members
|> > > (group_names=) at partition_mgr.c:2006 #9
|> > > 0x00475505 in _update_part_uid_access_list
|> > > (x=0x7fb03401e650,
|> > > arg=0x7fff07f13bf4) at partition_mgr.c:1930 #10
|> > > 0x7fb0e92ab675
|> > > in
|> > > list_for_each (l=0x1763e50, f=f@entry=0x4754d8
|> > > <_update_part_uid_access_list>, arg=arg@entry=0x7fff07f13bf4) at
|> > > list.c:420 #11 0x0047911a in load_part_uid_allow_list
|> > > (force=1) at partition_mgr.c:1971 #12 0x00428e5c in
|> > > _slurmctld_background (no_data=0x0) at controller.c:1911 #13 main
|> > > (argc=, argv=) at controller.c:601
|> > > 
|> > > As Matthieu said it seems something related to SSS daemon.
|> > > However we don't notice any slowdown due to SSSd in our
|> > > environment.
|> > > As I told you before, we are just testing SLURM on a small 100
|> > > nodes
|> > > cluster before going into production with about 6000 nodes next
|> > > Wednesday. At present the other nodes are managed by PBSPro and
|> > > the
|> > > 2
|> > > PBS serve