Re: [slurm-users] Slurm Upgrade from 17.02

2020-02-20 Thread Steven Senator (slurm-dev-list)
When upgrading to 18.08 it is prudent to add following lines into your
/etc/my.cnf as per
  https://slurm.schedmd.com/accounting.html
  https://slurm.schedmd.com/SLUG19/High_Throughput_Computing.pdf (slide #6)

[mysqld]
innodb_buffer_pool_size=1G
innodb_log_file_size=64M
innodb_lock_wait_timeout=900

If the node on which mysql is running has sufficient memory you may
want to increase the innodb_buffer_pool_size beyond 1G. That's just
the minimum threshold below which slurm complains. We use 8G, for
example, because it fits our churn rate for {job arrival, job dispatch
to run state} in RAM and our nodes enough RAM to accommodate an 8G
cache. (references on tuning below)

When you reset this, you will also need to remove the previous innodb
caches, which are probably in /var/lib/mysql. When we did this we
removed and recreated the slurm_acct_db, although that was partially
motivated by the fact that this coincided with an OS and database
patch upgrade and a major accounting and allocation cycle.
  0. Stop slurmctld, slurmdbd.
  1. Create a dump of your database. (mysqldump ...)
  2. Verify that the dump is complete and valid.
  3. Remove the slurm_acct_db. (mysql -e "drop database slurm_acct_db;")
  3. Stop your mysql instance cleanly.
  4. Check the logs. Verify that the mysql instance was stopped cleanly.
  5.  rm /var/lib/mysql/ib_logfile? /var/lib/ibdata1
  6. Put the new lines as above into /etc/my.cnf with the log file
sized appropriately.
  7. Start mysql.
  8. Verify it started cleanly.
  9. Restart the slurm dbd manually, possibly in non-daemon mode.
(slurmdbd -D -vv)
  10. sacctmgr create cluster 

If you want to restore the data back into the data base, do it
*before* step 9 so that the schema conversion can be performed. I like
using mutiple "-vv" so that I can see some of the messages as that
conversion process proceeds.

Some references on mysql innodb_buffer_pool_size tuning:
  
https://scalegrid.io/blog/calculating-innodb-buffer-pool-size-for-your-mysql-server/
  https://mariadb.com/kb/en/innodb-system-variables/#innodb_buffer_pool_size
  https://mariadb.com/kb/en/innodb-buffer-pool/
  https://www.percona.com/blog/2015/06/02/80-ram-tune-innodb_buffer_pool_size/
  https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool-resize.html

Hope this helps,
 -Steve Senator

On Wed, Feb 19, 2020 at 7:12 AM Ricardo Gregorio
 wrote:
>
> hi all,
>
>
>
> I am putting together an upgrade plan for slurm on our HPC. We are currently 
> running old version 17.02.11. Would you guys advise us upgrading to 18.08 or 
> 19.05?
>
>
>
> I understand we will have to also upgrade the version of mariadb from 5.5 to 
> 10.X and pay attention to 'long db upgrade from 17.02 to 18.X or 19.X' and 
> 'bug 6796' amongst other things.
>
>
>
> We would appreciate your comments/recommendations
>
>
>
> Regards,
>
> Ricardo Gregorio
>
> Research and Systems Administrator
>
> Operations ITS
>
>
>
>
>
>
> Rothamsted Research is a company limited by guarantee, registered in England 
> at Harpenden, Hertfordshire, AL5 2JQ under the registration number 2393175 
> and a not for profit charity number 802038.



Re: [slurm-users] Problem with permisions. CentOS 7.8

2020-05-28 Thread Steven Senator (slurm-dev-list)
What is in /var/log/munge/munged.log?
Munge is quite strict about permissions in its whole hierarchy of
control and configuration files, appropriately.

On Thu, May 28, 2020 at 11:01 AM Rodrigo Santibáñez
 wrote:
>
> Hello,
>
> You could find the solution here
> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation
>
> Best regards
>
> El jue., 28 de mayo de 2020 12:55, Ferran Planas Padros  
> escribió:
>>
>> Hello,
>>
>>
>> I have installed Slurm under CentOS 7.8 Kernel 3.10.0-1127.el7.x86_64. The 
>> Slurm version I have installed is 14.03.3 along with Munge 0.5.11.
>>
>> I know these are not the latest versions, but I wanted to have consistency 
>> in all my nodes.
>>
>>
>> I am facing a problem when I try to start Munge in the nodes with CentOS 
>> 7.8. Anytime I want to start Munge (service munge start) I get the error 
>> that I detail below.
>>
>>
>> 'munged[36385]: munged: Error: Failed to check logfile 
>> "/var/log/munge/munged.log": Permission denied'
>>
>>
>> I run the command as slurm user, and the /var/log/munge folder does belong 
>> to slurm.
>>
>>
>> I have the exact same setup on older nodes which are working under CentOS 
>> 6.6 and 6.5, and there I have no problem. At the end of the year we plan to 
>> migrate all nodes to CentOS 7.x, and I really need to understand what is 
>> happening here.
>>
>>
>> Best,
>>
>> Ferran



Re: [slurm-users] Change ExcNodeList on a running job

2020-06-04 Thread Steven Senator (slurm-dev-list)
Also consider the --no-kill ("-k") options to sbatch (and srun.)
Following from the sbatch man page.
 -k, --no-kill [=off]
  Do  not  automatically  terminate  a job if one of the nodes
it has been allocated fails.  The user will
  assume the responsibilities for fault-tolerance should a node
fail.  When there is a node  failure,  any
  active  job  steps  (usually MPI jobs) on that node will
almost certainly suffer a fatal error, but with
  --no-kill, the job allocation will not be revoked so the user
may launch new job steps on the  remaining
  nodes in their allocation.

  Specify an optional argument of "off" disable the effect of
the SBATCH_NO_KILL environment variable.

On-compute job-submission for these kinds of cases is (could be) your
friend.

An explorer job may validate every node, compose the nodes that are
appropriate for a given application, and then submit a subsequent job
dependent upon the explorer job. This subsequent job could have an explicit
node list (-w ...) or an exclude list (-x ...) or not. These are most
useful for much larger and/or higher priority jobs, so they are running
with a specific license, reservation or partition. Alternatively, the
daughter jobs can be made dependent on the explorer job
(--depend=afterok:) or the nodes which aren't appropriate
could have other actions initiated on them. (ex. sbatch --reboot -w
 node-diagnostic-script.sh)



On Thu, Jun 4, 2020 at 3:51 PM Rodrigo Santibáñez <
rsantibanez.uch...@gmail.com> wrote:

> What about instead of (automatic) requeue of the job, use --no-requeue in
> the first sbatch and when something went wrong with the job (why not
> something wrong with the node?) submit again with --no-requeue the job with
> the excluded nodes?
>
> something as: sbatch --no-requeue file.sh, and then sbatch --no-requeue
> --exclude=n001file.sh (options in the command line overrides the options
> inside the script)
>
> El jue., 4 jun. 2020 a las 17:40, Ransom, Geoffrey M. (<
> geoffrey.ran...@jhuapl.edu>) escribió:
>
>>
>>
>> Not quite.
>>
>> The user’s job script in question is checking the error status of the
>> program it ran while it is running. If a program fails the running job
>> wants to exclude the machine it is currently running on and requeue itself
>> in case it died due to a local machine issue that the scheduler has not
>> flagged as a problem.
>>
>>
>>
>> The current goal is to have a running job step in an array job add the
>> current host to its exclude list and requeue itself when it detects a
>> problem. I can’t seem to modify the exclude list while a job is running,
>> but once the task is requeued and back in the queue it is no longer running
>> so it can’t modify its own exclude list.
>>
>>
>>
>> I.e…. put something like the following into a sbatch script so each task
>> can run it against itself.
>>
>>
>>
>> If ! $runprogram $args ; then
>>
>>   NewExcNodeList=”$ ExcNodeList,$HOSTNAME”
>>
>>   scontrol update job ${SLURM_JOB_ID} ExcNodeList=$NewExcNodeList
>>
>>   scontrol requeue ${ SLURM_JOB_ID}
>>
>>   sleep 10
>>
>> fi
>>
>>
>>
>>
>>
>>
>>
>> *From:* slurm-users  *On Behalf
>> Of *Rodrigo Santibáñez
>> *Sent:* Thursday, June 4, 2020 4:16 PM
>> *To:* Slurm User Community List 
>> *Subject:* [EXT] Re: [slurm-users] Change ExcNodeList on a running job
>>
>>
>>
>> *APL external email warning: *Verify sender
>> slurm-users-boun...@lists.schedmd.com before clicking links or
>> attachments
>>
>>
>>
>> Hello,
>>
>>
>>
>> Jobs can be requeue if something wrong happens, and the node with failure
>> excluded by the controller.
>>
>>
>>
>> *--requeue*
>>
>> Specifies that the batch job should eligible to being requeue. The job
>> may be requeued explicitly by a system administrator, after node failure,
>> or upon preemption by a higher priority job. When a job is requeued, the
>> batch script is initiated from its beginning. Also see the *--no-requeue*
>> option. The *JobRequeue* configuration parameter controls the default
>> behavior on the cluster.
>>
>>
>>
>> Also, jobs can be run selecting a specific node or excluding nodes
>>
>>
>>
>> *-w*, *--nodelist*=<*node name list*>
>>
>> Request a specific list of hosts. The job will contain *all* of these
>> hosts and possibly additional hosts as needed to satisfy resource
>> requirements. The list may be specified as a comma-separated list of hosts,
>> a range of hosts (host[1-5,7,...] for example), or a filename. The host
>> list will be assumed to be a filename if it contains a "/" character. If
>> you specify a minimum node or processor count larger than can be satisfied
>> by the supplied host list, additional resources will be allocated on other
>> nodes as needed. Duplicate node names in the list will be ignored. The
>> order of the node names in the list is not important; the node names will
>> be sorted by Slurm.
>>
>>
>>
>> *-x*, *--exclude*=<*node name list*>
>>
>> Explicitly exclude certain

Re: [slurm-users] How to queue jobs based on non-existent features

2020-08-14 Thread Steven Senator (slurm-dev-list)
We use a scenario that is analogous to yours using features. Features
are defined in slurm.conf and are associated with nodes from-which a
job may be submitted, as an administratively, configuration-managed
authoritative source. (NodeName=xx-login State=FUTURE
AvailableFeatures=) (ie.
={green,blue,orange,etc})

The job prolog sets the node's features to those specified by the
 tag. The slurm.conf has: PrologFlags=Alloc set.
It also runs whatever configuration scripts to implement the set of
features. These scripts must be able to either succeed or fail-fast
within the PrologTimeout. So, we pre-prime configurations and just
flip the node from one to another.

There is also an older version which required a minimal job-submit
plugin to populate the job's AdminComment with the feature from the
'AllocNode'. However, that is now used mainly for accounting and
user-interface convenience.

That said, if I were to reimplement this, I would look seriously at
the interfaces and hooks used to connect to dynamically-provisioned
nodes such as slurm's ability to provision Google Cloud provided
nodes. (https://github.com/SchedMD/slurm-gcp,
https://cloud.google.com/solutions/deploying-slurm-cluster-compute-engine,
https://slurm.schedmd.com/SLUG19/Slurm_+_GCP.pdf). Cloud-bursting into
a freshly or dynamically provisioned node matches your use case. The
major difference is that your pool of nodes is nearby and yours.

Hope this helps,
-Steve

On Thu, Aug 13, 2020 at 7:19 PM Thomas M. Payerle  wrote:
>
> I have not had a chance to look at you rcode, but find it intriguing, 
> although I am not sure about use cases.  Do you do anything to lock out other 
> jobs from the affected node?
> E.g., you submit a job with unsatisfiable constraint foo.
> The tool scanning the cluster detects a job queued with foo constraint, and 
> sees node7 is idle, so does something to A so it can satisfy foo.
> However, before your job starts, my queued job starts running on node7 (maybe 
> my job needs 2 nodes and only one was free at time the scanning tool chose 
> node7).
> If the change needed for the foo feature is harmless to my job, then it is 
> not a big deal, other than your job is queued longer (and maybe the scanning 
> tool makes another foo node) ---
> but in that case why not make all nodes able to satisfy foo all the time?
>
> Maybe add a feature "generic" and have a job plugin that adds the generic 
> feature if no other feature requested, and have the scanning tool remove 
> generic when it adds foo.
> (And presumably scanning tool will detect when no more jobs pending jobs with 
> foo feature set and remove it from any idle nodes, both in actual node 
> modification and in Slurm, and
> then add the generic feature back).
> Though I can foresee possible abuses (I have a string of jobs and the cluster 
> is busy.  My jobs don't mind a foo node, so I submit them requesting foo.  
> Once idle nodes are converted to foo nodes, I get an almsot defacto 
> reservation on the foo nodes)
>
> But again, I am having trouble seeing real use cases.  Only one I can think 
> of is maybe if want to make different OS versions available; e.g. the cluster 
> is normally all CentOS, but if a job has a ubuntu20 flag, then the scanning 
> tool can take an idle node, drain it, reimage as ubuntu20, add ubuntu20 flag, 
> and undrain.
> I
>
> On Thu, Aug 13, 2020 at 7:05 PM Raj Sahae  wrote:
>>
>> Hi All,
>>
>>
>>
>> I have developed a first solution to this issue that I brought up back in 
>> early July. I don't think it is complete enough to be the final solution for 
>> everyone but it does work and I think it's a good starting place to showcase 
>> the value of this feature and iterate for improvement. I wanted to let the 
>> list know in case anyone was interested in trying it themselves.
>>
>>
>>
>> In short, I was able to make minimal code changes to the slurmctld config 
>> and job scheduler such that I can:
>>
>> Submit HELD jobs into the queue with sbatch, with invalid constraints, 
>> release the job with scontrol, and have it stay in the queue but not 
>> allocated.
>> Scan the queue with some other tool, make changes to the cluster as needed, 
>> update features, and the scheduler will pick up the new feature changes and 
>> schedule the job appropriately.
>>
>>
>>
>> The patch of my code changes is attached 
>> (0001-Add-a-config-option-allowing-unavailable-constraints.patch). I 
>> branched from the tip of 20.02 at the time, commit 34c96f1a2d.
>>
>>
>>
>> I did attempt to do this with plugins at first but after creating skeleton 
>> plugins for a node_feature plugin and a scheduler plugin, I realized that 
>> the constraint check that occurs in the job scheduler happens before any of 
>> those plugins are called.
>>
>>
>>
>> According to the job launch logic flow 
>> (https://slurm.schedmd.com/job_launch.html) perhaps I could do something in 
>> the job submit plugin but at that point I had spent 4 days playing with the 
>> plugin code and 

Re: [slurm-users] How to throttle sinfo/squeue/scontrol show so they don't throttle slurmctld

2020-08-17 Thread Steven Senator (slurm-dev-list)
The slurm scheduler only locks out user requests when specific data
structures are locked due to modification, or potential modification.
So, the most effective technique is to limit the time window when that
will be happening by a combination of efficient traversal of the main
scheduling loop (when the list itself may be modified), or longer time
windows when state may be in flux from resources, such as node slurmd
to controller slurmctld RPCs.

First, please become familiar with the sdiag command and its output.
There is a huge unknown in any answer that anyone not on your systems
has. Specifically, we don't know the job mixture that is submitted to
your clusters. A predictable pattern of job submissions is ideal,
because you will be able to optimize slurm's parameters. This is
doesn't mean a constant pattern, necessarily. It may mean that you
recognize that your daytime load is different from your nighttime
load, is different from you end-of-semester load. Then you can
influence user behavior with reservations, QOS and/or partitions so
that user jobs that are in those different loads leave strong hints,
through the use of reservations, QOS, partitions or specific accounts
that can be recognized by the scheduler.

These presentations provide guidance:
  https://slurm.schedmd.com/SUG14/sched_tutorial.pdf <-- start here
  https://slurm.schedmd.com/SLUG19/Troubleshooting.pdf
There are also relevant tickets in bugs.schedmd.com where you may
compare your cluster's characteristics, role, job load, etc with
similar reported situations. Since submitted job load {type,
characteristics, frequency} dominate the scheduler behavior, it is not
possible to provide a set of one size fits all guidelines. Look for
ways that your site is similar to the use case in the tickets.

In our particular case, we did the following:
1) we found that the average # jobs in the queue could be traversed in
2.5 minutes, so increased the bf_interval to ~4 minutes (to handle
variability of the load)
2) there was almost always a small job that could be backfill-scheduled
3) we limited the bf_max_job_user_part to ~30, depending on whether
our users used --time-min and minnodes, which together make for
efficient backfill scheduling, so that even if a small # of users take
advantage of the backfill scheduler, they don't appear to take over
the machine. From a scheduling perspective, this isn't a bad thing,
but it makes for many headaches for user support folks who have to
answer why specific users can dominate the system.
4) Set bf_max_job_test= and default_queue_depth= so that the full
queue can be traversed, but so that there's a cutoff if the queue is
huge and there are too many potential backfillable jobs; set
bf_continue with these limits so jobs near the bottom of the queue
don't become starved.
5) Measure the # of active RPC both with sdiag and ncat or similar tools
Consider increasing max_rpc_cnt based on these measurements.
6) Some of the guidelines in the high frequency/high throughput
guidance may be helpful, esp. increasing SOMAXCONN and other TCP
tuning. I would suggest caution when changing these, as they obviously
affect many different subsystems in addition to slurm. Unless you have
a dedicated slurm controller and scheduler node, you are increasing
risk and variability by making these changes.

For jobs sitting in COMP state, you may need to look at what is in
your epilogue or encourage other behaviors in job scripts,
applications or caching layers. Are jobs sitting in COMP state because
there's a lot of dirty I/O to be flushed? Does your epilog do the
equivalent of (fsync(F_DATASYNC))? Are the applications not syncing
their data during the job run? Some tuning of end-of-job timeouts
could be done in slurm, but this seems more of a symptom of
misbalanced caching and applications. Look for jobs stuck in, say,
Lustre I/O or network I/O and spikes in I/O right as jobs finish.

Hope this helps,
-Steve






On Mon, Aug 17, 2020 at 12:31 PM Ransom, Geoffrey M.
 wrote:
>
>
>
> Hello
>
> We are having performance issues with slurtmctld (delayed sinfo/squeue 
> results, socket timeouts for multiple sbatch calls, jobs/nodes sitting in 
> COMP state for an extended period of time).
>
>
>
> We just fully switch to Slurm from Univa and I think our problem is users 
> putting a lot of “scontrol show” calls (maybe squeue/sinfo as well) in large 
> batches of jobs and essentially DOS-ing our scheduler.
>
>
>
> Is there a built in way to throttle “squeue/sinfo/scontrol show” commands in 
> a reasonable manner so one user can’t do something dumb running jobs that 
> keep calling these commands in bulk?
>
>
>
> If I need to make something up to verify this I am thinking about making a 
> wrapper script around these commands that locks a shared  temp file on the 
> local disk (to avoid NFS locking issues) of each machine and then sleeps for 
> 5 seconds before calling the real command and releasing the lock. At least 
> this way a user w