Re: [slurm-users] Errors after removing partition

Jeffrey Frey Fri, 26 Jul 2019 08:30:45 -0700

If you check the source code (src/slurmctld/job_mgr.c) this error is indeed 
thrown when slurmctl unpacks job state files.  Tracing through 
read_slurm_conf() -> load_all_job_state() -> _load_job_state():



                part_ptr = find_part_record (partition);
                if (part_ptr == NULL) {
                        char *err_part = NULL;
                        part_ptr_list = get_part_list(partition, &err_part);
                        if (part_ptr_list) {
                                part_ptr = list_peek(part_ptr_list);
                                if (list_count(part_ptr_list) == 1)
                                        FREE_NULL_LIST(part_ptr_list);
                        } else {
                                verbose("Invalid partition (%s) for JobId=%u",
                                        err_part, job_id);
                                xfree(err_part);
                                /* not fatal error, partition could have been
                                 * removed, reset_job_bitmaps() will clean-up
                                 * this job */
                        }
                }


The comment after the error implies that this is not really a problem, and that 
it occurs specifically when a partition has been removed.




> On Jul 26, 2019, at 11:15 AM, Brian Andrus <toomuc...@gmail.com> wrote:
> 
> All,
> 
> I have a cloud based cluster using slurm 19.05.0-1
> I removed one of the partitions, but now everytime I start slurmctld I get 
> some errors:
> 
> slurmctld[63042]: error: Invalid partition (mpi-h44rs) for JobId=52545
> slurmctld[63042]: error: _find_node_record(756): lookup failure for 
> mpi-h44rs-01
> slurmctld[63042]: error: node_name2bitmap: invalid node specified mpi-h44rs-01
> .
> .
> slurmctld[63042]: error: _find_node_record(756): lookup failure for 
> mpi-h44rs-05
> slurmctld[63042]: error: node_name2bitmap: invalid node specified mpi-h44rs-05
> slurmctld[63042]: error: Invalid nodes (mpi-h44rs-[01-05]) for JobId=52545
> 
> I suspect this is in the saved state directory and if I were to down the 
> entire cluster and delete those files up, it would clear it up, but I prefer 
> to not have to down the cluster...
> 
> Is there a way to clean up "phantom" nodes and partitions that were deleted?
> 
> Brian Andrus 


::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE  19716
Office: (302) 831-6034  Mobile: (302) 419-4976
::::::::::::::::::::::::::::::::::::::::::::::::::::::

Re: [slurm-users] Errors after removing partition

Reply via email to