Hi Andreas,
On 2/14/19 8:56 AM, Henkel, Andreas wrote:
Hi Marcus,
More ideas:
CPUs doesn’t always count as core but may take the meaning of one thread, hence
makes different
Maybe the behavior of CR_ONE_TASK is still not solid nor properly documente
and ntasks and ntasks-per-node are honor
Hi Marcus,
We have skylake too and it didn’t work for us. We used cgroups only and process
binding went completely havoc with subnuma enabled.
While searching for solutions I found that hwloc does support subnuma only with
version > 2 (when looking for skylake in hwloc you will get hits in versi
Hi Andreas,
as slurmd -C shows, it detects 4 numa-nodes taking these as sockets.
This was also the way, we configured slurm.
numactl -H clearly shows the four domains and which belongs to which socket:
node distances:
node 0 1 2 3
0: 10 11 21 21
1: 11 10 21 21
2: 21 2
Hi Marcus,
for us slurmd -C as well as numactl -H looked fine, too. But we're using
task/cgroup only and every job starting on a skylake node gave us
|error("task/cgroup: task[%u] infinite loop broken while trying " "to
provision compute elements using %s (bitmap:%s)", |
from src/plugins/task/cg
Hi Andreas,
might be that this is one of the bugs in Slurm 18.
I think, I will open a bug report and see what they say.
Thank you very much, nonetheless.
Best
Marcus
On 2/14/19 2:36 PM, Andreas Henkel wrote:
Hi Marcus,
for us slurmd -C as well as numactl -H looked fine, too. But we're
Hi,
One job is in RH state which means JobHoldMaxRequeue.
The output file, specified by --output shows nothing suspicious.
Is there any way to analyze the stuck job?
Regards,
Mahmood
On 2/14/19 8:02 AM, Mahmood Naderan wrote:
One job is in RH state which means JobHoldMaxRequeue.
The output file, specified by --output shows nothing suspicious.
Is there any way to analyze the stuck job?
This happens when a job fails to start for MAX_BATCH_REQUEUE times
(which is 5 at the mo
On 2/14/19 12:22 AM, Marcus Wagner wrote:
CPUs=96 Boards=1 SocketsPerBoard=4 CoresPerSocket=12 ThreadsPerCore=2
RealMemory=191905
That's different to what you put in your config in the original email
though. There you had:
CPUs=48 Sockets=4 CoresPerSocket=12 ThreadsPerCore=2
This config
Hi Chris,
that can't be right, or there is some bug elsewhere:
We have configured CR_ONE_TASK_PER_CORE, so two tasks won't get a core
and its hyperthread.
According to your theory, I configured 48 threads. But then using just
--ntasks=48 would give me two nodes, right?
But Slurm schedules t
I have filed a bug:
https://bugs.schedmd.com/show_bug.cgi?id=6522
Lets see, what ScheMD has to tell us ;)
Best
Marcus
On 2/15/19 6:25 AM, Marcus Wagner wrote:
NumNodes=1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=48,mem=182400M,node=1,billing=48
--
Marcus Wagner, Di
10 matches
Mail list logo