On 11/12/20 6:06 PM, Shaokun Zhang wrote: >>> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 - 3) in the 2-cpu >>> system(0 - 1). The topology of this server is followed: >> >> This is with a feature enabled that Intel calls sub-NUMA-clustering >> (SNC), right? Explaining *that* feature would also be great context for > > Correct, > >> why this gets triggered on your system and not normally on others and >> why nobody noticed this until now. > > This is on intel 6248 platform:
I have no idea what a "6248 platform" is. >>> +static void calc_node_distance(int *node_dist, int node) >>> +{ >>> + int i; >>> + >>> + for (i = 0; i < nr_node_ids; i++) >>> + node_dist[i] = node_distance(node, i); >>> +} >> >> This appears to be the only place node_dist[] is written. That means it >> always contains a one-dimensional slice of the two-dimensional data >> represented by node_distance(). >> >> Why is a copy of this data needed? > > It is used to store the distance with the @node for later, apologies that I > can't follow your question correctly. Right, the data that you store is useful. *But*, it's also a verbatim copy of the data from node_distance(). Why not just use node_distance() directly in your code rather than creating a partial copy of it in the local node_dist[] array? >>> unsigned int cpumask_local_spread(unsigned int i, int node) >>> { >>> - int cpu, hk_flags; >>> + static DEFINE_SPINLOCK(spread_lock); >>> + static int node_dist[MAX_NUMNODES]; >>> + static bool used[MAX_NUMNODES]; >> >> Not to be *too* picky, but there is a reason we declare nodemask_t as a >> bitmap and not an array of bools. Isn't this just wasteful? >> >>> + unsigned long flags; >>> + int cpu, hk_flags, j, id; >>> const struct cpumask *mask; >>> >>> hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ; >>> @@ -220,20 +256,28 @@ unsigned int cpumask_local_spread(unsigned int i, int >>> node) >>> return cpu; >>> } >>> } else { >>> - /* NUMA first. */ >>> - for_each_cpu_and(cpu, cpumask_of_node(node), mask) { >>> - if (i-- == 0) >>> - return cpu; >>> - } >>> + spin_lock_irqsave(&spread_lock, flags); >>> + memset(used, 0, nr_node_ids * sizeof(bool)); >>> + calc_node_distance(node_dist, node); >>> + /* Local node first then the nearest node is used */ >> >> Is this comment really correct? This makes it sound like there is only > > I think it is correct, that's what we want to choose the nearest node. > >> fallback to a single node. Doesn't the _code_ fall back basically >> without limit? > > If I follow your question correctly, without this patch, if the local > node is used up, one random node will be choosed, right? Now we firstly > choose the nearest node by the distance, if all nodes has been choosen, > it will return the initial solution. The comment makes it sound like the code does: 1. Do the local node 2. Do the next nearest node 3. Stop In reality, I *think* it's more of a loop where it search ever-increasing distances away from the local node. I just think the comment needs to be made more precise. >>> + for (j = 0; j < nr_node_ids; j++) { >>> + id = find_nearest_node(node_dist, used); >>> + if (id < 0) >>> + break; >>> >>> - for_each_cpu(cpu, mask) { >>> - /* Skip NUMA nodes, done above. */ >>> - if (cpumask_test_cpu(cpu, cpumask_of_node(node))) >>> - continue; >>> + for_each_cpu_and(cpu, cpumask_of_node(id), mask) >>> + if (i-- == 0) { >>> + spin_unlock_irqrestore(&spread_lock, >>> + flags); >>> + return cpu; >>> + } >>> + used[id] = 1; >>> + } >>> + spin_unlock_irqrestore(&spread_lock, flags); >> >> The existing code was pretty sparsely commented. This looks to me to >> make it more complicated and *less* commented. Not the best combo. > > Apologies for the bad comments, hopefully I describe it clearly by the above > explantion. Do you want to take another pass at submitting this patch?