The questions are
1) Is there some way to distribute analogously the local memory of threads (I assume that it have the same size for each thread) using "reasonable" NUMA allocation ?

that is, not surprisingly, the default.  generally, on all NUMA machines,
the starting rule is that memory is allocated for a thread upon "first
touch". that is, the first thread to touch it, causing a page fault and triggering the actual allocation. (if you allocate memory but never touch it, it remains purely virtual, ignoring any book-keeping by your memory allocation library, if any.)

2) Is it right that using of numactl for applications may gives improvements of performance for the following case: the number of application processes is equal to the number of cores of one CPU *AND* the necessary (for application) RAM amount may be placed on one node DIMMs (I assume that RAM is allocated "continously").

you certainly don't want to _deliberately_ create imbalances.
"numactl --hardware" is interesting to see the state of memory allocation.
of course, it reflects only size and free (where free means "wasted" to the
kernel, not the same as "freeable".)



What will be w/performance (at numactl using) for the case if RAM size required is higher than RAM available per one node, and therefore the program will not use the possibility of (load balanced) simultaneous using of memory controllers on both CPUs ?

non-local memory is modestly slower than local - not dramatically.

(I also assume also that RAM is allocated continously).

I'm not sure what that means - continuously in time?  or contiguously?
the latter is definitely not true - the allocated memory map for a task
will normally be pretty chopped up, and the virtual addresses will have little relation to physical addresses.

3) Is there some reason to use things like
mpirun -np N /usr/bin/numactl <numactl_parameters>  my_application   ?

not that I know.

4) If I use malloc() and don't use numactl, how to understand - from which node Linux will begin the real memory allocation ? (I remember that I assume

if there is free memory on the node where the thread is running, that's where the physical page will be allocated.

that all the RAM is free) And how to understand where are placed the DIMMs which will corresponds to higher RAM addresses or lower RAM addresses ?

I don't see why userspace would need to know that. the main question is whether non-local allocations are allowed or not, and you set that policy
with numactl --localalloc (or override with --preferred, etc)

5) In which cases is it reasonable to switch on "Node memory interleaving" (in BIOS) for the application which uses more memory than is presented on the node ?

I leave it off, since numactl --interleave lets you get the same effect from user-space. I'm not sure I've ever seen it be a win.

And BTW: if I use taskset -c CPU1,CPU2, ... <program_file>
and the program_file creates some new processes, will all this processes run only on the same CPUs defined in taskset command ?

afaik, scheduler settings like this are indeed inherited across clone,
possibly also fork.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to