Hi folks, Over on the xCAT mailing list I've been involved in a thread relating to diverse settings of zone_reclaim_mode across nodes in clusters.
It starts here with Stuarts good description of the problem and diagnosis: http://sourceforge.net/p/xcat/mailman/message/32841877/ I did some poking around on our systems and was able to confirm that whilst our newer iDatplex (dx360 M4's) with SB CPUs all had zone_reclaim_mode set to 0 our older iDataplex with Nehalems (dx360 M2's) were all 1, along with an older SGI UV10 (Westmere). The clincher was the fact that on that same cluster we had some IBM x3690 X5 with Maxx5's which boot an identical diskless image and they had zone_reclaim_mode set to 0, not 1. Turns out that this is indeed autotuned by older kernels, with this text in the kernel Documentation/sysctl/vm.txt # zone_reclaim_mode is set during bootup to 1 if it is determined # that pages from remote zones will cause a measurable performance # reduction. The page allocator will then reclaim easily reusable # pages (those page cache pages that are currently not used) before # allocating off node pages. However, in 3.16 a patch was committed that disabled this auto-tuning, turning off zone reclamation by default. It's probably worth checking your own x86-64 systems to see if this is set for you and benchmarking with it disabled if it is.. Here's that patch with the description.. commit 4f9b16a64753d0bb607454347036dc997fd03b82 Author: Mel Gorman <mgor...@suse.de> Date: Wed Jun 4 16:07:14 2014 -0700 mm: disable zone_reclaim_mode by default When it was introduced, zone_reclaim_mode made sense as NUMA distances punished and workloads were generally partitioned to fit into a NUMA node. NUMA machines are now common but few of the workloads are NUMA-aware and it's routine to see major performance degradation due to zone_reclaim_mode being enabled but relatively few can identify the problem. Those that require zone_reclaim_mode are likely to be able to detect when it needs to be enabled and tune appropriately so lets have a sensible default for the bulk of users. This patch (of 2): zone_reclaim_mode causes processes to prefer reclaiming memory from local node instead of spilling over to other nodes. This made sense initially when NUMA machines were almost exclusively HPC and the workload was partitioned into nodes. The NUMA penalties were sufficiently high to justify reclaiming the memory. On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Users that are sophisticated enough to know they need zone_reclaim_mode will detect it. Signed-off-by: Mel Gorman <mgor...@suse.de> Acked-by: Johannes Weiner <han...@cmpxchg.org> Reviewed-by: Zhang Yanfei <zhangyan...@cn.fujitsu.com> Acked-by: Michal Hocko <mho...@suse.cz> Reviewed-by: Christoph Lameter <c...@linux.com> Signed-off-by: Andrew Morton <a...@linux-foundation.org> Signed-off-by: Linus Torvalds <torva...@linux-foundation.org> -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf