>> Because it stopped the random out of memory conditions that we were having. > > aha, so basically "rebooting windows resolves my performance problems" ;)
in other words, a workaround. I think it's important to note when behavior is a workaround, so that it don't get ossified into SOP. > Mark, I don't understand your forcefulness here. it's very simple: pagecache and VM balancing is a very important part of the kernel, and has received a lot of quite productive attention over the years. I question the assumption that "rebooting the pagecache" is a sensible way to deal with memory-tuning problems. it seems very passive-aggressive to me: as if there is assumption that the kernel isn't or can't Do The Right Thing for HPC. drop_caches is such a brutal, sledghammer thing. for instance, it can't be used if there are multiple jobs on the host. it assumes there is absolutely zero sharing between jobs (executables, contents of /etc/ld.so.cache, etc). for sites where a single job is rolled onto all nodes and runs for a long time, then is entirely removed, sure, it may make sense. rebooting entirely might even work better. I'm mainly concerned with clusters which run a wide mixture of jobs, probably with multiple jobs sharing a node at times. > All modern compute nodes are essentially NUMA machines (I am assuming all are > dual or more socket machines). it depends. we have some dual E5-2670 nodes that have two memory nodes - I strongly suspect that they do not need any pagecache-reboot, since they have just 2 normal zones to balance. obviously, 4-chip nodes (including AMD dual-G34 systems) have an increased chance of fragmentation. similarly, if you shell out for a MANY-node system, and run a single job at a time on it, you should certainly be more concerned with whether the kernel can balance all your tiny little memory zones. standard statistics apply: if the kernel balances a zone well .99 of the time, anyone with a few hundred zones will be very unhappy sometimes. in short, all >1s servers are NUMA, but that doesn't mean you should drop_caches. > If caches are a large fraction of memory then you have increased memory > requests from the foreign node. wait, slow down. first, why are you assuming remote-node access? do your jobs specifically touch a file from one node, populating the pagecache, then move to another node to perform the actual IO? we normally have a rank wired to a particular core for its life. yes, it's certainly possible for high IO to consume enough pagecache to also occupy space on remote nodes. are you sure this is bad though? pagecache is quite deliberately treated as a low-caste memory request - normally pagecache scavenges its own current usage under memory pressure. and the alternative is to be doing uncached IO (or pagecache misses). I often also meet people who think that having free memory is a good thing, when in fact it means "I bought too much ram". that's a little over the top, of course, but the real message is that having all your ram occupied, even or especially by pagecache, is good, since pagecache is so efficiently scavenged. (no IO, obviously - the Inactive fields in /proc/meminfo are lists dedicated to this sort of easy scavenging.) > Surely for HPC workloads resetting the system so that you get deterministic > run times is a good thing? who says determinism is a good thing? I assume, for instance, you turn off your CPU caches to obtain determinism, right? I'm not claiming that variance is good, but why do you assume that the normal functioning of the pagecache will cause it? > It would be good to see if there really is a big spread in run time - as I >think there should be. why? what's your logic? afaikt, you're assuming that the caches are either preventing memory allocations, causing swapping, or are somehow expensive to scavenge. while that's conceivable, it would be a kernel bug. here's the best justification I can think of for drop_caches: you run in a dedicated-host environment, but have a write-heavy workload. and yet each job shares NOTHING with its successor, not even references to /bin/sh. and each job spews out a vast bolus of writes at its end, and you don't want to tune vm.dirty* to ensure this write happens smoothly. instead, you want to idle all your cpus and memory to ensure that the writes are synced before letting the next job run. (it will, of course, spend its initial seconds mainly missing the pagecache, so cpus/memory will be mostly idle during this time as well.) but at least you can in good concience charge the next user for exclusive access, even though the whole-system throughput is lower due to all the IO waiting and stalling. as I said before, it's conceivable that the kernel has bugs related to scavenging - maybe it actually does eat a lot of cycles. since scavenging clean pages is a normal, inherent behavior, this would be somewhat worrisome, and should be reported/escalated as a Big Deal. after all, all read IO happens on this path, and all (non-O_DIRECT) writes *also* follow this path, since a dirty page gets cleaned (synced), then scavenged like all the other clean pages. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf