|
http://www.win.tue.nl/~aeb/linux/lk/lk-9.html Next Previous Contents 9. Memory
9.1 Physical and virtual memoryTraditionally, one has physical memory, that is, memory that is actually present in the machine, and virtual memory, that is, address space. Usually the virtual memory is much larger than the physical memory, and some hardware or software mechanism makes sure that a program can transparently use this much larger virtual space while in fact only the physical memory is available. Nowadays things are reversed: on a Pentium II one can have 64 GB physical memory, while addresses have 32 bits, so that the virtual memory has a size of 4 GB. We'll have to wait for a 64-bit architecture to get large amounts of virtual memory again. The present situation on a Pentium with more than 4 GB is that using the PAE (Physical Address Extension) it is possible to place the addressable 4 GB anywhere in the available memory, but it is impossible to have access to more than 4 GB at once. 9.2 Kinds of memoryKernel and user space work with virtual addresses (also called linear addresses) that are mapped to physical addresses by the memory management hardware. This mapping is defined by page tables, set up by the operating system. DMA devices use bus addresses. On an i386 PC, bus addresses are the same as physical addresses, but other architectures may have special address mapping hardware to convert bus addresses to physical addresses. Under Linux one has #include <asm/io.h> phys_addr = virt_to_phys(virt_addr); virt_addr = phys_to_virt(phys_addr); bus_addr = virt_to_bus(virt_addr); virt_addr = bus_to_virt(bus_addr); All this is about accessing ordinary memory. There is also
"shared memory" on the PCI or ISA bus. It can be mapped
inside a 32-bit address space using Life is complicated by the fact that there are various caches around, so that different ways to access the same physical address need not give the same result. See 9.3 Kernel memory handling
PagesThe basic unit of memory is the page.
Nobody knows how large a page is (that is why the Linux swapspace
structure, with a signature at the end of a page, is so unfortunate),
this is architecture-dependent, but typically PAGE_SIZE = 4096.
(PAGE_SIZE equals 1 << PAGE_SHIFT, and PAGE_SHIFT is 12, 13, 14,
15, 16
on the various architectures). If one is lucky, the Usually, the page size is determined by the hardware: the relation between virtual addresses and physical addresses is given by page tables, and when a virtual address is referenced that does not (yet) correspond to a physical address, a page fault occurs, and the operating system can take appropriate action. Most hardware allows a very limited choice of page sizes. (For example, a Pentium II knows about 4KiB and 4MiB pages.) Kernel memory allocation
Buddy systemThe kernel uses a buddy system with power-of-two sizes. For order
0, 1, 2, ..., 9 it has lists of areas containing 2^order pages.
If a small area is needed and only a larger area is available,
the larger area is split into two halves (buddies), possibly
repeatedly.
In this way the waste is at most 50%.
Since 2.5.40, the number of free areas of each order can be seen in
get_free_pageThe routine kmallocThe routine PriorityEach of the above routines has a Some common flag values (see /* Zone modifiers in GFP_ZONEMASK (see linux/mmzone.h - low four bits) */ #define __GFP_DMA 0x01/* Action modifiers - doesn't change the zoning */ #define __GFP_WAIT 0x10 /* Can wait and reschedule? */ #define __GFP_HIGH 0x20 /* Should access emergency pools? */ #define __GFP_IO 0x40 /* Can start low memory physical IO? */ #define __GFP_FS 0x100 /* Can call down to low-level FS? */ #define __GFP_COLD 0x200 /* Cache-cold page required */ #define GFP_NOIO (__GFP_WAIT) #define GFP_NOFS (__GFP_WAIT | __GFP_IO ) #define GFP_ATOMIC (__GFP_HIGH) #define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS) #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS) Uses:
vmallocThe routine bigphysareaThere is a patch around, the "
bigphysarea
patch",
allowing one to reserve a large area at boot time.
Some devices need a lot of room (for video frame grabbing,
scanned images, wavetable synthesis, etc.). See, e.g.,
The slab cacheThe routine The number of special purpose caches is increasing quickly. I have here a 2.4.18 system with 63 slab caches, and a 2.5.56 one with 149 slab caches. Probably this number should be reduced a bit again. Statistics is found in
The utility Question One of the reasons given for a slab cache is that it will give areas of the precise required size, and thus reduce memory waste. In the man page example above, look at how many pages are used for each slab cache, and how much space it would have cost to kmalloc() everything. Does the slab cache do better? Having a slab cache has some other advantages besides reducing waste. Allocation and freeing is faster than with the buddy system. There is an attempt at doing some cache colouring. 9.4 i386 addressingLet us look at an explicit machine to see how addressing is done. This is only a very brief description. For all details, see some Intel manual. (I have here 24547209.pdf, entitled IA-32 Intel Architecture Software Developer's Manual Volume 3, with Chapter 3: Protected Mode Memory Management. It can be downloaded from developer.intel.com.) Addressing goes in two stages: a logical address determined by segment and offset is translated into a linear address, and next the linear address is translated into a physical address. Real mode assembler programmers may be familiar with the first stage: one has a 16-bit segment register with value S and 16- or 32-bit offset O, that produce a 20- or 32-bit linear address A = 16*S + O, and this is also the physical address. There are six segment registers: CS, SS, DS, ES, FS, GS. The first is used when instructions are fetched, the second for stack access, the third for data access, and override prefixes allow one to specify any of the six. SegmentationIn protected mode one still has segmentation (although most systems do not use it - the corresponding hardware is just wasted and present only for compatibility reasons), but one also has paging. The paging is optional (it can be enabled/disabled), but this unused segmentation is always present, and much more complicated than in real mode.
Instead of a single 16-bit value, the segmentation register S is now a 13-bit index I followed by the single bit value TI (Table Index) and the 2-bit value RPL (Requested Protection Level). The bit TI selects one of two tables: the GDT (Global Descriptor Table) or the LDT (Local Descriptor Table). Each can have up to 8192 entries, and the 13-bit index I selects an entry in one of these tables.
Such an entry is an 8-byte struct that has a 32-byte field PagingWhen paging is enabled (as it is under Linux) we are only halfway, and this useless segmentation stuff is followed by the extremely useful paging. The linear address A is divided into three parts: a 10-bit directory AD, a 10-bit table AT and a 12-bit offset AO. The corresponding physical address is now found in four stages. (i) The CR3 processor register points at the page directory. (ii) Index the page directory with AD to find a page table. (iii) Index the page table with AT to find the page. (iv) Index the page with AO to find the physical address. Thus, the page directory is a table with up to 1024 entries. Each of these entries points to a page table that itself has up to 1024 entries. Each of these point to a page. The entries in page directory and page table have 32 bits, but only the first 20 are needed for the address since by convention pages and page tables start at a 4 KiB boundary so that the last 12 address bits are zero. In page directory and page table these last bits are used for bookkeeping purposes. By far the most important bit is the "present" bit. If the user program refers to a page that is not actually present (according to this bit in the page table) a page fault is generated, and the operating system might fetch the page, map it somewhere in memory, and continue the user program. Apart from the "present" bit there is the "dirty" bit, that indicates whether the page was modified, the "accessed" bit, that indicates that the page was read, and permission bits. The above was for 4 KiB pages. There are also 4 MiB pages, where one indexing stage is absent and the offset part has 22 bits. (With PAE one also has 2 MiB pages.) For many pages of detail, see the Intel manuals. 9.5 ReferenceRecently Mel Gorman released his thesis, a detailed description and commentary of the virtual memory management of Linux 2.4. 9.6 Overcommit and OOMNormally, a user-space program reserves (virtual) memory by calling
Linux on the other hand is seriously broken. It will by default answer "yes" to most requests for memory, in the hope that programs ask for more than they actually need. If the hope is fulfilled Linux can run more programs in the same memory, or can run a program that requires more virtual memory than is available. And if not then very bad things happen. What happens is that the OOM killer (OOM = out-of-memory) is invoked, and it will select some process and kill it. One holds long discussions about the choice of the victim. Maybe not a root process, maybe not a process doing raw I/O, maybe not a process that has already spent weeks doing some computation. And thus it can happen that one's emacs is killed when someone else starts more stuff than the kernel can handle. Ach. Very, very primitive. Of course, the very existence of an OOM killer is a bug. A typical case: I do Sep 19 00:33:10 mette kernel: Out of Memory: Killed process 8631 (xterm). Sep 19 00:33:34 mette kernel: Out of Memory: Killed process 9154 (xterm). Sep 19 00:34:05 mette kernel: Out of Memory: Killed process 6840 (xterm). Sep 19 00:34:42 mette kernel: Out of Memory: Killed process 9066 (xterm). Sep 19 00:35:15 mette kernel: Out of Memory: Killed process 9269 (xterm). Sep 19 00:35:43 mette kernel: Out of Memory: Killed process 9351 (xterm). Sep 19 00:36:05 mette kernel: Out of Memory: Killed process 6752 (xterm).Randomly xterm windows are killed, until the xterm
window
that was X's console is killed. Then X
exits and all
user processes die, including the umount process that
caused all this.
OK. This is very bad. People lose long-running processes, lose weeks of computation, just because the kernel is an optimist. Demo program 1: allocate memory without using it. #include <stdio.h>
#include <stdlib.h>
int main (void) {
int n = 0;
while (1) {
if (malloc(1<<20) == NULL) {
printf("malloc failure after %d MiB\n", n);
return 0;
}
printf ("got %d MiB\n", ++n);
}
}
Demo program 2: allocate memory and actually touch it all. #include <stdio.h>
#include <string.h>
#include <stdlib.h>
int main (void) {
int n = 0;
char *p;
while (1) {
if ((p = malloc(1<<20)) == NULL) {
printf("malloc failure after %d MiB\n", n);
return 0;
}
memset (p, 0, (1<<20));
printf ("got %d MiB\n", ++n);
}
}
Demo program 3: first allocate, and use later. #include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define N 10000
int main (void) {
int i, n = 0;
char *pp[N];
for (n = 0; n < N; n++) {
pp[n] = malloc(1<<20);
if (pp[n] == NULL)
break;
}
printf("malloc failure after %d MiB\n", n);
for (i = 0; i < n; i++) {
memset (pp[i], 0, (1<<20));
printf("%d\n", i+1);
}
return 0;
}
Typically, the first demo program will get a very large amount of
memory
before malloc() returns NULL. The second
demo program
will get a much smaller amount of memory, now that earlier obtained
memory
is actually used. The third program will get the same large amount as
the first program, and then is killed when it wants to use its memory.
(On a well-functioning system, like Solaris, the three demo programs
obtain the same amount of memory and do not crash but see malloc()
return NULL.)
For example:
Other kernels have other strategies. Sometimes processes get a segfault when accessing memory that the kernel is unable to provide, sometimes they are killed, sometimes other processes are killed, sometimes the kernel hangs. Turning off overcommit
Going in the wrong directionSince 2.1.27 there are a sysctl Going in the right directionSince 2.5.30 the values are:
0 (default): as before: guess about how much overcommitment is
reasonable,
1: never refuse any After # echo 2 > /proc/sys/vm/overcommit_memoryall three demo programs were able to obtain 498 MiB on this 2.6.8.1 machine (256 MiB, 539 MiB swap, lots of other active processes), very satisfactory. However, without swap, no more processes could be started - already more than half of the memory was committed. After # echo 80 > /proc/sys/vm/overcommit_ratioall three demo programs were able to obtain 34 MiB. (Exercise: solve two equations with two unknowns and conclude that main memory was 250 MiB, and the other processes took 166 MiB.) One can view the currently committed amount of memory in
9.7 Stack overflowProcesses use memory on the stack and on the heap.
Heap memory is provided by Each process has a soft and a hard limit on the stack size. The soft limit can be changed, but must stay below the hard limit. The hard limit can be decreased, but can be increased only by root. % ulimit -s unlimited % ulimit -s 10 % ls Segmentation fault % ulimit -s 100 -bash: ulimit: stack size: cannot modify limit: Operation not permitted % su Segmentation fault :-)One can get or set such limits via getrlimit/setrlimit: /* getstklim.c */
#include <stdio.h>
#include <sys/resource.h>
int main() {
struct rlimit rlim;
if (getrlimit(RLIMIT_STACK, &rlim))
return 1;
printf("stack size: current %ld max: %ld\n",
rlim.rlim_cur, rlim.rlim_max);
return 0;
}
The bash builtin ulimit works in KiB, the system calls
getrlimit/setrlimit in bytes.
% ulimit -s 8192 % ./getstklim stack size: current 8388608 max: -1 % ulimit -s 100 % ./getstklim stack size: current 102400 max: 102400 % su # ulimit -s unlimited # ./getstklim stack size: current -1 max: -1The old limit of 8 MiB was introduced in kernel version 1.3.7. Today it is ridiculously low. If you get random segfaults, put "ulimit -s unlimited" in your .profile.
It is not common to worry about stack overflow, and sometimes one can compromise the security of a system by starting a setuid binary with low resource limits: % ulimit -s 15; mount Segmentation fault % ulimit -s 10; mount Killed But a programmer can catch signals and do something about the SIGSEGV. (This was even the technique the Bourne shell used for memory management: don't use malloc but just store stuff in the data area past the end of what was used before. If that gives a SIGSEGV, then extend the data area in the signal handler and return from the signal handler. Unfortunately, as it turned out, this would not correctly restart the faulting instruction on all architectures.) A simple demo that catches SIGSEGV: #include <stdio.h>
#include <stdlib.h>
#include <signal.h>
void segfault(int dummy) {
printf("Help!\n");
exit(1);
}
int main() {
int *p = 0;
signal(SIGSEGV, segfault);
*p = 17;
return 0;
}
Without the exit() here, this demo will loop because
the illegal assignment is restarted.
This simple demo fails to catch stack overflow, because there is no stack space for a call frame for the segfault() interrupt handler. If it is desired to catch stack overflow one first must set up an alternative stack. As follows: ...
int main() {
char myaltstack[SIGSTKSZ];
struct sigaction act;
stack_t ss;
ss.ss_sp = myaltstack;
ss.ss_size = sizeof(myaltstack);
ss.ss_flags = 0;
if (sigaltstack(&ss, NULL))
errexit("sigaltstack failed");
act.sa_handler = segfault;
act.sa_flags = SA_ONSTACK;
if (sigaction(SIGSEGV, &act, NULL))
errexit("sigaction failed");
...
This will at least allow one to print an error message or to do some
cleanup
upon stack overflow.
Next Previous Contents |
