"Nikita V. Youshchenko" <yo...@debian.org> writes: >> We have had to carry that patch without any upstream support (or sharing >> with Novell, which eventually released SLES 11 with 2.6.27). As a >> result, the xen-flavour kernels for lenny are very buggy, particularly >> for domains with multiple vCPUs (though that *may* be fixed now). > > Unfortunately it is not fixed. > > We here once migrated to xen and now rely on it, and that gives lots of > frustration. For any loaded domain we still have to run etch kernel, > because lenny kernel constantly crashes after several days of heavy load. > Dom0's run lenny kernel - and with a fix for #542250 they don't crash, but > those are almost unloaded.
I was having problems with multiple vCPUs also, under moderate load I would regularly get crashes. I reported my findings in #504805. I swapped out machines, didn't work. When the fix for the xen_spin_wait() came out, I eagerly switched to that, but it didn't fix my problem. I even tried my hardest to switch to the latest upstream Xen kernel to see if that would fix things, but it was way too unstable and I couldn't get it to work at all. Eventually I stumbled on a way to keep my machines from restarting, its not a great solution, but it stops me from having to deal with the failure on a daily basis. I think that anyone else who is having this problem can do this and it will work. Obviously this is not the right solution, but it works until we can get a fix. First I made sure this was set: /etc/xen/xend-config.sxp: (dom0-cpus 0) Then I pinned individual physical CPUs to specific domU's, once pinned, the problem stops. What does that mean? Well, Xen does this wacky thing where it creates virtual CPUs (VCPUs), each domU has one of them by default (but you can have more), and then it moves physical CPUs between those VCPUs depending on need. So lets say you have four CPUs, and a domU. That domU has one VCPU by default. That VCPU could actually have the physical CPU 0, 1, 2, 3 all servicing it to provide that VCPU, even at the same time. I found somewhere that this can be a performance hit, because it needs to figure out how to deal with this and switch contexts. I also read that it could cause some instability (!), so pinning the physical CPUs so they don't move around seemed to solve this. The pinning does not stick across reboots, so it has to be done again if the system is rebooted, and it isn't really possible to set this in a startup script, at least I don't think so. So how do you do this? If you look at 'xm vcpu-list' (which annoyingly isn't listed in 'xm help') you will see the CPU column populated with a random CPU, depending on scheduling, and then the CPU Affinity column all say 'any cpu'. This means that any physical CPU could travel between them, and would, depending on the scheduling. Once you pin things, then the individual domU's are set to have specific CPU affinities, so the CPUs don't 'travel' between them, and magically the crash stops. So an example: r...@shoveler:~# xm vcpu-list Name ID VCPU CPU State Time(s) CPU Affinity Domain-0 0 0 1 -b- 283688.8 any cpu Domain-0 0 1 1 --- 39666.3 any cpu Domain-0 0 2 1 r-- 49224.4 any cpu Domain-0 0 3 1 -b- 75591.1 any cpu kite 1 0 3 -b- 71411.8 any cpu murrelet 2 0 0 -b- 472222.2 any cpu test 3 0 0 r-- 342182.3 any cpu So we want to fix that final column using 'xm vcpu-pin' (also a command not listed in 'xm help'): Usage: xm vcpu-pin <Domain> <VCPU|all> <CPUs|all> Set which CPUs a VCPU can use. r...@shoveler:~# xm vcpu-pin 0 0 0 r...@shoveler:~# xm vcpu-pin 0 1 0 r...@shoveler:~# xm vcpu-pin 0 2 0 r...@shoveler:~# xm vcpu-pin 0 3 0 r...@shoveler:~# xm vcpu-pin 1 0 1 r...@shoveler:~# xm vcpu-pin 2 0 2 r...@shoveler:~# xm vcpu-pin 3 0 3 r...@shoveler:~# xm vcpu-list Name ID VCPU CPU State Time(s) CPU Affinity Domain-0 0 0 1 -b- 283700.3 0 Domain-0 0 1 1 r-- 39669.6 0 Domain-0 0 2 1 -b- 49227.4 0 Domain-0 0 3 1 -b- 75596.2 0 kite 1 0 3 -b- 71415.3 1 murrelet 2 0 0 -b- 472237.8 2 test 3 0 0 r-- 342182.3 3 And voila, no more crashes... :P micah -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/87d3yj7yh3....@pond.riseup.net