This sounds like it could be garbage collection related, especially with a heap that large. Depending on your jvm tuning, a FGC could take quite a while, effectively 'pausing' the JVM.
Have you looked at something like jstat -gcutil or similar to monitor the garbage collection? On Jan 10, 2011, at 1:36 PM, Simon Wistow wrote: > I have a fairly classic master/slave set up. > > Response times on the slave are generally good with blips periodically, > apparently when replication is happening. > > Occasionally however the process will have one incredibly slow query and > will peg the CPU at 100%. > > The weird thing is that it will remain that way even if we stop querying > it and stop replication and then wait for over 20 minutes. The only way > to fix the problem at that point is to restart tomcat. > > Looking at slow queries around the time of the incident they don't look > particularly bad - they're predominantly filter queries running under > dismax and there doesn't seem to be anything unusual about them. > > The index file is about 266G and has 30G of disk free. The machine has > 50G of RAM and is running with -Xmx35G. > > Looking at the processes running it appears to be the main Java thread > that's CPU bound, not the child threads. > > Stracing the process gives a lot of brk instructions (presumably some > sort of wait loop) with occasional blips of: > > > mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0 > futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, > 325, {1294683789, 614186000}, ffffffff) = 0 > futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0 > mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0 > mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0 > futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1 > mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0 > futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1 > futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0 > futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0 > mmap(0x7fc2e0230000, 121962496, PROT_NONE, > MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = > 0x7fc2e0230000 > mmap(0x7fbca58e0000, 237568, PROT_NONE, > MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = > 0x7fbca58e0000 > > Any ideas about what's happening and if there's anyway to mitigate it? > If the box at least recovered then I could run another slave and load > balance between them working on the principle that the second box > would pick up the slack whilst the first box restabilised but, as it is, > that's not reliable. > > Thanks, > > Simon >