This sounds like it could be garbage collection related, especially with a heap 
that large.  Depending on your jvm tuning, a FGC could take quite a while, 
effectively 'pausing' the JVM.

Have you looked at something like jstat -gcutil   or similar to monitor the 
garbage collection?


On Jan 10, 2011, at 1:36 PM, Simon Wistow wrote:

> I have a fairly classic master/slave set up.
> 
> Response times on the slave are generally good with blips periodically, 
> apparently when replication is happening.
> 
> Occasionally however the process will have one incredibly slow query and 
> will peg the CPU at 100%.
> 
> The weird thing is that it will remain that way even if we stop querying 
> it and stop replication and then wait for over 20 minutes. The only way 
> to fix the problem at that point is to restart tomcat.
> 
> Looking at slow queries around the time of the incident they don't look 
> particularly bad - they're predominantly filter queries running under 
> dismax and there doesn't seem to be anything unusual about them.
> 
> The index file is about 266G and has 30G of disk free. The machine has 
> 50G of RAM and is running with -Xmx35G.
> 
> Looking at the processes running it appears to be the main Java thread 
> that's CPU bound, not the child threads. 
> 
> Stracing the process gives a lot of brk instructions (presumably some 
> sort of wait loop) with occasional blips of: 
> 
> 
> mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0
> futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 
> 325, {1294683789, 614186000}, ffffffff) = 0
> futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
> mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0
> mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0
> futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1
> mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0
> futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0
> futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
> mmap(0x7fc2e0230000, 121962496, PROT_NONE, 
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
> 0x7fc2e0230000
> mmap(0x7fbca58e0000, 237568, PROT_NONE, 
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
> 0x7fbca58e0000
> 
> Any ideas about what's happening and if there's anyway to mitigate it? 
> If the box at least recovered then I could run another slave and load 
> balance between them working on the principle that the second box 
> would pick up the slack whilst the first box restabilised but, as it is, 
> that's not reliable.
> 
> Thanks,
> 
> Simon
> 

Reply via email to