On Apr 24, 2013, at 12:22 PM, Timothy Potter <thelabd...@gmail.com> wrote:
> I like the idea of running a script vs. kill -9 ;-) Right now when a > node fails, we have monitors for whether a node is up and serving > queries. If not, that triggers some manual investigation and restart > process. Part of the process was to capture the logs and heap dump > file. What happened previously is that the log capture part wasn't > scripted into the restart process and so the logs got wiped out when > the restart happened :-( > > One question about this - when you say "logs the issue" from your > script - what type of things do you log? I've been relying on the > timestamp of the heap dump (hprof) as a way to trace back into our log > files. Yeah, that's pretty much it - the time of the event and the fact that an OOM occurred. If you are dropping a heap dump, that has the same info, but a log is just a nice compact little history of events. - Mark > > Thanks. > Tim > > On Wed, Apr 24, 2013 at 10:03 AM, Mark Miller <markrmil...@gmail.com> wrote: >> >> On Apr 24, 2013, at 12:00 PM, Mark Miller <markrmil...@gmail.com> wrote: >> >>>> -XX:OnOutOfMemoryError="kill -9 %p" -XX:+HeapDumpOnOutOfMemoryError >> >> The way I like to handle this is to have the OOM trigger a little script or >> set of cmds that logs the issue and kills the process. >> >> Then if you have the process supervised (via runit or something), it will >> just start back up (what else do you do after an OOM?), but you will have >> logged something, triggered a notification, whatever. >> >> - Mark