I like the idea of running a script vs. kill -9 ;-) Right now when a node fails, we have monitors for whether a node is up and serving queries. If not, that triggers some manual investigation and restart process. Part of the process was to capture the logs and heap dump file. What happened previously is that the log capture part wasn't scripted into the restart process and so the logs got wiped out when the restart happened :-(
One question about this - when you say "logs the issue" from your script - what type of things do you log? I've been relying on the timestamp of the heap dump (hprof) as a way to trace back into our log files. Thanks. Tim On Wed, Apr 24, 2013 at 10:03 AM, Mark Miller <markrmil...@gmail.com> wrote: > > On Apr 24, 2013, at 12:00 PM, Mark Miller <markrmil...@gmail.com> wrote: > >>> -XX:OnOutOfMemoryError="kill -9 %p" -XX:+HeapDumpOnOutOfMemoryError > > The way I like to handle this is to have the OOM trigger a little script or > set of cmds that logs the issue and kills the process. > > Then if you have the process supervised (via runit or something), it will > just start back up (what else do you do after an OOM?), but you will have > logged something, triggered a notification, whatever. > > - Mark