On Apr 24, 2013, at 12:22 PM, Timothy Potter <thelabd...@gmail.com> wrote:

> I like the idea of running a script vs. kill -9 ;-) Right now when a
> node fails, we have monitors for whether a node is up and serving
> queries. If not, that triggers some manual investigation and restart
> process. Part of the process was to capture the logs and heap dump
> file. What happened previously is that the log capture part wasn't
> scripted into the restart process and so the logs got wiped out when
> the restart happened :-(
> 
> One question about this - when you say "logs the issue" from your
> script - what type of things do you log? I've been relying on the
> timestamp of the heap dump (hprof) as a way to trace back into our log
> files.

Yeah, that's pretty much it - the time of the event and the fact that an OOM 
occurred. If you are dropping a heap dump, that has the same info, but a log is 
just a nice compact little history of events.

- Mark

> 
> Thanks.
> Tim
> 
> On Wed, Apr 24, 2013 at 10:03 AM, Mark Miller <markrmil...@gmail.com> wrote:
>> 
>> On Apr 24, 2013, at 12:00 PM, Mark Miller <markrmil...@gmail.com> wrote:
>> 
>>>> -XX:OnOutOfMemoryError="kill -9 %p" -XX:+HeapDumpOnOutOfMemoryError
>> 
>> The way I like to handle this is to have the OOM trigger a little script or 
>> set of cmds that logs the issue and kills the process.
>> 
>> Then if you have the process supervised (via runit or something), it will 
>> just start back up (what else do you do after an OOM?), but you will have 
>> logged something, triggered a notification, whatever.
>> 
>> - Mark

Reply via email to