Re: ContainerBackgroundProcessor and compounding OOMEs

Christopher Schultz Tue, 15 Jul 2014 12:43:58 -0700

Mark,

On 7/15/14, 12:44 PM, Mark Thomas wrote:
> On 14 July 2014 21:58:22 CEST, Christopher Schultz 
> <ch...@christopherschultz.net> wrote:
>> On 7/14/14, 12:18 PM, Mark Thomas wrote:
> 
>>>> 2. If the exception is VirtualMachineError, it gets re-thrown with
>>>> no log. This skips the "log" call in the above code and so the only
>>>> log will come from the VM's "unhandled exception" logger which may
>>>> not go where you expect it to go.
>>>
>>> It goes to stderr unless they system admin has redirected it and if
>>> they have, they should know where to look for it. Further, they
>> should
>>> also be monitoring it.
>>
>> True, but the background thread does die.
> 
> So? The VM is now in an unknown and potentially unstable state. Restarting 
> the VM is the only sensible way forward so the fact that the background 
> thread has died and any negative consequences of that are irrelevant.
> 
>>>> If we think that StackOverflowError is recoverable, why not
>>>> OutOfMemory?
>>>
>>> There is no assumption that StackOverflowError is recoverable.
>>
>>    public static void handleThrowable(Throwable t) {
>>        if (t instanceof ThreadDeath) {
>>            throw (ThreadDeath) t;
>>        }
>>        if (t instanceof StackOverflowError) {
>>            // Swallow silently - it should be recoverable
>>            return;
>>        }
>>        if (t instanceof VirtualMachineError) {
>>            throw (VirtualMachineError) t;
>>        }
>>        // All other instances of Throwable will be silently swallowed
>>    }
>>
>> As of April 17, 2014, markt disagrees with you (r1588269):
> 
> I disagree with him all the time. In this case though I think it is just 
> context and that probably just needs a better comment.
> 
> Stack overflow in application code should be recoverable as far as Tomcat is 
> concerned. The implication for the application that triggered it may be more 
> severe. There is no assumption that it is recoverable for the application 
> (hence the ERROR level log) but there is a possibility that it might be 
> recoverable therefore - like just about every other application level error - 
> Tomcat carries on.
> 
> 
>> http://svn.apache.org/viewvc/tomcat/trunk/java/org/apache/jasper/util/ExceptionUtils.java?r1=1588269&r2=1588268&pathrev=1588269
>>
>>> That is why it is logged at ERROR level.
>>
>> You are right, StackOverflow gets logged, but things like OOME are not
>> -- they are left to the unhandled exception handler. But they do
>> take-down the background thread when they do.
> 
> Other VM errors are assumed to be fatal. Hence I don't care if the background 
> processing thread dies in this case.
> 
> 
>>> It is guaranteed that only that
>>> thread was affected so you know (from the stack trace) exactly what
>>> failed where and you can also determine how bad things are and opt to
>>> restart Tomcat if necessary.
>>
>> For OOME, if you are using sessions you will need to schedule a restart
>> pretty much immediately, since they will never die.
>>
>>> With respect to OOME how, exactly, do you propose to differentiate
>>> between a "recoverable" OOME and a non-recoverable one?
>>
>> If the admin is monitoring (which we are), then they can make the
>> determination. The current situation is that Tomcat needs to be bounced
>> regardless of the severity of the OOME, because the background thread
>> stops.
> 
> I disagree. You have no way of knowing the state of the VM after a OOME. You 
> can hope the VM is OK but the only safe option is a restart.
> 
>>>> What about other VirtualMachineErrors?
>>>
>>> The current position is that they are all non-recoverable. If it can
>>> be demonstrated that more of them should be treated like
>> StackOverflow
>>> then we can do so.
>>
>> Okay. I think OOME should be treated similarly to StackOverflowError.
> 
> I strongly disagree. A stack overflow is guaranteed to only affect
> the thread that reports the error. It is my understanding that no
> such guarantee exists for an OOME. I'm prepared to be convinced
> otherwise (eg by something in one of the Java specs) but until that
> happens I am -1 on treating OOME like a stack overflow.


We have had situations /in production/ where a "single thread" has
triggered an OOME .. in a particular case, that thread was trying to
pull an unbounded set of results back from a database query and cache
them in a local data structure. The OOME killed the operation, the GC
kicked-in, trashed that huge structure, and the system stabilized.

We didn't want to bounce the server in the middle of the day and
interrupt people if we didn't have to. We watched things closely and
were convinced that it was in fact stable. We waited for the weekend and
bounced the JVM -- just to be "safe".

Had the background thread died in that situation, we would have had to
bounce the JVM within about 30 minutes given our user load since the
sessions would never die. We also have to meet certain "security
requirements" about expiration of idle connections, and without the
background thread, we cannot enforce those rules. (Note that, when the
background thread is dead, you can't even use JMX to ask that expired
sessions be expired: the 'backgroundProcess' operation seems to try to
nudge the background processor which .. isn't around anymore).

This discussion has become even more academic since Konstantin has begun
a bit of work already to help alleviate this. Thanks, K.

> I'll add that if it can be demonstrated that any of the other virtual
> machine errors only affect the current thread and do not indicate
> something is fundamentally broken in the VM then I'd be happy to
> treat them the same way as a stack overflow. (Given what virtual
> machine error is used for I think this is unlikely).

I checked, and I think you are right: VME is basically StackOverFlow,
OOME, and "holy shit we have no idea what in god's name is happening
here". I definitely agree that the latter cases should be treated as
completely catastrophic and the background processor can die for all I
care. (Note that I've never see any of those other class of errors in
the real world: perhaps I'll be upset when they take down by Tomcat
background thread in the future ;)

> I'll also add that if the current blanket 'allowing' of stack
> overflow can be shown to be harmful in some cases then I'd be in
> favour of changing that handling to not re throw the stack overflow
> only in those cases where it is known to be safe to do so.

-chris

signature.asc
Description: OpenPGP digital signature

Re: ContainerBackgroundProcessor and compounding OOMEs

Reply via email to