jasperjiaguo commented on code in PR #11496: URL: https://github.com/apache/pinot/pull/11496#discussion_r1319073457
########## pinot-core/src/main/java/org/apache/pinot/core/transport/DataTableHandler.java: ########## @@ -83,5 +98,12 @@ protected void channelRead0(ChannelHandlerContext ctx, ByteBuf msg) { public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) { LOGGER.error("Caught exception while handling response from server: {}", _serverRoutingInstance, cause); _brokerMetrics.addMeteredGlobalValue(BrokerMeter.RESPONSE_FETCH_EXCEPTIONS, 1); + if (cause instanceof java.lang.OutOfMemoryError) { Review Comment: Hey Gonzalo, thanks for bringing this up! In fact I had similar concerns. Hope the following answers can show you the rational of why I still think this would be a valid solution. > First, they can are thrown anyway, not just in the place where most of the memory is being consumed. Exactly. However, in this case we are dealing with a out of direct memory. On broker the vast majority of direct memory is allocated to buffer response (netty inbound). Therefore, when a direct memory OOM is thrown here within the netty lib, we are pretty sure it comes from large server response. This is only a POC code now, we we finalize it we would catch the direct memory oom only and throw the heap memory oom (as opposed to catch everything before). > Second, they fail in random spots and therefore may leave the system in an unexpected state. This is very true for Heap oom, in which case the system is hard to recover. That's also the reason why we are doing heap memory accounting and try to kill query execution before they trigger ooms on heap. However, in this case if it's direct memory oom, the code execution will be less impacted. In this case, we can try to recover. >Third, while we are catching the exception we may be in a situation where we are actually out of memory, so the code that catches the exception may also throw a new OOM. In this case, for example, we are logging and formatting strings (which generates garbage) and calling markServerDown, which may allocate. In this case it's fine to generate slight heap garbage because only direct memory is ooming. >Also: Shouldn't we just drop the messages in channelRead0 when responseSize is larger than a safe value? No it is insufficient. In this case the error happens in netty code before we actually go into `channelRead0`. We used to catch and log the error through extending `exceptionCaught`. I've also evaluated accounting in `channelRead0`; however, it would be too late since the call will stuck in netty lib and never get to `channelRead0`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org