[GitHub] [pinot] jasperjiaguo commented on a diff in pull request #11496: [POC] close channel on direct oom

via GitHub Thu, 07 Sep 2023 13:10:04 -0700


jasperjiaguo commented on code in PR #11496:
URL: https://github.com/apache/pinot/pull/11496#discussion_r1319073457



##########
pinot-core/src/main/java/org/apache/pinot/core/transport/DataTableHandler.java:
##########
@@ -83,5 +98,12 @@ protected void channelRead0(ChannelHandlerContext ctx, 
ByteBuf msg) {
   public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) {
     LOGGER.error("Caught exception while handling response from server: {}", 
_serverRoutingInstance, cause);
     
_brokerMetrics.addMeteredGlobalValue(BrokerMeter.RESPONSE_FETCH_EXCEPTIONS, 1);
+    if (cause instanceof java.lang.OutOfMemoryError) {

Review Comment:
   Hey Gonzalo, thanks for bringing this up! In fact I had similar concerns. 
Hope the following answers can show you the rational of why I still think this 
would be a valid solution. 
   > First, they can are thrown anyway, not just in the place where most of the 
memory is being consumed.
   Exactly. However, in this case we are dealing with a out of direct memory. 
On broker the vast majority of direct memory is allocated to buffer response 
(netty inbound). Therefore, when a direct memory OOM is thrown here within the 
netty lib, we are pretty sure it comes from large server response. This is only 
a POC code now, we we finalize it we would catch the direct memory oom only and 
throw the heap memory oom (as opposed to catch everything before).
   
   > Second, they fail in random spots and therefore may leave the system in an 
unexpected state.
   This is very true for Heap oom, in which case the system is hard to recover. 
That's also the reason why we are doing heap memory accounting and try to kill 
query execution before they trigger ooms on heap. However, in this case if it's 
direct memory oom, the code execution will be less impacted. In this case, we 
can try to recover.
   
   >Third, while we are catching the exception we may be in a situation where 
we are actually out of memory, so the code that catches the exception may also 
throw a new OOM. In this case, for example, we are logging and formatting 
strings (which generates garbage) and calling markServerDown, which may 
allocate.
   In this case it's fine to generate slight heap garbage because only direct 
memory is ooming.
   
   >Also: Shouldn't we just drop the messages in channelRead0 when responseSize 
is larger than a safe value?
   No it is insufficient. In this case the error happens in netty code before 
we actually go into `channelRead0`. We used to catch and log the error through 
extending `exceptionCaught`. I've also evaluated accounting in `channelRead0`; 
however, it would be too late since the call will stuck in netty lib and never 
get to `channelRead0`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [pinot] jasperjiaguo commented on a diff in pull request #11496: [POC] close channel on direct oom

Reply via email to