[ https://issues.apache.org/jira/browse/HBASE-28589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ZhenyuLi updated HBASE-28589: ----------------------------- Description: When an IOException occurs during response creation in ServerCall.setResponse(), the method only catches the IOException and logs a warning and sets the response to null. This causes the client to receive no response or experience connection issues without knowing what went wrong on the server side. An example of the current ServerCall.setResponse catching the exception is the flaw in the fix in {-}HBASE-14598{-}. The original fix for -HBASE-14598- addressed two aspects: # When a Scan/Get RPC attempts to allocate an excessively large array that could trigger an OutOfMemoryError (OOM), it checks the array size before allocation and throws a BufferOverflowException to prevent OOM. # The fix intended to stop client retries for such failures by throwing a DoNotRetryException when a BufferOverflowException occurs, as retrying cannot resolve the underlying issue. *The Problem:* The DoNotRetryException is never propagated to the client side. Here's the issue flow: # ByteBufferOutputStream.checkSizeAndGrow() throws BufferOverflowException # The exception propagates through the call stack: ** ByteBufferOutputStream.checkSizeAndGrow() ** encoder.write() ** encodeCellsTo() (Catches BufferOverflowException and turns it into DoNotRetryIOException) ** this.cellBlockBuilder.buildCellBlockStream() ** call.setResponse() # The DoNotRetryException is ultimately caught in call.setResponse, where it is merely logged but not sent back to the client # As a result, the client continues retrying indefinitely since the response is null and the Netty connection will be closed. *Current Status:* In the latest branches (3.0 and 2.6), this issue still exists. In ServerCall.java, when ALLOCATOR_POOL_ENABLED_KEY (hbase.ipc.server.reservoir.enabled) is set to false, the setResponse() method follows the same problematic path. If a DoNotRetryException is thrown in {{{}ByteBuffer b = this.cellBlockBuilder.buildCellBlock(this.connection.codec, this.connection.compressionCodec, cells);{}}}, it gets swallowed in the setResponse() catch block and never reaches the client. *Steps to Reproduce:* # Set up a 3-node HBase cluster with 3 RegionServers # Set hbase.ipc.server.reservoir.enabled to false to use ByteBufferOutputStream # Inject a BufferOverflowException at ByteBufferOutputStream.checkSizeAndGrow() to simulate an OOM condition # Send a scan request # Observe endless client retries *Expected Behavior:* The DoNotRetryException should be properly propagated to the client to prevent retry attempts. was: I have discovered that the fix for HBASE-14598 does not completely resolve the issue, and the problem persists in the latest branches (3.0 and 2.6). The original fix for HBASE-14598 addressed two aspects: # When a Scan/Get RPC attempts to allocate an excessively large array that could trigger an OutOfMemoryError (OOM), it checks the array size before allocation and throws a {{BufferOverflowException}} to prevent OOM. # The fix intended to stop client retries for such failures by throwing a {{DoNotRetryException}} when a {{BufferOverflowException}} occurs, as retrying cannot resolve the underlying issue. *The Problem:* The {{DoNotRetryException}} is never propagated to the client side. Here's the issue flow: # {{ByteBufferOutputStream.checkSizeAndGrow()}} throws {{BufferOverflowException}} # The exception propagates through the call stack: ** {{ByteBufferOutputStream.checkSizeAndGrow()}} ** {{encoder.write()}} ** {{encodeCellsTo() (Catch BufferOverflowException and turn it into DoNotRetryIOException)}} ** {{this.cellBlockBuilder.buildCellBlockStream()}} ** {{call.setResponse()}} # The {{DoNotRetryException}} is ultimately caught in call.setResponse, where it is merely logged but not sent back to the client # As a result, the client continues retrying indefinitely since the response is null and netty connection will be closed. *Current Status:* In the latest branches (3.0 and 2.6), this issue still exists. In {{{}ServerCall.java{}}}, when {{ALLOCATOR_POOL_ENABLED_KEY}} ({{{}hbase.ipc.server.reservoir.enabled{}}}) is set to {{{}false{}}}, the {{setResponse()}} method follows the same problematic path. If a {{DoNotRetryException}} is thrown in the ByteBuffer b = this.cellBlockBuilder.buildCellBlock(this.connection.codec, this.connection.compressionCodec, cells); it gets swallowed in the {{setResponse()}} catch block and never reaches the client. *Steps to Reproduce:* # Set up a 3-node HBase cluster with 3 RegionServers # Set {{hbase.ipc.server.reservoir.enabled}} to {{false to use ByteBufferOutputStream}} # Inject a {{BufferOverflowException}} at {{ByteBufferOutputStream.checkSizeAndGrow()}} to simulate an OOM condition # Send a scan request # Observe endless client retries *Expected Behavior:* The {{DoNotRetryException}} should be properly propagated to the client to prevent retry attempts. > Server side DoNotRetryException not propagated to client > -------------------------------------------------------- > > Key: HBASE-28589 > URL: https://issues.apache.org/jira/browse/HBASE-28589 > Project: HBase > Issue Type: Bug > Components: IPC/RPC > Affects Versions: 2.0.0, 2.4.0, 2.5.0, 2.6.0, 3.0.0 > Reporter: ZhenyuLi > Priority: Major > > When an IOException occurs during response creation in > ServerCall.setResponse(), the method only catches the IOException and logs a > warning and sets the response to null. This causes the client to receive no > response or experience connection issues without knowing what went wrong on > the server side. > An example of the current ServerCall.setResponse catching the exception is > the flaw in the fix in {-}HBASE-14598{-}. > The original fix for -HBASE-14598- addressed two aspects: > # When a Scan/Get RPC attempts to allocate an excessively large array that > could trigger an OutOfMemoryError (OOM), it checks the array size before > allocation and throws a BufferOverflowException to prevent OOM. > # The fix intended to stop client retries for such failures by throwing a > DoNotRetryException when a BufferOverflowException occurs, as retrying cannot > resolve the underlying issue. > *The Problem:* The DoNotRetryException is never propagated to the client > side. Here's the issue flow: > # ByteBufferOutputStream.checkSizeAndGrow() throws BufferOverflowException > # The exception propagates through the call stack: > ** ByteBufferOutputStream.checkSizeAndGrow() > ** encoder.write() > ** encodeCellsTo() (Catches BufferOverflowException and turns it into > DoNotRetryIOException) > ** this.cellBlockBuilder.buildCellBlockStream() > ** call.setResponse() > # The DoNotRetryException is ultimately caught in call.setResponse, where it > is merely logged but not sent back to the client > # As a result, the client continues retrying indefinitely since the response > is null and the Netty connection will be closed. > *Current Status:* In the latest branches (3.0 and 2.6), this issue still > exists. In ServerCall.java, when ALLOCATOR_POOL_ENABLED_KEY > (hbase.ipc.server.reservoir.enabled) is set to false, the setResponse() > method follows the same problematic path. If a DoNotRetryException is thrown > in {{{}ByteBuffer b = > this.cellBlockBuilder.buildCellBlock(this.connection.codec, > this.connection.compressionCodec, cells);{}}}, it gets swallowed in the > setResponse() catch block and never reaches the client. > *Steps to Reproduce:* > # Set up a 3-node HBase cluster with 3 RegionServers > # Set hbase.ipc.server.reservoir.enabled to false to use > ByteBufferOutputStream > # Inject a BufferOverflowException at > ByteBufferOutputStream.checkSizeAndGrow() to simulate an OOM condition > # Send a scan request > # Observe endless client retries > *Expected Behavior:* The DoNotRetryException should be properly propagated to > the client to prevent retry attempts. -- This message was sent by Atlassian Jira (v8.20.10#820010)