pan3793 commented on code in PR #8526:
URL: https://github.com/apache/hadoop/pull/8526#discussion_r3339734658


##########
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/zstd/TestZStandardCompressorDecompressor.java:
##########
@@ -557,6 +557,64 @@ public void testDecompressReturnsWhenNothingToDecompress() 
throws Exception {
     assertEquals(0, result);
   }
 
+  /**
+   * Verify that {@code setInput()} does not throw {@code 
BufferOverflowException}
+   * after a previous {@code decompress()} call threw an exception.
+   *
+   * <p>When {@code decompress()} processes compressed data, it sets
+   * {@code compressedDirectBuf.limit(bytesInCompressedBuffer)} — a value that
+   * may be smaller than {@code directBufferSize}. If {@code 
decompressDirectByteBufferStream}
+   * throws (e.g. on corrupted input), the limit is never restored. A 
subsequent
+   * {@code reset()} also does not restore {@code compressedDirectBuf.limit}.
+   * So the next {@code setInput()} call will hit {@code 
BufferOverflowException}
+   * because {@code setInputFromSavedData()} tries to {@code put()} more bytes
+   * than the current limit allows.</p>
+   *
+   * <p>This scenario occurs in practice when reading multiple zstd-compressed
+   * files from a directory: a corrupted file causes an exception 
mid-decompress,
+   * the decompressor is returned to the pool and reset, but the limit stays
+   * small. The next file's {@code setInput()} then fails.</p>
+   */
+  @Test
+  public void testSetInputAfterDecompressThrowsOnCorruptedData() throws 
Exception {

Review Comment:
   after a quick glance (also assisted by AI), other codecs seem to have a 
similar issue, but
   
   1. codec that goes native code path, which operates on raw memory, ignoring 
`ByteBuffer`'s limit, similar to previous Hadoop zstd impl
   2. class `BuiltInGzipCompressor` is marked as `@DoNotPool`, so it won't be 
reused with an illegal state
   
   our internal usage for hadoop codec is limited to gzip and lzo, we haven't 
seen similar real issues so far, and we are exploring replacing some of those 
with zstd. So I tend to keep this PR scope limited to the zstd codec for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to