This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 0ecc71b [SPARK-29871][ML] Catch all exceptions for handling invalid
images in image source
0ecc71b is described below
commit 0ecc71bbf979f13e7260af93c4bffa8c133dc9ea
Author: Hyukjin Kwon <[email protected]>
AuthorDate: Fri Oct 8 09:04:13 2021 +0900
[SPARK-29871][ML] Catch all exceptions for handling invalid images in image
source
### What changes were proposed in this pull request?
This PR fixes the test failure:
```
Running tests...
----------------------------------------------------------------------
test_read_images (pyspark.ml.tests.test_image.ImageFileFormatTest) ...
ERROR (12.050s)
======================================================================
ERROR [12.050s]: test_read_images
(pyspark.ml.tests.test_image.ImageFileFormatTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests/test_image.py",
line 35, in test_read_images
self.assertEqual(df.count(), 4)
File
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/dataframe.py",
line 507, in count
return int(self._jdf.count())
File
"/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
line 1286, in _call_
answer, self.gateway_client, self.target_id, self.name)
File
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py",
line 98, in deco
return f(*a, **kw)
File
"/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o32.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0
(TID 1, amp-jenkins-worker-05.amp, executor driver):
javax.imageio.IIOException: Unsupported Image Type
at
com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(JPEGImageReader.java:1079)
at
com.sun.imageio.plugins.jpeg.JPEGImageReader.read(JPEGImageReader.java:1050)
at javax.imageio.ImageIO.read(ImageIO.java:1448)
at javax.imageio.ImageIO.read(ImageIO.java:1352)
```
This exception happens apparently when handling malformed invalid images
with `dropInvalid` option set - `ImageIO.read` fails to catch
`javax.imageio.IIOException` for an invalid image that is not
`RuntimeException`.
In fact, the bytes are already in memory so the real IO exception would not
happen during `ImageIO.read`. Therefore, this PR proposes to catch all
exceptions when reading image to properly handle malformed images.
For the reason why it's flaky instead of consistently failing, I am not yet
sure. However, the fix should be correct.
### Why are the changes needed?
To fix the flaky tests, see https://github.com/apache/spark/runs/3802639160
as an example.
### Does this PR introduce _any_ user-facing change?
Users would be able to read malformed data even for the cases of
`javax.imageio.IIOException` (or other unexpected non-runtime exceptions) is
thrown when `dropInvalid` option is enabled.
### How was this patch tested?
Existing unittests. We should track if the tests are still flaky or not.
Closes #34187 from HyukjinKwon/SPARK-29871.
Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
---
mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
b/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
index 37b7159..242496f 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
@@ -133,9 +133,12 @@ object ImageSchema {
val img = try {
ImageIO.read(new ByteArrayInputStream(bytes))
} catch {
- // Catch runtime exception because `ImageIO` may throw unexpected
`RuntimeException`.
- // But do not catch the declared `IOException` (regarded as FileSystem
failure)
- case _: RuntimeException => null
+ // Note that:
+ // - At this point, the files are already read from the files as bytes.
Therefore,
+ // no real I/O exceptions are expected.
+ // - `ImageIO.read` can throw `javax.imageio.IIOException` that is
technically
+ // a runtime exception but it inherits IOException.
+ case _: Throwable => null
}
if (img == null) {
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]