Re: [PR] CORE: [PARQUET] Log corrupted parquet filenames to trace bad nodes that may have written them. [iceberg]

via GitHub Wed, 21 May 2025 09:00:22 -0700


dhruv-pratap commented on code in PR #13108:
URL: https://github.com/apache/iceberg/pull/13108#discussion_r2100663158



##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetReader.java:
##########
@@ -120,18 +125,27 @@ public boolean hasNext() {
 
     @Override
     public T next() {
-      if (valuesRead >= nextRowGroupStart) {
-        advance();
-      }
-
-      if (reuseContainers) {
-        this.last = model.read(last);
-      } else {
-        this.last = model.read(null);
+      try {
+        if (valuesRead >= nextRowGroupStart) {
+          advance();
+        }
+
+        if (reuseContainers) {
+          this.last = model.read(last);
+        } else {
+          this.last = model.read(null);
+        }
+        valuesRead += 1;
+
+        return last;
+      } catch (ParquetDecodingException e) {
+        if (reader != null) {
+          // Knowing the exact parquet file is essential for tracing bad nodes
+          // that produced the corrupt file, parquet lib doesn't do this today.
+          LOG.error("Error decoding Parquet file {}", reader.getFile(), e);

Review Comment:
   My bad, it indeed does. Unfortunately though 
`org.apache.parquet.hadoop.ParquetFileReader` encapsulates the `InputFile`, and 
only exposes `getFile()` to get the parquet file location. There is also 
`getPath()` but that has been marked as deprecated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] CORE: [PARQUET] Log corrupted parquet filenames to trace bad nodes that may have written them. [iceberg]

Reply via email to