fpetersen-gl commented on PR #13565: URL: https://github.com/apache/iceberg/pull/13565#issuecomment-3078659441
@nastra Sure! Actually, this time I wasn't writing about the internal `closed`-flag, but now that you point at it, let's discuss it quickly :wink:: If any code in the try-catch block throws an exception, which would most likely occur in the [`temp.flush()`](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1826), the writer is set to `closed` regardless of the status, as this is done in the `finally`-block. What I was actually referring to is the field [`state`](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L256), which nicely models the way through the different stages of writing a parquet-file. This one is a bit more tricky than this case though, where just a single line needs to be moved to another location. It tells in which stage the writer currently is and only allowed to advance through the carefully designed stages. But if some IO fails, the internal `state` already advanced into the next stage, making it impossible to go back and try again. If you look at the exception of the attached test, you can see the underlying cause of the final exception: ``` Caused by: java.io.IOException: The file being written is in an invalid state. Probably caused by an error thrown previously. Current state: ENDED at org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:250) at org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:224) at org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:586) at org.apache.iceberg.parquet.ParquetWriter.flushRowGroup(ParquetWriter.java:215) ... 100 more ``` Being in state `ENDED` suggests that the transition into this state happened [here](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1808), where one can clearly see that the transition should happen only after everything has been processed successfully. But how should this be solved? As the [end()-method](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1806-L1818) seems to finalize the file with some metadata, it is crucial to not write them for every try a writer tries to close the output. One could try to introduce new entries for the `STATE`-enum which indicate that the writer is about to enter a state (e.g. `STARTING_BLOCK` etc). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org