nastra commented on PR #13565: URL: https://github.com/apache/iceberg/pull/13565#issuecomment-3078792691
> @nastra Sure! Actually, this time I wasn't writing about the internal `closed`-flag, but now that you point at it, let's discuss it quickly 😉: If any code in the try-catch block throws an exception, which would most likely occur in the [`temp.flush()`](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1826), the writer is set to `closed` regardless of the status, as this is done in the `finally`-block. > > What I was actually referring to is the field [`state`](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L256), which nicely models the way through the different stages of writing a parquet-file. This one is a bit more tricky than this case though, where just a single line needs to be moved to another location. It tells in which stage the writer currently is and only allowed to advance through the carefully designed stages. But if some IO fails, the internal `state` already advanced into the next stage, making it impossible to go back and try again. If you look at the exception of the attached test, you can see the underlying cause of the final exception: > > ``` > Caused by: java.io.IOException: The file being written is in an invalid state. Probably caused by an error thrown previously. Current state: ENDED > at org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:250) > at org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:224) > at org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:586) > at org.apache.iceberg.parquet.ParquetWriter.flushRowGroup(ParquetWriter.java:215) > ... 100 more > ``` > > Being in state `ENDED` suggests that the transition into this state happened [here](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1808), where one can clearly see that the transition should happen only after everything has been processed successfully. > > But how should this be solved? As the [end()-method](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1806-L1818) seems to finalize the file with some metadata, it is crucial to not write them for every try a writer tries to close the output. One could try to introduce new entries for the `STATE`-enum which indicate that the writer is about to enter a state (e.g. `STARTING_BLOCK` etc). The issue with that exception you're seeing where the state being already set to `ENDED` is a result of moving the `closed` flag to the end in `ParquetWriter`, since that flag was also preventing https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java#L261 from being called multiple times. I haven't fully investigated the issue as I'm currently trying to understand what your expectation in the test is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org