Re: [PR] Closing the ParquetWriter in the correct spot [iceberg]

via GitHub Wed, 16 Jul 2025 07:12:14 -0700


nastra commented on PR #13565:
URL: https://github.com/apache/iceberg/pull/13565#issuecomment-3078792691


   > @nastra Sure! Actually, this time I wasn't writing about the internal 
`closed`-flag, but now that you point at it, let's discuss it quickly 😉: If any 
code in the try-catch block throws an exception, which would most likely occur 
in the 
[`temp.flush()`](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1826),
 the writer is set to `closed` regardless of the status, as this is done in the 
`finally`-block.
   > 
   > What I was actually referring to is the field 
[`state`](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L256),
 which nicely models the way through the different stages of writing a 
parquet-file. This one is a bit more tricky than this case though, where just a 
single line needs to be moved to another location. It tells in which stage the 
writer currently is and only allowed to advance through the carefully designed 
stages. But if some IO fails, the internal `state` already advanced into the 
next stage, making it impossible to go back and try again. If you look at the 
exception of the attached test, you can see the underlying cause of the final 
exception:
   > 
   > ```
   > Caused by: java.io.IOException: The file being written is in an invalid 
state. Probably caused by an error thrown previously. Current state: ENDED
   >    at 
org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:250)
   >    at 
org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:224)
   >    at 
org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:586)
   >    at 
org.apache.iceberg.parquet.ParquetWriter.flushRowGroup(ParquetWriter.java:215)
   >    ... 100 more
   > ```
   > 
   > Being in state `ENDED` suggests that the transition into this state 
happened 
[here](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1808),
 where one can clearly see that the transition should happen only after 
everything has been processed successfully.
   > 
   > But how should this be solved? As the 
[end()-method](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1806-L1818)
 seems to finalize the file with some metadata, it is crucial to not write them 
for every try a writer tries to close the output. One could try to introduce 
new entries for the `STATE`-enum which indicate that the writer is about to 
enter a state (e.g. `STARTING_BLOCK` etc).
   
   The issue with that exception you're seeing where the state being already 
set to `ENDED` is a result of moving the `closed` flag to the end in 
`ParquetWriter`, since that flag was also preventing 
https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java#L261
 from being called multiple times.
   I haven't fully investigated the issue as I'm currently trying to understand 
what your expectation in the test is.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Closing the ParquetWriter in the correct spot [iceberg]

Reply via email to