fpetersen-gl commented on PR #13565:
URL: https://github.com/apache/iceberg/pull/13565#issuecomment-3078659441

   @nastra Sure!
   Actually, this time I wasn't writing about the internal `closed`-flag, but 
now that you point at it, let's discuss it quickly :wink::
   If any code in the try-catch block throws an exception, which would most 
likely occur in the 
[`temp.flush()`](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1826),
 the writer is set to `closed` regardless of the status, as this is done in the 
`finally`-block.
   
   What I was actually referring to is the field 
[`state`](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L256),
 which nicely models the way through the different stages of writing a 
parquet-file.
   This one is a bit more tricky than this case though, where just a single 
line needs to be moved to another location. It tells in which stage the writer 
currently is and only allowed to advance through the carefully designed stages. 
But if some IO fails, the internal `state` already advanced into the next 
stage, making it impossible to go back and try again.
   If you look at the exception of the attached test, you can see the 
underlying cause of the final exception:
   ```
   Caused by: java.io.IOException: The file being written is in an invalid 
state. Probably caused by an error thrown previously. Current state: ENDED
        at 
org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:250)
        at 
org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:224)
        at 
org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:586)
        at 
org.apache.iceberg.parquet.ParquetWriter.flushRowGroup(ParquetWriter.java:215)
        ... 100 more
   ```
   Being in state `ENDED` suggests that the transition into this state happened 
[here](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1808),
 where one can clearly see that the transition should happen only after 
everything has been processed successfully.
   
   But how should this be solved? As the 
[end()-method](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1806-L1818)
 seems to finalize the file with some metadata, it is crucial to not write them 
for every try a writer tries to close the output. One could try to introduce 
new entries for the `STATE`-enum which indicate that the writer is about to 
enter a state (e.g. `STARTING_BLOCK` etc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to