[I] Can we make commits inside compaction jobs with partial-progress.enabled sequential to avoid CommitFailedException? [iceberg]

via GitHub Thu, 08 Feb 2024 08:51:20 -0800


paulpaul1076 opened a new issue, #9687:
URL: https://github.com/apache/iceberg/issues/9687


   ### Feature Request / Improvement
   
   From what I understand, if a compaction job compacts a lot of small files, 
and uses `partial-progress.enabled=true`, there are situations, when file 
groups get done being compacted at the same time, and then there are parallel 
commits of metadata being made, and they conflict with one another which leads 
to `CommitFailedException`. Is it possible to make these commits sequential, 
instead of parallel for the compaction job specifically? I don't think there's 
any point in them being parallel, it just leads to `CommitFailedException`.
   
   It's very easy to reproduce, for example, we can set 
`partial-progress.max-commit=1000` (or another high number) and make 
`max-file-group-size-bytes=1gb` (or some other low number). You will see that 
there are a lot of file groups and they all get committed in parallel which 
leads to this exception.
   
   That was a synthetic example. As for a real world example, I had a job that 
had to compact 50k files. For that job, I used `partial-progress.enabled=true` 
and it worked fine, because there were a lot of file groups and since 
`partial-progress.max-commits=10` was small (default), the commits were 
infrequent and didn't run in parallel for the most part.
   
   However, when I ran that same compaction job on a table with 5k files, for 
that jobs 10 commits is very frequent, because it doesn't have a lot of files 
to compact, and these commits start running in parallel and conflicting. This 
leads to having to configure these settings separately, depending on the number 
of files we have to compact. So, why not just make commits sequential (inside 
the compaction job only).
   
   @RussellSpitzer told me that he thought that this had been done, then we 
realized that it seems that it hasn't been done, and I didn't find any issue 
here in github about it. Is somebody already working on this?
   
   ### Query engine
   
   Spark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Can we make commits inside compaction jobs with partial-progress.enabled sequential to avoid CommitFailedException? [iceberg]

Reply via email to