hililiwei opened a new pull request, #6043: URL: https://github.com/apache/iceberg/pull/6043
# Proposal: Partial Updates ## motivation Take feature engineering as an example, there are thousands or even tens of thousands of columns in the table, but the task will update only a few of them. Currently, if want to update a row, we need to fetch all the columns, which is very inefficient. If we support partial updates, we only need to generate data files with equality and updated columns on write, which greatly improves throughput and reduces complexity ( You do not need to query the values of other columns that do not need to be changed). When reading, we combine the data file with the partial update file, which has some similarities to COR. In addition, to improve the read efficiency, a background asynchronous task can be used to merge files when the system is idle. ### Partial Update Files Partial updates files identify updated rows in a collection of data files by one or more column values, and includes one or more columns of the updated rows that need to be updated. Partial updates files store any subset of a table’s columns and use the table’s field ids. The *equality columns* are the columns of the file used to match data rows. The p*artial columns* are columns of the file used to update the specified column of the matching data row. The partial columns in a data row is updated to the new value if its equality columns values are equal to all equality columns for any row in an partial update file that applies to the row’s data file. For example, a table with the following data: ```text 1: id | 2: category | 3: name -------|-------------|--------- 1 | marsupial | Koala 2 | toy | Teddy 3 | NULL | Grizzly 4 | NULL | Polar ``` The equality `id = 3` and `name = Lily` could be written as the following partial update files: ```text equality_ids=[1] partial_ids=[3] 1: id | 3: name -------|--------- 3 | Lily ``` After applying the partially update file, will have the following data:: ```text 1: id | 2: category | 3: name -------|-------------|--------- 1 | marsupial | Koala 2 | toy | Teddy 3 | NULL | Lily 4 | NULL | Polar ``` Illustration:  In this example, we will find the id1 and id3 in the a.file, then update its Col1 to the new value.  In this example, we add a new column to the table and insert a partial update file that contains only the new columns. It might look more like an Insert, except we're inserting new columns for the old row, rather than inserting new rows. ### Brief change log This PR consists of two parts: * Evolution of the table format specification, mainly partial update file * Partial update files Write\Read P.S. : This is an internal feature that is under development. I wanted to hear from the community early on, so I raised this PR before it was finished. Of course, there's the engine integration part, but I think this PR is the core part of it, and we should talk about it there first to try and get on the same page. With this approach, we can solve a large number of business scenarios. Internally, we have implemented it with Flink and achieved satisfactory results in validation. What is our community's view of it? Hope to receive your feedback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org