hililiwei opened a new pull request, #6043:
URL: https://github.com/apache/iceberg/pull/6043

   # Proposal: Partial Updates
   
   ## motivation
   
   
   
   Take feature engineering as an example, there are thousands or even tens of 
thousands of columns in the table, but the task will update only a few of them. 
Currently, if want to update a row, we need to fetch  all the columns, which is 
very inefficient. If we support partial updates, we only need to generate data 
files with equality and updated columns on write, which greatly improves 
throughput and reduces complexity ( You do not need to query the values of 
other columns that do not need to be changed). When reading, we combine the 
data file with the partial update file, which has some similarities to COR. In 
addition, to improve the read efficiency, a background asynchronous task can be 
used to merge files when the system is idle.
   
   ### Partial Update Files
   
   Partial updates files identify updated rows in a collection of data files by 
one or more column values, and includes one or more columns of the updated rows 
that need to be updated.
   
   Partial updates files store any subset of a table’s columns and use the 
table’s field ids. The *equality columns* are the columns of the file used to 
match data rows. The p*artial columns* are columns of the file used to update 
the specified column of the matching data row.
   
   The partial columns in a data row is updated to the new value if its 
equality columns values are equal to all equality columns for any row in an 
partial update file that applies to the row’s data file.
   
   For example, a table with the following data:
   
   ```text
    1: id | 2: category | 3: name
   -------|-------------|---------
    1     | marsupial   | Koala
    2     | toy         | Teddy
    3     | NULL        | Grizzly
    4     | NULL        | Polar
   ```
   
   The equality `id = 3` and `name = Lily` could be written as the following 
partial update files:
   
   ```text
   equality_ids=[1]
   partial_ids=[3]
   
    1: id | 3: name
   -------|---------
    3     | Lily
   ```
   
   After applying the partially update file, will have the following data::
   
   ```text
    1: id | 2: category | 3: name
   -------|-------------|---------
    1     | marsupial   | Koala
    2     | toy         | Teddy
    3     | NULL        | Lily
    4     | NULL        | Polar
   ```
   
   Illustration:
   
   
![image](https://user-images.githubusercontent.com/59213263/197661183-acb6c06d-355e-4323-8bc1-e347db32a33c.png)
   
   In this example, we will find the id1 and id3  in the a.file, then update 
its Col1 to the new value.
   
   
![image](https://user-images.githubusercontent.com/59213263/197661874-3384e5f5-579b-4f98-baf4-47ebd4c08d76.png)
   
   In this example, we add a new column to the table and insert a partial 
update file that contains only the new columns. It might look more like an 
Insert, except we're inserting new columns for the old row, rather than 
inserting new rows.
   
   ### Brief change log
   
   This PR consists of two parts:
   
   * Evolution of the table format specification, mainly partial update file
   * Partial update files Write\Read 
   
   
   P.S. :
   This is an internal feature that is under development. I wanted to hear from 
the community early on, so I raised this PR before it was finished. Of course, 
there's the engine integration part, but I think this PR is the core part of 
it, and we should talk about it there first to try and get on the same page.
   
   With this approach, we can solve a large number of business scenarios. 
Internally, we have implemented it with Flink and achieved satisfactory results 
in validation.
   
   What is our community's view of it? Hope to receive your feedback.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to