zachdisc commented on PR #9731:
URL: https://github.com/apache/iceberg/pull/9731#issuecomment-1961864865

   ## Rev 3
   * Addressed last round of comments
   * Added a new API option to sort with a custom supplied 
`Function<DataFile>`. 
   
   Illustration (in unit test) - say you have a table that is partitioned by 
`a, truncate(b, 5), bucket(10, c)`. We can choose to re-organize manifests 
around `c`, such that we have roughly half of the `c` partitions contained in 
individual `manifest_list` tree entry. 
   
   So supply a Function like below, which checks if the `c` partition value is 
< 5 and returns a string. 
   
   ```
   Function<DataFile, String> test =
           (Function<DataFile, String> & Serializable)
               (dataFile) -> {
                 StructLike partition = dataFile.partition();
                 // Find the ordinal index for the c3 partition column for this 
data file
                 int c3Index =
                     IntStream.range(0, spec.fields().size())
                         .filter(i -> 
spec.fields().get(i).name().contains("c3"))
                         .findFirst()
                         .getAsInt();
                 Object c3BucketValue = partition.get(c3Index, Object.class);
   
                 // Return one string for the lower values, one for the upper. 
RewriteManifests
                 // will cluster datafiles together in manifests according to 
this value.
                 return (Integer) c3BucketValue < 5 ? "cluster=LT_5" : 
"cluster=GTE_5";
               };
   ```
   
   The value used for repartitioning/sorting on will fall into two values. One 
where the transformed bucket value is less than 5, one for values over 5. 
Illustration of the partition set for a data file vs this new supplied 
clustering column value
   
   
   ```
   +--------------------+---------------------+
   |partition           |__clustering_column__|
   +--------------------+---------------------+
   |{0, -531806488, 0}  |cluster=LT_5         |.  <-- the last partition value 
for `c` is 0. Which is labeled as "LT_5" 
   |{1, 385955472, 7}   |cluster=GTE_5        |
   |{2, 604077840, 6}   |cluster=GTE_5        |
   |{3, 1875302972, 4}  |cluster=LT_5         |
   |{4, -1772544904, 0} |cluster=LT_5         |
   +--------------------+---------------------+
   ```
   
   Rewritten manifests will group manifest_file s by having data files within 
that map to `c` partition values from 0-4 and 5-9
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to