zachdisc commented on PR #9731: URL: https://github.com/apache/iceberg/pull/9731#issuecomment-1961864865
## Rev 3 * Addressed last round of comments * Added a new API option to sort with a custom supplied `Function<DataFile>`. Illustration (in unit test) - say you have a table that is partitioned by `a, truncate(b, 5), bucket(10, c)`. We can choose to re-organize manifests around `c`, such that we have roughly half of the `c` partitions contained in individual `manifest_list` tree entry. So supply a Function like below, which checks if the `c` partition value is < 5 and returns a string. ``` Function<DataFile, String> test = (Function<DataFile, String> & Serializable) (dataFile) -> { StructLike partition = dataFile.partition(); // Find the ordinal index for the c3 partition column for this data file int c3Index = IntStream.range(0, spec.fields().size()) .filter(i -> spec.fields().get(i).name().contains("c3")) .findFirst() .getAsInt(); Object c3BucketValue = partition.get(c3Index, Object.class); // Return one string for the lower values, one for the upper. RewriteManifests // will cluster datafiles together in manifests according to this value. return (Integer) c3BucketValue < 5 ? "cluster=LT_5" : "cluster=GTE_5"; }; ``` The value used for repartitioning/sorting on will fall into two values. One where the transformed bucket value is less than 5, one for values over 5. Illustration of the partition set for a data file vs this new supplied clustering column value ``` +--------------------+---------------------+ |partition |__clustering_column__| +--------------------+---------------------+ |{0, -531806488, 0} |cluster=LT_5 |. <-- the last partition value for `c` is 0. Which is labeled as "LT_5" |{1, 385955472, 7} |cluster=GTE_5 | |{2, 604077840, 6} |cluster=GTE_5 | |{3, 1875302972, 4} |cluster=LT_5 | |{4, -1772544904, 0} |cluster=LT_5 | +--------------------+---------------------+ ``` Rewritten manifests will group manifest_file s by having data files within that map to `c` partition values from 0-4 and 5-9 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org