ravindrabajpai opened a new issue, #12431:
URL: https://github.com/apache/iceberg/issues/12431

   ### Proposed Change
   
   I have recently started exploring iceberg table format, and I feel hidden 
partitioning is a very powerful feature. I would like to explore it a bit 
further to see if following case can be possible. Please ignore this proposal 
and point me to an alternate solution if there is any.
   
   I would like to explore and propose the following type of partitioning 
mechanism explained with example - 
   
   For example a table with a specific partitioning structure that combines a 
"status" partition (active/inactive) with a date partition (business_date) for 
inactive records only. 
   
   It solves my small file issue for status=active partition, where if I don't 
have business date as sub partition, then I can have a few big files right 
inside the status=active foloder.
   
   
   1. Proposed/Expected Table Schema and Partitioning Strategy
   
   Let's assume we have a table with columns like:
   
   id (INT): Unique identifier
   status (STRING): "active" or "inactive"
   business_date (DATE): Date of the business event
   data (STRING): Some data
   
   Partitioning strategy will be:
   status: Top-level partition (active/inactive)
   business_date: Sub-partition only for status = 'inactive'.
   
   Creating the Iceberg Table -
   
   CREATE TABLE my_iceberg_table (
       id INT,
       status STRING,
       business_date DATE,
       data STRING
   )
   USING iceberg
   PARTITIONED BY (status, CASE WHEN status = 'inactive' THEN business_date 
ELSE null END);
   
   Here - 
   PARTITIONED BY (status, CASE WHEN status = 'inactive' THEN business_date 
ELSE null END): This is the key part.
   It partitions first by status.
   The CASE statement creates the sub-partition:
   If status is 'inactive', it partitions by business_date.
   If status is 'active', it partitions by NULL. 
   3. Ingesting Records
   
   INSERT INTO my_iceberg_table VALUES
   (1, 'active', '2023-10-26', 'Active data 1'),
   (2, 'active', '2023-10-27', 'Active data 2'),
   (3, 'inactive', '2023-10-26', 'Inactive data 1'),
   (4, 'inactive', '2023-10-27', 'Inactive data 2'),
   (5, 'inactive', '2023-10-27', 'Inactive data 3');
   
   SELECT * FROM my_iceberg_table WHERE status = 'active';
   SELECT * FROM my_iceberg_table WHERE status = 'inactive' AND business_date = 
'2023-10-27';
   
   
   ### Proposal document
   
   _No response_
   
   ### Specifications
   
   - [x] Table
   - [ ] View
   - [ ] REST
   - [ ] Puffin
   - [ ] Encryption
   - [ ] Other


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to