ravindrabajpai opened a new issue, #12431:
URL: https://github.com/apache/iceberg/issues/12431
### Proposed Change
I have recently started exploring iceberg table format, and I feel hidden
partitioning is a very powerful feature. I would like to explore it a bit
further to see if following case can be possible. Please ignore this proposal
and point me to an alternate solution if there is any.
I would like to explore and propose the following type of partitioning
mechanism explained with example -
For example a table with a specific partitioning structure that combines a
"status" partition (active/inactive) with a date partition (business_date) for
inactive records only.
It solves my small file issue for status=active partition, where if I don't
have business date as sub partition, then I can have a few big files right
inside the status=active foloder.
1. Proposed/Expected Table Schema and Partitioning Strategy
Let's assume we have a table with columns like:
id (INT): Unique identifier
status (STRING): "active" or "inactive"
business_date (DATE): Date of the business event
data (STRING): Some data
Partitioning strategy will be:
status: Top-level partition (active/inactive)
business_date: Sub-partition only for status = 'inactive'.
Creating the Iceberg Table -
CREATE TABLE my_iceberg_table (
id INT,
status STRING,
business_date DATE,
data STRING
)
USING iceberg
PARTITIONED BY (status, CASE WHEN status = 'inactive' THEN business_date
ELSE null END);
Here -
PARTITIONED BY (status, CASE WHEN status = 'inactive' THEN business_date
ELSE null END): This is the key part.
It partitions first by status.
The CASE statement creates the sub-partition:
If status is 'inactive', it partitions by business_date.
If status is 'active', it partitions by NULL.
3. Ingesting Records
INSERT INTO my_iceberg_table VALUES
(1, 'active', '2023-10-26', 'Active data 1'),
(2, 'active', '2023-10-27', 'Active data 2'),
(3, 'inactive', '2023-10-26', 'Inactive data 1'),
(4, 'inactive', '2023-10-27', 'Inactive data 2'),
(5, 'inactive', '2023-10-27', 'Inactive data 3');
SELECT * FROM my_iceberg_table WHERE status = 'active';
SELECT * FROM my_iceberg_table WHERE status = 'inactive' AND business_date =
'2023-10-27';
### Proposal document
_No response_
### Specifications
- [x] Table
- [ ] View
- [ ] REST
- [ ] Puffin
- [ ] Encryption
- [ ] Other
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]