Re: [I] Introducing a New Hive to Iceberg Table Migration Method: In-Place Upgrade [iceberg]

via GitHub Tue, 10 Jun 2025 18:50:41 -0700


slfan1989 commented on issue #12762:
URL: https://github.com/apache/iceberg/issues/12762#issuecomment-2960989911


   During the migration process, we encountered several challenges and 
addressed them with the following solutions:
   
   - Rollback of Migrated Hive Tables
   
   After migration, users may use the production tables in different ways. 
Sometimes, due to long periods without system updates, the system cannot 
immediately support accessing data through Iceberg tables. For example, 
partition access through show partitions db.table may no longer be supported 
and users need to switch to select * from db.table.partitions. As a result, 
users wanted the ability to roll back the migrated table to a Hive table.
   
   To address this, we introduced a new stored procedure, revert, which 
directly removes the Iceberg metadata on the HMS side. After the rollback, the 
user only needs to re-execute the production tasks for a specific period. For 
example, if Table A was migrated on June 2, 2025, and a rollback is required on 
June 5, 2025, after using the stored procedure for the rollback, the user would 
simply need to re-run the scheduling tasks for June 3 to June 5, 2025, and 
Table A will revert to a Hive table.
   
   - Aligning Partition Layout with Hive Table
   
   Typically, Hive table partitions are stored in the format 
/dbName/tableName/dt=20250610/, while Iceberg partitions follow the format 
/dbName/tableName/dt=20250610/data/. To ensure that the migrated Iceberg tables 
retain the original partition layout, we added a configuration option that 
allows Iceberg tables to continue using the original Hive partition layout 
after migration.
   
   - Limited Support for Using Iceberg Tables with Spark 2
   
   The community's support for Iceberg in Spark 3 is very robust, and we have 
migrated 80% of our tasks to Spark 3. However, there are still some tasks that 
depend on Spark 2 due to business constraints, and migrating these tasks is 
difficult. Despite this, users still wish to use Iceberg tables with Spark 2. 
In such cases, we agreed with users to only support reading Iceberg tables in 
Spark 2, while also considering partition pushdown scenarios. We have 
implemented basic partition pushdown capabilities, and while this code may have 
limited impact on the Iceberg community, I plan to release it in a personal 
repository and provide a link for users with similar needs to refer to.
   
   - Supporting More Partition Pushdown Scenarios
   
   Some of our tables have multi-level partitions, such as year, month, and 
day. Writing complex partition filter conditions for these types of tables can 
be difficult. For example, users might need to filter data between 2023-01-01 
and 2025-12-31, which results in a cumbersome partition expression. To solve 
this, we introduced a syntax similar to concat(year, month, day) >= 
'2023-01-01' and concat(year, month, day) <= '2025-12-31'. We defined a 
multi-level partition pushdown filtering expression in Iceberg and converted 
Spark's filters into partition pushdown expressions, which simplified the 
user's queries.
   
   These optimizations have effectively addressed the challenges encountered 
during the migration process, ensuring a smooth transition to Iceberg for users 
while enhancing the flexibility and efficiency of the system.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Introducing a New Hive to Iceberg Table Migration Method: In-Place Upgrade [iceberg]

Reply via email to