slfan1989 commented on issue #12762:
URL: https://github.com/apache/iceberg/issues/12762#issuecomment-2960987469

   @szehon-ho @manuzhang @pvary @deniskuzZ @dramaticlly 
   
   > Conclusion
   
   Compared to Hive, Spark offers significant advantages, especially in 
high-concurrency and large-scale migration scenarios. Its lightweight 
operation, low system load, high transparency, and simple rollback mechanism 
make it the preferred choice for our table migration. To date, we have 
successfully migrated over 500 tables using Spark, and we expect this number to 
exceed 1,000 by the end of the month.
   
   > Migration Requirements and Background:
   
   During the migration process, we perform batch migrations based on the 
specific tasks of users, adhering to the following requirements:
   
   Migration Transparency: The migration process must be fully transparent to 
the user, and the original tables must remain available without downtime.
   
   Time Sensitivity: For daily-scheduled tasks, the migration must be completed 
within 1 day; for hourly-scheduled tasks, the migration must not exceed 1 hour.
   
   > Challenges with Hive Migration:
   
   - Heavy Load in High Concurrency: When migrating 30-40 tables concurrently, 
especially when the tables have thousands of partitions, or even over 10,000 
partitions, the partition conversion efficiency using HMS is very low. This 
increases the load on HMS and may affect the response times of other queries.
   
   - DDL Operation Risks: For tables with many partitions, users are typically 
hesitant to execute DDL operations directly due to concerns about inconsistent 
states or potential disruptions during the migration. This necessitates manual 
intervention by administrators, adding complexity and risk to the migration 
process.
   
   > Advantages of Spark Migration:
   
   - Lightweight Operation: Spark's migration process is very lightweight. 
During snapshot creation, it does not impact the original table, and metadata 
migration does not affect the table's usage. The only operation required is 
updating the HMS metadata and writing the metajson file. To further reduce the 
load on HMS, we configure multiple HMS instances and use a polling mechanism, 
significantly lowering the load even for large tables.
   
   - No Concerns About Table Modifications: Since Spark’s migration process has 
little to no impact on the original table, users can continue to operate on the 
table during the migration. This provides greater flexibility and control to 
users, ensuring full transparency throughout the migration process.
   
   - Simple Rollback Mechanism: In case of any issues during migration, the 
rollback process is straightforward. Since no changes are made to the user 
data, the only necessary action is to delete the metadata file and restore the 
Hive metadata configuration, eliminating the complexity and potential risks of 
data recovery.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to