slfan1989 commented on issue #12762: URL: https://github.com/apache/iceberg/issues/12762#issuecomment-2960987469
@szehon-ho @manuzhang @pvary @deniskuzZ @dramaticlly > Conclusion Compared to Hive, Spark offers significant advantages, especially in high-concurrency and large-scale migration scenarios. Its lightweight operation, low system load, high transparency, and simple rollback mechanism make it the preferred choice for our table migration. To date, we have successfully migrated over 500 tables using Spark, and we expect this number to exceed 1,000 by the end of the month. > Migration Requirements and Background: During the migration process, we perform batch migrations based on the specific tasks of users, adhering to the following requirements: Migration Transparency: The migration process must be fully transparent to the user, and the original tables must remain available without downtime. Time Sensitivity: For daily-scheduled tasks, the migration must be completed within 1 day; for hourly-scheduled tasks, the migration must not exceed 1 hour. > Challenges with Hive Migration: - Heavy Load in High Concurrency: When migrating 30-40 tables concurrently, especially when the tables have thousands of partitions, or even over 10,000 partitions, the partition conversion efficiency using HMS is very low. This increases the load on HMS and may affect the response times of other queries. - DDL Operation Risks: For tables with many partitions, users are typically hesitant to execute DDL operations directly due to concerns about inconsistent states or potential disruptions during the migration. This necessitates manual intervention by administrators, adding complexity and risk to the migration process. > Advantages of Spark Migration: - Lightweight Operation: Spark's migration process is very lightweight. During snapshot creation, it does not impact the original table, and metadata migration does not affect the table's usage. The only operation required is updating the HMS metadata and writing the metajson file. To further reduce the load on HMS, we configure multiple HMS instances and use a polling mechanism, significantly lowering the load even for large tables. - No Concerns About Table Modifications: Since Spark’s migration process has little to no impact on the original table, users can continue to operate on the table during the migration. This provides greater flexibility and control to users, ensuring full transparency throughout the migration process. - Simple Rollback Mechanism: In case of any issues during migration, the rollback process is straightforward. Since no changes are made to the user data, the only necessary action is to delete the metadata file and restore the Hive metadata configuration, eliminating the complexity and potential risks of data recovery. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org