qqchang2nd commented on issue #12251: URL: https://github.com/apache/iceberg/issues/12251#issuecomment-2662259040
Let me share our use case at Sensors Data. We use HadoopTables to manage Iceberg tables with Impala as our query engine. In one of our customer environments, we encountered a query that only accessed one day's worth of data but took about 12 seconds total, with 6 seconds spent just on plan analysis. Through Arthas tracing, we identified the bottleneck was in manifest loading. After enabling manifest caching, the plan analysis time was reduced to under 200ms. For these types of queries, manifest caching can improve overall query performance by up to 2x. At Sensors Data, we currently have over 100 customers using the HadoopTables management approach, and this number is expected to grow to thousands in the future. We don't have plans to migrate to other catalog implementations like HadoopCatalog. The performance improvement from manifest caching is significant for our use case because: 1. We have long-running Impala daemons that repeatedly access these tables 2. The manifest files remain relatively stable during query execution 3. The memory overhead of caching is acceptable given the performance benefits Would you be interested in more detailed performance data or our specific usage patterns? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org