gaborkaszab commented on PR #5837: URL: https://github.com/apache/iceberg/pull/5837#issuecomment-2312634994
> most queries operate on freshmost data, so they will see Parquet files In general this is true but we still see users ending up tables with mixed file formats and having queries that read more than one type of data files. The problem this tries to solve, or rather the motivation of introducing these metrics is mostly for observability reasons. E.g. user complains about some queries being slow and then when checking the profiles it can stand out that they query mixed formats. Just one example. Another one is Impala specific where we do memory estimations during query plannings and there it's a factor what the file formats are read (and how many of them). Since I try to get this PR merged for 2 years now, and after spending this time on making multiple iterations on the code to make it more generic for multi-dimension metrics putting non-negligible efforts into this and we are still discussing if this makes any sense, I have to ask if there is any point keep pushing this any more. How I see this is that foremost, 1) extra metrics never hurt 2) If someone've put efforts into, this would be most probably useful for at least this someone 3) having thorough metrics on a production-grade software is inevitable for debugging, supportability and observability, query engines have 100s if not 1000s of them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org