gaborkaszab commented on PR #5837:
URL: https://github.com/apache/iceberg/pull/5837#issuecomment-2312634994

   > most queries operate on freshmost data, so they will see Parquet files
   
   In general this is true but we still see users ending up tables with mixed 
file formats and having queries that read more than one type of data files. The 
problem this tries to solve, or rather the motivation of introducing these 
metrics is mostly for observability reasons. E.g. user complains about some 
queries being slow and then when checking the profiles it can stand out that 
they query mixed formats. Just one example. Another one is Impala specific 
where we do memory estimations during query plannings and there it's a factor 
what the file formats are read (and how many of them).
   
   Since I try to get this PR merged for 2 years now, and after spending this 
time on making multiple iterations on the code to make it more generic for 
multi-dimension metrics putting non-negligible efforts into this and we are 
still discussing if this makes any sense, I have to ask if there is any point 
keep pushing this any more.
   How I see this is that foremost, 1) extra metrics never hurt 2) If 
someone've put efforts into, this would be most probably useful for at least 
this someone 3) having thorough metrics on a production-grade software is 
inevitable for debugging, supportability and observability, query engines have 
100s if not 1000s of them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to