Re: [PR] API,Core: Introduce metrics for data files by file format [iceberg]

via GitHub Fri, 23 Aug 2024 11:50:37 -0700


findepi commented on PR #5837:
URL: https://github.com/apache/iceberg/pull/5837#issuecomment-2307633354


   > for instance with Hive that used ORC format and with Impala that wrote 
Parquet files.
   
   that is likely addressed by preferred file format being a table-level 
configuration?
   
   > Impala is more performant with Parquet but there are huge tables in 
production written in ORC hence the motivation to move from one format to 
another but they don't want to do it in one step due to the size of the table.
   
   that absolutely makes sense!
   if i have historical data, i can change its default format (eg from ORC to 
Parquet) but have no desire to rewrite old data
   
   but then -- why would anyone care, actually?
   most queries operate on freshmost data, so they will see Parquet files. some 
queries operate on large time windows and will see ORC and Parquet files.  It 
is unclear what problem would per-format metrics solve.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API,Core: Introduce metrics for data files by file format [iceberg]

Reply via email to