rdblue commented on PR #7194:
URL: https://github.com/apache/iceberg/pull/7194#issuecomment-1519124565

   I'm going to close this PR because I don't think it is an approach that 
makes sense for the Iceberg project.
   
   One of the reasons why Iceberg exists is because it is important to solve 
problems in the right place. Before, we needed to solve problems in the 
processing engine or in a file format, and those created awkward, half-baked 
solutions. Similarly, I think that this is not the right place or a good 
approach for optimization.
   
   First, the ideal approach is to write data correctly in the first place. 
That's why Iceberg defines table-level tuning settings and write order, and why 
we request distribution and ordering in engines like Spark. We want to be able 
to asynchronously optimize tables, but we don't want to require it if we don't 
need to. Focusing effort on fixing the underlying problem (creating too many 
files) is a better approach. I think we should see if we can address the 
problem in the write path by coalescing outputs and aligning write distribution 
with table partitioning.
   
   Second, kicking off a job in a specific downstream job through an API 
intended to collect metrics is not a good design for asynchronous optimization. 
Quite a few comments question aspects of this. Those are valid concerns. But 
ignoring the specifics, I think that the choices here were made because this is 
attempting to solve a problem in the wrong place. Rather than going that 
direction and then getting pulled deeper into a mess -- adding more compute 
options or rules for how to take action -- I think the right approach is to 
have APIs that enable people to build optimizers, similar to how we handle 
catalogs. That's why we built metrics reporting as an API: to get important 
information to downstream systems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to