RussellSpitzer commented on PR #13084:
URL: https://github.com/apache/iceberg/pull/13084#issuecomment-2959899198

   I'm really not comfortable with the scope of changes to existing behavior 
that we are proposing here. I think that moving to a remote delete is a good 
idea to avoid serialization but I think we need to punt on the idea of fitting 
it into the existing execution path and expectations.
   
   I have a few proposals for paths forward that I would be comfortable with
   1. Just expose the join dataframe like we do in expire snapshots - 
https://github.com/apache/iceberg/pull/13289 (Example)
       This essentially punts the logic to the end user, for those users who 
have async delete services this is probably ideal.
       
   2. Add a new method which does a remote delete with a different contract. IE 
skip doExecute and have it be a part of a new method "remoteDelete" or 
something which creates the dataframes performs the delete remotely and returns 
just an Integer number of files deleted. Additionally we could wire this up to 
the procedure with a different return type.
       
   3. Continue using doExecute but use a parameter to switch between the old 
code and the new distributed path. The default should remain the local mode of 
operation but we could use an option to opt into the new delete logic. I think 
in this case we should abandon the contract for the result set when going down 
the distributed route. Returning the whole result set seems like we are just 
bringing back original problem and adding complexity.
   
   I think in any of these cases we should avoid caching if at all possible. In 
my experience this basically doubles to triples the execution time do to 
another round of serde
   
   We should also avoid any actions that cause the join to be preformed more 
than once, that means no calling count prematurely. 
   
   I'm open to other ideas as well if you have any other suggestions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to