RussellSpitzer commented on PR #13084: URL: https://github.com/apache/iceberg/pull/13084#issuecomment-2959899198
I'm really not comfortable with the scope of changes to existing behavior that we are proposing here. I think that moving to a remote delete is a good idea to avoid serialization but I think we need to punt on the idea of fitting it into the existing execution path and expectations. I have a few proposals for paths forward that I would be comfortable with 1. Just expose the join dataframe like we do in expire snapshots - https://github.com/apache/iceberg/pull/13289 (Example) This essentially punts the logic to the end user, for those users who have async delete services this is probably ideal. 2. Add a new method which does a remote delete with a different contract. IE skip doExecute and have it be a part of a new method "remoteDelete" or something which creates the dataframes performs the delete remotely and returns just an Integer number of files deleted. Additionally we could wire this up to the procedure with a different return type. 3. Continue using doExecute but use a parameter to switch between the old code and the new distributed path. The default should remain the local mode of operation but we could use an option to opt into the new delete logic. I think in this case we should abandon the contract for the result set when going down the distributed route. Returning the whole result set seems like we are just bringing back original problem and adding complexity. I think in any of these cases we should avoid caching if at all possible. In my experience this basically doubles to triples the execution time do to another round of serde We should also avoid any actions that cause the join to be preformed more than once, that means no calling count prematurely. I'm open to other ideas as well if you have any other suggestions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org