amogh-jahagirdar commented on code in PR #10784: URL: https://github.com/apache/iceberg/pull/10784#discussion_r1699116602
########## api/src/main/java/org/apache/iceberg/actions/RepairManifests.java: ########## @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.actions; + +import org.apache.iceberg.ManifestFile; + +/** + * An action that will repair manifests. Implementations should produce a new set of manifests where + * duplicate entries are removed and entries which refer to files which no longer exist are removed. + */ +public interface RepairManifests extends SnapshotUpdate<RepairManifests, RepairManifests.Result> { + + /** + * Determine the rewritten and added manifests without actually committing the operation to the + * table + * + * @return this for method chaining + */ + RepairManifests dryRun(); + + interface Result { + /** Returns rewritten manifests. */ + Iterable<ManifestFile> rewrittenManifests(); + + /** Returns added manifests. */ + Iterable<ManifestFile> addedManifests(); + + /** Returns the number of duplicate files removed */ + long duplicateFilesRemoved(); Review Comment: I agree, I think we can just put the Iterable here. Hopefully in the average repair case, it shouldn't really be millions of duplicate files? ########## api/src/main/java/org/apache/iceberg/actions/RepairManifests.java: ########## @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.actions; + +import org.apache.iceberg.ManifestFile; + +/** + * An action that will repair manifests. Implementations should produce a new set of manifests where + * duplicate entries are removed and entries which refer to files which no longer exist are removed. + */ +public interface RepairManifests extends SnapshotUpdate<RepairManifests, RepairManifests.Result> { + + /** + * Determine the rewritten and added manifests without actually committing the operation to the + * table + * + * @return this for method chaining + */ + RepairManifests dryRun(); Review Comment: > Shouldn't this be part of "apply()" which should already exist because this inherits PendingUpdate? @RussellSpitzer this inherits [`SnapshotUpdate`](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/actions/SnapshotUpdate.java#L30) in the Actions API, there's no apply() in that. I think you're thinking of the other [`SnapshotUpdate`](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/SnapshotUpdate.java) that's part of the "normal" API. This brings up a good point of should this also have an interface there or not. RewriteManifests is defined in the regular API and then there's also an action for that. The purpose of the Actions API is to enable engines to provide distributed/other custom implementations of procedures, and I think here we could just start by having it as an action and if there's a desire to add it to the typical API we could do it at that time? >should this be rather part of the actual Spark procedure rather than in the API? As I'm not sure how one would examine the results of a dry run in this case. For example, RemoveOrphanFilesProcedure has a dry run option part of the procedure, but not as part of the DeleteOrphanFiles API Good point, I missed that this is what orphan file removal does. If I think through it a bit more: by defining dryRun() at the interface level, what we're saying is *implementations* must have a way of supporting that (the implementation could ignore the option technically but that would be a bad implementation of the interface). If we look at what orphan file does, the action interface has a "deleteWith" API and the procedure will pass in a no-op deletion function. What we could do is have an action API "commitRepairWith()" which will accept some sort of commit function to perform the manifest rewrite or if the procedure is a dry-run, some no-op commit. It feels like it complicates the API more to do that. I think the real fix is to return Result as part of the dryRun...let me double check if this is feasible. ########## api/src/main/java/org/apache/iceberg/actions/RepairManifests.java: ########## @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.actions; + +import org.apache.iceberg.ManifestFile; + +/** + * An action that will repair manifests. Implementations should produce a new set of manifests where + * duplicate entries are removed and entries which refer to files which no longer exist are removed. + */ +public interface RepairManifests extends SnapshotUpdate<RepairManifests, RepairManifests.Result> { Review Comment: I think `SnapshotUpdate` is the correct one to extend, here's my thinking: 1.) SnapshotUpdate already extends Action 2.) I think implementations necessarily will produce snapshots when they perform repairs (putting aside the dry run API option). The new snapshot that would be produced, at least in my mind in the reference implementation would be essentially a "rewrite manifests"-like snapshot. ########## api/src/main/java/org/apache/iceberg/actions/RepairManifests.java: ########## @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.actions; + +import org.apache.iceberg.ManifestFile; + +/** + * An action that will repair manifests. Implementations should produce a new set of manifests where + * duplicate entries are removed and entries which refer to files which no longer exist are removed. Review Comment: @RussellSpitzer @szehon-ho I largely wanted to avoid a user having to think through which options they need to specify, and make sure implementations have sane defaults but I think I agree at least at the actions API level we could have separate configuraiton methods. When the procedure implementation is there, users could pass in what they want and the procedure implementation has some sane defaults (power users who use the raw actions APIs would have to set every option they want explicitly). cc @danielcweeks in case he had any thoughts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org