amogh-jahagirdar commented on code in PR #10784:
URL: https://github.com/apache/iceberg/pull/10784#discussion_r1699116602


##########
api/src/main/java/org/apache/iceberg/actions/RepairManifests.java:
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.ManifestFile;
+
+/**
+ * An action that will repair manifests. Implementations should produce a new 
set of manifests where
+ * duplicate entries are removed and entries which refer to files which no 
longer exist are removed.
+ */
+public interface RepairManifests extends SnapshotUpdate<RepairManifests, 
RepairManifests.Result> {
+
+  /**
+   * Determine the rewritten and added manifests without actually committing 
the operation to the
+   * table
+   *
+   * @return this for method chaining
+   */
+  RepairManifests dryRun();
+
+  interface Result {
+    /** Returns rewritten manifests. */
+    Iterable<ManifestFile> rewrittenManifests();
+
+    /** Returns added manifests. */
+    Iterable<ManifestFile> addedManifests();
+
+    /** Returns the number of duplicate files removed */
+    long duplicateFilesRemoved();

Review Comment:
   I agree, I think we can just put the Iterable here. Hopefully in the average 
repair case, it shouldn't really be millions of duplicate files? 



##########
api/src/main/java/org/apache/iceberg/actions/RepairManifests.java:
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.ManifestFile;
+
+/**
+ * An action that will repair manifests. Implementations should produce a new 
set of manifests where
+ * duplicate entries are removed and entries which refer to files which no 
longer exist are removed.
+ */
+public interface RepairManifests extends SnapshotUpdate<RepairManifests, 
RepairManifests.Result> {
+
+  /**
+   * Determine the rewritten and added manifests without actually committing 
the operation to the
+   * table
+   *
+   * @return this for method chaining
+   */
+  RepairManifests dryRun();

Review Comment:
   > Shouldn't this be part of "apply()" which should already exist because 
this inherits PendingUpdate?
   
   @RussellSpitzer this inherits 
[`SnapshotUpdate`](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/actions/SnapshotUpdate.java#L30)
 in the Actions API, there's no apply() in that. 
   
   I think you're thinking of the other 
[`SnapshotUpdate`](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/SnapshotUpdate.java)
 that's part of the "normal" API. This brings up a good point of should this 
also have an interface there or not. RewriteManifests is defined in the regular 
API and then there's also an action for that.
   
   The purpose of the Actions API is to enable engines to provide 
distributed/other custom implementations of procedures, and I think here we 
could just start by having it as an action and if there's a desire to add it to 
the typical API we could do it at that time?
   
   
   >should this be rather part of the actual Spark procedure rather than in the 
API? As I'm not sure how one would examine the results of a dry run in this 
case. For example, RemoveOrphanFilesProcedure has a dry run option part of the 
procedure, but not as part of the DeleteOrphanFiles API
   
   Good point, I missed that this is what orphan file removal does. If I think 
through it a bit more: by defining dryRun() at the interface level, what we're 
saying is *implementations* must have a way of supporting that (the 
implementation could ignore the option technically but that would be a bad 
implementation of the interface). 
   
   If we look at what orphan file does, the action interface has a "deleteWith" 
API and the procedure will pass in a no-op deletion function. 
   
   What we could do is have an action API "commitRepairWith()" which will 
accept some sort of commit function to perform the manifest rewrite or if the 
procedure is a dry-run, some no-op commit. 
   It feels like it complicates the API more to do that.
   
   I think the real fix is to return Result as part of the dryRun...let me 
double check if this is feasible.



##########
api/src/main/java/org/apache/iceberg/actions/RepairManifests.java:
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.ManifestFile;
+
+/**
+ * An action that will repair manifests. Implementations should produce a new 
set of manifests where
+ * duplicate entries are removed and entries which refer to files which no 
longer exist are removed.
+ */
+public interface RepairManifests extends SnapshotUpdate<RepairManifests, 
RepairManifests.Result> {

Review Comment:
   I think `SnapshotUpdate` is the correct one to extend, here's my thinking:
   
   1.) SnapshotUpdate already extends Action
   2.) I think implementations necessarily will produce snapshots when they 
perform repairs (putting aside the dry run API option). The new snapshot that 
would be produced, at least in my mind in the reference implementation would be 
essentially a "rewrite manifests"-like snapshot. 



##########
api/src/main/java/org/apache/iceberg/actions/RepairManifests.java:
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.ManifestFile;
+
+/**
+ * An action that will repair manifests. Implementations should produce a new 
set of manifests where
+ * duplicate entries are removed and entries which refer to files which no 
longer exist are removed.

Review Comment:
   @RussellSpitzer @szehon-ho  I largely wanted to avoid a user having to think 
through which options they need to specify, and make sure implementations have 
sane defaults but I think I agree at least at the actions API level we could 
have separate configuraiton methods. When the procedure implementation is 
there, users could pass in what they want and the procedure implementation has 
some sane defaults (power users who use the raw actions APIs would have to set 
every option they want explicitly). cc @danielcweeks in case he had any 
thoughts.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to