dramaticlly commented on code in PR #13720:
URL: https://github.com/apache/iceberg/pull/13720#discussion_r2308573994
##########
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java:
##########
@@ -230,7 +230,8 @@ public void testStartVersion() throws Exception {
.startVersion("v2.metadata.json")
.execute();
- checkFileNum(1, 1, 1, 4, result);
+ // 1 metadata JSON file, 1 snapshot, 2 manifests, 1 data file
+ checkFileNum(1, 1, 2, 5, result);
Review Comment:
@stevenzwu raised a good point on this length fix shall not change the
behavior for delta rewrite. I recall we had a earlier discussion on this in
https://github.com/apache/iceberg/pull/13720#discussion_r2292103624 but on a
second thought I believe we are trying to do many things together, to rewrite
extra manifests which is not added by delta snapshot to fix the manifest size
in new manifest-list.
```mermaid
graph TD
subgraph "Version File Metadata"
V2["v2.json"]
V3["v3.json"]
end
subgraph "Manifest Lists"
S1["snap1.avro"]
S2["snap2.avro"]
end
subgraph "Manifests"
M1["m1.avro"]
M2["m2.avro"]
end
V2 --> S1
V3 --> S2
S1 --> M1
S2 --new--> M2
S2 --existing--> M1
```
Let's use this unit test as an example here, if we do incremental rewrite
from v2.metadata.json
- Before:
we scan manifests table to filter manifests where added_snapshot_id is in
deltaSnapshotIdSet, so we will only need to rewrite 1 manifest (m2.avro)
- After:
we scan all_manifests table to filter manifests where reference_snapshot_id
is in deltaSnapshotIdSet, so we will have to rewrite 2 manifests (m1.avro &
m2.avro)
I think this might help with fix the previous manifest size problem in the
new manifest-list but we are no longer doing incremental path rewrite (where
only added snapshot shall be able to determine the delta).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]