BsoBird opened a new pull request, #10623: URL: https://github.com/apache/iceberg/pull/10623
# Refactor the code of HadoopTableOptions ## 1.Current problems ### 1.1.Enabling `write.metadata.delete-after-commit.enabled` may result in dirty commits. Since this option is turned on, catalog will clean up the commit history. If the client commits an older version at this point, the commit will succeed because the metadata has been deleted. However, the user will not be able to view this commit. From the user's point of view, he will encounter a very strange result: his submission is successful, but the content of the submission disappears. In addition, dirty commits generated in this case will not be cleaned up. ### 1.2.Problems with lockManager's locking policy. #### 1.2.1.Incorrect behaviour of locking and releasing But it ignores the fact that the writeVersionHint process may also need to be locked. In current implementations, users rely on versionHintFile to speed up version lookups. If the information in the versionHintFile is incorrect, then it will lead to big trouble. Example:If more than one client executes the commit method concurrently, the version information in the versionHintFile may be written to a very old version.(versionHintFile’s version < currentVersion - write.metadata.previous-versions-max). At this point, the user cannot call the `refresh` method to get the latest version. Eventually, the user's commit will **fail forever**. #### 1.2.2.Incorrect timing of locking and releasing he current version of the code is locked only for commits of a certain version, e.g., multiple clients are committing to version 3 at the same time, and the lock is in effect. However, when multiple clients submit different versions at the same time, they are not locked to each other. For object storage filesystems, such locking may cause the information in the versionHintFile to be incorrect. For normal filesystems, such locking is meaningless, since the OS provides the same functionality when calling fs.rename. And it also causes the versionHintFile content to be messed up. #### 1.2.3.Incorrect behaviour of handing exception The current implementation of the commit method is not atomic, assuming that the client calls renameMetadata successfully and then encounters an OOM exception, which triggers spark to clean up the data files from this commit. But spark doesn't clean up the metadata files generated by this commit, which ultimately leads to the problem described in [here](https://github.com/apache/iceberg/pull/9546#issue-2095166356). **So.given the large number of problems with hadoopTableOptions, we need to completely refactor hadoopTableOptions.** ## 2.List of features to be supported ### 2.1.In the case of a user using an arbitrary storage system with a global locking service, we need to support the following features: - Supports concurrency control - No dirty commits - Avoid generating versionHintFile files with misplaced content - Guarantees the atomicity of commit operations. ### 2.2.In the case where the user is using a normal file system (which supports rename operations that do not overwrite the target file) and is not using the global locking service, the following features need to be supported: - Supports concurrency control - Regardless of whether the contents of the versionHintFile are incorrect, the client can submit the correct metadata. - Support for cleaning up dirty commits - Guarantees the atomicity of commit operations. ## 3.Glossary: - Dirty-commit: `commitVersion < currentVersion - write.metadata.previous-versions-max` && `commit-success=True` - Normal file system: which supports rename operations that do not overwrite the target file,Example: `mv -n a.txt b.txt`,`HdfsFileSystem.rename(src,dst)`,`WindowsFs.rename(sec,dst)` - Atomicity of commit: After the new version of the metadata has been renamed successfully, any exceptions will be ignored. Any exceptions thrown before the metadata has been renamed will be treated as a commit failure. If the rename of the new version of the metadata fails, we will throw a `CommitStateUnknown` exception and stop everything. The operation of VersionHintFile will be disregarded. - Global locking service: All clients operating on the iceberg table must acquire the same lock. Successful acquisitions are required before subsequent commits can be performed. Example: `Distributed locking based on zk/redis/database`, `constrains all operations to the same in-memory lock`. ## 4.Flow chart: ```mermaid graph TB START[commit-start] -->TRY-LOCK{try-get-lock} TRY-LOCK --> |fail| COMMIT-FAIL[commit-fail] TRY-LOCK --> |success| CHECK-EXISTS{check-metadata-file-exists} CHECK-EXISTS -->|already-exists| COMMIT-FAIL CHECK-EXISTS -->|not-exists| GLOBAL-LOCK{useGlobalLockingService?} GLOBAL-LOCK -->|yes| FIND-LATEST-VERSION(findLatestVersion) GLOBAL-LOCK -->|no| FIND-LATEST-VERSION-NO-HINT(findLatestVersionWithOutVersionHint) FIND-LATEST-VERSION-NO-HINT --> CHECK-COMMIT-VERSION{checkCurrentVersionIsLatest} FIND-LATEST-VERSION --> CHECK-COMMIT-VERSION CHECK-COMMIT-VERSION --> |no| COMMIT-FAIL(commit-fail) CHECK-COMMIT-VERSION --> |yes| DO-COMMIT{do-commit} FAST-FAIL --> COMMIT-STATE-UNKNOWN(commit-state-unknow) DO-COMMIT --> |not-success| COMMIT-FAIL(commit-fail) DO-COMMIT --> |server-side-exception| TRY-CHECK-COMMIT(try-check) DO-COMMIT --> |commit-success| NEED-CLEAN-DIRTY{useGlobalLockingService?} DO-COMMIT --> |client-side-exception| COMMIT-STATE-UNKNOWN TRY-CHECK-COMMIT --> |check-success-and-commit-fail| COMMIT-FAIL TRY-CHECK-COMMIT --> |check-success-and-commit-success| COMMIT-SUCCESS TRY-CHECK-COMMIT --> |exception| COMMIT-STATE-UNKNOWN NEED-CLEAN-DIRTY --> |yes| CHECK-DIRTY-COMMIT{this-commit-is-a-dirty-commit?} CHECK-DIRTY-COMMIT --> |no| CLEAN-DIRTY(clean-old-dirty-commit) CHECK-DIRTY-COMMIT --> |yes| FAST-FAIL(fast-fail) CLEAN-DIRTY --> COMMIT-SUCCESS NEED-CLEAN-DIRTY --> |no| COMMIT-SUCCESS COMMIT-SUCCESS --> UNLOCK(UNLOCK) COMMIT-STATE-UNKNOWN --> UNLOCK COMMIT-FAIL --> UNLOCK UNLOCK --> END(END) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org