suxiaogang223 commented on PR #59487: URL: https://github.com/apache/doris/pull/59487#issuecomment-3747571607
I've reviewed the current implementation against the [official Apache Iceberg Spark procedures documentation](https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_manifests), and I'd like to suggest some changes to better align with the upstream specification. --- ## 1. Parameter Design ### Spark Official Implementation | Argument Name | Required? | Type | Description | |--------------|:----------:|:------|:-------------| | `table` | ✔️ | string | Name of the table to update | | `use_caching` | ☐ | boolean | Use Spark caching during operation (defaults to false). Enabling caching can increase memory footprint on executors. | | `spec_id` | ☐ | int | Spec id of the manifests to rewrite (defaults to current spec id) | ### Current Doris Implementation | Parameter | Type | Description | |-----------|:------|:-------------| | `cluster-by-partition` | boolean | Cluster manifests by partition fields | | `rewrite-all` | boolean | Rewrite all manifests when true | | `min-manifest-size-bytes` | long | Minimum manifest file size threshold | | `max-manifest-size-bytes` | long | Maximum manifest file size threshold | | `scan-thread-pool-size` | int | Thread pool size for parallel scanning | ### Issue & Recommendation **The current implementation is over-engineered compared to the official specification.** I believe we should **simplify the implementation** to only support the basic `rewrite_manifests` functionality. The extensive filtering parameters for manifest files are **unnecessary** because: 1. **Manifest files are typically very small** - They contain metadata references to data files, not the actual data itself. Most manifest files range from a few KB to a few MB. 2. **The primary purpose of `rewrite_manifests` is to merge ALL small manifest files** into larger ones for better scan planning. This is a metadata optimization operation that should process all manifests by default. 3. **Over-filtering defeats the purpose** - The operation is meant to be comprehensive, not selective. If we only rewrite some manifests based on size thresholds, we miss the optimization benefits. ### Proposed Changes I recommend **removing** the following parameters: - ❌ `cluster-by-partition` (not in Spark spec - clustering is automatic based on partition spec) - ❌ `rewrite-all` (not in Spark spec - all manifests should be rewritten by default) - ❌ `min-manifest-size-bytes` (not in Spark spec - unnecessary complexity) - ❌ `max-manifest-size-bytes` (not in Spark spec - unnecessary complexity) - ❌ `scan-thread-pool-size` (not in Spark spec - engine-specific optimization) ### Recommended Parameter Set | Parameter | Required? | Type | Description | |-----------|:----------:|:------|:-------------| | `table` | ✔️ | string | Name of the table to update (inferred from ALTER TABLE) | | `spec_id` | ☐ | int | Spec id of the manifests to rewrite (defaults to current spec id) | **Note:** We could add `spec_id` as an optional parameter if needed for advanced use cases, but the core functionality should remain simple. --- ## 2. Return Value Schema ### Spark Official Output | Output Name | Type | Description | |-------------|:------|:-------------| | `rewritten_manifests_count` | int | Number of manifests which were re-written by this command | | `added_manifests_count` | int | **Number of new manifest files which were written by this command** | ### Current Doris Output | Column Name | Type | Description | |-------------|:------|:-------------| | `rewritten_manifests_count` | int | Number of data manifests rewritten by this command | | `total_data_manifests_count` | int | **Total number of data manifests BEFORE rewrite** | ### Issue The second output column is **inconsistent** with the official specification. The Spark version returns `added_manifests_count` (the number of **new** manifests created after the rewrite), not the total count before the operation. ### Recommended Change ```java // Current implementation return Lists.newArrayList( new Column("rewritten_manifests_count", Type.INT, false, "Number of data manifests rewritten by this command"), new Column("total_data_manifests_count", Type.INT, false, // ❌ Incorrect "Total number of data manifests before rewrite") ); // Recommended change - align with Spark spec return Lists.newArrayList( new Column("rewritten_manifests_count", Type.INT, false, "Number of manifests which were re-written by this command"), new Column("added_manifests_count", Type.INT, false, // ✅ Aligned with spec "Number of new manifest files which were written by this command") ); ``` **Note:** The `added_manifests_count` should reflect the actual number of new manifest files created by the rewrite operation, which is typically smaller than the original count. --- ## 3. Summary of Recommended Changes | Aspect | Current | Recommended | Reason | |--------|---------|-------------|--------| | **Parameters** | 5 custom params | 0-1 params (optional `spec_id`) | Align with Spark spec; manifest files are small and don't need filtering | | **Return Column 2** | `total_data_manifests_count` | `added_manifests_count` | Match official output schema | | **Functionality** | Complex with filtering logic | Simple: merge all manifests | Core purpose is optimization through consolidation | | **Implementation** | Custom predicate-based selection | Direct Iceberg API usage | Leverage upstream `RewriteManifests` action | --- ## 4. Benefits of Alignment ✅ **Compatibility** - Users familiar with Spark/Iceberg will have a consistent experience across engines ✅ **Simplicity** - Easier to understand, test, and maintain ✅ **Performance** - No overhead from unnecessary filtering logic ✅ **Documentation** - Can reference official docs directly without translation ✅ **Future-proof** - Automatically benefits from upstream improvements --- ## 5. Additional Considerations ### 5.1 Engine-Specific Optimizations Parameters like `scan-thread-pool-size` are engine-specific and should be handled at the engine configuration level, not as procedure parameters. Doris can internally manage thread pools for optimal performance without exposing this to users. ### 5.2 Backward Compatibility If there are concerns about removing existing parameters, we could: 1. Deprecate them with warnings in a minor release 2. Remove them in a future major release 3. Document that they are ignored (no-op) with a deprecation warning ### 5.3 Implementation Reference The recommended approach is to use Iceberg's `RewriteManifests` action directly: ```java RewriteManifests rm = table.rewriteManifests(); if (specId != null) { rm.specId(specId); } rm.commit(); ``` This is simpler and more aligned with the official specification than the current predicate-based approach. --- ## 6. Example Usage (After Changes) ```sql -- Basic usage - rewrite all manifests for current spec CALL catalog_name.system.rewrite_manifests('db.table'); -- Or via ALTER TABLE (Doris syntax) ALTER TABLE catalog_name.db.table EXECUTE rewrite_manifests(); -- Optional: rewrite manifests for a specific spec ALTER TABLE catalog_name.db.table EXECUTE rewrite_manifests('spec_id' = '1'); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
