Re: [PR] [Feature](iceberg) Implement rewrite_manifests procedure for Iceberg [doris]

via GitHub Tue, 13 Jan 2026 19:52:19 -0800


suxiaogang223 commented on PR #59487:
URL: https://github.com/apache/doris/pull/59487#issuecomment-3747571607


   I've reviewed the current implementation against the [official Apache 
Iceberg Spark procedures 
documentation](https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_manifests),
 and I'd like to suggest some changes to better align with the upstream 
specification.
   
   ---
   
   ## 1. Parameter Design
   
   ### Spark Official Implementation
   
   | Argument Name | Required? | Type | Description |
   |--------------|:----------:|:------|:-------------|
   | `table` | ✔️ | string | Name of the table to update |
   | `use_caching` | ☐ | boolean | Use Spark caching during operation (defaults 
to false). Enabling caching can increase memory footprint on executors. |
   | `spec_id` | ☐ | int | Spec id of the manifests to rewrite (defaults to 
current spec id) |
   
   ### Current Doris Implementation
   
   | Parameter | Type | Description |
   |-----------|:------|:-------------|
   | `cluster-by-partition` | boolean | Cluster manifests by partition fields |
   | `rewrite-all` | boolean | Rewrite all manifests when true |
   | `min-manifest-size-bytes` | long | Minimum manifest file size threshold |
   | `max-manifest-size-bytes` | long | Maximum manifest file size threshold |
   | `scan-thread-pool-size` | int | Thread pool size for parallel scanning |
   
   ### Issue & Recommendation
   
   **The current implementation is over-engineered compared to the official 
specification.**
   
   I believe we should **simplify the implementation** to only support the 
basic `rewrite_manifests` functionality. The extensive filtering parameters for 
manifest files are **unnecessary** because:
   
   1. **Manifest files are typically very small** - They contain metadata 
references to data files, not the actual data itself. Most manifest files range 
from a few KB to a few MB.
   
   2. **The primary purpose of `rewrite_manifests` is to merge ALL small 
manifest files** into larger ones for better scan planning. This is a metadata 
optimization operation that should process all manifests by default.
   
   3. **Over-filtering defeats the purpose** - The operation is meant to be 
comprehensive, not selective. If we only rewrite some manifests based on size 
thresholds, we miss the optimization benefits.
   
   ### Proposed Changes
   
   I recommend **removing** the following parameters:
   
   - ❌ `cluster-by-partition` (not in Spark spec - clustering is automatic 
based on partition spec)
   - ❌ `rewrite-all` (not in Spark spec - all manifests should be rewritten by 
default)
   - ❌ `min-manifest-size-bytes` (not in Spark spec - unnecessary complexity)
   - ❌ `max-manifest-size-bytes` (not in Spark spec - unnecessary complexity)
   - ❌ `scan-thread-pool-size` (not in Spark spec - engine-specific 
optimization)
   
   ### Recommended Parameter Set
   
   | Parameter | Required? | Type | Description |
   |-----------|:----------:|:------|:-------------|
   | `table` | ✔️ | string | Name of the table to update (inferred from ALTER 
TABLE) |
   | `spec_id` | ☐ | int | Spec id of the manifests to rewrite (defaults to 
current spec id) |
   
   **Note:** We could add `spec_id` as an optional parameter if needed for 
advanced use cases, but the core functionality should remain simple.
   
   ---
   
   ## 2. Return Value Schema
   
   ### Spark Official Output
   
   | Output Name | Type | Description |
   |-------------|:------|:-------------|
   | `rewritten_manifests_count` | int | Number of manifests which were 
re-written by this command |
   | `added_manifests_count` | int | **Number of new manifest files which were 
written by this command** |
   
   ### Current Doris Output
   
   | Column Name | Type | Description |
   |-------------|:------|:-------------|
   | `rewritten_manifests_count` | int | Number of data manifests rewritten by 
this command |
   | `total_data_manifests_count` | int | **Total number of data manifests 
BEFORE rewrite** |
   
   ### Issue
   
   The second output column is **inconsistent** with the official 
specification. The Spark version returns `added_manifests_count` (the number of 
**new** manifests created after the rewrite), not the total count before the 
operation.
   
   ### Recommended Change
   
   ```java
   // Current implementation
   return Lists.newArrayList(
       new Column("rewritten_manifests_count", Type.INT, false,
           "Number of data manifests rewritten by this command"),
       new Column("total_data_manifests_count", Type.INT, false,  // ❌ Incorrect
           "Total number of data manifests before rewrite")
   );
   
   // Recommended change - align with Spark spec
   return Lists.newArrayList(
       new Column("rewritten_manifests_count", Type.INT, false,
           "Number of manifests which were re-written by this command"),
       new Column("added_manifests_count", Type.INT, false,  // ✅ Aligned with 
spec
           "Number of new manifest files which were written by this command")
   );
   ```
   
   **Note:** The `added_manifests_count` should reflect the actual number of 
new manifest files created by the rewrite operation, which is typically smaller 
than the original count.
   
   ---
   
   ## 3. Summary of Recommended Changes
   
   | Aspect | Current | Recommended | Reason |
   |--------|---------|-------------|--------|
   | **Parameters** | 5 custom params | 0-1 params (optional `spec_id`) | Align 
with Spark spec; manifest files are small and don't need filtering |
   | **Return Column 2** | `total_data_manifests_count` | 
`added_manifests_count` | Match official output schema |
   | **Functionality** | Complex with filtering logic | Simple: merge all 
manifests | Core purpose is optimization through consolidation |
   | **Implementation** | Custom predicate-based selection | Direct Iceberg API 
usage | Leverage upstream `RewriteManifests` action |
   
   ---
   
   ## 4. Benefits of Alignment
   
   ✅ **Compatibility** - Users familiar with Spark/Iceberg will have a 
consistent experience across engines
   ✅ **Simplicity** - Easier to understand, test, and maintain
   ✅ **Performance** - No overhead from unnecessary filtering logic
   ✅ **Documentation** - Can reference official docs directly without 
translation
   ✅ **Future-proof** - Automatically benefits from upstream improvements
   
   ---
   
   ## 5. Additional Considerations
   
   ### 5.1 Engine-Specific Optimizations
   
   Parameters like `scan-thread-pool-size` are engine-specific and should be 
handled at the engine configuration level, not as procedure parameters. Doris 
can internally manage thread pools for optimal performance without exposing 
this to users.
   
   ### 5.2 Backward Compatibility
   
   If there are concerns about removing existing parameters, we could:
   1. Deprecate them with warnings in a minor release
   2. Remove them in a future major release
   3. Document that they are ignored (no-op) with a deprecation warning
   
   ### 5.3 Implementation Reference
   
   The recommended approach is to use Iceberg's `RewriteManifests` action 
directly:
   
   ```java
   RewriteManifests rm = table.rewriteManifests();
   if (specId != null) {
       rm.specId(specId);
   }
   rm.commit();
   ```
   
   This is simpler and more aligned with the official specification than the 
current predicate-based approach.
   
   ---
   
   ## 6. Example Usage (After Changes)
   
   ```sql
   -- Basic usage - rewrite all manifests for current spec
   CALL catalog_name.system.rewrite_manifests('db.table');
   
   -- Or via ALTER TABLE (Doris syntax)
   ALTER TABLE catalog_name.db.table EXECUTE rewrite_manifests();
   
   -- Optional: rewrite manifests for a specific spec
   ALTER TABLE catalog_name.db.table EXECUTE rewrite_manifests('spec_id' = '1');
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [Feature](iceberg) Implement rewrite_manifests procedure for Iceberg [doris]

Reply via email to