zhuqi-lucas commented on issue #21973:
URL: https://github.com/apache/datafusion/issues/21973#issuecomment-4359777557

   ## Testing Strategy — No Regressions Allowed
   
   The most critical requirement for this architectural change: **zero test 
regressions** + comprehensive new coverage for all known production bugs.
   
   ### Existing test coverage that MUST pass
   
   1. **All existing `enforce_sorting` tests** 
(`datafusion/core/tests/physical_optimizer/enforce_sorting.rs`) — 160+ tests
   2. **All existing `enforce_distribution` tests** 
(`datafusion/core/tests/physical_optimizer/enforce_distribution.rs`) — 200+ 
tests
   3. **All sqllogictest tests** — the full SLT suite must pass unchanged
   4. **`SanityCheckPlan` tests** — all sanity checks must continue to pass
   
   ### New tests to add (from real production incidents)
   
   These are bugs we've hit in production that the merged rule must prevent:
   
   1. **Multi-partition sort + limit + SanityCheckPlan** (the bug that 
triggered this epic):
      - `OutputRequirementExec(SinglePartition)` → `StorageExec(N partitions)`
      - `EnforceSorting` must insert `SortPreservingMergeExec` between 
`SortExec(preserve_partitioning=true)` and `GlobalLimitExec`
      - Test with 2, 16, and 32 partitions
   
   2. **Sort pushdown through projection over multi-partition input**:
      - `OutputRequirementExec(SinglePartition)` → `ProjectionExec` → 
`RepartitionExec(10)`
      - Sort pushed through projection must include `SortPreservingMergeExec`
   
   3. **Sort pushdown with already-optimal plan (no extra SPM)**:
      - `OutputRequirementExec(SinglePartition)` → `SortPreservingMergeExec` → 
`SortExec(preserve=true)` → `RepartitionExec(10)`
      - Must NOT add duplicate `SortPreservingMergeExec`
   
   4. **InterleaveExec stability under multiple passes** (#21826):
      - `UnionExec` children with varying partitioning
      - Running optimizer twice must not panic 
`InterleaveExec::with_new_children`
   
   5. **Multiple EnforceDistribution preserves fetch** (#14150):
      - `SELECT * FROM t ORDER BY c LIMIT 5` with multi-partition input
      - Fetch value must survive multiple optimizer passes
   
   6. **Sort + aggregate on multi-partitioned table** (#18989):
      - `Sort → Aggregate → Sort → Aggregate` on multi-partitioned input
      - Must insert proper `RepartitionExec` between aggregates
   
   ### Idempotency tests (NEW — the core guarantee)
   
   ```rust
   /// Framework: run EnsureRequirements twice, assert same plan
   fn assert_idempotent(plan: Arc<dyn ExecutionPlan>) {
       let config = ConfigOptions::default();
       let p1 = EnsureRequirements::new().optimize(plan, &config).unwrap();
       let p2 = EnsureRequirements::new().optimize(p1.clone(), 
&config).unwrap();
       assert_eq!(
           displayable(p1.as_ref()).indent(true).to_string(),
           displayable(p2.as_ref()).indent(true).to_string(),
           "EnsureRequirements is not idempotent"
       );
       // Also verify SanityCheckPlan passes on both
       SanityCheckPlan::new().optimize(p1, &config).unwrap();
       SanityCheckPlan::new().optimize(p2, &config).unwrap();
   }
   ```
   
   Test topologies for idempotency:
   - Multi-partition sort + limit
   - Union with mixed partition counts
   - Projection over multi-partition source
   - Window functions over repartitioned data
   - Hash join with sort-merge fallback
   - Aggregate with grouping sets
   
   ### Test-first approach
   
   Each sub-task in this epic should follow:
   1. **Write the failing test first** (from production incident or known bug)
   2. Implement the fix
   3. Verify no existing tests regress
   4. Add idempotency test for the new topology
   
   This ensures the merged rule is strictly better — it passes everything the 
old rules passed, plus catches everything they missed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to