Re:Re: Large PR Merge Strategy – Recommended Practices for Apache Cloudberry (Incubating)

jiaqi.zhou Fri, 11 Apr 2025 03:40:25 -0700

Hi ed,




I think you didn't get my point. So let me show u if we follow the workflow:




1. git checkout main

2. git checkout -b test-commit200

3. (branch test-commit200) sh 200commit.sh

``` 200commit.sh

for i in {1..200}; do

  echo "Commit $i" >> test-file.txt

  git add test-file.txt

  git commit -m "Commit $i"

done

```

4. git push origin test-commit200

5. git checkout main

6. (branch main) git merge origin/test-commit200 --no-ff

7. (branch main) git log

```

commit 49880256dce3c17789d31b3d173686d579e99e2e (HEAD -> main) Merge: 
49d49b87eee 3a48c1a66a7 Author: zhoujiaqi <zhouji...@hashdata.cn> Date: Fri Apr 
11 17:54:36 2025 +0800 Merge remote-tracking branch 'origin/test-commit200' 
into main commit 3a48c1a66a7931f50e0200daf2b5fb5e69df9ec9 
(origin/test-commit200, test-commit200)

```




The commit "Merge any branch into main " is not a clean and linear commit 
message. Also i don't think that is a good idea that we direct push into the 
main.




I can accept two ways to merge into main:

1. cherrypick + push main(push by admin)

  - git checkout main

  - git cherry-pick {PAX first commit}^...{PAX last commit}

  - git push origin main  // This step should be done by a special person, not 
me (I am just a developer)




2. splited PR

  - As I am doing now, the reasons has already been outlined in my earlier 
emails, and I prefer this approach.




Thanks
Jiaqi

At 2025-04-11 00:11:18, "Ed Espino" <esp...@apache.org> wrote:
>Hi Jiaqi,
>
>Thanks again for explaining the reasoning behind splitting the PAX PR.
>Your concerns about "merge main" are well-taken — it introduces
>non-linear history, complicates git bisect, and can lead to downstream
>integration issues. It's clear that the decision to split was made
>carefully under release pressure, and I appreciate the open dialogue
>around this.
>
>Looking forward, I’d like to propose a Git CLI-based workflow that can
>help us avoid splitting large PRs in the future — even when commit
>count exceeds GitHub’s "Rebase and Merge" limit in the UI.
>
>This approach allows us to:
>- Preserve full commit history (no squash)
>- Avoid splitting logically complete work
>- Maintain linear history for bisectability and readability
>- Cleanly integrate with downstream forks if needed
>
>Proposed Workflow:
>------------------
>
>    # 1. Rebase the feature branch onto the latest main
>    git checkout feature/your-branch
>    git fetch origin
>    git rebase origin/main
>
>    # 2. Push the rebased feature branch
>    git push --force-with-lease
>
>    # 3. After PR approval, ensure main is still current
>    git fetch origin
>    git checkout main
>    git pull origin main
>
>    # 4. If main has progressed, rebase the feature branch again
>    git checkout feature/your-branch
>    git rebase origin/main
>    git push --force-with-lease
>
>    # 5. Merge the rebased branch into main using CLI
>    git checkout main
>    git merge feature/your-branch --no-ff
>    git push origin main
>
>This process:
>- Avoids GitHub UI merge limitations
>- Keeps the commit graph clean and linear
>- Ensures CI validation is accurate and relevant
>- Preserves the full contribution context
>
>Next Steps:
>-----------
>
>If this general approach makes sense, I’d be happy to help document it
>in our committer or contributor guidelines. I’d also love to hear from
>others — especially those maintaining downstream forks or submitting
>larger features.
>
>Thanks again Jiaqi for leading the PAX work and for raising the
>trade-offs so thoughtfully. It’s through these conversations that we
>build a better process together.
>
>Best,
>-=e
>
>
>On Thu, Apr 10, 2025 at 7:51 AM jiaqi.zhou <jiaqi...@163.com> wrote:
>
>> Hi all,
>>
>>
>>
>>
>> My colleagues and I have internally discussed the option of using a "merge
>> main" approach to bypass the "100+ commit rebase and merge problem".
>>
>>
>>
>>
>> Why not "merge main"?
>>
>> - Non-linear History: Merging main would create a non-linear commit graph.
>>
>> - Impact on Git Bisect: This could complicate debugging workflows like git
>> bisect.
>>
>> - Downstream Compatibility: Projects forked from CloudBerryDB with
>> divergent codebases might face integration challenges.
>>
>>
>>
>>
>> Why choose splited the PR?
>>
>> PAX had CI + code review internally since the project was launched, and
>> every commit is complete (that is why we don’t choose squash). And after
>> the split PRs are merged, the commits are linear.
>>
>>
>>
>>
>> With the CBDB release approaching, please let us discuss this topic as
>> soon as possible.
>>
>> Thanks
>> Jiaqi
>>
>>
>> 在 2025-04-10 22:01:09，"Ed Espino" <esp...@apache.org> 写道：
>> >Hi all,
>> >
>> >I’d like to raise a contribution workflow concern we're currently
>> >encountering in Apache Cloudberry (Incubating), and propose that we
>> >establish a preferred approach for handling similar situations going
>> >forward.
>> >
>> >Contributor *@jiaqizho* submitted a significant pull request:
>> >*#1002 – Feature: introduce a high-performance hybrid row-columnar storage
>> >engine <https://github.com/apache/cloudberry/pull/1002>*
>> >
>> >The PR contains *300+ commits* and has successfully passed CI. However,
>> due
>> >to the number of commits, GitHub's *“Rebase and Merge”* option is disabled
>> >— a known limitation when the PR size exceeds certain internal thresholds.
>> >As a result, the PR cannot be merged via the web UI, even by committers
>> >with full permissions.
>> >
>> >In response, the contributor has now *split the PR into four smaller PRs*
>> >in an attempt to work around the UI limitation and proceed with merging.
>> >------------------------------
>> >Why This May Not Be Ideal
>> >
>> >While the effort is appreciated, splitting the PR introduces several
>> >drawbacks:
>> >
>> >   -
>> >
>> >   *Review context becomes fragmented* across multiple PRs.
>> >   -
>> >
>> >   *Merge complexity increases*, especially when changes are
>> interdependent.
>> >   -
>> >
>> >   *Contributor and reviewer effort multiplies*, with more overhead and
>> >   duplicated CI runs.
>> >   -
>> >
>> >   *It sends a mixed message* to future contributors that PR splitting is
>> >   preferred in these cases — which isn’t necessarily true.
>> >
>> >------------------------------
>> >What Other ASF Projects Do
>> >
>> >Several other Apache projects handle large PRs by relying on *Git
>> CLI-based
>> >merges*, rather than splitting:
>> >
>> >   -
>> >
>> >   *Apache Arrow*: Encourages local rebases and merges for large
>> >   contributions.
>> >   -
>> >
>> >   *Apache Spark*: Merges and squashes are typically done via CLI;
>> >   splitting is discouraged unless changes are logically separable.
>> >   -
>> >
>> >   *Apache Kafka*: Maintainers use merge scripts
>> >   <
>> https://cwiki.apache.org/confluence/display/KAFKA/Pull+Request+Workflow>
>> >   to handle large PRs manually.
>> >   -
>> >
>> >   *Apache Flink* and *Apache Beam*: Default to local CLI workflows to
>> >   maintain history and bypass UI restrictions.
>> >
>> >This keeps reviews cohesive and simplifies the overall process for
>> >contributors and committers alike.
>> >------------------------------
>> >✅ Recommended Best Practice for Apache Cloudberry
>> >
>> >To align with ASF norms and improve maintainability, I propose:
>> >
>> >   1.
>> >
>> >   *Using Git CLI-based merges* as the standard method for large PRs
>> (e.g.,
>> >   100+ commits or more).
>> >   2.
>> >
>> >   *Discouraging contributors from splitting PRs* to work around UI
>> >   limitations, unless explicitly requested by reviewers for clarity or
>> >   modularity.
>> >   3.
>> >
>> >   *Documenting this workflow* in our committer guidelines to ensure
>> >   consistency.
>> >
>> >------------------------------
>> > Verified CLI Merge Workflow for Large PRs
>> >
>> ># 1. Fetch the PR branch directly from GitHub
>> >git fetch origin pull/1002/head:pax-merge
>> >
>> ># 2. Optionally rebase for a linear history
>> >git checkout pax-merge
>> >git rebase origin/main
>> >
>> ># 3. Merge into main
>> >git checkout main
>> >git pull origin main
>> >git merge pax-merge --no-ff
>> >
>> ># 4. Push the result to the repository
>> >git push origin main
>> >
>> ># (Optional) Clean up
>> >git branch -d pax-merge
>> >
>> >This approach avoids GitHub’s UI merge limitations, preserves commit
>> >history, and maintains a better experience for both contributors and
>> >reviewers.
>> >------------------------------
>> >
>> >Would love to hear thoughts from the community. If there's agreement, we
>> >should add contributing and committer workflows to our newly enabled wiki.
>> >
>> >Best regards,
>> >-=e
>> >Ed Espino
>> >Apache Cloudberry (Incubating) & MADlib
>>
>
>
>-- 
>Ed Espino
>Apache Cloudberry (Incubating) & MADlib

Re:Re: Large PR Merge Strategy – Recommended Practices for Apache Cloudberry (Incubating)

Reply via email to