I've been working on some CI improvements that should help with test
reliability and make debugging faster. Here's what's changed:

What Got Fixed
==============

1. Test Schedule Optimization
------------------------------
We were hitting disk space exhaustion issues during tests, particularly
with autovacuum-template0-segment. This test consumes a lot of space
through WAL generation and XID consumption (~210 million XIDs). I moved
it to run early in the schedule when we have ~20GB available instead of
~10GB later on. Added comments in greenplum_schedule to document why
it's positioned there.

2. Rocky Linux Mirror Reliability (Two-Part Fix)
-------------------------------------------------
Those annoying 404 errors from Rocky Linux mirrors were causing random
CI failures. Added two fixes:

First attempt: Metadata refresh and retry logic (--setopt=retries=10)
to handle transient mirror issues.

Still failing? Turns out the real issue was Rocky Linux 9.6 being too
new - mirrors hadn't fully synced the metadata yet. Now pinning to
stable Rocky Linux 9.x repos (--releasever=9) instead of bleeding-edge
9.6. This keeps us compatible with the 9.6 container while avoiding the
mirror sync lag.

Should see a lot fewer infrastructure-related flakes now.

3. Artifact Reuse Feature
-------------------------
While debugging the disk space issues, I realized we were wasting
~50-70 minutes rebuilding on every test iteration. Added a new workflow
input to reuse build artifacts from previous runs. Now you can iterate
on test fixes in ~15-30 minutes instead of over an hour.

Quick example:

  gh workflow run build-cloudberry.yml \
    --field reuse_artifacts_from_run_id=12345678901 \
    --field test_selection=ic-good-opt-off

4. Documentation
----------------
Created .github/workflows/README.md with step-by-step guides for:
- Using artifact reuse
- Running workflows in forks
- Manual workflow triggers
- Troubleshooting

Try It Out
==========

Changes are in the ci-fix branch:
- Fork: https://github.com/edespino/cloudberry/tree/ci-fix
- Compare with main:
https://github.com/apache/cloudberry/compare/main...edespino:cloudberry:ci-fix

I'm running several test iterations to validate the mirror fixes are
reliable. If all goes well, I'll open a PR later today.

If you're debugging test failures or working on CI improvements, give
the artifact reuse feature a try. It's been pretty helpful for
iterating quickly.

Feedback welcome - especially if there are other workflow parameters or
features that would be useful.

Cheers,
-=e

On Mon, Oct 6, 2025 at 3:03 AM Leonid Borchuk <[email protected]> wrote:

> Great, is it reproducible? I mean, restart it two more times to be sure
> it's
> not an accident.
>
> On Mon, Oct 6, 2025 at 12:44 PM Ed Espino <[email protected]> wrote:
>
> > I just managed to get CI to pass. All I had to do was to move two suites
> to
> > the top of the greenplum schedule:
> > https://github.com/edespino/cloudberry/actions/runs/18274708119
> >
> >
> > -=e
> >
> > Ed Espino
> > 925.389.4640
> >
> >
> > On Mon, Oct 6, 2025 at 1:02 AM Leonid Borchuk <[email protected]>
> > wrote:
> >
> > > Hi, folk!
> > >
> > > I tried to investigate the issue too. Anyway, I can't finish PR without
> > > tests.
> > >
> > > What I see:
> > >
> > > 1. There is no "evil" file or commit that broke the tests. What we
> > stumbled
> > > upon - Out of space. I saw some generated coredump files in tests, but
> > they
> > > also seemed to be a consequence of space exhaustion.
> > >
> > > df at the end of tests:
> > >
> > > Show disk usage info
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1015
> > > >Filesystem
> > > Type Size Used Avail Use% Mounted on
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1016
> > > >overlay
> > > overlay 73G 73G 99M 100% /
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1017
> > > >tmpfs
> > > tmpfs 64M 0 64M 0% /dev
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1018
> > > >shm
> > > tmpfs 2.0G 68K 2.0G 1% /dev/shm
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1019
> > > >/dev/root
> > > ext4 73G 73G 99M 100% /__w
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1020
> > > >tmpfs
> > > tmpfs 3.2G 9.2M 3.2G 1% /run/docker.sock
> > >
> > > 2. The top space consumer is cloudberry (One could find debug output in
> > my
> > > repo
> > >
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548
> > > ).
> > > All looks like we had been growing a little bit with each additional
> > test,
> > > and at the end has reached the limit of space.
> > >
> > > sudo du -c / | sort -n | tail -150 shows:
> > >
> > > 621320 /usr/lib64
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2459
> > > >655388
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror2/demoDataDir1/pg_wal
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2460
> > > >655388
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/pg_wal
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2461
> > > >655392
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast2/demoDataDir1/pg_wal
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2462
> > > >655396
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/pg_wal
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2463
> > > >663736
> > > /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/standby/base
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2464
> > > >664192
> > >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/qddir/demoDataDir-1/base
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2465
> > > >720932
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/pg_wal
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2466
> > > >720936
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0/pg_wal
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2467
> > > >856204
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/base/17018
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2468
> > > >856996
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/base/17018
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2469
> > > >904508
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0/pg_distributedlog
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2470
> > > >910092
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/pg_distributedlog
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2471
> > > >984940
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror2/demoDataDir1/base/17018
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2472
> > > >985876
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast2/demoDataDir1/base/17018
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2473
> > > >1049388
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0/base/17018
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2474
> > > >1049444
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/base/17018
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2475
> > > >1225744
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/base
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2476
> > > >1226736
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/base
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2477
> > > >1277720
> > > /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/standby
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2478
> > > >1354452
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror2/demoDataDir1/base
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2479
> > > >1355584
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast2/demoDataDir1/base
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2480
> > > >1420728
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0/base
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2481
> > > >1420796
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/base
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2482
> > > >1639176
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/qddir/demoDataDir-1/pg_subtrans
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2483
> > > >1888592
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2484
> > > >1888596
> > > /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror3
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2485
> > > >1893504
> > > /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2486
> > > >1893508
> > > /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast3
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2487
> > > >1904984
> > > /usr
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2488
> > > >2017300
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror2/demoDataDir1
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2489
> > > >2017304
> > > /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror2
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2490
> > > >2023864
> > > /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast2/demoDataDir1
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2491
> > > >2023868
> > > /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast2
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2492
> > > >2549268
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/pg_subtrans
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2493
> > > >3049412
> > > /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/qddir/demoDataDir-1
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2494
> > > >3049416
> > > /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/qddir
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2495
> > > >3133184
> > >
> > >
> >
> /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0
> > >
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2496
> > > >3133188
> > > /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2497
> > > >5735508
> > > /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2498
> > > >5735512
> > > /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2499
> > > >21019248
> > > /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2500
> > > >21019320
> > > /__w/cloudberry/cloudberry/gpAux/gpdemo
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2501
> > > >21020984
> > > /__w/cloudberry/cloudberry/gpAux
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2502
> > > >22435396
> > > /__w/cloudberry/cloudberry
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2503
> > > >22435400
> > > /__w/cloudberry
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2504
> > > >22454024
> > > /__w
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2505
> > > >25110408
> > > /
> > > <
> > >
> >
> https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2506
> > > >25110408
> > > total
> > >
> > > Some observations from here:
> > >
> > > A. The total space size of one test is ~25Gb
> > > B. pg_subtrans on segments ~ 2,5 Gb and more then data size
> > > C. pg_distributedlog on segments ~ 1 Gb
> > > D. pg_wal on segments ~700 mb
> > >
> > > So the envelope math here is:
> > >
> > > I. We have 3 heavy tests: ic-good-opt-on, ic-good-opt-off,
> > ic-cbdb-parallel
> > > II. All tests executed in parallel
> > > III. Each test have 3 segments and 3 mirrors
> > > IV. Total needed space size for tests 3 (the number of parallel tests)
> x
> > > (pg_wal_size * 6 + pg_subtrans_size * 4 + pg_distributedlog * 2 ),
> which
> > is
> > > 3 * (0,7 * 6 + 2,5 * 4 + 1 * 2) = 48,6 Gb
> > >
> > > My thought and questions from here:
> > >
> > > 1. Could we set max-parallel in strategy to 2 (this will lengthen the
> > > tests)
> > > ?
> > > 2. Could we set archive_command to /bin/true and do not store WAL
> files?
> > > 3. How to understand why pg_subtrans is so big? There should be a long
> > > transaction + a lot of subtransactions (savepoints?) - but 2,5 Gb ...
> > >
> > > On Sat, Oct 4, 2025 at 9:54 PM Ed Espino <[email protected]> wrote:
> > >
> > > > I have an updated mechanism to free unused space in the test
> > > environments.
> > > > Unfortunately, this is not resolving the testing issues. I will be
> > > > attempting to isolate the issue to any recent code and CI changes.
> > > > Additionally after a conversation with Tushar, I will be reaching out
> > to
> > > > the Apache Infrastructure team to identify necessary steps to use
> > larger
> > > CI
> > > > resources (if possible).
> > > >
> > > > Stay tuned,
> > > > -=e
> > > >
> > > > On Fri, Oct 3, 2025 at 2:50 AM Dianjin Wang <[email protected]>
> > > wrote:
> > > >
> > > > > Cool, thanks Ed!
> > > > >
> > > > >
> > > > >
> > > > > Best,
> > > > > Dianjin Wang
> > > > >
> > > > >
> > > > > Ed Espino <[email protected]>于2025年10月3日 周五17:08写道:
> > > > >
> > > > > > I have determined that the test container is running out of disk
> > > space
> > > > > and
> > > > > > this is leading to the testing issues. I am trying to determine
> if
> > it
> > > > is
> > > > > > possible to clean up unused artifacts in the test container prior
> > to
> > > > test
> > > > > > execution.
> > > > > >
> > > > > > -=e
> > > > > >
> > > > > > On Thu, Oct 2, 2025 at 9:51 PM Ed Espino <[email protected]>
> > wrote:
> > > > > >
> > > > > > > I'll take a look.
> > > > > > >
> > > > > > > -=e
> > > > > > >
> > > > > > > --
> > > > > > > Ed Espino
> > > > > > > Apache Cloudberry (Incubating) & MADlib
> > > > > > >
> > > > > > > On Thu, Oct 2, 2025 at 8:50 PM Dianjin Wang <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > >> Hi,
> > > > > > >>
> > > > > > >> I’m wondering if the CI might be having issues. On my PR
> #1358,
> > > the
> > > > > > >> jobs `ic-good-opt-off`, `ic-good-opt-on`, and
> `ic-cbdb-parallel`
> > > > have
> > > > > > >> been failing consistently, even after multiple reruns.
> > > > > > >>
> > > > > > >> I also noticed similar failures happening on other PRs. Could
> > > > someone
> > > > > > >> help check if the CI is currently down or unstable?
> > > > > > >>
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> Dianjin Wang
> > > > > > >>
> > > > > > >>
> > > > ---------------------------------------------------------------------
> > > > > > >> To unsubscribe, e-mail: [email protected]
> > > > > > >> For additional commands, e-mail:
> [email protected]
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Ed Espino
> > > > Apache Cloudberry (Incubating) & MADlib
> > > >
> > >
> >
>


-- 
Ed Espino
Apache Cloudberry (Incubating) & MADlib

Reply via email to