Hi, folk!

I tried to investigate the issue too. Anyway, I can't finish PR without
tests.

What I see:

1. There is no "evil" file or commit that broke the tests. What we stumbled
upon - Out of space. I saw some generated coredump files in tests, but they
also seemed to be a consequence of space exhaustion.

df at the end of tests:

Show disk usage info
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1015>Filesystem
Type Size Used Avail Use% Mounted on
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1016>overlay
overlay 73G 73G 99M 100% /
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1017>tmpfs
tmpfs 64M 0 64M 0% /dev
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1018>shm
tmpfs 2.0G 68K 2.0G 1% /dev/shm
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1019>/dev/root
ext4 73G 73G 99M 100% /__w
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1020>tmpfs
tmpfs 3.2G 9.2M 3.2G 1% /run/docker.sock

2. The top space consumer is cloudberry (One could find debug output in my
repo
https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548).
All looks like we had been growing a little bit with each additional test,
and at the end has reached the limit of space.

sudo du -c / | sort -n | tail -150 shows:

621320 /usr/lib64
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2459>655388
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror2/demoDataDir1/pg_wal

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2460>655388
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/pg_wal

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2461>655392
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast2/demoDataDir1/pg_wal

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2462>655396
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/pg_wal

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2463>663736
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/standby/base
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2464>664192
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/qddir/demoDataDir-1/base
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2465>720932
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/pg_wal

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2466>720936
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0/pg_wal

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2467>856204
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/base/17018

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2468>856996
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/base/17018

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2469>904508
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0/pg_distributedlog

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2470>910092
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/pg_distributedlog

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2471>984940
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror2/demoDataDir1/base/17018

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2472>985876
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast2/demoDataDir1/base/17018

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2473>1049388
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0/base/17018

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2474>1049444
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/base/17018

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2475>1225744
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/base

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2476>1226736
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/base
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2477>1277720
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/standby
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2478>1354452
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror2/demoDataDir1/base

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2479>1355584
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast2/demoDataDir1/base
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2480>1420728
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0/base

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2481>1420796
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/base
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2482>1639176
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/qddir/demoDataDir-1/pg_subtrans

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2483>1888592
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2484>1888596
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror3
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2485>1893504
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2486>1893508
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast3
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2487>1904984
/usr
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2488>2017300
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror2/demoDataDir1

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2489>2017304
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror2
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2490>2023864
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast2/demoDataDir1
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2491>2023868
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast2
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2492>2549268
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/pg_subtrans

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2493>3049412
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/qddir/demoDataDir-1
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2494>3049416
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/qddir
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2495>3133184
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0

<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2496>3133188
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2497>5735508
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2498>5735512
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2499>21019248
/__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2500>21019320
/__w/cloudberry/cloudberry/gpAux/gpdemo
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2501>21020984
/__w/cloudberry/cloudberry/gpAux
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2502>22435396
/__w/cloudberry/cloudberry
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2503>22435400
/__w/cloudberry
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2504>22454024
/__w
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2505>25110408
/
<https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2506>25110408
total

Some observations from here:

A. The total space size of one test is ~25Gb
B. pg_subtrans on segments ~ 2,5 Gb and more then data size
C. pg_distributedlog on segments ~ 1 Gb
D. pg_wal on segments ~700 mb

So the envelope math here is:

I. We have 3 heavy tests: ic-good-opt-on, ic-good-opt-off, ic-cbdb-parallel
II. All tests executed in parallel
III. Each test have 3 segments and 3 mirrors
IV. Total needed space size for tests 3 (the number of parallel tests) x
(pg_wal_size * 6 + pg_subtrans_size * 4 + pg_distributedlog * 2 ), which is
3 * (0,7 * 6 + 2,5 * 4 + 1 * 2) = 48,6 Gb

My thought and questions from here:

1. Could we set max-parallel in strategy to 2 (this will lengthen the tests)
?
2. Could we set archive_command to /bin/true and do not store WAL files?
3. How to understand why pg_subtrans is so big? There should be a long
transaction + a lot of subtransactions (savepoints?) - but 2,5 Gb ...

On Sat, Oct 4, 2025 at 9:54 PM Ed Espino <[email protected]> wrote:

> I have an updated mechanism to free unused space in the test environments.
> Unfortunately, this is not resolving the testing issues. I will be
> attempting to isolate the issue to any recent code and CI changes.
> Additionally after a conversation with Tushar, I will be reaching out to
> the Apache Infrastructure team to identify necessary steps to use larger CI
> resources (if possible).
>
> Stay tuned,
> -=e
>
> On Fri, Oct 3, 2025 at 2:50 AM Dianjin Wang <[email protected]> wrote:
>
> > Cool, thanks Ed!
> >
> >
> >
> > Best,
> > Dianjin Wang
> >
> >
> > Ed Espino <[email protected]>于2025年10月3日 周五17:08写道:
> >
> > > I have determined that the test container is running out of disk space
> > and
> > > this is leading to the testing issues. I am trying to determine if it
> is
> > > possible to clean up unused artifacts in the test container prior to
> test
> > > execution.
> > >
> > > -=e
> > >
> > > On Thu, Oct 2, 2025 at 9:51 PM Ed Espino <[email protected]> wrote:
> > >
> > > > I'll take a look.
> > > >
> > > > -=e
> > > >
> > > > --
> > > > Ed Espino
> > > > Apache Cloudberry (Incubating) & MADlib
> > > >
> > > > On Thu, Oct 2, 2025 at 8:50 PM Dianjin Wang <[email protected]>
> > > wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> I’m wondering if the CI might be having issues. On my PR #1358, the
> > > >> jobs `ic-good-opt-off`, `ic-good-opt-on`, and `ic-cbdb-parallel`
> have
> > > >> been failing consistently, even after multiple reruns.
> > > >>
> > > >> I also noticed similar failures happening on other PRs. Could
> someone
> > > >> help check if the CI is currently down or unstable?
> > > >>
> > > >>
> > > >> Best,
> > > >> Dianjin Wang
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: [email protected]
> > > >> For additional commands, e-mail: [email protected]
> > > >>
> > > >>
> > >
> >
>
>
> --
> Ed Espino
> Apache Cloudberry (Incubating) & MADlib
>

Reply via email to