Hi, folk! I tried to investigate the issue too. Anyway, I can't finish PR without tests.
What I see: 1. There is no "evil" file or commit that broke the tests. What we stumbled upon - Out of space. I saw some generated coredump files in tests, but they also seemed to be a consequence of space exhaustion. df at the end of tests: Show disk usage info <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1015>Filesystem Type Size Used Avail Use% Mounted on <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1016>overlay overlay 73G 73G 99M 100% / <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1017>tmpfs tmpfs 64M 0 64M 0% /dev <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1018>shm tmpfs 2.0G 68K 2.0G 1% /dev/shm <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1019>/dev/root ext4 73G 73G 99M 100% /__w <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:1020>tmpfs tmpfs 3.2G 9.2M 3.2G 1% /run/docker.sock 2. The top space consumer is cloudberry (One could find debug output in my repo https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548). All looks like we had been growing a little bit with each additional test, and at the end has reached the limit of space. sudo du -c / | sort -n | tail -150 shows: 621320 /usr/lib64 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2459>655388 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror2/demoDataDir1/pg_wal <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2460>655388 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/pg_wal <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2461>655392 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast2/demoDataDir1/pg_wal <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2462>655396 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/pg_wal <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2463>663736 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/standby/base <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2464>664192 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/qddir/demoDataDir-1/base <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2465>720932 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/pg_wal <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2466>720936 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0/pg_wal <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2467>856204 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/base/17018 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2468>856996 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/base/17018 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2469>904508 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0/pg_distributedlog <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2470>910092 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/pg_distributedlog <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2471>984940 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror2/demoDataDir1/base/17018 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2472>985876 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast2/demoDataDir1/base/17018 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2473>1049388 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0/base/17018 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2474>1049444 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/base/17018 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2475>1225744 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/base <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2476>1226736 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/base <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2477>1277720 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/standby <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2478>1354452 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror2/demoDataDir1/base <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2479>1355584 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast2/demoDataDir1/base <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2480>1420728 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0/base <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2481>1420796 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/base <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2482>1639176 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/qddir/demoDataDir-1/pg_subtrans <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2483>1888592 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2484>1888596 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror3 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2485>1893504 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2486>1893508 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast3 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2487>1904984 /usr <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2488>2017300 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror2/demoDataDir1 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2489>2017304 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror2 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2490>2023864 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast2/demoDataDir1 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2491>2023868 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast2 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2492>2549268 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/pg_subtrans <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2493>3049412 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/qddir/demoDataDir-1 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2494>3049416 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/qddir <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2495>3133184 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2496>3133188 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast_mirror1 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2497>5735508 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2498>5735512 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs/dbfast1 <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2499>21019248 /__w/cloudberry/cloudberry/gpAux/gpdemo/datadirs <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2500>21019320 /__w/cloudberry/cloudberry/gpAux/gpdemo <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2501>21020984 /__w/cloudberry/cloudberry/gpAux <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2502>22435396 /__w/cloudberry/cloudberry <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2503>22435400 /__w/cloudberry <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2504>22454024 /__w <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2505>25110408 / <https://github.com/open-gpdb/cloudberry/actions/runs/18248898786/job/51961002548#step:14:2506>25110408 total Some observations from here: A. The total space size of one test is ~25Gb B. pg_subtrans on segments ~ 2,5 Gb and more then data size C. pg_distributedlog on segments ~ 1 Gb D. pg_wal on segments ~700 mb So the envelope math here is: I. We have 3 heavy tests: ic-good-opt-on, ic-good-opt-off, ic-cbdb-parallel II. All tests executed in parallel III. Each test have 3 segments and 3 mirrors IV. Total needed space size for tests 3 (the number of parallel tests) x (pg_wal_size * 6 + pg_subtrans_size * 4 + pg_distributedlog * 2 ), which is 3 * (0,7 * 6 + 2,5 * 4 + 1 * 2) = 48,6 Gb My thought and questions from here: 1. Could we set max-parallel in strategy to 2 (this will lengthen the tests) ? 2. Could we set archive_command to /bin/true and do not store WAL files? 3. How to understand why pg_subtrans is so big? There should be a long transaction + a lot of subtransactions (savepoints?) - but 2,5 Gb ... On Sat, Oct 4, 2025 at 9:54 PM Ed Espino <[email protected]> wrote: > I have an updated mechanism to free unused space in the test environments. > Unfortunately, this is not resolving the testing issues. I will be > attempting to isolate the issue to any recent code and CI changes. > Additionally after a conversation with Tushar, I will be reaching out to > the Apache Infrastructure team to identify necessary steps to use larger CI > resources (if possible). > > Stay tuned, > -=e > > On Fri, Oct 3, 2025 at 2:50 AM Dianjin Wang <[email protected]> wrote: > > > Cool, thanks Ed! > > > > > > > > Best, > > Dianjin Wang > > > > > > Ed Espino <[email protected]>于2025年10月3日 周五17:08写道: > > > > > I have determined that the test container is running out of disk space > > and > > > this is leading to the testing issues. I am trying to determine if it > is > > > possible to clean up unused artifacts in the test container prior to > test > > > execution. > > > > > > -=e > > > > > > On Thu, Oct 2, 2025 at 9:51 PM Ed Espino <[email protected]> wrote: > > > > > > > I'll take a look. > > > > > > > > -=e > > > > > > > > -- > > > > Ed Espino > > > > Apache Cloudberry (Incubating) & MADlib > > > > > > > > On Thu, Oct 2, 2025 at 8:50 PM Dianjin Wang <[email protected]> > > > wrote: > > > > > > > >> Hi, > > > >> > > > >> I’m wondering if the CI might be having issues. On my PR #1358, the > > > >> jobs `ic-good-opt-off`, `ic-good-opt-on`, and `ic-cbdb-parallel` > have > > > >> been failing consistently, even after multiple reruns. > > > >> > > > >> I also noticed similar failures happening on other PRs. Could > someone > > > >> help check if the CI is currently down or unstable? > > > >> > > > >> > > > >> Best, > > > >> Dianjin Wang > > > >> > > > >> > --------------------------------------------------------------------- > > > >> To unsubscribe, e-mail: [email protected] > > > >> For additional commands, e-mail: [email protected] > > > >> > > > >> > > > > > > > > -- > Ed Espino > Apache Cloudberry (Incubating) & MADlib >
