Re: Call for collaboration: CI and Test Improvements

Cheng Pan Thu, 02 Apr 2026 22:13:42 -0700

Thanks for raising this thread, it's really nice that one PMC member is taking 
the lead in improving Hadoop CI!

I can share some of the issues I've observed so far.

1. The current CI pipeline runs in serial; I recall that, with good luck (no 
crash, no OOM), it could take over 28 hours to complete the entire CI pipeline.
2. Some jobs runs two times, one time on trunk branch (the PR target branch), 
one on the PR branch. I haven't looked into the underlying reasons, but I think 
we might be able to omit the trunk one.
3. Hadoop CI runs on Jenkins servers maintained by ASF, and these machines 
appear to have some stability issues, sometimes.

I tried to improve it, but I found that I first had to figure out how Yetus 
works, which is a tool mainly composed of shell and scripting language, which 
is quite a challenge for me. Subsequently, I changed direction and explored 
whether Hadoop CI could be migrated to GitHub Actions(GHA), then get some good 
news, and aslo some challenges.

1. We should test Hadoop inside container instead of virtual machine, for 
consistent native libraries installation, that means we should built the 
Dockerfile on-the-fly, then use it as the container for testing.
2. When runs GHA inside container, there are some limitations[1], for example, 
unable to change USER to non-root, this causes some HDFS tests, especially 
those cases related permission, do not work properly.
3. The Standard GitHub-hosted runner[2] has 4C16G, is not sufficent for some 
tests.

Given the current situation, I think we can do the following immediately:

1. Move some CI jobs, e.g., native compile test on Debian, Rocky, from Jenkins 
to GHA, and run them in parallel.
2. Investigate whether we can skip run CI on trunk branch for PR.

Additionally, I'm also investigating if we can get rid some native 
code/dependencies in the future, for example:

- In HADOOP-19839, I found the modern JDK already provided fast enough 
CRC32/CRC32C built-in implementation, do we need to maintain Hadoop's native 
CRC32/CRC32C in `libhadoop`?
- In HADOOP-19855, I'm investigating replacing native zstd C bindings with 
zstd-jni library, which is the de facto choice for JVM applications for the 
zStandard compression algorithm.

[1] 
https://docs.github.com/en/actions/reference/workflows-and-actions/dockerfile-support#user
[2] https://docs.github.com/en/actions/reference/runners/github-hosted-runners

Thanks,
Cheng Pan

> On Apr 3, 2026, at 06:28, Aaron Fabbri <[email protected]> wrote:
> 
> I'd like to put some effort into improving our CI run time and reliability
> but I need your help. I don't know how everything works, and there is too
> much work to do for one person.
> 
> Join me in an informal "interest group" of folks that are interested in:
> 
> - Reducing runtime of existing CI / branch tests.
> - Eliminating flaky tests.
> - Improving test coverage and tooling.
> 
> Please reply to this thread if you are interested in helping, or if you
> have ideas for specific technical issues to address. We can use this JIRA
> to track related efforts:
> 
> https://issues.apache.org/jira/browse/HADOOP-19820
> 
> You can also tag me in the #hadoop channel on ASF Slack:
> https://the-asf.slack.com/archives/CDSDT7A0H
> 
> (I'll volunteer to keep this mailing list updated on any interesting
> discussions there).
> 
> Thanks!
> Aaron <[email protected]>

Re: Call for collaboration: CI and Test Improvements

Reply via email to