Thanks for raising this thread, it's really nice that one PMC member is taking the lead in improving Hadoop CI!
I can share some of the issues I've observed so far. 1. The current CI pipeline runs in serial; I recall that, with good luck (no crash, no OOM), it could take over 28 hours to complete the entire CI pipeline. 2. Some jobs runs two times, one time on trunk branch (the PR target branch), one on the PR branch. I haven't looked into the underlying reasons, but I think we might be able to omit the trunk one. 3. Hadoop CI runs on Jenkins servers maintained by ASF, and these machines appear to have some stability issues, sometimes. I tried to improve it, but I found that I first had to figure out how Yetus works, which is a tool mainly composed of shell and scripting language, which is quite a challenge for me. Subsequently, I changed direction and explored whether Hadoop CI could be migrated to GitHub Actions(GHA), then get some good news, and aslo some challenges. 1. We should test Hadoop inside container instead of virtual machine, for consistent native libraries installation, that means we should built the Dockerfile on-the-fly, then use it as the container for testing. 2. When runs GHA inside container, there are some limitations[1], for example, unable to change USER to non-root, this causes some HDFS tests, especially those cases related permission, do not work properly. 3. The Standard GitHub-hosted runner[2] has 4C16G, is not sufficent for some tests. Given the current situation, I think we can do the following immediately: 1. Move some CI jobs, e.g., native compile test on Debian, Rocky, from Jenkins to GHA, and run them in parallel. 2. Investigate whether we can skip run CI on trunk branch for PR. Additionally, I'm also investigating if we can get rid some native code/dependencies in the future, for example: - In HADOOP-19839, I found the modern JDK already provided fast enough CRC32/CRC32C built-in implementation, do we need to maintain Hadoop's native CRC32/CRC32C in `libhadoop`? - In HADOOP-19855, I'm investigating replacing native zstd C bindings with zstd-jni library, which is the de facto choice for JVM applications for the zStandard compression algorithm. [1] https://docs.github.com/en/actions/reference/workflows-and-actions/dockerfile-support#user [2] https://docs.github.com/en/actions/reference/runners/github-hosted-runners Thanks, Cheng Pan > On Apr 3, 2026, at 06:28, Aaron Fabbri <[email protected]> wrote: > > I'd like to put some effort into improving our CI run time and reliability > but I need your help. I don't know how everything works, and there is too > much work to do for one person. > > Join me in an informal "interest group" of folks that are interested in: > > - Reducing runtime of existing CI / branch tests. > - Eliminating flaky tests. > - Improving test coverage and tooling. > > Please reply to this thread if you are interested in helping, or if you > have ideas for specific technical issues to address. We can use this JIRA > to track related efforts: > > https://issues.apache.org/jira/browse/HADOOP-19820 > > You can also tag me in the #hadoop channel on ASF Slack: > https://the-asf.slack.com/archives/CDSDT7A0H > > (I'll volunteer to keep this mailing list updated on any interesting > discussions there). > > Thanks! > Aaron <[email protected]>
