Hi everyone, I'd like to start a discussion about implementing public metrics tracking for our GitHub Actions workflows. Having previously worked with the Greenplum team, I saw firsthand how valuable build and test metrics can be for project health monitoring. When Greenplum moved from their original CI system to Concourse CI, we lost this capability. I believe now is a good time to reintroduce this kind of metrics tracking for Cloudberry, with an emphasis on making it publicly accessible to benefit our entire community.
As context, being part of the Apache Software Foundation means we need to be thoughtful about resource usage, leveraging free resources where possible. We currently use GitHub-hosted runners for our container execution, which comes with certain resource constraints. Understanding our resource utilization patterns could help us: - Identify environment-related issues - Optimize test execution within resource limits - Detect product performance regressions - Highlight test inefficiencies - Make informed decisions about infrastructure needs Proposed Benefits: - Transparent view of project health for users and contributors - Track test stability over time - Identify problematic or flaky tests - Monitor build performance trends - Support data-driven decisions about test infrastructure - Enable community members to investigate test failures - Generate metrics for project health reporting - Optimize resource usage within GitHub-hosted runner constraints Data Collection Overview: We propose tracking the following categories of information: System & Environment Data: - OS environments and versions - Container images and versions - Build configurations - Resource metrics (memory, disk usage, execution time limits) - GitHub runner resource constraints and utilization Workflow-Level Metrics: - Build timestamps and duration - Overall workflow status - Branch type (main vs feature branches) - Type of trigger (merge, PR, manual) - Resource consumption patterns Build Metrics: - Build status and duration - Artifact generation success - Configuration details - Resource utilization - Memory and disk space usage - Build timeouts or resource-related failures Test Suite Metrics: - Suite name and configuration - Total/passed/failed/ignored test counts - Test duration - Categories of test failures - System resource metrics during test runs - Resource constraint impacts What We Explicitly Won't Track: - Individual committer names or IDs - PR author information - Blame/attribution data - Individual developer metrics The goal is to focus on systemic patterns and project health: - Identify unstable test patterns - Track performance trends - Monitor resource utilization - Detect infrastructure issues - Support release quality metrics - Optimize resource usage This data would allow us to answer questions like: - Which test suites have become less stable over time? - Do certain configurations consistently show problems? - Are there patterns in test failures across different environments? - How do infrastructure changes impact build performance? - What are our most resource-intensive tests? - Where are we hitting GitHub-hosted runner limits? - Which tests are most affected by resource constraints? Technical Implementation: - Store metrics in PostgreSQL database (via ASF Infra) - Public read access through a web dashboard - Metrics collection from GitHub-hosted runner workflows - Estimated storage needs: ~250MB initially, ~100MB annual growth - Data retention: Full history preserved - Access: Public read access, write access limited to GitHub Actions Given our project's expertise with PostgreSQL, we're well-positioned to implement and maintain this system. We could also share our experience with other ASF projects interested in similar public metrics collection, particularly those also operating under resource constraints. Questions for Discussion: 1. Would this kind of public metrics tracking be valuable to you and our user community? 2. What specific metrics would be most useful for users and contributors? 3. How would you envision the community using this data? 4. Any concerns about implementation, maintenance, or data visibility? 5. Ideas for making the metrics more accessible and useful to the community? 6. Suggestions for dashboard features that would benefit users and contributors? 7. What resource utilization metrics would be most helpful to track? If there's support for this initiative, I'll submit an ASF Infra request for the required PostgreSQL database. For reference, here's our current GitHub Actions workflow: https://github.com/apache/cloudberry/blob/main/.github/workflows/build-cloudberry.yml Looking forward to your thoughts and suggestions on making our project metrics more transparent and accessible to everyone. Best regards, -=e -- Ed Espino Apache Cloudberry (incubating) & MADlib