[DISCUSS] Proposal: Public Build and Test Metrics Dashboard for Cloudberry

Ed Espino Sat, 30 Nov 2024 23:07:46 -0800

Hi everyone,

I'd like to start a discussion about implementing public metrics tracking
for our GitHub Actions workflows. Having previously worked with the
Greenplum team, I saw firsthand how valuable build and test metrics can be
for project health monitoring. When Greenplum moved from their original CI
system to Concourse CI, we lost this capability. I believe now is a good
time to reintroduce this kind of metrics tracking for Cloudberry, with an
emphasis on making it publicly accessible to benefit our entire community.


As context, being part of the Apache Software Foundation means we need to
be thoughtful about resource usage, leveraging free resources where
possible. We currently use GitHub-hosted runners for our container
execution, which comes with certain resource constraints. Understanding our
resource utilization patterns could help us:

   - Identify environment-related issues
   - Optimize test execution within resource limits
   - Detect product performance regressions
   - Highlight test inefficiencies
   - Make informed decisions about infrastructure needs

Proposed Benefits:

   - Transparent view of project health for users and contributors
   - Track test stability over time
   - Identify problematic or flaky tests
   - Monitor build performance trends
   - Support data-driven decisions about test infrastructure
   - Enable community members to investigate test failures
   - Generate metrics for project health reporting
   - Optimize resource usage within GitHub-hosted runner constraints

Data Collection Overview:

We propose tracking the following categories of information:

System & Environment Data:

   - OS environments and versions
   - Container images and versions
   - Build configurations
   - Resource metrics (memory, disk usage, execution time limits)
   - GitHub runner resource constraints and utilization

Workflow-Level Metrics:

   - Build timestamps and duration
   - Overall workflow status
   - Branch type (main vs feature branches)
   - Type of trigger (merge, PR, manual)
   - Resource consumption patterns

Build Metrics:

   - Build status and duration
   - Artifact generation success
   - Configuration details
   - Resource utilization
   - Memory and disk space usage
   - Build timeouts or resource-related failures

Test Suite Metrics:

   - Suite name and configuration
   - Total/passed/failed/ignored test counts
   - Test duration
   - Categories of test failures
   - System resource metrics during test runs
   - Resource constraint impacts

What We Explicitly Won't Track:

   - Individual committer names or IDs
   - PR author information
   - Blame/attribution data
   - Individual developer metrics

The goal is to focus on systemic patterns and project health:

   - Identify unstable test patterns
   - Track performance trends
   - Monitor resource utilization
   - Detect infrastructure issues
   - Support release quality metrics
   - Optimize resource usage

This data would allow us to answer questions like:

   - Which test suites have become less stable over time?
   - Do certain configurations consistently show problems?
   - Are there patterns in test failures across different environments?
   - How do infrastructure changes impact build performance?
   - What are our most resource-intensive tests?
   - Where are we hitting GitHub-hosted runner limits?
   - Which tests are most affected by resource constraints?

Technical Implementation:

   - Store metrics in PostgreSQL database (via ASF Infra)
   - Public read access through a web dashboard
   - Metrics collection from GitHub-hosted runner workflows
   - Estimated storage needs: ~250MB initially, ~100MB annual growth
   - Data retention: Full history preserved
   - Access: Public read access, write access limited to GitHub Actions

Given our project's expertise with PostgreSQL, we're well-positioned to
implement and maintain this system. We could also share our experience with
other ASF projects interested in similar public metrics collection,
particularly those also operating under resource constraints.

Questions for Discussion:

   1. Would this kind of public metrics tracking be valuable to you and our
   user community?
   2. What specific metrics would be most useful for users and contributors?
   3. How would you envision the community using this data?
   4. Any concerns about implementation, maintenance, or data visibility?
   5. Ideas for making the metrics more accessible and useful to the
   community?
   6. Suggestions for dashboard features that would benefit users and
   contributors?
   7. What resource utilization metrics would be most helpful to track?

If there's support for this initiative, I'll submit an ASF Infra request
for the required PostgreSQL database.

For reference, here's our current GitHub Actions workflow:
https://github.com/apache/cloudberry/blob/main/.github/workflows/build-cloudberry.yml

Looking forward to your thoughts and suggestions on making our project
metrics more transparent and accessible to everyone.

Best regards,
-=e
--
Ed Espino
Apache Cloudberry (incubating) & MADlib

[DISCUSS] Proposal: Public Build and Test Metrics Dashboard for Cloudberry

Reply via email to