Re: [DISCUSS] Proposal: Public Build and Test Metrics Dashboard for Cloudberry

Roman Shaposhnik Mon, 02 Dec 2024 18:06:05 -0800

On Sat, Nov 30, 2024 at 11:07 PM Ed Espino <esp...@apache.org> wrote:
>
> Hi everyone,
>
> I'd like to start a discussion about implementing public metrics tracking
> for our GitHub Actions workflows. Having previously worked with the
> Greenplum team, I saw firsthand how valuable build and test metrics can be
> for project health monitoring. When Greenplum moved from their original CI
> system to Concourse CI, we lost this capability. I believe now is a good
> time to reintroduce this kind of metrics tracking for Cloudberry, with an
> emphasis on making it publicly accessible to benefit our entire community.
>
> As context, being part of the Apache Software Foundation means we need to
> be thoughtful about resource usage, leveraging free resources where
> possible. We currently use GitHub-hosted runners for our container
> execution, which comes with certain resource constraints. Understanding our
> resource utilization patterns could help us:
>
>    - Identify environment-related issues
>    - Optimize test execution within resource limits
>    - Detect product performance regressions
>    - Highlight test inefficiencies
>    - Make informed decisions about infrastructure needs
>
> Proposed Benefits:
>
>    - Transparent view of project health for users and contributors
>    - Track test stability over time
>    - Identify problematic or flaky tests
>    - Monitor build performance trends
>    - Support data-driven decisions about test infrastructure
>    - Enable community members to investigate test failures
>    - Generate metrics for project health reporting
>    - Optimize resource usage within GitHub-hosted runner constraints
>
> Data Collection Overview:
>
> We propose tracking the following categories of information:
>
> System & Environment Data:
>
>    - OS environments and versions
>    - Container images and versions
>    - Build configurations
>    - Resource metrics (memory, disk usage, execution time limits)
>    - GitHub runner resource constraints and utilization
>
> Workflow-Level Metrics:
>
>    - Build timestamps and duration
>    - Overall workflow status
>    - Branch type (main vs feature branches)
>    - Type of trigger (merge, PR, manual)
>    - Resource consumption patterns
>
> Build Metrics:
>
>    - Build status and duration
>    - Artifact generation success
>    - Configuration details
>    - Resource utilization
>    - Memory and disk space usage
>    - Build timeouts or resource-related failures
>
> Test Suite Metrics:
>
>    - Suite name and configuration
>    - Total/passed/failed/ignored test counts
>    - Test duration
>    - Categories of test failures
>    - System resource metrics during test runs
>    - Resource constraint impacts
>
> What We Explicitly Won't Track:
>
>    - Individual committer names or IDs
>    - PR author information
>    - Blame/attribution data
>    - Individual developer metrics
>
> The goal is to focus on systemic patterns and project health:
>
>    - Identify unstable test patterns
>    - Track performance trends
>    - Monitor resource utilization
>    - Detect infrastructure issues
>    - Support release quality metrics
>    - Optimize resource usage
>
> This data would allow us to answer questions like:
>
>    - Which test suites have become less stable over time?
>    - Do certain configurations consistently show problems?
>    - Are there patterns in test failures across different environments?
>    - How do infrastructure changes impact build performance?
>    - What are our most resource-intensive tests?
>    - Where are we hitting GitHub-hosted runner limits?
>    - Which tests are most affected by resource constraints?
>
> Technical Implementation:
>
>    - Store metrics in PostgreSQL database (via ASF Infra)
>    - Public read access through a web dashboard
>    - Metrics collection from GitHub-hosted runner workflows
>    - Estimated storage needs: ~250MB initially, ~100MB annual growth
>    - Data retention: Full history preserved
>    - Access: Public read access, write access limited to GitHub Actions
>
> Given our project's expertise with PostgreSQL, we're well-positioned to
> implement and maintain this system. We could also share our experience with
> other ASF projects interested in similar public metrics collection,
> particularly those also operating under resource constraints.
>
> Questions for Discussion:
>
>    1. Would this kind of public metrics tracking be valuable to you and our
>    user community?
>    2. What specific metrics would be most useful for users and contributors?
>    3. How would you envision the community using this data?
>    4. Any concerns about implementation, maintenance, or data visibility?
>    5. Ideas for making the metrics more accessible and useful to the
>    community?
>    6. Suggestions for dashboard features that would benefit users and
>    contributors?
>    7. What resource utilization metrics would be most helpful to track?
>
> If there's support for this initiative, I'll submit an ASF Infra request
> for the required PostgreSQL database.
>
> For reference, here's our current GitHub Actions workflow:
> https://github.com/apache/cloudberry/blob/main/.github/workflows/build-cloudberry.yml
>
> Looking forward to your thoughts and suggestions on making our project
> metrics more transparent and accessible to everyone.


This is a fantastic idea! That said, are there any hosted tools (like
any GH features, etc?) that would allow us to skip maintaining custom
PostgreSQL DB?

Thanks,
Roman.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cloudberry.apache.org
For additional commands, e-mail: dev-h...@cloudberry.apache.org

Re: [DISCUSS] Proposal: Public Build and Test Metrics Dashboard for Cloudberry

Reply via email to