Hi Dianjin and Max,

> 2.2 Wal-g (Backup & Restore)
> * Status: No active development.
> * Gap: Untested with Pax storage, risking backup/restore failures.
> * Limitation: Does not support incremental backups for PAX tables due
> to their unique metadata.
> * Action: Max will provide PAX documentation to help the team
> understand its mechanics for Valg integration.


I have prepared PR in wal-g[1] with PAX storeage incremental backups.

I have used following assumptions about PAX:
1. All files in <id>_pax/ folder completely immutable
2. These files became visible in Cloudberry's heap-based metadata strictly 
after 
   they were properly fsync()-ed
3. Cloudberry writes PAX-related events to WAL. And during 
Point-in-time-recovery 
   Cloudberry will replay changes from WAL and repair files that were created 
between
   pg_backup_start/pg_backup_stop.

I have added following rules on top of this:
1. Only files visible in Cloudberry metadata is stored in shared between 
incremental
   backups zone.
2. Only files visible in wal-g metadata (json) is considered as uploaded to S3.

All this should produce reliable incremental backups in presence of concurrent 
writes
and wal-g crashes.

Max, could you please verify my assumptions above.

Also, I wonder why VACUUM FULL on non-clustered PAX tables just copies data 
without
any small file consolidation & deleted rows vacuum and visimap removal.

[1] https://github.com/wal-g/wal-g/pull/2275


Best regards,
Nikolay


> Hi all,
> 
> Thanks for joining the first community meeting. Below is the meeting
> recap generated by AI and lightly edited by me for clarity. Please
> take it as a reference.
> 
> - Meeting Notes:
> https://docs.google.com/document/d/14NLYVvApvijsQDt7uCKblVPKhayJSxb6na9dMAp5NAM/edit?usp=sharing
> - Meeting recording:
> https://fathom.video/share/xRtnrNXVr1P_1X2kQZ96nKRaPDEWSGCc (I will
> upload the recording to ASF Cloudberry Youtube Channel later.)
> 
> ~~~~~
> 
> # Meeting Purpose
> 
> Kick off the first bi-weekly community meeting to align on progress
> and priorities.
> 
> # Key Takeaways
> 
> * PRs Blocked by Architectural Mismatch: Key PRs implementing
> Postgres-style features (e.g., parallel append) are stalled. They
> conflict with Cloudberry's MPVV-style execution model, which requires
> pre-launching workers, unlike Postgres's dynamic approach.
> 
> * PXF Roadmap Defined: The PXF roadmap has three stages: 1) sync with
> upstream Greenplum PXF, 2) integrate with the latest kernel (e.g.,
> parallel foreign table scans), and 3) add pushdown capabilities
> (aggregation, join).
> 
> * New Extensions Proposed: Two new extensions were proposed:
> HooksCollector for performance monitoring and yezzey for S3 archiving
> of append-only tables to reduce storage costs.
> 
> * Release 2.1: Release 2.1 is code-complete but blocked on testing and
> documentation. The new binary swap feature is confirmed working,
> enabling zero-downtime upgrades.
> 
> # Topics
> 
> 1. Main Repo & PR Review
> 
> 1.1 Stalled PRs: A review of old, stalled PRs revealed a core
> architectural conflict.
> * Conflict: Postgres-style features (e.g., parallel append) rely on
> dynamic worker launching, which clashes with CloudBerry's MPV-style
> model of pre-launching workers before dispatching plans.
> * Action: Community reviews and feedback are encouraged to help find a 
> solution.
> 
> 1.2 Dianjin's PRs need more reviews
> 
> 2. Ecosystem Extensions
> 
> 2.1 PXF (Parallel eXecution Framework)
> * Status: Code synced with upstream Greenplum PXF; source cleanup is
> in progress.
> * Roadmap:
> - Sync: Catch up with the upstream Greenplum PXF branch.
> - Integrate: Leverage the latest kernel's capabilities (e.g.,
> parallel foreign table scans) via the pxf_fdw framework.
> - Pushdown: Add support for remote aggregation and join pushdown.
> - Blocker: Orca does not currently support foreign data wrappers
> (FDWs), which PXF uses. This must be addressed for full integration.
> 
> Warning: PXF's FDW implementation is not production-ready; VMware
> recommends it only in PXF 7.1.
> 
> 2.2 Wal-g (Backup & Restore)
> * Status: No active development.
> * Gap: Untested with Pax storage, risking backup/restore failures.
> * Limitation: Does not support incremental backups for PAX tables due
> to their unique metadata.
> * Action: Max will provide PAX documentation to help the team
> understand its mechanics for Valg integration.
> 
> 2.3 HooksCollector (Performance Monitoring)
> * Proposal: Open source the data-gathering component of Greenplum 6's
> Command Center.
> * Function: Collects query performance data via hooks and sends it
> externally via protobuf.
> * Goal: Attract community contributions and feedback.
> * Action: Dianjin will share the link for creating a formal proposal
> in GitHub Discussions.
> 
> 2.4 Yezzey (S3 Archiving)
> * Proposal: An extension to upload/download append-only table data to/from S3.
> * Rationale: To reduce storage costs by moving cold data to cheaper
> object storage.
> * Action: Leonid will post the idea to the dev mailing list for public
> discussion.
> 
> 2.5. Release & Governance
> Release 2.1:
> * Status: Code-complete on the Release 2 branch.
> * Blockers: Requires more testing and user-facing documentation for
> building from source.
> * Binary Swap: The new feature is confirmed working, enabling
> zero-downtime upgrades.
> * Release Manager: Ed volunteered but may be unavailable. Dianjin is the 
> backup.
> 
> 3. Incubation Report: Leonid and Dianjin will collaborate on drafting
> the report.
> 
> 4. Open Topics
> 
> * 2026 Roadmap: Dianjin shared a draft roadmap on the dev mailing list
> for feedback.
> * Lakehouse Support: Leonid proposed adding Lakehouse support, noting
> high community interest in Russia.
> * Russian Documentation: Leonid's team will translate documentation to
> Russian and propose hosting it on the official CloudBerry site to
> create a single source of truth.
> * TPC-DS Benchmarking:
> - Problem: Inconsistent TPC-DS test setups between teams yield
> non-comparable results, hindering effective performance tuning.
> - Proposed Solution: Integrate a TPC-DS benchmark tool directly into
> the database kernel (like DuckDB) for easy, standardized execution.
> 
> # Next Steps
> 
> - Leonid:
> * Post the yezzey S3 archiving proposal to the dev mailing list.
> * Post the Lakehouse support idea to the dev mailing list.
> * Collaborate with Dianjin on the incubation report.
> * Host the next community meeting.
> 
> - Dianjin:
> * Share the GitHub Discussions link for the HooksCollector proposal.
> * Confirm Ed's availability for the Release 2.1 manager role.
> * Share the 2026 roadmap draft on the dev mailing list.
> * Share the Shenzhen meetup materials (translated to English).
> 
> - Max:
> * Send PAX documentation to the team to aid WAL-G integration.
> 
> - All:
> * Review stalled PRs and provide feedback.
> * Discuss the TPC-DS benchmark standardization proposal on the dev
> mailing list.
> 
> Next Meeting:
> - Rescheduled to February 27th to accommodate the Chinese New Year holiday.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to