[Proposal] pg_stat_wal_records – per-record-type WAL generation statistics

SATYANARAYANA NARLAPURAM Thu, 26 Mar 2026 17:29:55 -0700

Hi Hackers,

I'd like to propose a new system view, pg_stat_wal_records, that exposes
per-resource-manager, per-record-type WAL generation counts.


*Sample Output:*
postgres=# SELECT * FROM pg_stat_wal_records ORDER BY count DESC LIMIT 10;
 resource_manager |  record_type   | count  |          stats_reset
------------------+----------------+--------+-------------------------------
 Heap             | INSERT         | 500000 | 2026-03-26 22:15:00.12345+00
 Transaction      | COMMIT         | 500000 |
 Btree            | INSERT_LEAF    |  53821 |
 Heap             | HOT_UPDATE     |  12744 |
 XLOG             | FPI            |   8923 |

*The Gap:*

Postgre already has pg_stat_wal for aggregate WAL volume (bytes, full-page
images, buffers), and pg_walinspect (superuser access required) for
post-hoc forensic analysis of individual WAL segments. But I don't see a
lightweight, observability tool that answers in real time which record
types are responsible for the WAL. Additionally, pg_walinspect runs against
on-disk WAL files, which is expensive. This view will be useful for
monitoring systems to poll cheaply.

*Use cases:*
WAL volume investigation: see which record types dominate WAL generation in
real time without touching disk.
Monitoring integration: Prometheus/Grafana can poll the view to track WAL
composition over time and alert on anomalies.
Replication tuning: identify whether WAL volume is dominated by data
changes, index maintenance, FPIs, or vacuum activity to guide tuning.
Extension debugging: custom WAL resource managers get visibility
automatically.

*Key design decisions*
*Counting mechanism:*
The counting mechanism is a single backend-local array increment in
XLogInsert():
pgstat_pending_wal_records[rmid][(info >> 4) & 0x0F]++;

This indexes into a uint64[256][16] array (32 KB per backend) using the
rmgr ID and the 4-bit record-type subfield of the WAL info byte. Counters
are flushed to shared memory via the standard pgstat infrastructure.
I am using per-backend pending array instead of direct shared-memory
writes. The counter is incremented in backend-local memory and flushed to
shared memory by the existing pgstat flush cycle. Don't expect to see any
contention in the hot path (please see perf results below).
Fixed 256×16 matrix. All 256 possible rmgr IDs × 16 possible record types.
This accommodates core resource managers and any custom WAL resource
managers from extensions without configuration. The 32 KB per-backend cost
is modest. Uses rm_identify() for human-readable names. The SRF calls each
resource manager's rm_identify callback to translate the info byte into a
readable record type name (for example INSERT, COMMIT, VACUUM, HOT_UPDATE).
Added the reset functionality via pg_stat_reset_shared('wal_records'),
consistent with the existing pattern for wal, bgwriter, archiver, etc.
View skips zero-count entries, keeping output clean.

*Performance overhead*
Benchmarked with pgbench (scale 50, 16 clients, 16 threads, 30s,
synchronous_commit=off) on 64 vCPU machine with data and WAL on NVMe:

*Configuration Avg TPS*
With patch 42,266
Without patch 42,053
The overhead is within measurement noise (~0.5%). The increment hits a
backend-local, L1-hot array and is dwarfed by XLogInsert's existing CRC,
locking, and memcpy work.

Attached a draft patch, please share your thoughts.


Thanks,
Satya

v1-0001-pg-stat-wal-records.patch
Description: Binary data

[Proposal] pg_stat_wal_records – per-record-type WAL generation statistics

Reply via email to