Hi hackers,

 == Motivation ==

We operate a fleet of PostgreSQL instances with logical replication. On
several occasions, we have experienced production incidents where logical
decoding spill files (pg_replslot/<slot>/xid-*.spill) grew uncontrollably —
consuming tens of gigabytes and eventually filling up the data disk. This
caused the entire instance to go read-only, impacting not just replication
but all write workloads.

The typical scenario is a large transaction (e.g. bulk data load or a
long-running DDL) combined with a subscriber that is either slow or
temporarily disconnected. The reorder buffer exceeds
logical_decoding_work_mem and starts spilling, but there is no upper bound
on how much can be spilled. The only backstop today is the OS returning
ENOSPC, at which point the damage is already done.

We looked for existing protections:

   - max_slot_wal_keep_size: limits WAL retention, but does not affect
   spill files at all.
   - logical_decoding_work_mem: controls *when* spilling starts, but not
   *how much* can be spilled.
   - There is no existing GUC, patch, or commitfest entry that addresses
   spill file disk quota.


The "Report reorder buffer size" patch (CF #6053, by Ashutosh Bapat)
improves observability of reorder buffer state, which is complementary —
but observability alone cannot prevent disk-full incidents.

== Proposed solution ==

The attached patch adds a new GUC:
logical_decoding_spill_limit (integer, unit kB, default 0)

When set to a positive value, it limits the total size of on-disk spill
files per replication slot. Key design points:

   1. Tracking: We add two new fields: - ReorderBuffer.spillBytesOnDisk —
   current total on-disk spill size for this slot (unlike spillBytes which is
   a cumulative statistic counter, this is a live gauge). -
   ReorderBufferTXN.serialized_size — per-transaction on-disk size, so we can
   accurately decrement the global counter during cleanup.
   2. Increment: In ReorderBufferSerializeChange(), after a successful
   write(), both counters are incremented by the size written.
   3. Decrement: In ReorderBufferRestoreCleanup(), when spill files are
   unlinked, the global counter is decremented by the transaction's
   serialized_size.
   4. Enforcement: In ReorderBufferCheckMemoryLimit(), before calling
   ReorderBufferSerializeTXN(), we check: if (spillBytesOnDisk + txn->size >
   spill_limit) ereport(ERROR, ...) This is only checked on the spill-to-disk
   path — not on the streaming path (which involves no disk I/O).
   5. Behavior on limit exceeded: An ERROR is raised with
   ERRCODE_CONFIGURATION_LIMIT_EXCEEDED. The walsender exits, but the slot's
   restart_lsn and confirmed_flush are preserved. The subscriber can reconnect
   after the DBA:
      1. increases logical_decoding_spill_limit, or
      2. increases logical_decoding_work_mem (to reduce spilling), or
      3. switches to a streaming-capable output plugin (which avoids
      spilling entirely).
   6. Default 0 means unlimited — fully backward compatible.

== Why per-slot, not global? ==

Each ReorderBuffer instance lives in a single walsender process and
corresponds to exactly one replication slot. A per-slot limit is:

   - Lock-free (no shared memory coordination needed)
   - Simple to reason about (each slot has its own budget)
   - Sufficient to protect against disk-full (the DBA sets the limit based
   on available disk / number of slots)

A global (cross-slot) limit could be layered on top later if needed, but
would require shared-memory counters with spinlock/atomic protection.

== Performance impact ==

   - Hot path (in-memory change queuing): zero overhead.
   - Spill path: one integer comparison before serialization, one integer
   addition after write() — negligible compared to the I/O cost.
   - Cleanup path: one integer subtraction after unlink() — negligible.


Looking forward to feedback.
Thanks,
Shawn.

Reply via email to