Hi,

When a table is added to a publication without a column list:

ALTER PUBLICATION my_pub ADD TABLE my_table;

Any new column added to my_table will automatically be replicated. The
underlying catalog (pg_publication_rel.prattrs) stores NULL for this case.

When a table is added with an explicit column list:

ALTER PUBLICATION my_pub ADD TABLE my_table (id, name, status);

New columns will NOT be replicated until the publication is explicitly
altered.
The catalog stores an int2vector of the specified attribute numbers.

The problem is that pg_get_publication_tables() in pg_publication.c (the
set-returning
function backing the pg_publication_tables view) erases this distinction.
When prattrs is NULL, it opens the table, iterates all eligible attributes,
and builds a synthetic int2vector of all current columns. The view then
shows
the same attnames output for both cases.

| Scenario                 | prattrs in catalog | attrs from SRF      |
attnames in view |
|--------------------------|--------------------|---------------------|------------------|
| column list (a, b)       | {1, 2  }           | {1,2}               | {a,
b}           |
| No column list           | NULL               | {1,2,3,...} (synth) | {a,
b, c, ...}   |
| FOR ALL TABLES/IN SCHEMA | no catalog row     | {1,2,3,...} (synth) | {a,
b, c, ...}   |

This is a problem for operations: schema migrations that add columns to
published
tables may or may not replicate depending on how the publication was
originally
defined, and the only way to check is querying pg_publication_rel directly.

Additionally, tablesync.c has a workaround heuristic that tries to reverse
this synthesis by comparing array_length(gpt.attrs, 1) against c.relnatts.
This heuristic is buggy: relnatts counts all user attributes including
dropped
columns, but the synthesized list excludes dropped columns and conditionally
excludes generated columns. So for any table with a dropped column, the
heuristic incorrectly treats "no column list" as if an explicit list were
specified.

The problematic code is the result of commits that incrementally
built on each other, with later ones introducing workarounds for side
effects
of earlier ones:

1. fd0b9dcebda (06/2022) Amit Kapila (author: Hou Zhijie)
Prohibit combining publications with different column lists

Added the subscriber-side check that prevents combining publications with
different column lists for the same table. The subscriber runs
SELECT DISTINCT gpt.attrs and expects at most one row. To handle the fact
that pg_get_publication_tables() returned NULL for no-column-list
publications (which would always be distinct from any explicit list), this
commit introduced the relnatts heuristic in tablesync.c:

CASE WHEN (array_length(gpt.attrs, 1) = c.relnatts)
     THEN NULL ELSE gpt.attrs END

The idea was: if the SRF returns a synthesized list whose length equals
relnatts, it must be the "all columns" case, so collapse it back to NULL.
But at this point the SRF didn't synthesize anything yet, so the heuristic
was anticipating the next commit.

2. b7ae0395369 (01/2023) Amit Kapila (author: Shi yu)
Ignore dropped and generated columns from the column list

This is the commit that introduced the synthesis loop. It "solved" a problem
where a table in two publications (one with a column list naming all
columns,
one without a column list) was erroring with "cannot use different column
lists"
because one returned NULL and the other returned an int2vector. The fix was
to synthesize a column list when none was specified, filtering out dropped
and generated columns, so that both publications would produce identical
int2vectors and DISTINCT would collapse them to one row.

But this synthesis made it impossible to distinguish "all columns" from
"explicit list of all columns" in the view. And the relnatts heuristic from
fd0b9dcebda - which was supposed to reverse the synthesis - was broken from
the start because relnatts includes dropped columns while the synthesized
list
excludes them.

The synthesis in b7ae0395369 tried to make "no column list" and "explicit
list
of all columns" look identical. But they have genuinely different
semantics:

- No column list (NULL): all current and future columns are replicated.
  ALTER TABLE ADD COLUMN automatically replicates the new column.
- Explicit full list: only the named columns are replicated. New columns
  are NOT replicated until the publication is explicitly altered.

By making them indistinguishable, the synthesis hid a real conflict from
users
who had a table in two publications with different column semantics on the
same subscription. I am proposing a fix that restores the distinction and
correctly
(IMO) surfaces this conflict.

The fix is to stop synthesizing the full column list in
pg_get_publication_tables().
When prattrs is NULL in the catalog, let attrs remain NULL in the SRF
output.
Remove the buggy CASE WHEN heuristic in tablesync.c since it is no longer
needed.

There is one scenario where there is a change for users: one pub no list +
one pub explicit all columns. Anyone with that specific configuration will
see a new error on the
next tablesync or subscription refresh after upgrading. The fix for those
users is to
either remove the explicit column list from the second publication (making
both "all
columns"), or keep the difference and use separate subscriptions.

AFAICT this is safe except for the change in behavior I describe above.
psql and pg_dump query pg_publication_rel directly and not the affected
view. New subscriber
talking to old publisher still works (old pub synthesizes a list, new sub
handles it).
Old subscriber talking to new publisher also works (the old CASE WHEN
evaluates
array_length(NULL, 1) which returns NULL, falling through to the ELSE branch
returning NULL).

Attached is a patch with the fix, including updates to documentation, 7
updated
regression tests where existing expected outputs where attnames changes from
{a} to NULL, and added new test cases that verify both cases are
distinguishable in
the same query.

Roberto Mello
Snowflake

Attachment: v1-0001-Fix-pg_publication_tables-to-return-NULL-attnames.patch
Description: Binary data

Reply via email to