> On Mar 19, 2026, at 22:56, Etsuro Fujita <[email protected]> wrote:
>
> Hi,
>
> I got an offline report from my colleague Zhibai Song that
> close_cursor() is called for a freed PGconn, leading to a server
> crash. Here is a reproducer (the original reproducer he provided is a
> bit complex, so I simplified it):
>
> create server loopback
> foreign data wrapper postgres_fdw
> options (dbname 'postgres');
> create user mapping for current_user server loopback;
> create table t1 (id int, data text);
> create foreign table ft1 (id int, data text)
> server loopback options (table_name 't1');
> insert into ft1 values (1, 'foo');
> start transaction;
> -- This caches the remote connection's PGconn in PgFdwScanState
> declare c1 cursor for select * from ft1;
> fetch c1;
> id | data
> ----+------
> 1 | foo
> (1 row)
>
> savepoint s1;
> select * from ft1;
> id | data
> ----+------
> 1 | foo
> (1 row)
>
> select pid from pg_stat_activity
> where datname = 'postgres'
> and application_name = 'postgres_fdw';
> pid
> -------
> 91853
> (1 row)
>
> -- This terminates the remote session
> select pg_terminate_backend(91853);
> pg_terminate_backend
> ----------------------
> t
> (1 row)
>
> -- This leaves the remote connection's changing_xact_state as true
> rollback to s1;
> savepoint s1;
> -- This calls pgfdw_reject_incomplete_xact_state_change(), freeing
> -- the remote connection's PGconn as changing_xact_state is true
> select * from ft1;
> ERROR: connection to server "loopback" was lost
> rollback to s1;
> -- This calls close_cursor() on the PGconn cached in PgFdwScanState,
> -- which was freed above, leading to a server crash
> close c1;
>
> I think the root cause is that it is too early to free the PGconn in
> pgfdw_reject_incomplete_xact_state_change() even if the connection is
> in a state where we cannot use it any further; I think we should delay
> that until abort cleanup (ie, pgfdw_xact_callback()). Attached is a
> patch for that.
>
> Best regards,
> Etsuro Fujita
> <fix-connection-handling-in-postgres-fdw.patch>
Hi Etsuro-san,
I can reproduce the server crash following your procedure, and I traced the
problem.
The issue is that, during select * from ft1,
pgfdw_reject_incomplete_xact_state_change() calls disconnect_pg_server(), which
destroys conn and sets ConnCacheEntry->conn = NULL, but does not update
PgFdwScanState->conn. As a result, when "close c1" is executed later,
PgFdwScanState->conn points to stale memory with random contents.
I am not sure we should still allow further commands to run after select * from
ft1, given that it has already raised: "ERROR: connection to server "loopback"
was lost”. Maybe we should not keep going as if the connection were still there.
I am not very familiar with the FDW code, so I am not ready to suggest a
concrete fix. But it seems wrong to let later paths keep using
PgFdwScanState->conn after select * from ft1 has already failed with connection
loss. My guess is that we either need to invalidate all dependent state when
disconnect_pg_server() runs, or otherwise prevent later cleanup paths from
touching the cached PGconn *.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/