> On Mar 19, 2026, at 22:56, Etsuro Fujita <[email protected]> wrote:
> 
> Hi,
> 
> I got an offline report from my colleague Zhibai Song that
> close_cursor() is called for a freed PGconn, leading to a server
> crash.  Here is a reproducer (the original reproducer he provided is a
> bit complex, so I simplified it):
> 
> create server loopback
>    foreign data wrapper postgres_fdw
>    options (dbname 'postgres');
> create user mapping for current_user server loopback;
> create table t1 (id int, data text);
> create foreign table ft1 (id int, data text)
>    server loopback options (table_name 't1');
> insert into ft1 values (1, 'foo');
> start transaction;
> -- This caches the remote connection's PGconn in PgFdwScanState
> declare c1 cursor for select * from ft1;
> fetch c1;
> id | data
> ----+------
>  1 | foo
> (1 row)
> 
> savepoint s1;
> select * from ft1;
> id | data
> ----+------
>  1 | foo
> (1 row)
> 
> select pid from pg_stat_activity
>    where datname = 'postgres'
>      and application_name = 'postgres_fdw';
>  pid
> -------
> 91853
> (1 row)
> 
> -- This terminates the remote session
> select pg_terminate_backend(91853);
> pg_terminate_backend
> ----------------------
> t
> (1 row)
> 
> -- This leaves the remote connection's changing_xact_state as true
> rollback to s1;
> savepoint s1;
> -- This calls pgfdw_reject_incomplete_xact_state_change(), freeing
> -- the remote connection's PGconn as changing_xact_state is true
> select * from ft1;
> ERROR:  connection to server "loopback" was lost
> rollback to s1;
> -- This calls close_cursor() on the PGconn cached in PgFdwScanState,
> -- which was freed above, leading to a server crash
> close c1;
> 
> I think the root cause is that it is too early to free the PGconn in
> pgfdw_reject_incomplete_xact_state_change() even if the connection is
> in a state where we cannot use it any further; I think we should delay
> that until abort cleanup (ie, pgfdw_xact_callback()).  Attached is a
> patch for that.
> 
> Best regards,
> Etsuro Fujita
> <fix-connection-handling-in-postgres-fdw.patch>

Hi Etsuro-san,

I can reproduce the server crash following your procedure, and I traced the 
problem.

The issue is that, during select * from ft1, 
pgfdw_reject_incomplete_xact_state_change() calls disconnect_pg_server(), which 
destroys conn and sets ConnCacheEntry->conn = NULL, but does not update 
PgFdwScanState->conn. As a result, when "close c1" is executed later, 
PgFdwScanState->conn points to stale memory with random contents.

I am not sure we should still allow further commands to run after select * from 
ft1, given that it has already raised: "ERROR:  connection to server "loopback" 
was lost”. Maybe we should not keep going as if the connection were still there.

I am not very familiar with the FDW code, so I am not ready to suggest a 
concrete fix. But it seems wrong to let later paths keep using 
PgFdwScanState->conn after select * from ft1 has already failed with connection 
loss. My guess is that we either need to invalidate all dependent state when 
disconnect_pg_server() runs, or otherwise prevent later cleanup paths from 
touching the cached PGconn *.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/






Reply via email to