Windows 10 got stuck with PostgreSQL at starting up. Adding delay lets it avoid.
I got a trouble in PostgreSQL 9.3.x on Windows 10. I would like to add new delay code as an official build option. Windows 10 sometime (approximately once in 300 tries) hung up at OS starting up. The logs say it happened while the PostgreSQL service was starting. When OS stopped, some postgres auxiliary process were started and some were not started yet. The Windows dump say some threads of the postgres auxiliary process are waiting OS level locks and the logon processes’thread are also waiting a lock. MS help desk said that PostgreSQL’s OS level deadlock caused OS freeze. I think it is strange story. But, in fact, it not happened in repeated tests when I got rid of PostgreSQL from the initial auto-starting services. I tweaked PostgreSQL 9.3.x (the newest from the repository) to add 0.5 or 3.0 seconds delay after each sub process starts. And then the hung up was gone. This test patch is attached. It is only implemented for Windows. Also, I did not use existing pg_usleep because it contains locking codes (e.g. WaitForSingleObject and Enter/LeaveCriticalSection). Although Windows OS may have some problems, I think we should have a means to avoid it. Can PostgreSQL be accepted such delay codes as build-time options by preprocessor variables? Thanks, Takatsuka Haruka diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index d6fc2ed..ff03ebd 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -398,6 +398,30 @@ extern int optreset; /* might not be declared by system headers */ static DNSServiceRef bonjour_sdref = NULL; #endif +#define USE_AFTER_AUX_FORK_SLEEP 3000 + +#ifdef USE_AFTER_AUX_FORK_SLEEP +#ifndef WIN32 +#define AFTER_AUX_FORK_SLEEP() +#else +#define AFTER_AUX_FORK_SLEEP() do { SleepEx(USE_AFTER_AUX_FORK_SLEEP, FALSE); } while(0) +#endif +#else +#define AFTER_AUX_FORK_SLEEP() +#endif + +#define USE_AFTER_BACKEND_FORK_SLEEP 500 + +#ifdef USE_AFTER_BACKEND_FORK_SLEEP +#ifndef WIN32 +#define AFTER_BACKEND_FORK_SLEEP() +#else +#define AFTER_BACKEND_FORK_SLEEP() do { SleepEx(USE_AFTER_BACKEND_FORK_SLEEP, FALSE); } while(0) +#endif +#else +#define AFTER_BACKEND_FORK_SLEEP() +#endif + /* * postmaster.c - function prototypes */ @@ -1709,6 +1733,7 @@ ServerLoop(void) */ StreamClose(port->sock); ConnFree(port); + AFTER_BACKEND_FORK_SLEEP(); } } } @@ -2801,11 +2826,20 @@ reaper(SIGNAL_ARGS) * situation, some of them may be alive already. */ if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0) + { AutoVacPID = StartAutoVacLauncher(); + AFTER_AUX_FORK_SLEEP(); + } if (XLogArchivingActive() && PgArchPID == 0) + { PgArchPID = pgarch_start(); + AFTER_AUX_FORK_SLEEP(); + } if (PgStatPID == 0) + { PgStatPID = pgstat_start(); + AFTER_AUX_FORK_SLEEP(); + } /* some workers may be scheduled to start now */ maybe_start_bgworker(); @@ -5259,6 +5293,7 @@ StartChildProcess(AuxProcType type) /* * in parent, successful fork */ + AFTER_AUX_FORK_SLEEP(); return pid; }
Re: Windows 10 got stuck with PostgreSQL at starting up. Adding delay lets it avoid.
On Fri, 29 Jun 2018 08:34:18 +0200 Thomas Kellerer wrote: > Did you try setting the service to "delayed start"? We didn't try it yet. Thanks to give an idea. I think that MS would advise us already if it were a just solution for this case. Anyway, we will try and confirm it. Thanks, Takatsuka Haruka > TAKATSUKA Haruka schrieb am 29.06.2018 um 08:03: > > I got a trouble in PostgreSQL 9.3.x on Windows 10. > > I would like to add new delay code as an official build option. > > > > Windows 10 sometime (approximately once in 300 tries) hung up > > at OS starting up. The logs say it happened while the PostgreSQL > > service was starting. When OS stopped, some postgres auxiliary > > process were started and some were not started yet. > > > > The Windows dump say some threads of the postgres auxiliary process > > are waiting OS level locks and the logon processes’thread are > > also waiting a lock. MS help desk said that PostgreSQL’s OS level > > deadlock caused OS freeze. I think it is strange story. But, > > in fact, it not happened in repeated tests when I got rid of > > PostgreSQL from the initial auto-starting services. > > > > I tweaked PostgreSQL 9.3.x (the newest from the repository) to add > > 0.5 or 3.0 seconds delay after each sub process starts. > > And then the hung up was gone. This test patch is attached. > > It is only implemented for Windows. Also, I did not use existing > > pg_usleep because it contains locking codes (e.g. WaitForSingleObject > > and Enter/LeaveCriticalSection). > > > > Although Windows OS may have some problems, I think we should have > > a means to avoid it. Can PostgreSQL be accepted such delay codes > > as build-time options by preprocessor variables?
Re: Windows 10 got stuck with PostgreSQL at starting up. Adding delay lets it avoid.
> On Fri, 29 Jun 2018 08:34:18 +0200 > Thomas Kellerer wrote: > > > Did you try setting the service to "delayed start"? > > We didn't try it yet. Thanks to give an idea. I think that > MS would advise us already if it were a just solution for this case. > Anyway, we will try and confirm it. We tried it. "Delayed start" was ineffective. Also setting a script that start the PostgreSQL service on the "startup" folder was ineffective. It has not been solved except by PostgreSQL code modification (inserting delay). Thanks, Haruka Takatsuka > > TAKATSUKA Haruka schrieb am 29.06.2018 um 08:03: > > > I got a trouble in PostgreSQL 9.3.x on Windows 10. > > > I would like to add new delay code as an official build option. > > > > > > Windows 10 sometime (approximately once in 300 tries) hung up > > > at OS starting up. The logs say it happened while the PostgreSQL > > > service was starting. When OS stopped, some postgres auxiliary > > > process were started and some were not started yet. > > > > > > The Windows dump say some threads of the postgres auxiliary process > > > are waiting OS level locks and the logon processes’thread are > > > also waiting a lock. MS help desk said that PostgreSQL’s OS level > > > deadlock caused OS freeze. I think it is strange story. But, > > > in fact, it not happened in repeated tests when I got rid of > > > PostgreSQL from the initial auto-starting services. > > > > > > I tweaked PostgreSQL 9.3.x (the newest from the repository) to add > > > 0.5 or 3.0 seconds delay after each sub process starts. > > > And then the hung up was gone. This test patch is attached. > > > It is only implemented for Windows. Also, I did not use existing > > > pg_usleep because it contains locking codes (e.g. WaitForSingleObject > > > and Enter/LeaveCriticalSection). > > > > > > Although Windows OS may have some problems, I think we should have > > > a means to avoid it. Can PostgreSQL be accepted such delay codes > > > as build-time options by preprocessor variables?
Hot standby failing with PANIC: WAL contains references to invalid pages
Hi folks, I got the same failure with the following report on Amazon Linux 2017.09 and postgresql95-9.5.8-1.73.amzn1.x86_64. The WAL's RMGR ID and its type (Heap2/VISIBLE) are also same. The servers had worked for over 2 weeks and the standby crashed. Firstly we suspected something low layer curruption, but any exact reason haven't been found yet. I think possibility of a bug now because of the other similar report. Does anyone has any idea or similar experience? [Hot standby failing with page # of relation # is uninitialized] https://www.postgresql.org/message-id/a46709a5-3c41-7acb-1db0-60a317283...@tracktrans.com (log output @stanby) 2017-12-11 18:14:16 JST [5094]: [13-1] db=,user=,state=01000 WARNING: page 1646627 of relation pg_tblspc/16407/PG_9.5_201510051/16406/31164 is uninitialized 2017-12-11 18:14:16 JST [5094]: [14-1] db=,user=,state=01000 CONTEXT: xlog redo Heap2/VISIBLE: cutoff xid 210320 2017-12-11 18:14:16 JST [5094]: [15-1] db=,user=,state=XX000 PANIC: WAL contains references to invalid pages 2017-12-11 18:14:16 JST [5094]: [16-1] db=,user=,state=XX000 CONTEXT: xlog redo Heap2/VISIBLE: cutoff xid 210320 2017-12-11 18:14:16 JST [5088]: [5-1] db=,user=,state=0 LOG: startup process (PID 5094) was terminated by signal 6: Aborted 2017-12-11 18:14:16 JST [5088]: [6-1] db=,user=,state=0 LOG: terminating any other active server processes with best regards, Haruka Takatsuka