server process exited with code 1

2021-04-23 Thread Eric Hill
Hey,

We are experiencing a periodic PostgreSQL crash. It happens overnight when 
automated processes are updating content on our website.  My PostgreSQL version 
information is:

PostgreSQL 11.10, compiled by Visual C++ build 1914, 64-bit

It is running on Windows Server 2019 Standard.  We were having this same crash 
on PostgreSQL 9.6, also running on Windows Server, I believe 2012, before a 
recent upgrade.  Here is the relevant part of the log:

2021-04-23 03:20:57 EDT [5324]: LOG:  connection received: host=10.120.80.162 
port=54017
2021-04-23 03:20:57 EDT [5324]: LOG:  connection authorized: user=dba_webjmp 
database=webjmp
2021-04-23 03:21:00 EDT [15776]: LOG:  server process (PID 14820) exited with 
exit code 1
2021-04-23 03:21:00 EDT [15776]: DETAIL:  Failed process was running: SET 
client_min_messages TO warning;SET TIME ZONE INTERVAL '+00:00' HOUR TO MINUTE;
2021-04-23 03:21:00 EDT [15776]: LOG:  terminating any other active server 
processes
2021-04-23 03:21:00 EDT [5324]: WARNING:  terminating connection because of 
crash of another server process
2021-04-23 03:21:00 EDT [5324]: DETAIL:  The postmaster has commanded this 
server process to roll back the current transaction and exit, because another 
server process exited abnormally and possibly corrupted shared memory.
2021-04-23 03:21:00 EDT [5324]: HINT:  In a moment you should be able to 
reconnect to the database and repeat your command.

I would really like for PostgreSQL to produce a crashdump of this crash but one 
is not produced.  I made a crashdumps folder in the data directory and gave it 
every permission I could think of, but no joy there.

We’ve looked at various aspects of the PostgreSQL instance and the  machine it 
is running on, and we don’t seem to be running into any resource constraints 
(connections, memory).  This is what I’m logging:

# pgBadger-approved log line prefix
log_line_prefix = '%t [%p]: '

# S1561728 - Log any query that takes > 250 milliseconds to return
log_min_duration_statement = 250ms

# Additional logging to troubleshoot crashes, also using pgBadger 
recommendations
log_checkpoints = on# Default is off
log_lock_waits = on # Default is off
log_connections = on# Default is off
log_disconnections = on # Default is off
log_temp_files = 0  # log temporary files equal or larger
log_autovacuum_min_duration = 0 # log all autovacuums
log_error_verbosity = default   # default is the default

Any thoughts?  Any other information I could provide that would be helpful?

Thanks!

Eric


Re: server process exited with code 1

2021-04-23 Thread Eric Hill
Hey, Julien,

Yes, I have looked at those instructions and followed them to a T.  My data 
directory is C:\Program Files\PostgreSQL\11\data as confirmed by SHOW 
data_directory;, and I gave the postgres, pgsqladmin, and NETWORK SERVICE 
accounts (the latter is the account that is running the services) Full Control 
on the crashdumps directory I created under data.  No crash dumps have yet 
appeared in that folder.  I’m not sure how to convince crashdumps to show up.  
Maybe with all the antivirus stuff turned off it will help?

Thanks!

Eric

From: Julien Rouhaud 
Date: Friday, April 23, 2021 at 12:21 PM
To: Eric Hill 
Cc: pgsql-general@lists.postgresql.org 
Subject: Re: server process exited with code 1
EXTERNAL

On Fri, Apr 23, 2021 at 03:58:32PM +, Eric Hill wrote:
> Hey,
>
> We are experiencing a periodic PostgreSQL crash. It happens overnight when 
> automated processes are updating content on our website.  My PostgreSQL 
> version information is:
>
> PostgreSQL 11.10, compiled by Visual C++ build 1914, 64-bit
>
> It is running on Windows Server 2019 Standard.  We were having this same 
> crash on PostgreSQL 9.6, also running on Windows Server, I believe 2012, 
> before a recent upgrade.  Here is the relevant part of the log:
>
> 2021-04-23 03:20:57 EDT [5324]: LOG:  connection received: host=10.120.80.162 
> port=54017
> 2021-04-23 03:20:57 EDT [5324]: LOG:  connection authorized: user=dba_webjmp 
> database=webjmp
> 2021-04-23 03:21:00 EDT [15776]: LOG:  server process (PID 14820) exited with 
> exit code 1
> 2021-04-23 03:21:00 EDT [15776]: DETAIL:  Failed process was running: SET 
> client_min_messages TO warning;SET TIME ZONE INTERVAL '+00:00' HOUR TO MINUTE;
> 2021-04-23 03:21:00 EDT [15776]: LOG:  terminating any other active server 
> processes
> 2021-04-23 03:21:00 EDT [5324]: WARNING:  terminating connection because of 
> crash of another server process
> 2021-04-23 03:21:00 EDT [5324]: DETAIL:  The postmaster has commanded this 
> server process to roll back the current transaction and exit, because another 
> server process exited abnormally and possibly corrupted shared memory.
> 2021-04-23 03:21:00 EDT [5324]: HINT:  In a moment you should be able to 
> reconnect to the database and repeat your command.
>
> I would really like for PostgreSQL to produce a crashdump of this crash but 
> one is not produced.  I made a crashdumps folder in the data directory and 
> gave it every permission I could think of, but no joy there.

The mentioned query works just fine for me, so I doubt we could help without a
core dump.  Did you check the instructions at
https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.postgresql.org%2Fwiki%2FGetting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows%23Using_crash_dumps_to_debug_random_and_unpredictable_backend_crashes&data=04%7C01%7CEric.Hill%40jmp.com%7Cddb46609d2e148820b3908d90673e86a%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637547917153130030%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=VJTI%2FHtgXN6u5F6k5YYcImSuS3MwQe%2BIDdLy7CQLo7g%3D&reserved=0?


Re: server process exited with code 1

2021-04-23 Thread Eric Hill
Wow, Tom, thanks, that gives us a lot to go on.

The first sign of trouble on the client is just ECONNRESET:

[[37m2021-04-23T07:21:01.073Z[39m] [35m WARN[39m: main/6716 on [redacted]:
[37m[36mPubSub lost connection to datatabase, retrying connection. Error: 
read ECONNRESET
at TCP.onStreamRead (internal/stream_base_commons.js:209:20)
at TCP.callbackTrampoline (internal/async_hooks.js:131:14) {
  errno: -4077,
  code: 'ECONNRESET',
  syscall: 'read'
}[39m
[39m[[37m2021-04-23T07:21:01.079Z[39m] [35m WARN[39m: main/6716 on [redacted]:
[37m[36mPubSub lost connection to datatabase, retrying connection. Error: 
Connection terminated unexpectedly

(Times in that log are GMT, so 07:21am here === 03:21am in the psql log).

I don’t believe we have any unusual extensions.  We do have triggers, and the 
VM does have antivirus protection.  I’ll work on exclusions for the AV, and 
we’ll look into our triggers a bit.

Thanks!

Eric


From: Tom Lane 
Date: Friday, April 23, 2021 at 12:36 PM
To: Eric Hill 
Cc: pgsql-general@lists.postgresql.org 
Subject: Re: server process exited with code 1
EXTERNAL

Eric Hill  writes:
> 2021-04-23 03:20:57 EDT [5324]: LOG:  connection received: host=10.120.80.162 
> port=54017
> 2021-04-23 03:20:57 EDT [5324]: LOG:  connection authorized: user=dba_webjmp 
> database=webjmp
> 2021-04-23 03:21:00 EDT [15776]: LOG:  server process (PID 14820) exited with 
> exit code 1
> 2021-04-23 03:21:00 EDT [15776]: DETAIL:  Failed process was running: SET 
> client_min_messages TO warning;SET TIME ZONE INTERVAL '+00:00' HOUR TO MINUTE;
> 2021-04-23 03:21:00 EDT [15776]: LOG:  terminating any other active server 
> processes

Hm.  That's fairly odd.  Exit code 1 ordinarily means that the backend
hit a "FATAL" error, which is one that causes process termination but
*not* a database-wide restart.  The fact that the postmaster is treating
it that way anyway implies that the backend failed to disarm the "dead
man switch" mechanism that exists to check that backends cleaned up
their shared-memory state before exiting.  Another thing that doesn't
square with this being an ordinary FATAL exit is that there's no message
visible in the log.

(I wonder what this looks like from the client's end --- is it getting
any sort of termination message, or just a lost network connection?)

The implication is that something in the backend did _exit(1) to force
process termination without any of the normal atexit cleanup.  There
is not, or at least isn't supposed to be, anything in Postgres proper
that does that.  So I'm wondering if you have any nonstandard code
in there, such as unusual extensions.  A badly-written event trigger
could perhaps do it too.

This being Windows, a certain amount of suspicion has to be directed
towards bogus antivirus code, too.

regards, tom lane