Hi,

  recently I got two postgres crashes on an installation that is
running for years already and without significant changes recently.

Postgres is 15.15, OS is FreeBSD 14.3

The crashes are SIGBUS, happening on different db-clusters running on
the same node from the same binary:

col: Feb  2 03:52:57 LOG:  background worker "parallel worker" (PID 79324) was 
terminated by signal 10: Bus error
int: Feb 12 03:38:03 LOG:  background worker "parallel worker" (PID 26340) was 
terminated by signal 10: Bus error

On the second occurrance I looked into the coredump (which is sparse
because this is a production build):

* thread #1, name = 'postgres', stop reason = signal SIGBUS
 * frame #0: 0x0000000829930ac3 libc.so.7`___lldb_unnamed_symbol5890 + 131
   frame #1: 0x000000082992da28 libc.so.7`___lldb_unnamed_symbol5865 + 504
   frame #2: 0x000000082992e889 libc.so.7`___lldb_unnamed_symbol5871 + 2617
   frame #3: 0x000000082990ca84 libc.so.7`___lldb_unnamed_symbol5446 + 644
   frame #4: 0x000000082990c6b7 libc.so.7`___lldb_unnamed_symbol5445 + 839
   frame #5: 0x0000000829952945 libc.so.7`___lldb_unnamed_symbol6064 + 21
   frame #6: 0x0000000829900013 libc.so.7`___lldb_unnamed_symbol5410 + 755
   frame #7: 0x00000000009c0577 postgres`AllocSetContextCreateInternal + 199
   frame #8: 0x00000000006d588c postgres`ExecAssignExprContext + 108
   frame #9: 0x00000000006faab9 postgres`ExecInitSeqScan + 73
   frame #10: 0x00000000006cf188 postgres`ExecInitNode + 248
   frame #11: 0x00000000006c8440 postgres`standard_ExecutorStart + 1056
   frame #12: 0x00000000006cca12 postgres`ParallelQueryMain + 402
   frame #13: 0x0000000000585f79 postgres`ParallelWorkerMain + 985
   frame #14: 0x00000000007bc606 postgres`StartBackgroundWorker + 310
   frame #15: 0x00000000007c1f00 postgres`maybe_start_bgworkers + 1104
   frame #16: 0x00000000007c0a43 postgres`sigusr1_handler + 307
   frame #17: 0x00000008228aa606 libthr.so.3`___lldb_unnamed_symbol688 + 214
   frame #18: 0x00000008228a9b0a libthr.so.3`___lldb_unnamed_symbol669 + 314
   frame #19: 0x0000000821a402d3
   frame #20: 0x00000000007c2545 postgres`ServerLoop + 1605
   frame #21: 0x00000000007bffa3 postgres`PostmasterMain + 3251
   frame #22: 0x0000000000720601 postgres`main + 801
   frame #23: 0x0000000829803190 libc.so.7`__libc_start1 + 304
   frame #24: 0x00000000004ff4e4 postgres`_start + 36

I'm not sure what to make of this. A single crash might be due to
a cosmic ray or whatever, a second occurrance usually means there
is something wrong.

That function AllocSetContextCreateInternal() seems to do some
memory allocation. That somehow explains the SIGBUS event, and
shifts the balance more to a software issue instead of a hardware
issue.

Forensics from the logfiles tell me that in both cases, the only
running task that might use parallel workers, was a routine data
collection job that runs at least every night - and a different
one in both cases with no common parts, no special plugins used
or whatever, just plain SQL.

In between the two crashes, the postgres binaries were updated and
the system subsequently rebooted.

The system does not report any hardware issues, neither failures
in other applications running. Memory ECC exists, and does
actually work - I've seen that in the past.

The two clusters use different physical disks.

Postgres configuration is mostly as recommended. I was surprized to
find that we now use *three* different shared-memory allocation tools,
but the manual is clear about that:
 * 4096 byte from SysV shm (visible with ipcs)
 * shared buffers apparently from anonymous mmap() - nowhere visible
   in the system
 * dynamic shared buffers from Posix - these are visible with
   posixshmcontrol.

Some sources say postgres would access the shared memory via handles
under /dev/shm. But this is not possible because /dev/shm does not
exist (by default on FreeBSD jails).

Furthermore, the manual says postgres uses "a significant number" of
semaphores, and that these are *not* SysV sem. They also are not
Posix, because these do not exist - one would need to build a
custom kernel to get them (according to "man 4 sem").

So far, this does not shed much light on the issue, except insofar
as the "dynamic shared memory" seems historically intended specifically
for parallel workers. One could assume a kind of coincidence, but
looking closer, there are always some of these Posix shm present, on
every cluster, and right from the start, parallel workers or not:
  # posixshmcontrol list
MODE            OWNER   GROUP   SIZE    PATH
rw-------       postgres        postgres        30976   /PostgreSQL.1991522144
rw-------       postgres        postgres        2097152 /PostgreSQL.45072524
rw-------       postgres        postgres        1048576 /PostgreSQL.1450298

Here are my config adjustments so far as they might somehow relate
to memory allocation:

max_connections = 60                    # (change requires restart)
shared_buffers = 40MB                   # min 128kB
temp_buffers = 20MB                     # min 800kB
work_mem = 50MB                         # min 64kB
maintenance_work_mem = 50MB             # min 1MB
max_stack_depth = 40MB                  # min 100kB
dynamic_shared_memory_type = posix      # the default is the first option
max_files_per_process = 200             # min 25
effective_io_concurrency = 5            # 1-1000; 0 disables prefetching
synchronous_commit = off                # synchronization level;

-- PMc


Reply via email to