ERROR: too many dynamic shared memory segments
Hi, I started seeing those errors on Postgresql 10.1 running on Ubuntu 16.04.3 (64GB/8CPU) server. The DB enters recovery mode after that. 2017-11-23 07:20:39 CET::@:[24823]: ERROR: could not attach to dynamic shared area 2017-11-23 07:20:39 CET::@:[24822]: ERROR: could not map dynamic shared memory segment 2017-11-23 07:20:40 CET:192.168.xx,xx(33974):u(at)db:[24209]: ERROR: too many dynamic shared memory segments The errors happen when the parallel execution is enabled and multiple queries are executed simultaneously. If I set the max_parallel_workers_per_gather = 0 the error doesn't occur. The queries are rather big, each executes over 10sec. I see the error when 4 or more queries are started at the same time. my postgresql.conf: max_connections = 100 shared_buffers = 12GB effective_cache_size = 51GB work_mem = 83886kB maintenance_work_mem = 2GB max_wal_size = 8GB min_wal_size = 4GB checkpoint_completion_target = 0.9 wal_buffers = 32MB default_statistics_target = 1000 max_locks_per_transaction = 128 #max_parallel_workers_per_gather = 0 PS. I've posted this question on pgsql-admin distro but didn't get any feedback. -- regards, Jakub Glapa
Re: ERROR: too many dynamic shared memory segments
n par_7_datasource (cost=0.00..372.00 rows=12228 width=0) Index Cond: ((datasource)::text = 'one'::text) -> Bitmap Index Scan on par_7_datasource (cost=0.00..4.36 rows=10 width=0) Index Cond: ((datasource)::text = 'two'::text) -> Bitmap Index Scan on par_7_datasource (cost=0.00..4.36 rows=10 width=0) Index Cond: ((datasource)::text = 'three'::text) In this particular query there were over _100_ partitions connected with the UNION ALL operator. -- regards, pozdrawiam, Jakub Glapa On Mon, Nov 27, 2017 at 11:47 AM, Thomas Munro < thomas.mu...@enterprisedb.com> wrote: > On Mon, Nov 27, 2017 at 10:54 PM, Jakub Glapa > wrote: > > The DB enters recovery mode after that. > > That's not good. So it actually crashes? Can you please show the > full error messages? > > > 2017-11-23 07:20:39 CET::@:[24823]: ERROR: could not attach to dynamic > > shared area > > From src/backend/utils/mmgr/dsa.c. The same message is used for > dsa_attach() and for attach_internal(), but in this case it must be > the latter because we use in-place DSA areas for parallel query. This > means that when the worker tried to attach it found that > control->refcnt == 0, meaning 'too late, every other backend including > the leader has already detached'. > > > 2017-11-23 07:20:39 CET::@:[24822]: ERROR: could not map dynamic shared > > memory segment > > From src/backend/access/transam/parallel.c when dsm_attach returns > null. I think this also means 'too late'. > > So those are both workers that have started up and found that the > leader has abandoned the parallel query already, but they discovered > it at different stages. PID 24823 didn't manage to attach to the DSM > segment, while PID 24822 did but found that the other(s) had already > detached from the per-query DSA area inside it. > > > 2017-11-23 07:20:40 CET:192.168.xx,xx(33974):u(at)db:[24209]: ERROR: > too > > many > > dynamic shared memory segments > > > > The errors happen when the parallel execution is enabled and multiple > > queries are executed simultaneously. > > If I set the max_parallel_workers_per_gather = 0 the error doesn't > occur. > > The queries are rather big, each executes over 10sec. > > I see the error when 4 or more queries are started at the same time. > > > > my postgresql.conf: > > > > max_connections = 100 > > So the two questions we need to figure out are: (1) how do we manage > to use up all 64 + 2 * 100 DSM slots (or somehow corrupt things so it > looks that way) by running only 4 queries, and (2) why would be be > crashing rather than failing and reporting an error? I'm not able to > reproduce the problem from your description running lots of parallel > queries running at the same time. Can you provide reproduction steps? > Does your query plan happen to include a Parallel Bitmap Heap Scan? > > -- > Thomas Munro > http://www.enterprisedb.com >
Re: ERROR: too many dynamic shared memory segments
Hi Thomas, doubling the max_connection has the problem gone away for now! Yay! As for the crash. I dug up the initial log and it looks like a segmentation fault... 2017-11-23 07:26:53 CET:192.168.10.83(35238):user@db:[30003]: ERROR: too many dynamic shared memory segments 2017-11-23 07:26:53 CET:192.168.10.83(35238):user@db:[30003]: STATEMENT: SELECT 2017-11-23 07:26:55 CET::@:[29398]: LOG: server process (PID 26992) was terminated by signal 11: Segmentation fault 2017-11-23 07:26:55 CET::@:[29398]: DETAIL: Failed process was running: SELECT . 2017-11-23 07:26:55 CET::@:[29398]: LOG: terminating any other active server processes 2017-11-23 07:26:55 CET:192.168.10.83(35242):user@db:[30009]: WARNING: terminating connection because of crash of another server process 2017-11-23 07:26:55 CET:192.168.10.83(35242):user@db:[30009]: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2017-11-23 07:26:55 CET:192.168.10.83(35242):user@db:[30009]: HINT: In a moment you should be able to reconnect to the database and repeat your command. 2017-11-23 07:26:55 CET:192.168.10.83(35268):user@db:[30097]: WARNING: terminating connection because of crash of another server process 2017-11-23 07:26:55 CET:192.168.10.83(35268):user@db:[30097]: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2017-11-23 07:26:55 CET:192.168.10.83(35268):user@db:[30097]: HINT: In a moment you should be able to reconnect to the database and repeat your command. 2017-11-23 07:26:55 CET:192.168.10.83(46614):user@db:[31080]: WARNING: terminating connection because of crash of another server process 2017-11-23 07:26:55 CET:192.168.10.83(46614):user@db:[31080]: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2017-11-23 07:26:55 CET:192.168.10.83(46614):user@db: HINT: In a moment you should be able to reconnect to the database and repeat your command. 2017-11-23 07:26:55 CET:192.168.10.83(35238):user@db:[30003]: WARNING: terminating connection because of crash of another server process 2017-11-23 07:26:55 CET:192.168.10.83(35238):user@db:[30003]: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory 2017-11-23 07:26:55 CET:192.168.10.83(35300):user@db:[30152]: FATAL: the database system is in recovery mode -- regards, pozdrawiam, Jakub Glapa On Mon, Nov 27, 2017 at 7:53 PM, Thomas Munro wrote: > On Tue, Nov 28, 2017 at 1:13 AM, Jakub Glapa > wrote: > > The queries are somehow special. > > We are still using the old style partitioning (list type) but we abuse > it a > > bit when querying. > > When querying a set of partitions instead of doing it via parent table we > > stitch together the required tables with UNION ALL (this requires less > > locking) and was more performant in our benchmark (the new native > > partitioning might improve this but we didn't research that yet). > > > > The queries are in form of > > SELECT col1,col2,col3 FROM > > (SELECT * > > FROM par1 > > WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend <= > > '2017-11-26 23:59:59.999' > > UNION ALL SELECT * > > FROM par2 > > WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend > <= > > '2017-11-26 23:59:59.999' > > UNION ALL SELECT * > > FROM par2 > > WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend > <= > > '2017-11-26 23:59:59.999' > > > > WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend <= > > '2017-11-26 23:59:59.999' > > > > UNION ALL > > ... > > ) unionalias > > WHERE > > > > > > > > and indeed the query planner shows the usage of Bitmap Heap Scan > > > > -> Subquery Scan on "*SELECT* 2" > > (cost=3068.58..19793.94 rows=1 width=1284) > >-> Gather > (cost=3068.58..19793.93 > > rows=1 width=5014) > > Workers Planned: 2 > > -> Parallel Bitmap Heap > Scan > > on par_6 (cost=2068.58..18
Re: ERROR: too many dynamic shared memory segments
I see that the segfault is under active discussion but just wanted to ask if increasing the max_connections to mitigate the DSM slots shortage is the way to go? -- regards, Jakub Glapa On Mon, Nov 27, 2017 at 11:48 PM, Thomas Munro < thomas.mu...@enterprisedb.com> wrote: > On Tue, Nov 28, 2017 at 10:05 AM, Jakub Glapa > wrote: > > As for the crash. I dug up the initial log and it looks like a > segmentation > > fault... > > > > 2017-11-23 07:26:53 CET:192.168.10.83(35238):user@db:[30003]: ERROR: > too > > many dynamic shared memory segments > > Hmm. Well this error can only occur in dsm_create() called without > DSM_CREATE_NULL_IF_MAXSEGMENTS. parallel.c calls it with that flag > and dsa.c doesn't (perhaps it should, not sure, but that'd just change > the error message), so that means this the error arose from dsa.c > trying to get more segments. That would be when Parallel Bitmap Heap > Scan tried to allocate memory. > > I hacked my copy of PostgreSQL so that it allows only 5 DSM slots and > managed to reproduce a segv crash by trying to run concurrent Parallel > Bitmap Heap Scans. The stack looks like this: > > * frame #0: 0x0001083ace29 > postgres`alloc_object(area=0x, size_class=10) + 25 at > dsa.c:1433 > frame #1: 0x0001083acd14 > postgres`dsa_allocate_extended(area=0x, size=72, > flags=4) + 1076 at dsa.c:785 > frame #2: 0x000108059c33 > postgres`tbm_prepare_shared_iterate(tbm=0x7f9743027660) + 67 at > tidbitmap.c:780 > frame #3: 0x000108000d57 > postgres`BitmapHeapNext(node=0x7f9743019c88) + 503 at > nodeBitmapHeapscan.c:156 > frame #4: 0x000107fefc5b > postgres`ExecScanFetch(node=0x7f9743019c88, > accessMtd=(postgres`BitmapHeapNext at nodeBitmapHeapscan.c:77), > recheckMtd=(postgres`BitmapHeapRecheck at nodeBitmapHeapscan.c:710)) + > 459 at execScan.c:95 > frame #5: 0x000107fef983 > postgres`ExecScan(node=0x7f9743019c88, > accessMtd=(postgres`BitmapHeapNext at nodeBitmapHeapscan.c:77), > recheckMtd=(postgres`BitmapHeapRecheck at nodeBitmapHeapscan.c:710)) + > 147 at execScan.c:162 > frame #6: 0x0001080008d1 > postgres`ExecBitmapHeapScan(pstate=0x7f9743019c88) + 49 at > nodeBitmapHeapscan.c:735 > > (lldb) f 3 > frame #3: 0x000108000d57 > postgres`BitmapHeapNext(node=0x7f9743019c88) + 503 at > nodeBitmapHeapscan.c:156 >153 * dsa_pointer of the iterator state which will be used by >154 * multiple processes to iterate jointly. >155 */ > -> 156 pstate->tbmiterator = tbm_prepare_shared_iterate(tbm); >157 #ifdef USE_PREFETCH >158 if (node->prefetch_maximum > 0) >159 > (lldb) print tbm->dsa > (dsa_area *) $3 = 0x > (lldb) print node->ss.ps.state->es_query_dsa > (dsa_area *) $5 = 0x > (lldb) f 17 > frame #17: 0x00010800363b > postgres`ExecGather(pstate=0x7f9743019320) + 635 at > nodeGather.c:220 >217 * Get next tuple, either from one of our workers, or by running the > plan >218 * ourselves. >219 */ > -> 220 slot = gather_getnext(node); >221 if (TupIsNull(slot)) >222 return NULL; >223 > (lldb) print *node->pei > (ParallelExecutorInfo) $8 = { > planstate = 0x7f9743019640 > pcxt = 0x7f97450001b8 > buffer_usage = 0x000108b7e218 > instrumentation = 0x000108b7da38 > area = 0x > param_exec = 0 > finished = '\0' > tqueue = 0x > reader = 0x > } > (lldb) print *node->pei->pcxt > warning: could not load any Objective-C class information. This will > significantly reduce the quality of type information available. > (ParallelContext) $9 = { > node = { > prev = 0x00010855fb60 > next = 0x00010855fb60 > } > subid = 1 > nworkers = 0 > nworkers_launched = 0 > library_name = 0x7f9745000248 "postgres" > function_name = 0x7f9745000268 "ParallelQueryMain" > error_context_stack = 0x > estimator = (space_for_chunks = 180352, number_of_keys = 19) > seg = 0x > private_memory = 0x000108b53038 > toc = 0x000108b53038 > worker = 0x > } > > I think there are two failure modes: one of your sessions showed the > "too many ..." error (that's good, ran out of slots and said so and > our error machinery worked as it should), and another crashed with a > segfault, because it tried to use a NULL "area" pointer (bad). I > think this is a degenerate case where we completely failed to launch > parallel query, but we ran the parallel query plan anyway and this > code thinks that the DSA is available. Oops. > > -- > Thomas Munro > http://www.enterprisedb.com >
Re: ERROR: too many dynamic shared memory segments
Thank You Thomas! -- regards, Jakub Glapa On Thu, Dec 7, 2017 at 10:30 PM, Thomas Munro wrote: > On Tue, Dec 5, 2017 at 1:18 AM, Jakub Glapa wrote: > > I see that the segfault is under active discussion but just wanted to > ask if > > increasing the max_connections to mitigate the DSM slots shortage is the > way > > to go? > > Hi Jakub, > > Yes. In future releases this situation will improve (maybe we'll > figure out how to use one DSM segment for all the gather nodes in your > query plan, and maybe it'll be moot anyway because maybe we'll be able > to use a Parallel Append for queries like yours so that it uses the > same set of workers over all the child plans instead of the > fork()-fest you're presumably seeing). For now your only choice, if > you want that plan to run, is to crank up max_connections so that the > total number of concurrently executing Gather nodes is less than about > 64 + 2 * max_connections. There is also a crash bug right now in the > out-of-slots case as discussed, fixed in the next point release, but > even with that fix in place you'll still need a high enough > max_connections setting to be sure to be able to complete the query > without an error. > > Thanks for the report! > > -- > Thomas Munro > http://www.enterprisedb.com >