Re: pgbench: could not connect to server: Resource temporarily unavailable
Sorry Tom for the duplicate email. Resending with the mailing list. > Thanks for your response. I'm using a Centos Linux environment and have > the open files set very high: > > -bash-4.2$ ulimit -a|grep open > open files (-n) 65000 > > What else could be limiting the connections? > > Kevin > > > On Sat, 20 Aug 2022 at 21:20, Tom Lane wrote: > >> Kevin McKibbin writes: >> > What's limiting my DB from allowing more connections? >> >> > This is a sample of the output I'm getting, which repeats the error 52 >> > times (one for each failed connection) >> >> > -bash-4.2$ pgbench -c 200 -j 200 -t 100 benchy >> > ... >> > connection to database "benchy" failed: >> > could not connect to server: Resource temporarily unavailable >> > Is the server running locally and accepting >> > connections on Unix domain socket >> > "/var/run/postgresql/.s.PGSQL.5432"? >> >> This is apparently a client-side failure not a server-side failure >> (you could confirm that by seeing whether any corresponding >> failure shows up in the postmaster log). That means that the >> kernel wouldn't honor pgbench's attempt to open a connection, >> which implies you haven't provisioned enough networking resources >> to support the number of connections you want. Since you haven't >> mentioned what platform this is on, it's impossible to say more >> than that --- but it doesn't look like Postgres configuration >> settings are at issue at all. >> >> regards, tom lane >> >
Re: pgbench: could not connect to server: Resource temporarily unavailable
On 2022-08-20 Sa 23:20, Tom Lane wrote: > Kevin McKibbin writes: >> What's limiting my DB from allowing more connections? >> This is a sample of the output I'm getting, which repeats the error 52 >> times (one for each failed connection) >> -bash-4.2$ pgbench -c 200 -j 200 -t 100 benchy >> ... >> connection to database "benchy" failed: >> could not connect to server: Resource temporarily unavailable >> Is the server running locally and accepting >> connections on Unix domain socket >> "/var/run/postgresql/.s.PGSQL.5432"? > This is apparently a client-side failure not a server-side failure > (you could confirm that by seeing whether any corresponding > failure shows up in the postmaster log). That means that the > kernel wouldn't honor pgbench's attempt to open a connection, > which implies you haven't provisioned enough networking resources > to support the number of connections you want. Since you haven't > mentioned what platform this is on, it's impossible to say more > than that --- but it doesn't look like Postgres configuration > settings are at issue at all. The first question in my mind from the above is where this postgres instance is actually listening. Is it really /var/run/postgresql? Its postmaster.pid will tell you. I have often seen client programs pick up a system libpq which is compiled with a different default socket directory. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: pgbench: could not connect to server: Resource temporarily unavailable
Andrew Dunstan writes: > On 2022-08-20 Sa 23:20, Tom Lane wrote: >> Kevin McKibbin writes: >>> What's limiting my DB from allowing more connections? > The first question in my mind from the above is where this postgres > instance is actually listening. Is it really /var/run/postgresql? Its > postmaster.pid will tell you. I have often seen client programs pick up > a system libpq which is compiled with a different default socket directory. I wouldn't think that'd explain a symptom of some connections succeeding and others not within the same pgbench run. I tried to duplicate this behavior locally (on RHEL8) and got something interesting. After increasing the server's max_connections to 1000, I can do $ pgbench -S -c 200 -j 100 -t 100 bench and it goes through fine. But: $ pgbench -S -c 200 -j 200 -t 100 bench pgbench (16devel) starting vacuum...end. pgbench: error: connection to server on socket "/tmp/.s.PGSQL.5440" failed: Resource temporarily unavailable Is the server running locally and accepting connections on that socket? pgbench: error: could not create connection for client 154 So whatever is triggering this has nothing to do with the server, but with how many threads are created inside pgbench. I notice also that sometimes it works, making it seem like possibly a race condition. Either that or there's some limitation on how fast threads within a process can open sockets. Also, I determined that libpq's connect() call is failing synchronously (we get EAGAIN directly from the connect() call, not later). I wondered if libpq should accept EAGAIN as a synonym for EINPROGRESS, but no: that just makes it fail on the next touch of the socket. The only documented reason for connect(2) to fail with EAGAIN is EAGAIN Insufficient entries in the routing cache. which seems pretty unlikely to be the issue here, since all these connections are being made to the same local address. On the whole this is smelling more like a Linux kernel bug than anything else. regards, tom lane
Re: pgbench: could not connect to server: Resource temporarily unavailable
On 2022-08-21 Su 17:15, Tom Lane wrote: > Andrew Dunstan writes: >> On 2022-08-20 Sa 23:20, Tom Lane wrote: >>> Kevin McKibbin writes: What's limiting my DB from allowing more connections? >> The first question in my mind from the above is where this postgres >> instance is actually listening. Is it really /var/run/postgresql? Its >> postmaster.pid will tell you. I have often seen client programs pick up >> a system libpq which is compiled with a different default socket directory. > I wouldn't think that'd explain a symptom of some connections succeeding > and others not within the same pgbench run. Oh, yes, I agree, I missed that aspect of it. > > I tried to duplicate this behavior locally (on RHEL8) and got something > interesting. After increasing the server's max_connections to 1000, > I can do > > $ pgbench -S -c 200 -j 100 -t 100 bench > > and it goes through fine. But: > > $ pgbench -S -c 200 -j 200 -t 100 bench > pgbench (16devel) > starting vacuum...end. > pgbench: error: connection to server on socket "/tmp/.s.PGSQL.5440" failed: > Resource temporarily unavailable > Is the server running locally and accepting connections on that > socket? > pgbench: error: could not create connection for client 154 > > So whatever is triggering this has nothing to do with the server, > but with how many threads are created inside pgbench. I notice > also that sometimes it works, making it seem like possibly a race > condition. Either that or there's some limitation on how fast > threads within a process can open sockets. > > Also, I determined that libpq's connect() call is failing synchronously > (we get EAGAIN directly from the connect() call, not later). I wondered > if libpq should accept EAGAIN as a synonym for EINPROGRESS, but no: > that just makes it fail on the next touch of the socket. > > The only documented reason for connect(2) to fail with EAGAIN is > >EAGAIN Insufficient entries in the routing cache. > > which seems pretty unlikely to be the issue here, since all these > connections are being made to the same local address. > > On the whole this is smelling more like a Linux kernel bug than > anything else. > > *nod* cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: pgbench: could not connect to server: Resource temporarily unavailable
Andrew Dunstan writes: > On 2022-08-21 Su 17:15, Tom Lane wrote: >> On the whole this is smelling more like a Linux kernel bug than >> anything else. > *nod* Conceivably we could work around this in libpq: on EAGAIN, just retry the failed connect(), or maybe better to close the socket and take it from the top with the same target server address. On the one hand, reporting EAGAIN certainly sounds like an invitation to do just that. On the other hand, if the failure is persistent then libpq is locked up in a tight loop --- and "Insufficient entries in the routing cache" doesn't seem like a condition that would clear immediately. It's also pretty unclear why the kernel would want to return EAGAIN instead of letting the nonblock connection path do the waiting, which is why I'm suspecting a bug rather than designed behavior. I think I'm disinclined to install such a workaround unless we get confirmation from some kernel hacker that it's operating as designed and application-level retry is intended. regards, tom lane
Re: pgbench: could not connect to server: Resource temporarily unavailable
On Mon, Aug 22, 2022 at 9:48 AM Tom Lane wrote: > It's also pretty unclear why the kernel would want to return > EAGAIN instead of letting the nonblock connection path do the > waiting, which is why I'm suspecting a bug rather than designed > behavior. Could it be that it fails like that if the listen queue is full on the other side? https://github.com/torvalds/linux/blob/master/net/unix/af_unix.c#L1493 If it's something like that, maybe increasing /proc/sys/net/core/somaxconn would help? I think older kernels only had 128 here.
Re: pgbench: could not connect to server: Resource temporarily unavailable
Hi, On 2022-08-21 17:15:01 -0400, Tom Lane wrote: > I tried to duplicate this behavior locally (on RHEL8) and got something > interesting. After increasing the server's max_connections to 1000, > I can do > > $ pgbench -S -c 200 -j 100 -t 100 bench > > and it goes through fine. But: > > $ pgbench -S -c 200 -j 200 -t 100 bench > pgbench (16devel) > starting vacuum...end. > pgbench: error: connection to server on socket "/tmp/.s.PGSQL.5440" failed: > Resource temporarily unavailable > Is the server running locally and accepting connections on that > socket? > pgbench: error: could not create connection for client 154 > > So whatever is triggering this has nothing to do with the server, > but with how many threads are created inside pgbench. I notice > also that sometimes it works, making it seem like possibly a race > condition. Either that or there's some limitation on how fast > threads within a process can open sockets. I think it's more likely to be caused by the net.core.somaxconn sysctl limiting the size of the listen backlog. The threads part just influences the speed at which new connections are made, and thus how quickly the backlog is filled. Do you get the same behaviour if you set net.core.somaxconn to higher than the number of connections? IIRC you need to restart postgres for it to take effect. Greetings, Andres Freund
Re: pgbench: could not connect to server: Resource temporarily unavailable
Thomas Munro writes: > If it's something like that, maybe increasing > /proc/sys/net/core/somaxconn would help? I think older kernels only > had 128 here. Bingo! I see $ cat /proc/sys/net/core/somaxconn 128 by default, which is right about where the problem starts. After $ sudo sh -c 'echo 1000 >/proc/sys/net/core/somaxconn' *and restarting the PG server*, I can do a lot more threads without a problem. Evidently, the server's socket's listen queue length is fixed at creation and adjusting the kernel limit won't immediately change it. So what we've got is that EAGAIN from connect() on a Unix socket can mean "listen queue overflow" and the kernel won't treat that as a nonblock-waitable condition. Still seems like a kernel bug perhaps, or at least a misfeature. Not sure what I think at this point about making libpq retry after EAGAIN. It would make sense for this particular undocumented use of EAGAIN, but I'm worried about others, especially the documented reason. On the whole I'm inclined to leave the code alone; but is there sufficient reason to add something about adjusting somaxconn to our documentation? regards, tom lane
Re: pgbench: could not connect to server: Resource temporarily unavailable
On Mon, Aug 22, 2022 at 10:55 AM Tom Lane wrote: > Not sure what I think at this point about making libpq retry after > EAGAIN. It would make sense for this particular undocumented use > of EAGAIN, but I'm worried about others, especially the documented > reason. On the whole I'm inclined to leave the code alone; > but is there sufficient reason to add something about adjusting > somaxconn to our documentation? My Debian system apparently has a newer man page: EAGAIN For nonblocking UNIX domain sockets, the socket is nonblocking, and the connection cannot be completed immediately. For other socket families, there are insufficient entries in the routing cache. Yeah retrying doesn't seem that nice. +1 for a bit of documentation, which I guess belongs in the server tuning part where we talk about sysctls, perhaps with a link somewhere near max_connections? More recent Linux kernels bumped it to 4096 by default so I doubt it'll come up much in the future, though. Note that we also call listen() with a backlog value capped to our own PG_SOMAXCONN which is 1000. I doubt many people benchmark with higher numbers of connections but it'd be nicer if it worked when you do... I was curious and checked how FreeBSD would handle this. Instead of EAGAIN you get ECONNREFUSED here, until you crank up kern.ipc.somaxconn, which also defaults to 128 like older Linux.
Re: pgbench: could not connect to server: Resource temporarily unavailable
Thomas Munro writes: > Yeah retrying doesn't seem that nice. +1 for a bit of documentation, > which I guess belongs in the server tuning part where we talk about > sysctls, perhaps with a link somewhere near max_connections? More > recent Linux kernels bumped it to 4096 by default so I doubt it'll > come up much in the future, though. Hmm. It'll be awhile till the 128 default disappears entirely though, especially if assorted BSDen use that too. Probably worth the trouble to document. > Note that we also call listen() > with a backlog value capped to our own PG_SOMAXCONN which is 1000. I > doubt many people benchmark with higher numbers of connections but > it'd be nicer if it worked when you do... Actually it's 1. Still, I wonder if we couldn't just remove that limit now that we've desupported a bunch of stone-age kernels. It's hard to believe any modern kernel can't defend itself against silly listen-queue requests. regards, tom lane
Re: pgbench: could not connect to server: Resource temporarily unavailable
On Mon, Aug 22, 2022 at 12:20 PM Tom Lane wrote: > Thomas Munro writes: > > Yeah retrying doesn't seem that nice. +1 for a bit of documentation, > > which I guess belongs in the server tuning part where we talk about > > sysctls, perhaps with a link somewhere near max_connections? More > > recent Linux kernels bumped it to 4096 by default so I doubt it'll > > come up much in the future, though. > > Hmm. It'll be awhile till the 128 default disappears entirely > though, especially if assorted BSDen use that too. Probably > worth the trouble to document. I could try to write a doc patch if you aren't already on it. > > Note that we also call listen() > > with a backlog value capped to our own PG_SOMAXCONN which is 1000. I > > doubt many people benchmark with higher numbers of connections but > > it'd be nicer if it worked when you do... > > Actually it's 1. Still, I wonder if we couldn't just remove > that limit now that we've desupported a bunch of stone-age kernels. > It's hard to believe any modern kernel can't defend itself against > silly listen-queue requests. Oh, right. Looks like that was just paranoia in commit 153f4006763, back when you got away from using the (very conservative) SOMAXCONN macro. Looks like that was 5 on ancient systems going back to the original sockets stuff, and later 128 was a popular number. Yeah I'd say +1 for removing our cap. I'm pretty sure every system will internally cap whatever value we pass in if it doesn't like it, as POSIX explicitly says it can freely do with this "hint". The main thing I learned today is that Linux's connect(AF_UNIX) implementation doesn't refuse connections when the listen backlog is full, unlike other OSes. Instead, for blocking sockets, it sleeps and wakes with everyone else to fight over space. I *guess* for non-blocking sockets that introduced a small contradiction -- there isn't the state space required to give you a working EINPROGRESS with the same sort of behaviour (if you reified a secondary queue for that you might as well make the primary one larger...), but they also didn't want to give you ECONNREFUSED just because you're non-blocking, so they went with EAGAIN, because you really do need to call again with the sockaddr. The reason I wouldn't want to call it again is that I guess it'd be a busy CPU burning loop until progress can be made, which isn't nice, and failing with "Resource temporarily unavailable" to the user does in fact describe the problem, if somewhat vaguely. Hmm, maybe we could add a hint to the error, though?
Re: pgbench: could not connect to server: Resource temporarily unavailable
Thomas Munro writes: > On Mon, Aug 22, 2022 at 12:20 PM Tom Lane wrote: >> Hmm. It'll be awhile till the 128 default disappears entirely >> though, especially if assorted BSDen use that too. Probably >> worth the trouble to document. > I could try to write a doc patch if you aren't already on it. I haven't done anything about it yet, but could do so tomorrow or so. (BTW, I just finished discovering that NetBSD has the same 128 limit. It looks like they intended to make that settable via sysctl, because it's a variable not a constant; but they haven't actually wired up the variable to sysctl yet.) > Oh, right. Looks like that was just paranoia in commit 153f4006763, > back when you got away from using the (very conservative) SOMAXCONN > macro. Looks like that was 5 on ancient systems going back to the > original sockets stuff, and later 128 was a popular number. Yeah I'd > say +1 for removing our cap. I'm pretty sure every system will > internally cap whatever value we pass in if it doesn't like it, as > POSIX explicitly says it can freely do with this "hint". Yeah. I hadn't thought to check the POSIX text, but their listen(2) page is pretty clear that implementations should *silently* reduce the value to what they can handle, not fail. Also, SUSv2 says the same thing in different words, so the requirement's been that way for a very long time. I think we could drop this ancient bit of paranoia. > ... Hmm, maybe we could add a hint to the error, > though? libpq doesn't really have a notion of hints --- perhaps we ought to fix that sometime. But this doesn't seem like a very exciting place to start, given the paucity of prior complaints. (And anyway people using other client libraries wouldn't be helped.) I think some documentation in the "Managing Kernel Resources" section should be plenty for this. regards, tom lane
Re: pgbench: could not connect to server: Resource temporarily unavailable
On Mon, Aug 22, 2022 at 2:18 PM Tom Lane wrote: > Thomas Munro writes: > > On Mon, Aug 22, 2022 at 12:20 PM Tom Lane wrote: > >> Hmm. It'll be awhile till the 128 default disappears entirely > >> though, especially if assorted BSDen use that too. Probably > >> worth the trouble to document. > > > I could try to write a doc patch if you aren't already on it. > > I haven't done anything about it yet, but could do so tomorrow or so. Cool. BTW small correction to something I said about FreeBSD: it'd be better to document the new name kern.ipc.soacceptqueue (see listen(2) HISTORY) even though the old name still works and matches OpenBSD and macOS.
Re: pgbench: could not connect to server: Resource temporarily unavailable
Thomas Munro writes: > Cool. BTW small correction to something I said about FreeBSD: it'd be > better to document the new name kern.ipc.soacceptqueue (see listen(2) > HISTORY) even though the old name still works and matches OpenBSD and > macOS. Thanks. Sounds like we get to document at least three different sysctl names for this setting :-( regards, tom lane
