On Fri, Jun 14, 2013 at 9:27 PM, Roberto De Ioris <[email protected]> wrote:
> > > On Fri, Jun 14, 2013 at 6:46 AM, Roberto De Ioris <[email protected]> > > wrote: > > > >> > >> > Hi, uwsgi users, > >> > > >> > I need help in trying to determine where a strange problem comes and > >> how > >> > to > >> > fix it. > >> > > >> > TL;DR: socket backlog full for one of the vassals, while no visible > >> reason > >> > for that, only restarting the emperor helps. > >> > > >> > I've had this problem a couple of times, to which only restarting > >> uwsgi > >> > became a solution. > >> > nginx in the frontend started giving out 502 errors, saying > >> > > >> > connect() to unix:///var/lib/crm/uwsgi_includes/slontur.socket failed > >> (11: > >> > Resource temporarily unavailable) > >> > > >> > This led me to a thought, that this might be due to a *overfilled > >> socket > >> > backlog*. > >> > > >> > "netstat" output had *hundreds of lines like this*: > >> > > >> > unix 2 [ ] STREAM CONNECTING 0 > >> > /var/lib/crm/uwsgi_includes/slontur.socket > >> > unix 2 [ ] STREAM CONNECTING 0 > >> > /var/lib/crm/uwsgi_includes/slontur.socket > >> > .... > >> > > >> > Unfortunately, I did not have time to inspect netstat's output more > >> > carefully, experimenting with command line parameters. > >> > > >> > I could not understand, where all of these connections were coming > >> from, > >> > but actually it's only nginx and the emperor, who potentially could > >> make > >> > all these connections, so I tried restarting nginx to see if it helps. > >> It > >> > did not. > >> > > >> > Touching the ".ini" file did not reload the vassal either. So only > >> > restarting the emperor itself helped. > >> > > >> > The project is not under heavy load, so* these connections where > >> > definitely > >> > not coming from the browsers at the same time.* > >> > All the background threads in the app (it's running background threads > >> > with > >> > gevent) where also not running, cause there was nothing in the logs. > >> > * > >> > * > >> > *uwsgi.log did not have any messages either.* > >> > * > >> > * > >> > The app has "heartbeat = 20" in the .ini file, but that did not make > >> it > >> > reload by itself. > >> > > >> > Here are the configs: > >> > > >> > *The vassal, which stopped working* > >> > [uwsgi] > >> > logfile-chown = crm > >> > disable-logging = > >> > home = /var/lib/crm/.virtualenvs/crm > >> > logto = /var/lib/crm/homes/%n/logs/uwsgi.log > >> > gid = crm-%n > >> > env = LC_ALL=en_US.UTF-8 > >> > env = LANG=en_US.UTF-8 > >> > env = SERVER_SOFTWARE=gevent > >> > env = DJANGO_SETTINGS_MODULE=%n_settings > >> > worker-reload-mercy = 5 > >> > gevent = 100 > >> > idle = 86400 > >> > harakiri = 30 > >> > reload-mercy = 5 > >> > lazy-apps = true > >> > cheap = true > >> > heartbeat = 20 > >> > pythonpath = /var/lib/crm/src > >> > pythonpath = /var/lib/crm/homes/%n > >> > harakiri-verbose = > >> > uid = crm-%n > >> > chdir = /var/lib/crm/homes/%n > >> > wsgi = crm.deploy.gevent_wsgi > >> > die-on-idle = true > >> > > >> > *Emperor:* > >> > exec uwsgi --logto /var/log/uwsgi/emperor.log \ > >> > --die-on-term --emperor "/var/lib/crm/uwsgi_includes" \ > >> > --emperor-tyrant \ > >> > --emperor-on-demand-directory "/var/lib/crm/uwsgi_includes" > >> > > >> > Thanks for your time, > >> > Igor Katson. > >> > _______________________________________________ > >> > > >> > >> > >> You have a single process, so if your app is blocked (for whatever > >> reason) > >> your whole instance will be blocked. The best way to understand what is > >> going on would be adding the stats server, so when the app is blocked > >> you > >> can ask it the whole server status. > >> > >> The heartbeat ensure the master is alive, while in your case the worker > >> is > >> stuck. > >> > >> By the way when you experience the problem, try only touching the config > >> file. There is no need to reload the whole emperor stack > >> > >> > >> I enabled the stats to be able to inspect the problem next time. > > Unfortunately, I still can't find a way to see who is connecting to the > > unix socket, it seems there's no valid tool for that. > > > I am not sure to understand what you mean, the stats server gives you the > currently running connections. The problem is that with unix sockets you > do not have the listen queue size, but it should not be a big problem. > I meant that I cannot reproduce the problem artificially, so when it happens next time I will dump the stats server output. And the other one was that I cannot find with netstat or any other command, which process is connecting to the unix socket, only the one which is listening. If it was TCP, "netstat -p" would show the pid of the client socket, as well as the server socket, but for unix sockets, only the listener pid is shown. > > Another great tool for debugging is the strace command. Just > > strace -s 1000 -p <pid> > Cool! I'll try it out if it happens again. > > where <pid> is the pid of the stuck worker. > > The tracebacker generally is the silver bullet, but infortunately comining > it with gevent lead to pretty hard-to-debug informations. > > > > > > In the meantime, is there any way to make uwsgi kill the worker or > > anything > > like that which will make it self-heal in a situation like above? > > > > > > if the harakiri is not triggering, well the problem could be much more > complex (like a db problem and so on). Are you sure all of the parts of > the app are gevent-friendly ? (pay attention to the db adapter as > generally they are the weak point). > > Well, I cannot be 100% sure, but as far as I know, yes. db is psycopg2 with "gevent_psycopg2" applied. Everything else involving network is pure-python. I believe, gdb or strace will show, if a problem is inside some other c-code. The fact that all the background jobs running inside the process, where not working, may show that you are right here, and the gevent loop was blocked. Whenever a new connection is accepted by a gevent core, the harakiri is > reset, so maybe you are not completely stuck. > Can enabling the master help this? i.e. if the worker is stuck, the master shouldn't, right? I had no "master = true" line in the ini file, but there are always 2 processes running for each ini file, so I'm not sure if the master was running. I now added "master = true" just in case it may help. > > Is there anything in uwsgi logs ? > No, the logs are empty, absolutely nothing there. Thank you, Roberto, for your advice, I will be more prepared to meet the problem next time with strace and gdb.
_______________________________________________ uWSGI mailing list [email protected] http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi
