On Fri, Jun 14, 2013 at 9:27 PM, Roberto De Ioris <[email protected]> wrote:

>
> > On Fri, Jun 14, 2013 at 6:46 AM, Roberto De Ioris <[email protected]>
> > wrote:
> >
> >>
> >> > Hi, uwsgi users,
> >> >
> >> > I need help in trying to determine where a strange problem comes and
> >> how
> >> > to
> >> > fix it.
> >> >
> >> > TL;DR: socket backlog full for one of the vassals, while no visible
> >> reason
> >> > for that, only restarting the emperor helps.
> >> >
> >> > I've had this problem a couple of times, to which only restarting
> >> uwsgi
> >> > became a solution.
> >> > nginx in the frontend started giving out 502 errors, saying
> >> >
> >> > connect() to unix:///var/lib/crm/uwsgi_includes/slontur.socket failed
> >> (11:
> >> > Resource temporarily unavailable)
> >> >
> >> > This led me to a thought, that this might be due to a *overfilled
> >> socket
> >> > backlog*.
> >> >
> >> > "netstat" output had *hundreds of lines like this*:
> >> >
> >> > unix  2      [ ]         STREAM     CONNECTING    0
> >> >  /var/lib/crm/uwsgi_includes/slontur.socket
> >> > unix  2      [ ]         STREAM     CONNECTING    0
> >> >  /var/lib/crm/uwsgi_includes/slontur.socket
> >> > ....
> >> >
> >> > Unfortunately, I did not have time to inspect netstat's output more
> >> > carefully, experimenting with command line parameters.
> >> >
> >> > I could not understand, where all of these connections were coming
> >> from,
> >> > but actually it's only nginx and the emperor, who potentially could
> >> make
> >> > all these connections, so I tried restarting nginx to see if it helps.
> >> It
> >> > did not.
> >> >
> >> > Touching the ".ini" file did not reload the vassal either. So only
> >> > restarting the emperor itself helped.
> >> >
> >> > The project is not under heavy load, so* these connections where
> >> > definitely
> >> > not coming from the browsers at the same time.*
> >> > All the background threads in the app (it's running background threads
> >> > with
> >> > gevent) where also not running, cause there was nothing in the logs.
> >> > *
> >> > *
> >> > *uwsgi.log did not have any messages either.*
> >> > *
> >> > *
> >> > The app has "heartbeat = 20" in the .ini file, but that did not make
> >> it
> >> > reload by itself.
> >> >
> >> > Here are the configs:
> >> >
> >> > *The vassal, which stopped working*
> >> > [uwsgi]
> >> > logfile-chown = crm
> >> > disable-logging =
> >> > home = /var/lib/crm/.virtualenvs/crm
> >> > logto = /var/lib/crm/homes/%n/logs/uwsgi.log
> >> > gid = crm-%n
> >> > env = LC_ALL=en_US.UTF-8
> >> > env = LANG=en_US.UTF-8
> >> > env = SERVER_SOFTWARE=gevent
> >> > env = DJANGO_SETTINGS_MODULE=%n_settings
> >> > worker-reload-mercy = 5
> >> > gevent = 100
> >> > idle = 86400
> >> > harakiri = 30
> >> > reload-mercy = 5
> >> > lazy-apps = true
> >> > cheap = true
> >> > heartbeat = 20
> >> > pythonpath = /var/lib/crm/src
> >> > pythonpath = /var/lib/crm/homes/%n
> >> > harakiri-verbose =
> >> > uid = crm-%n
> >> > chdir = /var/lib/crm/homes/%n
> >> > wsgi = crm.deploy.gevent_wsgi
> >> > die-on-idle = true
> >> >
> >> > *Emperor:*
> >> > exec uwsgi --logto /var/log/uwsgi/emperor.log \
> >> >     --die-on-term --emperor "/var/lib/crm/uwsgi_includes" \
> >> >     --emperor-tyrant \
> >> >     --emperor-on-demand-directory "/var/lib/crm/uwsgi_includes"
> >> >
> >> > Thanks for your time,
> >> > Igor Katson.
> >> > _______________________________________________
> >> >
> >>
> >>
> >> You have a single process, so if your app is blocked (for whatever
> >> reason)
> >> your whole instance will be blocked. The best way to understand what is
> >> going on would be adding the stats server, so when the app is blocked
> >> you
> >> can ask it the whole server status.
> >>
> >> The heartbeat ensure the master is alive, while in your case the worker
> >> is
> >> stuck.
> >>
> >> By the way when you experience the problem, try only touching the config
> >> file. There is no need to reload the whole emperor stack
> >>
> >>
> >> I enabled the stats to be able to inspect the problem next time.
> > Unfortunately, I still can't find a way to see who is connecting to the
> > unix socket, it seems there's no valid tool for that.
>
>
> I am not sure to understand what you mean, the stats server gives you the
> currently running connections. The problem is that with unix sockets you
> do not have the listen queue size, but it should not be a big problem.
>
I meant that I cannot reproduce the problem artificially, so when it
happens next time I will dump the stats server output. And  the other one
was that I cannot find with netstat or any other command, which process is
connecting to the unix socket, only the one which is listening. If it was
TCP, "netstat -p" would show the pid of the client socket, as well as the
server socket, but for unix sockets, only the listener pid is shown.

>
> Another great tool for debugging is the strace command. Just
>
> strace -s 1000 -p <pid>
>
Cool! I'll try it out if it happens again.

>
> where <pid> is the pid of the stuck worker.
>
> The tracebacker generally is the silver bullet, but infortunately comining
> it with gevent lead to pretty hard-to-debug informations.
>
>
> >
> > In the meantime, is there any way to make uwsgi kill the worker or
> > anything
> > like that which will make it self-heal in a situation like above?
> >
> >
>
> if the harakiri is not triggering, well the problem could be much more
> complex (like a db problem and so on). Are you sure all of the parts of
> the app are gevent-friendly ? (pay attention to the db adapter as
> generally they are the weak point).
>
> Well, I cannot be 100% sure, but as far as I know, yes. db is psycopg2
with "gevent_psycopg2" applied. Everything else involving network is
pure-python.
I believe, gdb or strace will show, if a problem is inside some other
c-code.

The fact that all the background jobs running inside the process, where not
working, may show that you are right here, and the gevent loop was blocked.

Whenever a new connection is accepted by a gevent core, the harakiri is
> reset, so maybe you are not completely stuck.
>
Can enabling the master help this? i.e. if the worker is stuck, the master
shouldn't, right?
I had no "master = true" line in the ini file, but there are always 2
processes running for each ini file, so I'm not sure if the master was
running. I now added "master = true" just in case it may help.

>
> Is there anything in uwsgi logs ?
>

No, the logs are empty, absolutely nothing there.

Thank you, Roberto, for your advice, I will be more prepared to meet the
problem next time with strace and gdb.
_______________________________________________
uWSGI mailing list
[email protected]
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi

Reply via email to