[uWSGI] Server hangs after harakiri; debugging harakiri events in general

Jody McIntyre Fri, 21 Nov 2014 11:40:30 -0800

Hi,

Our uWSGI server hangs (stops serving any requests until it's
restarted) about once a week, generally after a harakiri event.  Can
anyone help troubleshoot this?  Also how can I debug harakiri events
in general?  Most of them don't cause the server to hang, but I don't
understand what's causing them.  The requests printed when the worker
dies are all normal parts of our app that are accessed hundreds of
times per day without incident.


uWSGI version is 2.0.8.
OS is Ubuntu 14.04 LTS.
CPU is x86_64 - Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz on Amazon EC2.
Webserver is nginx, load balancer is haproxy.

Config is below.

Logs from a harakiri that caused the server to hang:

Thu Nov 20 15:09:29 2014 - *** HARAKIRI ON WORKER 8 (pid: 5046, try: 1) ***
HARAKIRI: -- syscall> 7 0x7fffafe0e9c0 0x1 0xffffffff 0x8 0x1040bc8
0x1 0x7fffafe0e9a0 0x7f3afea6cfbd
HARAKIRI: -- wchan> poll_schedule_timeout
Thu Nov 20 15:09:29 2014 - HARAKIRI !!! worker 8 status !!!
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 0] 127.0.0.1 - GET
/acct_quota since 1416495853
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 1] 127.0.0.1 - POST /pullf/
since 1416495854
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 2] 127.0.0.1 - GET / since 1416495861
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 3] 127.0.0.1 - GET / since 1416495853
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 4] 127.0.0.1 - POST /signin/
since 1416495865
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 5] 127.0.0.1 - POST
/clientresp since 1416495856
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 6] 127.0.0.1 - POST /pullf/
since 1416495858
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 7] 127.0.0.1 - GET
/~Dreamshot/495/percentage-of-bachelors-degrees-conferred-to-women-in-the-usa-by-major-1970-2012/
since 1416495852
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 8] 127.0.0.1 - POST
/stylethemes/ since 1416495858
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 9] 127.0.0.1 - POST
/clientresp since 1416495854
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 10] 127.0.0.1 - POST
/clientresp since 1416495856
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 11] 127.0.0.1 - GET
/acct_quota since 1416495860
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 12] 127.0.0.1 - POST
/signin/ since 1416495866
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 13] 127.0.0.1 - POST
/clientresp since 1416495865
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 14] 127.0.0.1 - POST /pullf/
since 1416495853
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 15] 127.0.0.1 - GET
/%7Ehianalytics/189/ since 1416495852
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 16] 127.0.0.1 - POST /pullf/
since 1416495851
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 17] 127.0.0.1 - POST
/signin/ since 1416495868
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 18] 127.0.0.1 - POST
/clientresp since 1416495866
Thu Nov 20 15:09:29 2014 - HARAKIRI [core 19] 127.0.0.1 - GET
/getsources?fid=&extrarefs=Doktorigi%3A8 since 1416495868
Thu Nov 20 15:09:29 2014 - HARAKIRI !!! end of worker 8 status !!!
DAMN ! worker 8 (pid: 5046) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 8 (new pid: 10985)
monitor (pid=10985): Starting stack trace monitor.
WSGI app 0 (mountpoint='') ready in 1 seconds on interpreter 0xa3dd80
pid: 10985 (default app)

When the server is able to successfully restart the worker, the
message looks similar.  Here's our latest:

Fri Nov 21 18:36:12 2014 - *** HARAKIRI ON WORKER 5 (pid: 23549, try: 1) ***
HARAKIRI: -- wchan> futex_wait_queue_me
Fri Nov 21 18:36:12 2014 - HARAKIRI !!! worker 5 status !!!
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 0] 127.0.0.1 - GET /plot
since 1416594367
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 1] 127.0.0.1 - POST
/getuser/ since 1416594367
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 2] 127.0.0.1 - POST
/user_account_actions since 1416594370
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 3] 127.0.0.1 - GET /plot
since 1416594366
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 4] 127.0.0.1 - POST /pullf/
since 1416594368
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 5] 127.0.0.1 - POST
/clientresp since 1416594368
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 6] 127.0.0.1 - GET
/python/3d-plots-tutorial/ since 1416594368
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 7] 127.0.0.1 - POST
/getuser/ since 1416594370
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 8] 127.0.0.1 - POST
/getuser/ since 1416594367
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 9] 127.0.0.1 - POST
/getuser/ since 1416594368
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 10] 127.0.0.1 - POST
/getuser/ since 1416594368
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 11] 127.0.0.1 - POST
/svgtopdf/ since 1416594371
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 12] 127.0.0.1 - POST
/clientresp since 1416594366
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 13] 127.0.0.1 - GET
/quandl?code=WORLDBANK/UZB_SP_RUR_TOTL_ZS since 1416594368
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 14] 127.0.0.1 - GET
/~martin.2098/20/-line0-css-penthouse-line0-line0 since 1416594367
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 15] 127.0.0.1 - POST
/user_account_actions since 1416594368
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 16] 127.0.0.1 - GET /plot
since 1416594368
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 17] 127.0.0.1 - GET /plot
since 1416594367
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 18] 127.0.0.1 - POST
/clientresp since 1416594371
Fri Nov 21 18:36:12 2014 - HARAKIRI [core 19] 127.0.0.1 - POST
/getnotifs/ since 1416594367
Fri Nov 21 18:36:12 2014 - HARAKIRI !!! end of worker 5 status !!!
DAMN ! worker 5 (pid: 23549) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 5 (new pid: 24129)
monitor (pid=24129): Starting stack trace monitor.
WSGI app 0 (mountpoint='') ready in 0 seconds on interpreter 0xae8aa0
pid: 24129 (default app)

Configuration from --show-config:

;uWSGI instance configuration
[uwsgi]
show-config = true
emperor = /etc/streambed_uwsgi.ini
;end of configuration

Contents of /etc/streambed_uwsgi.ini:

[uwsgi]

uid = www-data
gid = www-data

chdir = /var/www/streambed/shelly
module = apache.wsgi
socket = /var/run/streambed.sock
chown-socket = www-data
logto = /var/log/uwsgi/streambed
pidfile = /var/run/streambed.pid

master = true
# Conventional SIGTERM behaviour - needed for runit:
die-on-term = true
# Clean up on exit:
vacuum = true

# 10 processes, 20 threads each:
processes = 10
threads = 20

buffer-size = 32768

# Load the app in each worker process, rather than in the master process:
lazy = true
# Maximum time to service a request (seconds):
harakiri = 300
harakiri-verbose = true
# Reload each process after this number of requests:
max-requests = 10000
# Save HTTP bodies larger than this to disk (bytes):
post-buffering = 1000000

# Stats socket
stats = /var/run/uwsgi/streambed.stats


Thanks for any hints or suggestions on either of these issues!

Jody McIntyre
Plotly Engineering
_______________________________________________
uWSGI mailing list
[email protected]
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi

[uWSGI] Server hangs after harakiri; debugging harakiri events in general

Reply via email to