On Wed, Dec 23, 2015 at 6:35 PM, Maxim Dounin <mdou...@mdounin.ru> wrote:
> Hello! > > On Wed, Dec 23, 2015 at 11:57:53AM +0300, Руслан Закиров wrote: > > > On Tue, Dec 22, 2015 at 4:49 PM, Maxim Dounin <mdou...@mdounin.ru> > wrote: > > > > > Hello! > > > > > > On Tue, Dec 22, 2015 at 02:43:11PM +0300, Руслан Закиров wrote: > > > > > > [...] > > > > > > > Our error log has "ignore long locked inactive cache entry" alerts, > but I > > > > really couldn't match it to "defreeze" event. Access log has > > > STALE/UPDATING > > > > requests between the alert and EXPIRED (cache updating) request. > > > > > > The "ignore long locked inactive cache entry" alerts indicate that > > > a cache entry was locked by some request, and wasn't unlocked for > > > a long time. The alert is expected to appear if a cache node is > > > locked for cache inactive time (as set by proxy_cache_path > > > inactive=, 10 minutes by default). > > > > > > Inactive is defined in the config, but it's set to default 10m. > > > > What happens with requests after this time? Do they hit backend and > update > > cache? Do they use stale version? > > As long as an entry is locked and not accessed for the inactive > time, it will cause the alert in question to be logged by cache > manager. Nothing else happens - that is, the entry is still > locked, and if it contains information that the entry is currently > being updated, requests to this entry will return stale content > with UPDATING status as per proxy_cache_use_stale. > > This alert may in theory happen in normal situation if a backend > response takes longer than the inactive time set. But in general > it indicates that there is a problem. > uwsgi on backends has harakiri setting at 9 seconds and 10 seconds timeout in nginx. Index page takes 0.2-0.7 seconds to render. (Note well that "locked" here should not be confused with > request-level locks as in proxy_cache_lock.) > > > I'm going to check "long locked" messages in the log to see how many was > > for "/" location. > > The hash should be the same if we didn't change cache key, right? > > Any message indicate that there is a problem. If the "/" location > is requested often, it'll likely won't be mentioned in the "ignore long > locked" messages. > > > > Most likely reason is that a worker died or was killed > > > while holding a lock on a cache node (i.e., while a request was > > > waiting for a new response from a backend). > > > > > > > Shouldn't be there a record in error log? Error log level at warn. > > There should be something like "worker process exited" messages, > though at least some of them may be on the "notice" level (e.g., > if you force nginx worker to quit by sending SIGTERM to it > directly). > > > > Trivial things to consider: > > > > > > - check logs for segmentation faults; > > > > no seg faults in logs > > Any other alerts, crit or emerg messages? Any alerts about > socket leaks on configuration reloads ("open socket left..."), > sockets in CLOSED state? How long relevant requests to upstream > server take (i.e., make sure that the behaviour observed isn't a > result of an upstream server just being very slow)? > Looks like we have some misconfiguration and I get the following: 2015/12/17 01:49:13 [*crit*] 5232#0: *599273664 rename() "/var/nginx/cache/0080683483" to "/var/www/storage/img/ru-cyber/answers/10-5.jpg" failed (13: Permission denied) while reading upstream, client: 188.115.151.67, server: www.sports.ru, request: "GET /storage/img/ru-cyber/answers/10-5.jpg HTTP/1.1", upstream: " http://192.168.1.10:80/storage/img/ru-cyber/answers/10-5.jpg", host: " www.sports.ru", referrer: "http://cyber.sports.ru/dota2/1034151992.html" 2015/12/17 01:50:17 [*crit*] 5229#0: *599313395 open() "/var/www/storage/img/fantasy/shirts/ule/slo.png" failed (13: Permission denied), client: 192.168.1.10, server: www.sports.ru, request: "GET /storage/img/fantasy/shirts/ule/slo.png HTTP/1.0", host: "www.sports.ru", referrer: " http://fantasy-h2h.ru/analytics/fantasy_statistics/liga_europa_2015" > > - if you are using 3rd party modules / patches, try without them; > > > > > > > from freebsd port, updated gist [1] with `nginx -V` output > > [1] https://gist.github.com/ruz/05456767750715f6b54e > > The only suspicious thing I see there is "ssi on" and > "proxy_cache_lock" at the same time. There were some fixes > related to proxy_cache_lock with subrequests in more recent > versions. > > > > - make sure you don't kill worker processes yourself or using some > > > automation scripts (in particular, don't try to terminate old > > > worker processes after a configuration reload). > > > > > > > One recent appearance of the problem was at 1:30AM and I checked > > logs for crazy midnight deploys - nothing. > > > > Also, we don't use anything custom to restart nginx, just regular > services > > management tools. > > Just a regular kill will be enough screw up things. On the other > hand, it may be an nginx bug as well. > > Note though, that the version you are using is quite old. > Debugging anything but 1.9.x hardly make sense. You may want to > try upgrading to the latest nginx-devel port. > As I said before 1.8.0 made things worse. May be we will try nginx 1.9.x on one of frontends and see how it goes. > > Note well that if a problem happens, it's a good idea to either > completely restart nginx or do binary upgrade procedure ("service > nginx upgrade") to make sure shared memory structures are intact. > > -- > Maxim Dounin > http://nginx.org/ > > _______________________________________________ > nginx mailing list > nginx@nginx.org > http://mailman.nginx.org/mailman/listinfo/nginx > -- Руслан Закиров Руководитель отдела разработки веб-сервисов +7(916) 597-92-69, ruz @ <http://www.sports.ru/>
_______________________________________________ nginx mailing list nginx@nginx.org http://mailman.nginx.org/mailman/listinfo/nginx