[ceph-users] MDS blocked ops; kernel: Workqueue: ceph-pg-invalid ceph_invalidate_work [ceph]

Frank Schilder Tue, 03 Sep 2019 05:53:16 -0700

Hi, I encountered a problem with blocked MDS operations and a client becoming 
unresponsive. I dumped the MDS cache, ops, blocked ops and some further log 
information here:


https://files.dtu.dk/u/peQSOY1kEja35BI5/2010-09-03-mds-blocked-ops?l

A user of our HPC system was running a job that creates a somewhat stressful 
MDS load. This workload tends to lead to MDS warnings like "slow metadata ops" 
and "client does not respond to caps release", which usually disappear without 
intervantion after a while.

He cancelled the job and one operation from one of the clients remained stuck 
in the MDS. We had a health warning about 1 blocked meta data operation and one 
client failing to respond to caps release. I should mention that we execute 
"echo 3 > /proc/sys/vm/drop_caches" in the epilogue script executed after every 
job, which usually cleans up all unused caps without problems. So, at the time 
I was looking at the number of client caps, these were down to below 100 for 
the client in question due to epilogue script execution. Looks like there might 
be a race condition with the drop caches and MDS requests.

In addition, while this happened, there was backfill going on. All PGs were 
active+other stuff. All storage was r/w-accessible.

On the client side, this was in the logs:

Sep  3 09:15:57 sn110 kernel: INFO: task kworker/0:1:79782 blocked for more 
than 120 seconds.
Sep  3 09:15:57 sn110 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  3 09:15:57 sn110 kernel: kworker/0:1     D ffff995cf4614100     0 79782    
  2 0x00000000
Sep  3 09:15:57 sn110 kernel: Workqueue: ceph-pg-invalid ceph_invalidate_work 
[ceph]
Sep  3 09:15:57 sn110 kernel: Call Trace:
[... see link above ...]

I did not see slow ops on any of the OSDs. All other information in the link 
above.

We had to reboot the client to resolve this problem. It seems like the MDS does 
not clean up blocked requests in certain situations when it ought to be 
possible. I hope the cache and ops dumps help pinpoint the reason.

Best regards,
Frank
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] MDS blocked ops; kernel: Workqueue: ceph-pg-invalid ceph_invalidate_work [ceph]

Reply via email to