Short story:
------------
We have a new HPC installation with file systems provided by cephfs (home,
apps, ...). We have one cephfs and all client file systems are sub-directory
mounts. On this ceph file system, we have a bit more than 500 nodes with
currently 2 ceph fs mounts each, resulting in a bit over 1000 client
connections. Today, we run a benchmark for the HPC nodes, which somehow crashed
our MDS setup as a by-product. We are currently on "ceph version 13.2.2
(02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)".
We have the fs meta-data on SSD and the data on an EC data-pool on HDD. The
ceph servers are running on kernel v5.0 while the clients are on v3.10. It
looks like all important ceph fs kernel client fixes are back-ported (not all
though). Everything on mimic. If necessary, we will provide more details, which
we omitted here in favour of focus on the quite long event log below.
What we observed after starting the benchmark was between 10000-20000 read OPs,
probably caused by the application (running on all 500+ nodes) repeatedly
accessing the same or few same files. For a while, this more or less worked
until operations started piling up, which never got flushed. After quite a bit
of work, it looks like we got our cluster back as well as discovering a
(serious?) problem with an MDS daemon pair set-up as active/standby-replay.
The main questions in this e-mail are basically (please see long story below
for the clue we found):
1) Is there a bug with having MDS daemons acting as standby-replay?
2) Is there a manual way to cancel an "internal op fragmentdir"?
Our MDS configuration section looked like this before the incident:
[mds.ceph-08]
mds standby replay = true
mds standby for rank = 0
mds standby for fscid = 1
[mds.ceph-12]
mds standby replay = true
mds standby for rank = 0
mds standby for fscid = 1
We changed it to this one until further information on the issue:
[mds.ceph-08]
mds standby replay = false
#mds standby for rank = 0
#mds standby for fscid = 1
[mds.ceph-12]
mds standby replay = false
#mds standby for rank = 0
#mds standby for fscid = 1
Long Story:
-----------
Disclaimer: I was a bit under stress getting our cluster back. Unfortunately, I
did not take copies of all potentially relevant information and have to recall
some from memory. We are still in a phase where we can execute tests to collect
more data - if necessary. We will certainly stress test the active/standby
configuration with the same benchmark.
It looks like there might be a bug where the standby-replay daemon is regarded
as active and a fragmentdir operation becomes scheduled, but never executed.
After this internal request is in the pipeline, operations keep piling up
without ever being executed. The IO to files accessed by the benchmark stopped.
Here is a protocol of what we observed and did:
- We started the benchmark and observed large IOPs (10-20K ops read) with low
MB/s (300-500MB/s read). Very little write going on.
- At some point we started seeing the health warning
1 MDSs report slow requests
and a large number of OPS blocked (NNNN-some 4 digit number):
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdsceph-12(mds.0): NNNN slow requests are blocked > 30 secs
with the number NNNN increasing fast.
- Some time later, we see trimming messages in addition:
MDS_TRIM 2 MDSs behind on trimming
mdsceph-12(mds.0): Behind on trimming (292/128) max_segments: 128,
num_segments: 292
mdsceph-08(mds.0): Behind on trimming (292/128) max_segments: 128,
num_segments: 292
with num_segments increasing slowly.
- That's when we started taking action since we noted that the old ops were
never executed and ops would probably just keep piling up.
- We cancelled the benchmark job that was causing the IO load and waited to see
if ceph would start recovering. I should note that this problem was already
triggered by earlier runs of the same benchmark and the health issues did
disappear after some time back then. However, in those cases the user
terminated the job as soon as he noticed strange behaviour - much much earlier
than in the supervised test described here, where we wanted to see how ceph
acts under a sustained load of this type (users do have job accidents:).
- We observed that this time, ceph was not getting better. The number of stuck
operations was remaining at this level:
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdsceph-12(mds.0): 12051 slow requests are blocked > 30 secs
MDS_TRIM 2 MDSs behind on trimming
mdsceph-12(mds.0): Behind on trimming (292/128) max_segments: 128,
num_segments: 292
mdsceph-08(mds.0): Behind on trimming (292/128) max_segments: 128,
num_segments: 292
- So, we had to do something. After trying a number of less invasive things,
like remounting the cepf fs on a client node, to no avail, we decided to take
more drastic measures.
- We evicted all clients and observed that more and more OPS got marked as
"cancelled" and "cleaned up". During the eviction process, these warnings came
up temporarily:
mdsceph-12(mds.0): Client sn371.hpc.ait.dtu.dk:con-fs-hpc failing to
respond to capability release client_id: 226263
mdsceph-12(mds.0): Client sn322.hpc.ait.dtu.dk:con-fs-hpc failing to
respond to capability release client_id: 242672
mdsceph-12(mds.0): Client sn339.hpc.ait.dtu.dk:con-fs-hpc failing to
respond to capability release client_id: 242708
mdsceph-12(mds.0): Client sn335.hpc.ait.dtu.dk:con-fs-hpc failing to
respond to capability release client_id: 272728
I'm aware that there is/was a kernel (client) bug that might still be present
in 3.10. However, I have not seen this warning in any other context so far. I
did test pushing clients to release caps by changing the MDS cache memory limit
at run time and clients responded with dropping caps as needed. I have never
seen the MDS exceeding the cache memory limit.
- After kicking out all clients, we ended up with this result:
HEALTH_WARN 1 MDSs report slow requests; 2 MDSs behind on trimming
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdsceph-12(mds.0): 12051 slow requests are blocked > 30 secs
MDS_TRIM 2 MDSs behind on trimming
mdsceph-12(mds.0): Behind on trimming (294/128) max_segments: 128,
num_segments: 294
mdsceph-08(mds.0): Behind on trimming (294/128) max_segments: 128,
num_segments: 294
[root@ceph-mds:ceph-12 /]# ceph daemon mds.ceph-12 ops | grep description | wc
-l
12051
[root@ceph-mds:ceph-12 /]# ceph daemon mds.ceph-12 ops | grep killing | wc -l
12050
So, there was 1 stuck operation that was not client related. I started
filtering the descriptions with "grep -v" and got a hit:
[root@ceph-mds:ceph-12 /]# ceph daemon mds.ceph-12 ops | grep "internal op
fragmentdir" -A 100
"description": "internal op fragmentdir:mds.0:169119",
"initiated_at": "2019-05-13 11:17:52.002507",
"age": 12796.457152,
"duration": 12796.666273,
"type_data": {
"flag_point": "dispatched",
"reqid": "mds.0:169119",
"op_type": "internal_op",
"internal_op": 5376,
"op_name": "fragmentdir",
"events": [
{
"time": "2019-05-13 11:17:52.002507",
"event": "initiated"
},
{
"time": "2019-05-13 11:17:52.002507",
"event": "header_read"
},
{
"time": "0.000000",
"event": "throttled"
},
{
"time": "0.000000",
"event": "all_read"
},
{
"time": "0.000000",
"event": "dispatched"
}
]
}
},
The information I found searching the web indicates that this is something that
happens in an active/active set-up, where one active MDS requests another
active MDS to take over part of the meta data operations. In our case, there
was no MDS to accept that request. Our hypothesis is, that the standby-replay
daemon erroneously was seen as active.
I should add here that (all?) the blocked operations had as their last entry
something like "waiting for directory fragmentation" or "pending ...". I forgot
to copy the message and have to recall that from memory. In any case, I
suspected something like the above operation being stuck in the queue holding
everything up already a bit earlier.
- Since all other OPS were cancelled, we assumed it would be safe at this point
to restart MDS daemons without loosing relevant data. The idea to get rid of
the pending dirfrag was to restart the standby-replay daemon as standby (only)
and see what happens. If things looked good, we would fail over the active
daemon and restart this one also with standby (only) configuration as indicated
above with the modified MDS section in ceph.conf.
Indeed, this procedure flushed the dirfrag out of the queue and all other
operations with it. The cluster came back healthy and I do not think we lost
any relevant IO.
- I should mention that ceph fs actually kept working fine for anyone except
the user running the benchmark. Only IO to particular files/a particular
directory stopped, so this problem seems to remain isolated. Also, the load on
the servers was not high during the test. The fs remained responsive to other
users. Also, the MDS daemons never crashed. There was no fail-over except the
ones we triggered manually.
As mentioned above, we can do some more testing within reason. We already have
pilot users on the system and we need to keep it sort of working.
Best regards and thanks in advance for any help on getting a stable
active/standby-replay config working.
And for reading all that.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com