On Thu, Jun 29, 2017 at 6:22 PM, Nick Fisk <[email protected]> wrote:
>> -----Original Message-----
>> From: Ilya Dryomov [mailto:[email protected]]
>> Sent: 29 June 2017 16:58
>> To: Nick Fisk <[email protected]>
>> Cc: Ceph Users <[email protected]>
>> Subject: Re: [ceph-users] Kernel mounted RBD's hanging
>>
>> On Thu, Jun 29, 2017 at 4:30 PM, Nick Fisk <[email protected]> wrote:
>> > Hi All,
>> >
>> > Putting out a call for help to see if anyone can shed some light on this.
>> >
>> > Configuration:
>> > Ceph cluster presenting RBD's->XFS->NFS->ESXi Running 10.2.7 on the
>> > OSD's and 4.11 kernel on the NFS gateways in a pacemaker cluster Both
>> > OSD's and clients are go into a pair of switches, single L2 domain (no
>> > sign from pacemaker that there is network connectivity issues)
>> >
>> > Symptoms:
>> > - All RBD's on a single client randomly hang for 30s to several
>> > minutes, confirmed by pacemaker and ESXi hosts complaining
>>
>> Hi Nick,
>>
>> What is a "single client" here?
>
> I mean a node of the pacemaker cluster. So all RBD's on the same pacemaker
> node hang.
>
>>
>> > - Cluster load is minimal when this happens most times
>>
>> Can you post gateway syslog and point at when this happened?
>> Corresponding pacemaker excerpts won't hurt either.
>
> Jun 28 16:35:38 MS-CEPH-Proxy1 lrmd[2026]: warning:
> p_export_ceph-ds1_monitor_60000 process (PID 17754) timed out
> Jun 28 16:35:43 MS-CEPH-Proxy1 lrmd[2026]: crit:
> p_export_ceph-ds1_monitor_60000 process (PID 17754) will not die!
> Jun 28 16:43:51 MS-CEPH-Proxy1 lrmd[2026]: warning:
> p_export_ceph-ds1_monitor_60000:17754 - timed out after 30000ms
> Jun 28 16:43:52 MS-CEPH-Proxy1 IPaddr(p_vip_ceph-ds1)[28482]: INFO: ifconfig
> ens224:0 down
> Jun 28 16:43:52 MS-CEPH-Proxy1 lrmd[2026]: notice:
> p_vip_ceph-ds1_stop_0:28482:stderr [ SIOCDELRT: No such process ]
> Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]: notice: Operation
> p_vip_ceph-ds1_stop_0: ok (node=MS-CEPH-Proxy1, call=471, rc=0,
> cib-update=318, confirmed=true)
> Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: INFO:
> Un-exporting file system ...
> Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: INFO:
> unexporting 10.3.20.0/24:/mnt/Ceph-DS1
> Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: INFO:
> Unlocked NFS export /mnt/Ceph-DS1
> Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: INFO:
> Un-exported file system(s)
> Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]: notice: Operation
> p_export_ceph-ds1_stop_0: ok (node=MS-CEPH-Proxy1, call=473, rc=0,
> cib-update=319, confirmed=true)
> Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]: INFO:
> Exporting file system(s) ...
> Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]: INFO:
> exporting 10.3.20.0/24:/mnt/Ceph-DS1
> Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]: INFO:
> directory /mnt/Ceph-DS1 exported
> Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]: notice: Operation
> p_export_ceph-ds1_start_0: ok (node=MS-CEPH-Proxy1, call=474, rc=0,
> cib-update=320, confirmed=true)
>
> If I enable the read/write checks for the FS resource, they also timeout at
> the same time.
What about syslog that the above corresponds to?
Thanks,
Ilya
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com