Following up on a very old thread on the off-chance that anyone else encounters 
this problem...

Restarting the cluster didn’t fix the problem, but we just finished upgrading 
this cluster from Quincy to Reef, and it’s working now. It’s great to see valid 
output from `ceph orch device ls` after almost a year :-)

Cheers,
/rjg

> On Oct 28, 2024, at 5:52 PM, Bob Gibson <[email protected]> wrote:
> 
> EXTERNAL EMAIL | USE CAUTION
> 
> I enabled debug logging with `ceph config set mgr 
> mgr/cephadm/log_to_cluster_level debug` and viewed the logs with `ceph -W 
> cephadm --watch-debug`. I can see the orchestrator refreshing the device 
> list, and this is reflected in the `ceph-volume.log` file on the target osd 
> nodes. When I restart the mgr, `ceph orch device ls` reports each device with 
> “5w ago” under the “REFRESHED” column. After the orchestrator attempts to 
> refresh the device list, `ceph orch device ls` stops outputting any data at 
> all until I restart the mgr again.
> 
> I discovered that I can query the cached device data using `ceph config-key 
> dump`. On the problematic cluster, the `created` attribute is stale, e.g.
> 
> ceph config-key dump | jq -r .'"mgr/cephadm/host.ceph-osd31.devices.0"' | jq 
> .devices[].created
> "2024-09-23T17:56:44.914535Z"
> "2024-09-23T17:56:44.914569Z"
> "2024-09-23T17:56:44.914591Z"
> "2024-09-23T17:56:44.914612Z"
> "2024-09-23T17:56:44.914632Z"
> "2024-09-23T17:56:44.914652Z"
> "2024-09-23T17:56:44.914672Z"
> "2024-09-23T17:56:44.914692Z"
> "2024-09-23T17:56:44.914711Z"
> "2024-09-23T17:56:44.914732Z"
> 
> whereas on working clusters the `created` attribute is set to the time the 
> device information was last cached, e.g.
> 
> ceph config-key dump | jq -r .'"mgr/cephadm/host.ceph-osd1.devices.0"' | jq 
> .devices[].created
> "2024-10-28T21:49:29.510593Z"
> "2024-10-28T21:49:29.510635Z"
> "2024-10-28T21:49:29.510657Z"
> "2024-10-28T21:49:29.510678Z"
> 
> It appears that the orchestrator is polling the devices but failing to update 
> the cache for some reason. It would be interesting to see what happens if I 
> removed one of these device entries from the cache, but the cluster is in 
> production so I’m hesitant to poke at it.
> 
> We have a maintenance window scheduled in December which will provide an 
> opportunity to perform a complete restart of the cluster. Hopefully that will 
> clean things up. In the meantime, I’ve set all devices to be unmanaged, and 
> the cluster is otherwise healthy, so unless anyone has any other ideas to 
> offer I guess I’ll just leave things as-is until the maintenance window.
> 
> Cheers,
> /rjg

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to