Hi,

I added Adam in CC (Cephadm lead). Maybe he can tell us more about how this information is gathered / parsed.

On 9/17/25 11:24, Boris wrote:
Hi Stefan,

We run the 18.2.7 version.

I turned on the debug logs and this looks really weird:

[DBG] public networks ['PREFIX:22::/64']
[DBG] cluster networks []
[DBG] processing data from 12 hosts
[DBG] checking public network membership for: ['host-20', 'host-13', 'host-19', 'host-18', 'host-12', 'host-16', 'host-14', 'host-15', 'host-22', 'host-17', 'host-23', 'host-21']
[DBG] checking network PREFIX:22::/64
[DBG] subnet data - {"subnet": "PREFIX:22::/64", "mtu_map": {"9100": ["host-13", "host-12", "host-16", "host-14", "host-15", "host-17", "host-21"]}, "speed_map": {"20000": ["host-13", "host-12", "host-16", "host-14", "host-17", "host-21"], "10000": ["host-15"]}} [DBG] processing mtu map : {"9100": ["host-13", "host-12", "host-16", "host-14", "host-15", "host-17", "host-21"]}
[DBG] MTU problems detected
[DBG] most hosts using 9100
[DBG] processing subnet : {"subnet": "PREFIX:22::/64", "mtu_map": {"9100": ["host-13", "host-12", "host-16", "host-14", "host-15", "host-17", "host-21"]}, "speed_map": {"20000": ["host-13", "host-12", "host-16", "host-14", "host-17", "host-21"], "10000": ["host-15"]}}
[DBG] linkspeed issue(s) detected
[DBG] most hosts using 10000

But when I check one of the hosts (host-22). that do not show up in the subnet or mtu map and run a "cephadm --image quay.io/ceph/ ceph@sha256:1b9158ce28975f95def6a0ad459fa19f1336506074267a4b47c1bd914a00fec0 <http://quay.io/ceph/ceph@sha256:1b9158ce28975f95def6a0ad459fa19f1336506074267a4b47c1bd914a00fec0> gather-facts" I get the following for the interface which should have the prefix

     "bond0.22": {
       "driver": "",
       "iftype": "logical",
       "ipv4_address": "",
       "ipv6_address": "fe80::.../64",
       "lower_devs_list": [
         "bond0"
       ],
       "mtu": 9100,
       "nic_type": "ethernet",
       "operstate": "up",
       "speed": 20000,
       "upper_devs_list": []
     }

But doing a "cephadm --image quay.io/ceph/ ceph@sha256:1b9158ce28975f95def6a0ad459fa19f1336506074267a4b47c1bd914a00fec0 <http://quay.io/ceph/ceph@sha256:1b9158ce28975f95def6a0ad459fa19f1336506074267a4b47c1bd914a00fec0> list-networks" I get

...,
     "PREFIX:22::/64": {
         "bond0.22": [
             "PREFIX:22::186"
         ]
     },
...,
     "fe80::/64": {
...,
         "bond0.22": [
             "fe80::..."
         ],
...
     }
}

So why doesn't it show up in the 9100 mtu_map and in the 20000 speed_map?
And why does it use the link local address in the gather-facts section?

I don't know. Does not make sense to me.


I cross checked with some other host that acutally shows up and in the gather-facts it got the correct prefix, but the list-networks look the same with the fe80::/64 network. Is there a way to priortize the configured IP addresses in the output and only use the link-local addresses when there is no other IP address? I really wouldn't like to disable the link-local address, because I want to have the ceph-nodes in different clusters configured as same as possible. We have some clusters that use RA on the mgmt interface and some that have a static gateway configured.

Disabling link-local address will brick IPv6 networking functionality.

I repeated your tests at two different clusters. I can confirm what you are seeing. The "gather-facts" gives correct results for some, but not for others. In some cases the management IP shows up, but not the Ceph public interface, on other cases only local-link interfaces show up.

Note, we do not use linux bonding or bridging but we use Open vSwitch bridges, bonds and interfaces. Each host has a dedicated "ceph" interface with the public IP on it (bound to an uplink bridge, which is bound to a bond).

A thing I noticed is the way the hostname is derived. If something like this is present in /etc/hosts:

::1     ip6-localhost ip6-loopback mon2

The hostname in "gather-facts" will be "ip6-localhost" instead of mon2.

Another thing is the "--image" part of "cephadm --image quay.io/ceph/ceph@sha256:1b9158ce28975f95def6a0ad459fa19f1336506074267a4b47c1bd914a00f gather-facts" does not seem to work. It doesn't matter if I add a reference to a non-existent image, so I doubt this option has any effect at all.

Gr. Stefan



Am Mi., 17. Sept. 2025 um 09:06 Uhr schrieb Stefan Kooman <[email protected] <mailto:[email protected]>>:

    On 9/16/25 18:34, Boris wrote:
     > Hi,
     >
     > I am currently debugging an issue with the ceph config checks.
     > We have some random hosts that alert
     >
     > "HOSTNAME does not have an interface on any public network"
     >
     > but they have. It is IPv6, static configured and, because we
    don't have a
     > cluster_network, OSDs are bound to that IP in the specific network.
     >
     > I went through the netplan config and they are basically the same
    on all
     > hosts.
     > And after rebooting the hosts some of them resolved and some didn't.
     >
     > How can I dig deeper to figure out what is going on.


    Can you find some log output related to these events (docu here [1])?

     >
     > All services are ceph-orch podman containers
     > All hosts are ubuntu 22.04 with latest HWE kernel (6.8.0-79-generic)

    What version of Ceph are you running? We have ubuntu 22.04 clusters
    configured exactly like this (IPv6 only), so I'm really curious. We
    haven't seen this behavior yet (18.2.4).

    Gr. Stefan

    [1]:
    https://docs.ceph.com/en/latest/cephadm/operations/#watching-
    cephadm-log-messages <https://docs.ceph.com/en/latest/cephadm/
    operations/#watching-cephadm-log-messages>



--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groüen Saal.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to