Hi,
I added Adam in CC (Cephadm lead). Maybe he can tell us more about how
this information is gathered / parsed.
On 9/17/25 11:24, Boris wrote:
Hi Stefan,
We run the 18.2.7 version.
I turned on the debug logs and this looks really weird:
[DBG] public networks ['PREFIX:22::/64']
[DBG] cluster networks []
[DBG] processing data from 12 hosts
[DBG] checking public network membership for: ['host-20', 'host-13',
'host-19', 'host-18', 'host-12', 'host-16', 'host-14', 'host-15',
'host-22', 'host-17', 'host-23', 'host-21']
[DBG] checking network PREFIX:22::/64
[DBG] subnet data - {"subnet": "PREFIX:22::/64", "mtu_map": {"9100":
["host-13", "host-12", "host-16", "host-14", "host-15", "host-17",
"host-21"]}, "speed_map": {"20000": ["host-13", "host-12", "host-16",
"host-14", "host-17", "host-21"], "10000": ["host-15"]}}
[DBG] processing mtu map : {"9100": ["host-13", "host-12", "host-16",
"host-14", "host-15", "host-17", "host-21"]}
[DBG] MTU problems detected
[DBG] most hosts using 9100
[DBG] processing subnet : {"subnet": "PREFIX:22::/64", "mtu_map":
{"9100": ["host-13", "host-12", "host-16", "host-14", "host-15",
"host-17", "host-21"]}, "speed_map": {"20000": ["host-13", "host-12",
"host-16", "host-14", "host-17", "host-21"], "10000": ["host-15"]}}
[DBG] linkspeed issue(s) detected
[DBG] most hosts using 10000
But when I check one of the hosts (host-22). that do not show up in the
subnet or mtu map and run a "cephadm --image quay.io/ceph/
ceph@sha256:1b9158ce28975f95def6a0ad459fa19f1336506074267a4b47c1bd914a00fec0 <http://quay.io/ceph/ceph@sha256:1b9158ce28975f95def6a0ad459fa19f1336506074267a4b47c1bd914a00fec0> gather-facts" I get the following for the interface which should have the prefix
"bond0.22": {
"driver": "",
"iftype": "logical",
"ipv4_address": "",
"ipv6_address": "fe80::.../64",
"lower_devs_list": [
"bond0"
],
"mtu": 9100,
"nic_type": "ethernet",
"operstate": "up",
"speed": 20000,
"upper_devs_list": []
}
But doing a "cephadm --image quay.io/ceph/
ceph@sha256:1b9158ce28975f95def6a0ad459fa19f1336506074267a4b47c1bd914a00fec0 <http://quay.io/ceph/ceph@sha256:1b9158ce28975f95def6a0ad459fa19f1336506074267a4b47c1bd914a00fec0> list-networks" I get
...,
"PREFIX:22::/64": {
"bond0.22": [
"PREFIX:22::186"
]
},
...,
"fe80::/64": {
...,
"bond0.22": [
"fe80::..."
],
...
}
}
So why doesn't it show up in the 9100 mtu_map and in the 20000 speed_map?
And why does it use the link local address in the gather-facts section?
I don't know. Does not make sense to me.
I cross checked with some other host that acutally shows up and in the
gather-facts it got the correct prefix, but the list-networks look the
same with the fe80::/64 network.
Is there a way to priortize the configured IP addresses in the output
and only use the link-local addresses when there is no other IP address?
I really wouldn't like to disable the link-local address, because I want
to have the ceph-nodes in different clusters configured as same as
possible. We have some clusters that use RA on the mgmt interface and
some that have a static gateway configured.
Disabling link-local address will brick IPv6 networking functionality.
I repeated your tests at two different clusters. I can confirm what you
are seeing. The "gather-facts" gives correct results for some, but not
for others. In some cases the management IP shows up, but not the Ceph
public interface, on other cases only local-link interfaces show up.
Note, we do not use linux bonding or bridging but we use Open vSwitch
bridges, bonds and interfaces. Each host has a dedicated "ceph"
interface with the public IP on it (bound to an uplink bridge, which is
bound to a bond).
A thing I noticed is the way the hostname is derived. If something like
this is present in /etc/hosts:
::1 ip6-localhost ip6-loopback mon2
The hostname in "gather-facts" will be "ip6-localhost" instead of mon2.
Another thing is the "--image" part of "cephadm --image
quay.io/ceph/ceph@sha256:1b9158ce28975f95def6a0ad459fa19f1336506074267a4b47c1bd914a00f
gather-facts" does not seem to work. It doesn't matter if I add a
reference to a non-existent image, so I doubt this option has any effect
at all.
Gr. Stefan
Am Mi., 17. Sept. 2025 um 09:06 Uhr schrieb Stefan Kooman <[email protected]
<mailto:[email protected]>>:
On 9/16/25 18:34, Boris wrote:
> Hi,
>
> I am currently debugging an issue with the ceph config checks.
> We have some random hosts that alert
>
> "HOSTNAME does not have an interface on any public network"
>
> but they have. It is IPv6, static configured and, because we
don't have a
> cluster_network, OSDs are bound to that IP in the specific network.
>
> I went through the netplan config and they are basically the same
on all
> hosts.
> And after rebooting the hosts some of them resolved and some didn't.
>
> How can I dig deeper to figure out what is going on.
Can you find some log output related to these events (docu here [1])?
>
> All services are ceph-orch podman containers
> All hosts are ubuntu 22.04 with latest HWE kernel (6.8.0-79-generic)
What version of Ceph are you running? We have ubuntu 22.04 clusters
configured exactly like this (IPv6 only), so I'm really curious. We
haven't seen this behavior yet (18.2.4).
Gr. Stefan
[1]:
https://docs.ceph.com/en/latest/cephadm/operations/#watching-
cephadm-log-messages <https://docs.ceph.com/en/latest/cephadm/
operations/#watching-cephadm-log-messages>
--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]