Erwan David wrote: > Everything in /etc/networkinterfaces. > > It is a bit complicated let me explain the situation before going to > configuration:
Actually your situation sounds pretty normal to me. > # The primary network interface > auto eth0 > iface eth0 inet static > address 88.190.17.120 > netmask 255.255.255.0 > gateway 88.190.17.1 > up ip addr add 88.191.245.121/32 dev eth0 label eth0:0 > up ip -6 addr add 2001:0bc8:30d3::1/64 dev eth0 > down ip addr del 88.191.245.121/32 dev eth0 label eth0:0 > down ip -6 addr del 2001:0bc8:30d3::1/64 dev eth0 I don't see anything unusual there. However I am not an IPv6 expert and still need to learn the details of it. The IPv4 parts look perfectly reasonable. I have no reason to doubt the IPv6 parts. > 88.190.17.120 is the "private" address (if I change server I will get > another address) 88.191.245.121 and 2001:0bc8:30d3::1 are the "public > addresses", becaus I may migrate them to another machine at same > hoster, making them more robust for public facing services (web email > and ntp server in pool.ntp.org for this one) Yes. A common strategy. Looks good. > The router for IPv6 is given through the RA (I have the correct sysctl > set up for accepting teh RA *and* routing IPv6) I will assume it is good. The important thing is that it will start up using ifupdown. It is set to use "auto" meaning that it will start synchronously at system boot time. If it were using "allow-hotplug" then it would use the current standard event driven interface. The two startup paths should both work but they are different. It is certainly possible for them to behave differently with one path working and one not working. I have problems with NIS/yp with the allow-hotplug event driven path but it works with the auto path for example. (I need to debug that to root cause some day.) > > Just for the purposes of debugging if you are using "allow-hotplug" > > then try switching that to "auto". In theory allow-hotplug should > > always work but since it is the newer event driven method sometimes > > there are still bugs to be found. It is possible that your case is > > one of those. Try "auto" instead and see if that older start ordering > > causes things to work in the correct way. > > I always use auto for fixed machines, like this server. I see by this that you are already aware of the issues and understand the differences between. I will still say a lot for the archive because it might help someone else looking at the problem later. But then my question would be the reverse. If you were to switch to allow-hotplug would that cause things to happen differently and perhaps work? It would be something to try. Although I am sure you don't want to thrash your production server. Trying these experiments on a local victim development machine or VM would be good. Since you are using "auto" then the numbers defined in the LSB headers in the /etc/init.d/* scripts should drive the placement in the boot order in the /etc/rc2.d/S* symlinks. Things should work in that order. If things do not work in that order then that is the problem to find and fix. Also when the interface starts up it will execute the scripts registered in /etc/network/if-*.d/* and those will happen at the time when the interface status changes. But I doubt that is the problem here since by definition if-up.d/foo would happen after the interface is up and your problem is something happening before then. > resolv.conf is > > search rail.eu.org > nameserver 127.0.0.1 Just to verify, no "resolvconf" installed? > unbound listen on loopback when it is started: > > unbound 3048 unbound 3u IPv4 11035 0t0 UDP 127.0.0.1:domain > unbound 3048 unbound 4u IPv4 11036 0t0 TCP 127.0.0.1:domain > (LISTEN) > unbound 3048 unbound 5u IPv6 11037 0t0 UDP [::1]:domain > unbound 3048 unbound 6u IPv6 11038 0t0 TCP [::1]:domain (LISTEN) I think I will guess that the problem is that "auto" is the old path through the system boot. Something in your use of 'unbound' isn't set up for that path. Dig into how unbound starts. $ ls -1 /etc/rcS.d/S* $ ls -1 /etc/rc2.d/S* Look over that list and verify that it should be starting networking in /etc/rcS.d/S*networking and that unbound starts up when it is supposed to start up. For example for me: /etc/rcS.d/S15networking /etc/rc2.d/S03bind9 Everyone's numbers will be different of course since those are determined by the installed set of LSB headers from the /etc/init.d/* files. The numbers do not matter. They are set dynamically by 'insserv'. > > The errors you showed in the log file were from dns name resolution > > failures. How are nameservers configured for your machine? Are you > > using DHCP to set them? Or are they statically definited? Are you > > running a local machine nameserver daemon such as bind9 or dnsmasq or > > other? What is in the /etc/resolv.conf file? > > I use 2 dns servers, on different IP addresses : NSD on public > addresses, authoritative for the rail.eu.org zone and > 2001:0bc8:30d3::/48 reverse zone, unbound on loopback and > 88.190.17.120 as recursive server for my small infrastructure Seems reasonable. > But the problem is not here. I realize that my choice of logs was > rather poor. Here is another excerpt that I will comment > ... > Dec 10 18:21:24 tee kernel: [ 15.347685] IPv6: ADDRCONF(NETDEV_UP): eth0: > link is not ready > -> link not ready : no IPv6 Hmm... I don't know. You will have to keep digging until you understand it. Sorry. > Dec 10 18:21:24 tee ntpd[2361]: ntpd 4.2.6p5@1.2349-o Mon May 20 14:24:35 UTC > 2013 (1) > -> Here we see that ntpd is started before NIC is ready, and the IPv6 > address iis clearly absent For me ntp starts at /etc/rc2.d/S02ntp and of course networking started up at /etc/rcS.d/S15networking. Just for the particular case of ntp it could start sooner because it is quite smart and is coded to watch interfaces. If a new interface is added after ntp starts then ntp will notice and you will see it adopt it in the logs. But I realize this is just for ntp and your problem is otherwise. > > Dec 10 18:21:26 tee kernel: [ 18.584193] bnx2 0000:02:00.0 eth0: NIC Copper > Link is Up, 1000 Mbps full duplex > Dec 10 18:21:26 tee kernel: [ 18.584196] , receive & transmit flow control > ON > Dec 10 18:21:26 tee kernel: [ 18.584281] IPv6: ADDRCONF(NETDEV_CHANGE): > eth0: link becomes ready > > -> There the link becomes ready, but too late... I would deeply investigate /etc/init.d/networking and see what it is doing. It appears that it is releasing and moving on before the link is ready. That shouldn't happen. When it releases and moves on the link should be ready to go. This may be confused if /etc/network/if-up.d/* something is actually triggering these other daemons to start. As if they are using the event driven interface. Beware. But again, they shouldn't be running until the interface is up. But obviously something is. > Dec 10 18:21:26 tee ntpd_intres[2519]: host name not found: canon.inria.fr > Dec 10 18:21:26 tee ntpd_intres[2519]: host name not found: ntp.obspm.fr > Dec 10 18:21:26 tee ntpd_intres[2519]: host name not found: ntp.sceen.net > Dec 10 18:21:26 tee ntpd_intres[2519]: host name not found: clock.tix.ch > Dec 10 18:21:26 tee ntpd_intres[2519]: host name not found: ntp.proserve.nl > > -> Here we see that no DNS yet : unbound did not start For ntp in particular the LSB header in /etc/init.d/ntp does not declare a dependency upon the nameserver. Therefore they are started together. For ntp it may start before the nameserver. I don't know if ntp retries dns names however. I know ntp is okay with dynamic interfaces but I don't know if it retries names. But if it required a name then I would think the LSB headers would require $named in the LSB header. It doesn't and therefore if it is isn't a bug then it apparently doesn't need it. I find it hard to believe that a bug like that would have gone unaddressed for so long. Therefore I tend to believe that it is okay to have ntp start before a named. > -> IPv6 is now on, ntpd adapts... Yes. The ntpd is actually a very solid program. > Dec 10 18:21:33 tee unbound-anchor: /var/lib/unbound/root.key has content > Dec 10 18:21:33 tee unbound-anchor: success: the anchor is ok > Dec 10 18:21:33 tee unbound: [3048:0] notice: init module 0: validator > Dec 10 18:21:33 tee unbound: [3048:0] notice: init module 1: iterator > Dec 10 18:21:33 tee unbound: [3048:0] info: start of service (unbound 1.4.21). > > -> Recursive DNS resolver starts now... Since there are no dependencies between ntp and $named provided by unbound (I assume unbound provides $named?) then those are allowed to run asynchronously to each other. That is okay by definition of the dependencies. If not then the dependencies should be changed. (The /etc/insserv/overrides/* directory is convenient for this or for experiments.) > In parrallel lets see the logs for the DHCPv6 client : > ... > 2013.12.10 18:21:24 Client Critical Interface eth0/2 is down or doesn't have > any link-local address. > ... > > See it starts at 18:21:24 2 s before link is on, and fails : The IPv6 > network is thus not routed to me... I haven't worked with IPv6 yet. Still stuck in the last decade here. I didn't see where the dhcpv6 client is supposed to start. Since you are bringing up the IPv6 interface with "up" in the /e/n/i file I wouldn't expect anything else to happen at that point. So how is the dhcpv6 client getting triggered to start? That seems likely to be related to the problem. > So the problem is clearly that several network services (ntpd, nsd, > dibbler-client at least) are started *before* interface is > completely up. You will have to keep digging to understand it. Use a victim machine close by you for development and debugging until you know enough to fix your production server. On the victim machine I would be inclined to single step through single user mode and start each init script individually. That would make it easier to understand the parts and what should be happening when. I imagine that at some point you will trigger an event and then something completely unexpected will happen and at that point you will have found the clue. Or it will all work perfectly on your victim machine. In which case you will have an A-B comparison available between your test setup and your production machine. On the victim machine I would be inclined to hack on /etc/init.d/networking and verify that networking is indeed up and running completely at the end of the script when it passes control on. If not then backtrack from there. > Thanks for reading this long message, but I think this logs > examination is important to understand what happens. Yes. That helped. Sorry no answers yet though. Bob
signature.asc
Description: Digital signature