Re: Network services started before NIC UP.

Bob Proulx Sat, 21 Dec 2013 13:11:37 -0800

Erwan David wrote:
> Everything in /etc/networkinterfaces.
> 
> It is a bit complicated let me explain the situation before going to
> configuration:


Actually your situation sounds pretty normal to me.

> # The primary network interface
> auto eth0
> iface eth0 inet static
>     address 88.190.17.120
>     netmask 255.255.255.0
>     gateway 88.190.17.1
>     up ip addr add 88.191.245.121/32 dev eth0 label eth0:0
>     up ip -6 addr add 2001:0bc8:30d3::1/64 dev eth0
>     down ip addr del 88.191.245.121/32 dev eth0 label eth0:0
>     down ip -6 addr del 2001:0bc8:30d3::1/64 dev eth0

I don't see anything unusual there.  However I am not an IPv6 expert
and still need to learn the details of it.  The IPv4 parts look
perfectly reasonable.  I have no reason to doubt the IPv6 parts.

> 88.190.17.120 is the "private" address (if I change server I will get
> another address) 88.191.245.121 and 2001:0bc8:30d3::1 are the "public
> addresses", becaus I may migrate them to another machine at same
> hoster, making them more robust for public facing services (web email
> and ntp server in pool.ntp.org for this one)

Yes.  A common strategy.  Looks good.

> The router for IPv6 is given through the RA (I have the correct sysctl
> set up for accepting teh RA *and* routing IPv6)

I will assume it is good.

The important thing is that it will start up using ifupdown.  It is
set to use "auto" meaning that it will start synchronously at system
boot time.  If it were using "allow-hotplug" then it would use the
current standard event driven interface.  The two startup paths should
both work but they are different.  It is certainly possible for them
to behave differently with one path working and one not working.  I
have problems with NIS/yp with the allow-hotplug event driven path but
it works with the auto path for example.  (I need to debug that to
root cause some day.)

> > Just for the purposes of debugging if you are using "allow-hotplug"
> > then try switching that to "auto".  In theory allow-hotplug should
> > always work but since it is the newer event driven method sometimes
> > there are still bugs to be found.  It is possible that your case is
> > one of those.  Try "auto" instead and see if that older start ordering
> > causes things to work in the correct way.
> 
> I always use auto for fixed machines, like this server.

I see by this that you are already aware of the issues and understand
the differences between.  I will still say a lot for the archive
because it might help someone else looking at the problem later.

But then my question would be the reverse.  If you were to switch to
allow-hotplug would that cause things to happen differently and
perhaps work?  It would be something to try.  Although I am sure you
don't want to thrash your production server.  Trying these experiments
on a local victim development machine or VM would be good.

Since you are using "auto" then the numbers defined in the LSB headers
in the /etc/init.d/* scripts should drive the placement in the boot
order in the /etc/rc2.d/S* symlinks.  Things should work in that
order.  If things do not work in that order then that is the problem
to find and fix.

Also when the interface starts up it will execute the scripts
registered in /etc/network/if-*.d/* and those will happen at the time
when the interface status changes.  But I doubt that is the problem
here since by definition if-up.d/foo would happen after the interface
is up and your problem is something happening before then.

> resolv.conf is 
> 
> search rail.eu.org
> nameserver 127.0.0.1

Just to verify, no "resolvconf" installed?

> unbound listen on loopback when it is started:
> 
> unbound 3048 unbound    3u  IPv4  11035      0t0  UDP 127.0.0.1:domain 
> unbound 3048 unbound    4u  IPv4  11036      0t0  TCP 127.0.0.1:domain 
> (LISTEN)
> unbound 3048 unbound    5u  IPv6  11037      0t0  UDP [::1]:domain 
> unbound 3048 unbound    6u  IPv6  11038      0t0  TCP [::1]:domain (LISTEN)

I think I will guess that the problem is that "auto" is the old path
through the system boot.  Something in your use of 'unbound' isn't set
up for that path.  Dig into how unbound starts.

  $ ls -1 /etc/rcS.d/S*
  $ ls -1 /etc/rc2.d/S*

Look over that list and verify that it should be starting networking
in /etc/rcS.d/S*networking and that unbound starts up when it is
supposed to start up.  For example for me:

  /etc/rcS.d/S15networking
  /etc/rc2.d/S03bind9

Everyone's numbers will be different of course since those are
determined by the installed set of LSB headers from the /etc/init.d/*
files.  The numbers do not matter.  They are set dynamically by
'insserv'.

> > The errors you showed in the log file were from dns name resolution
> > failures.  How are nameservers configured for your machine?  Are you
> > using DHCP to set them?  Or are they statically definited?  Are you
> > running a local machine nameserver daemon such as bind9 or dnsmasq or
> > other?  What is in the /etc/resolv.conf file?
> 
> I use 2 dns servers, on different IP addresses : NSD on public
> addresses, authoritative for the rail.eu.org zone and
> 2001:0bc8:30d3::/48 reverse zone, unbound on loopback and
> 88.190.17.120 as recursive server for my small infrastructure

Seems reasonable.

> But the problem is not here. I realize that my choice of logs was
> rather poor. Here is another excerpt that I will comment
> ...
> Dec 10 18:21:24 tee kernel: [   15.347685] IPv6: ADDRCONF(NETDEV_UP): eth0: 
> link is not ready
> -> link not ready : no IPv6

Hmm...  I don't know.  You will have to keep digging until you
understand it.  Sorry.

> Dec 10 18:21:24 tee ntpd[2361]: ntpd 4.2.6p5@1.2349-o Mon May 20 14:24:35 UTC 
> 2013 (1)
> -> Here we see that ntpd is started before NIC is ready, and the IPv6
>    address iis clearly absent

For me ntp starts at /etc/rc2.d/S02ntp and of course networking
started up at /etc/rcS.d/S15networking.

Just for the particular case of ntp it could start sooner because it
is quite smart and is coded to watch interfaces.  If a new interface
is added after ntp starts then ntp will notice and you will see it
adopt it in the logs.  But I realize this is just for ntp and your
problem is otherwise.
> 
> Dec 10 18:21:26 tee kernel: [   18.584193] bnx2 0000:02:00.0 eth0: NIC Copper 
> Link is Up, 1000 Mbps full duplex
> Dec 10 18:21:26 tee kernel: [   18.584196] , receive & transmit flow control 
> ON
> Dec 10 18:21:26 tee kernel: [   18.584281] IPv6: ADDRCONF(NETDEV_CHANGE): 
> eth0: link becomes ready
> 
> -> There the link becomes ready, but too late...

I would deeply investigate /etc/init.d/networking and see what it is
doing.  It appears that it is releasing and moving on before the link
is ready.  That shouldn't happen.  When it releases and moves on the
link should be ready to go.

This may be confused if /etc/network/if-up.d/* something is actually
triggering these other daemons to start.  As if they are using the
event driven interface.  Beware.  But again, they shouldn't be running
until the interface is up.  But obviously something is.

> Dec 10 18:21:26 tee ntpd_intres[2519]: host name not found: canon.inria.fr
> Dec 10 18:21:26 tee ntpd_intres[2519]: host name not found: ntp.obspm.fr
> Dec 10 18:21:26 tee ntpd_intres[2519]: host name not found: ntp.sceen.net
> Dec 10 18:21:26 tee ntpd_intres[2519]: host name not found: clock.tix.ch
> Dec 10 18:21:26 tee ntpd_intres[2519]: host name not found: ntp.proserve.nl
> 
> -> Here we see that no DNS yet : unbound did not start 

For ntp in particular the LSB header in /etc/init.d/ntp does not
declare a dependency upon the nameserver.  Therefore they are started
together.  For ntp it may start before the nameserver.  I don't know
if ntp retries dns names however.  I know ntp is okay with dynamic
interfaces but I don't know if it retries names.  But if it required a
name then I would think the LSB headers would require $named in the
LSB header.  It doesn't and therefore if it is isn't a bug then it
apparently doesn't need it.  I find it hard to believe that a bug like
that would have gone unaddressed for so long.  Therefore I tend to
believe that it is okay to have ntp start before a named. 

> -> IPv6 is now on, ntpd adapts...

Yes.  The ntpd is actually a very solid program.

> Dec 10 18:21:33 tee unbound-anchor: /var/lib/unbound/root.key has content
> Dec 10 18:21:33 tee unbound-anchor: success: the anchor is ok
> Dec 10 18:21:33 tee unbound: [3048:0] notice: init module 0: validator
> Dec 10 18:21:33 tee unbound: [3048:0] notice: init module 1: iterator
> Dec 10 18:21:33 tee unbound: [3048:0] info: start of service (unbound 1.4.21).
> 
> -> Recursive DNS resolver starts now...

Since there are no dependencies between ntp and $named provided by
unbound (I assume unbound provides $named?) then those are allowed to
run asynchronously to each other.  That is okay by definition of the
dependencies.  If not then the dependencies should be changed.  (The
/etc/insserv/overrides/* directory is convenient for this or for
experiments.)

> In parrallel lets see the logs for the DHCPv6 client :
> ...
> 2013.12.10 18:21:24 Client Critical  Interface eth0/2 is down or doesn't have 
> any link-local address.
> ...
> 
> See it starts at 18:21:24 2 s before link is on, and fails : The IPv6
> network is thus not routed to me...

I haven't worked with IPv6 yet.  Still stuck in the last decade here.
I didn't see where the dhcpv6 client is supposed to start.  Since you
are bringing up the IPv6 interface with "up" in the /e/n/i file I
wouldn't expect anything else to happen at that point.  So how is the
dhcpv6 client getting triggered to start?  That seems likely to be
related to the problem.

> So the problem is clearly that several network services (ntpd, nsd,
> dibbler-client at least) are started *before* interface is
> completely up.

You will have to keep digging to understand it.  Use a victim machine
close by you for development and debugging until you know enough to
fix your production server.

On the victim machine I would be inclined to single step through
single user mode and start each init script individually.  That would
make it easier to understand the parts and what should be happening
when.  I imagine that at some point you will trigger an event and then
something completely unexpected will happen and at that point you will
have found the clue.

Or it will all work perfectly on your victim machine.  In which case
you will have an A-B comparison available between your test setup and
your production machine.

On the victim machine I would be inclined to hack on
/etc/init.d/networking and verify that networking is indeed up and
running completely at the end of the script when it passes control
on.  If not then backtrack from there.

> Thanks for reading this long message, but I think this logs
> examination is important to understand what happens.

Yes.  That helped.  Sorry no answers yet though.

Bob

signature.asc
Description: Digital signature

Re: Network services started before NIC UP.

Reply via email to