Package: klibc-utils
Version: 1.5.12-2
Severity: important
Tags: patch
(I'm using X-Debbugs-Cc so that the klibc list receives a copy directly.
I'd be glad to be kept in Cc since I don't follow that list, thanks
already.)
Hi maks, hpa, Louis, and others,
I've been experiencing for a while timeouts at DHCP-time in ipconfig
when starting up several machines (say: 2 out of 4 don't boot) at the
very same time (FWIW, using the 'reboot' mechanism of our SSI system,
which means the reboots happen in sync).
Finally, I found some time to look into it, and discovered that the
automata doesn't take some cases into account. First, I've rebuilt
ipconfig with DEBUG/IPC_DEBUG, and discovered that the machines that
don't boot receive the messages broadcasted by the other dhcp clients
(be it DISCOVER or REQUEST) and from that point on, receive nothing
else.
Now, looking at the code (under usr/kinit/ipconfig for those following
at home):
packet.c: packet_recv()
| …
| if (udp->source != htons(cfg_remote_port) ||
| udp->dest != htons(cfg_local_port))
| goto free_pkt;
| …
| free_pkt:
| free(ip);
| return 0;
| …
Which means in case of source/dest mismatch (which is the case when a
message from another client is received), 0 is returned.
Now, looking at the callers:
bootp_proto.c: bootp_recv_reply() & dhcp_proto.c: dhcp_recv()
| …
| ret = packet_recv(iov, 3);
| if (ret <= 0)
| return ret;
| …
Again, 0 is returned.
dhcp_recv() is wrapped into dhcp_recv_offer() & dhcp_recv_ack().
Finally, all those are used in switch() statements in main.c, where -1
and strictly positive values are checked for, but not 0. Hence the
attached patch: 0001-trivial.patch.
I guess one could consider it a special case that might deserve a
DEVST_SOFTERROR state, which could have a shorter retry delay than
DEVST_ERROR. Especially true for some setups with a hundred machines or
more, it'd be quite a PITA to wait 10 seconds for a retry where only a
couple of machines will complete the DHCP handshake. Not to mention the
default timeout that'll bite. That's why I'm proposing a second patch:
0002-introduce-softerror.patch; and since it's probably overkill to
introduce that additional state, I think the functionally equivalent
0003-cleaner.patch will be better if you want to implement my suggestion.
Patches against master branch, tested on Debian's sid version (1.5.12).
Errm, now that I'm rebooting on a loopy fashion, it looks like those
patches don't cure the problem totally, so I guess I'm back to
debugging. Hopefully, upstream will figure this out better than I do.
Cheers,
--
Cyril Brulebois
--- Begin Message ---
If a broadcast message (e.g. a DISCOVER or REQUEST message sent by
another DHCP client) is received when waiting for an OFFER or an ACK
message, the error isn't taken into account in main().
Fix this by considering null return value as errors, so that the DHCP
handshake can be tried again.
Signed-off-by: Cyril Brulebois <cyril.bruleb...@kerlabs.com>
---
usr/kinit/ipconfig/main.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/usr/kinit/ipconfig/main.c b/usr/kinit/ipconfig/main.c
--- a/usr/kinit/ipconfig/main.c
+++ b/usr/kinit/ipconfig/main.c
@@ -178,6 +178,7 @@ static int process_receive_event(struct state *s, time_t
now)
s->restart_state = DEVST_BOOTP;
switch (bootp_recv_reply(s->dev)) {
case -1:
+ case 0:
s->state = DEVST_ERROR;
break;
case 1:
@@ -191,6 +192,7 @@ static int process_receive_event(struct state *s, time_t
now)
s->restart_state = DEVST_DHCPDISC;
switch (dhcp_recv_offer(s->dev)) {
case -1:
+ case 0:
s->state = DEVST_ERROR;
break;
case DHCPOFFER: /* Offer received */
@@ -204,6 +206,7 @@ static int process_receive_event(struct state *s, time_t
now)
s->restart_state = DEVST_DHCPDISC;
switch (dhcp_recv_ack(s->dev)) {
case -1: /* error */
+ case 0:
s->state = DEVST_ERROR;
break;
case DHCPACK: /* ACK received */
--
1.5.6.5
--- End Message ---
--- Begin Message ---
If a broadcast message (e.g. a DISCOVER or REQUEST message sent by
another DHCP client) is received while waiting for an OFFER or an ACK
message, the error isn't taken into account in main(), and timeout gets
reached.
Fix this by introducing DEVST_SOFTERROR, since one wants to retry ASAP
(after 1 second instead of 10 for DEVST_ERROR) when in the middle of a
concurrency situation, e.g. because a lots of machines get booted at the
very same time.
Signed-off-by: Cyril Brulebois <cyril.bruleb...@kerlabs.com>
---
usr/kinit/ipconfig/main.c | 23 ++++++++++++++++++++++-
usr/kinit/ipconfig/netdev.h | 1 +
2 files changed, 23 insertions(+), 1 deletions(-)
diff --git a/usr/kinit/ipconfig/main.c b/usr/kinit/ipconfig/main.c
--- a/usr/kinit/ipconfig/main.c
+++ b/usr/kinit/ipconfig/main.c
@@ -180,6 +180,9 @@ static int process_receive_event(struct state *s, time_t
now)
case -1:
s->state = DEVST_ERROR;
break;
+ case 0:
+ s->state = DEVST_SOFTERROR;
+ break;
case 1:
s->state = DEVST_COMPLETE;
DEBUG(("\n bootp reply\n"));
@@ -193,6 +196,9 @@ static int process_receive_event(struct state *s, time_t
now)
case -1:
s->state = DEVST_ERROR;
break;
+ case 0:
+ s->state = DEVST_SOFTERROR;
+ break;
case DHCPOFFER: /* Offer received */
s->state = DEVST_DHCPREQ;
dhcp_send_request(s->dev);
@@ -206,6 +212,9 @@ static int process_receive_event(struct state *s, time_t
now)
case -1: /* error */
s->state = DEVST_ERROR;
break;
+ case 0:
+ s->state = DEVST_SOFTERROR;
+ break;
case DHCPACK: /* ACK received */
s->state = DEVST_COMPLETE;
break;
@@ -224,6 +233,18 @@ static int process_receive_event(struct state *s, time_t
now)
case DEVST_ERROR:
/* error occurred, try again in 10 seconds */
s->expire = now + 10;
+ DEBUG(("\n"));
+ handled = 0;
+ break;
+
+ case DEVST_SOFTERROR:
+ /* "soft" error occurred (probably due to
+ concurrency), try again in 1 second */
+ s->expire = now + 1;
+ DEBUG(("\n"));
+ handled = 0;
+ break;
+
default:
DEBUG(("\n"));
handled = 0;
@@ -250,7 +271,7 @@ static void process_timeout_event(struct state *s, time_t
now)
* If we had an error, restore a sane state to
* restart from.
*/
- if (s->state == DEVST_ERROR)
+ if (s->state == DEVST_ERROR || s->state == DEVST_SOFTERROR)
s->state = s->restart_state;
/*
diff --git a/usr/kinit/ipconfig/netdev.h b/usr/kinit/ipconfig/netdev.h
--- a/usr/kinit/ipconfig/netdev.h
+++ b/usr/kinit/ipconfig/netdev.h
@@ -61,6 +61,7 @@ extern struct netdev *ifaces;
#define DEVST_DHCPREQ 3
#define DEVST_COMPLETE 4
#define DEVST_ERROR 5
+#define DEVST_SOFTERROR 6
int netdev_getflags(struct netdev *dev, short *flags);
int netdev_setaddress(struct netdev *dev);
--
1.5.6.5
--- End Message ---
--- Begin Message ---
If a broadcast message (e.g. a DISCOVER or REQUEST message sent by
another DHCP client) is received while waiting for an OFFER or an ACK
message, the error isn't taken into account in main(), and timeout gets
reached.
Fix this by setting state to DEVST_ERROR but with a delay of 1 second
instead of the usual 10 for the other error cases, since one wants to
retry ASAP when in the middle of a concurrency situation, e.g. because
a lots of machines get booted at the very same time.
Signed-off-by: Cyril Brulebois <cyril.bruleb...@kerlabs.com>
---
usr/kinit/ipconfig/main.c | 18 +++++++++++++++++-
1 files changed, 17 insertions(+), 1 deletions(-)
diff --git a/usr/kinit/ipconfig/main.c b/usr/kinit/ipconfig/main.c
--- a/usr/kinit/ipconfig/main.c
+++ b/usr/kinit/ipconfig/main.c
@@ -172,6 +172,7 @@ static void complete_device(struct netdev *dev)
static int process_receive_event(struct state *s, time_t now)
{
int handled = 1;
+ int delay = 0;
switch (s->state) {
case DEVST_BOOTP:
@@ -179,6 +180,11 @@ static int process_receive_event(struct state *s, time_t
now)
switch (bootp_recv_reply(s->dev)) {
case -1:
s->state = DEVST_ERROR;
+ delay = 10;
+ break;
+ case 0:
+ s->state = DEVST_ERROR;
+ delay = 1;
break;
case 1:
s->state = DEVST_COMPLETE;
@@ -192,6 +198,11 @@ static int process_receive_event(struct state *s, time_t
now)
switch (dhcp_recv_offer(s->dev)) {
case -1:
s->state = DEVST_ERROR;
+ delay = 10;
+ break;
+ case 0:
+ s->state = DEVST_ERROR;
+ delay = 1;
break;
case DHCPOFFER: /* Offer received */
s->state = DEVST_DHCPREQ;
@@ -205,6 +216,11 @@ static int process_receive_event(struct state *s, time_t
now)
switch (dhcp_recv_ack(s->dev)) {
case -1: /* error */
s->state = DEVST_ERROR;
+ delay = 10;
+ break;
+ case 0:
+ s->state = DEVST_ERROR;
+ delay = 1;
break;
case DHCPACK: /* ACK received */
s->state = DEVST_COMPLETE;
@@ -223,7 +239,7 @@ static int process_receive_event(struct state *s, time_t
now)
case DEVST_ERROR:
/* error occurred, try again in 10 seconds */
- s->expire = now + 10;
+ s->expire = now + delay;
default:
DEBUG(("\n"));
handled = 0;
--
1.5.6.5
--- End Message ---