Right, so after a day spent with Daviey and a bunch of 30MB pcap files, we think we've figured this out.
the key exchange that failed happens here: 7418 112.051626 10.55.200.99 10.55.200.1 TFTP Read Request, File: amd64/generic/quantal/commissioning/initrd.gz, Transfer type: octet, tsize\000=0\000, blksize\000=1408\000 7419 112.053444 10.55.200.1 10.55.200.99 TFTP Option Acknowledgement, tsize\000=18988167\000, blksize\000=1400\000 7420 113.053489 10.55.200.1 10.55.200.99 TFTP Option Acknowledgement, tsize\000=18988167\000, blksize\000=1400\000 7423 116.053542 10.55.200.1 10.55.200.99 TFTP Option Acknowledgement, tsize\000=18988167\000, blksize\000=1400\000 7425 116.832761 10.55.200.99 10.55.200.1 TFTP Acknowledgement, Block: 0 The client requests the initrd, but something in the firmware or pxelinux itself gets hung for almost five seconds. During that time, the maas tftpd sends three ACKs (option acknowledgements, specifically), and times out. By the time the client sends the ACK-0 to start the data transfer, the session state has been discarded and the tftpd just loggs the exception as an OOPS and waits for the next session to start. Incidentally, we spent a lot of time correlating requested/actual block sizes for a while between this tftpd and the HPA tftpd. That turned out to be a red herring, of course, but it seemed like a compelling lead for a while. The solution did come from a comparision to tftpd-hpa, though. In a few places in tftp/bootstrap.py and tftp/session.py there are timeout tuples set to (1, 3, 7). The iterable is consumed by the watchdog code every time a packet is sent out, and once the iterable is empty the watchdog tells the state machine to give up on the request. We never dug too far into the units or where in the conversation these things are read, but the fact that there are three times in the tuple and that the daemon gave up after three ACKs is a compelling coïncidence. The tftpd-hpa code tries six times, waiting one second each: <Daviey> Spads: #define TIMEOUT 1000000 /* Default timeout (us) */ <Daviey> #define TRIES 6 /* Number of attempts to send each packet */ <Daviey> #define TIMEOUT_LIMIT ((1 << TRIES)-1) Extending the tuple at line 346 of bootstrap.py solved this situation for us, and the maas tftpd succeeded just as tftpd-hpa. In the end we settled on: class RemoteOriginReadSession(TFTPBootstrap): """Bootstraps a L{ReadSession}, that was started remotely, - we've received a RRQ. """ timeout = (1, 1, 1, 1, 1, 1) ...as this more closely mimics what Daviey found in the tftpd-hpa source. This timeout tuple appears in a few places, so any adjustments to this code should probably be made to all of the timeout iterables in bootstrap.py and session.py. Finally, while it's true that this seems to be a workaround for a fault on the client side (whether the fault is in firmware or in pxelinux.0 I can't say), I believe it is also a regression against the precise maas, which used cobbler. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1155556 Title: HP ProLiant DL380 G7 tftps kernel, but initrd tracebacks in tftp server. DL380 G6 succeeds. To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1155556/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs