Hello folks, I would like to take the chance and discuss with a broader audience one of the long-standing problems in building Beowulf clusters and present a potential solution for it: NTP to synchronize the system times of different nodes just doesn't cut it. To be fair, it was never designed to achieve the accuracy that you would like to have in a cluster.
For clusters it's time to replace it with a solution that works better in a LAN. The good news is that there is already one: the Precision Time Protocol (PTP) IEEE 1588 standard [1]. David Lombard and I have investigated the applicability of that standard and its existing open source implementation PTPd [2] for HPC clusters; so far it looks very promising. [1] http://ieee1588.nist.gov/ [2] http://ptpd.sourceforge.net/ Those of you who were at this year's LCI conference might remember my speaker's corner presentation and the ensuing discussions that we had there. For those who were not there, let me summarize: 1. NTP only achieves an accuracy of ~1ms, which is several orders higher than the latency of messages that one might want to measure, so the measuring error for individual events is way to high. 2. The frequency of clocks in typical hardware varies in a non-linear way that linear clock correction methods cannot compensate for; as one of the main developers of an MPI tracing library (Vampirtrace/Intel Trace Collector) I have struggled with that for a long time. There is a non-linear clock correction in the tracing library which helps, but it only works for applications that regularly invoke an API call - not very user-friendly. 3. PTP as implemented by PTPd works in user space, requires no special hardware and achieves +/-1us accuracy; this is from a real measurement with NTP running on a head node and PTPd running on two compute nodes. At LCI we asked the audience a few questions to get a feeling what the community was thinking about the problem and I'd like to repeat the questions here: * Have you heard of PTP and considered to use it in clusters? * How would applications or clusters benefit from a better cluster-wide clock? * What obstacles did or could prevent using PTP(d) for that purpose? It turned out that no-one had heard of PTP before; so my hope is that it is new to you also and I'm not boring you to death with an old hat... Not having an accurate cluster clock was considered annoying by most people. Comparing log files from different nodes and performance analysis obviously benefit from higher accuracy, whereas applications typically do not depend on it. This is perhaps a chicken-and-egg problem: because timing is inaccurate, no-one considers algorithms which depend on it, so no-one builds clusters which have a better clock because the demand is not there. Regarding obstacles we had a very good discussion with Don Becker which went into the details of how PTPd implements PTP: an essential part of PTP is the time stamping of multicast packets as close as possible to the actual point in time when they go onto the wire or are received by a host. In PTPd this is done by asking the IP stack to time stamp incoming packages in the kernel using a standard interface for that. Outgoing packages are looped back and the "incoming" time stamp on the loop back device is assumed to be close to the point in time when the same packet also left the node. This assumption can be off by a considerable delta and worse, due to other traffic and queues in Ethernet hardware/drivers the delta can vary. This is expected to lead to noise in the measurements and reduced accuracy. Another problem with PTP is scaling: all communication in the original 2002 revision of the standard (which PTPd is currently based on) uses multicasts, even for the client to master communication which could be done with point-to-point communication. This was apparently a deliberate decision to simplify the implementation of the standard in embedded devices. The effect is that for every packet in a subnet all nodes receive the PTP packet and wake up the PTPd daemon even if that daemon just discards the packet. On the bright side one can extrapolate from the frequency of these packets (master to clients: two packets every two seconds; each client to master: two packets in random intervals from 4 to 60 seconds) that the packet rate is still <1000/s for 10000 nodes - this should be fairly scalable. There are obvious ideas for avoiding some of these shortcomings (using point-to-point communication for client->master packets; putting the PTP implementation into the kernel) but it's not clear who will pick up that work and what the benefit would be. The PTPd source is available and although I don't know for sure, I'd expect that the author would be happy to accept patches. I mailed him today and told him that we are looking at PTPd in an HPC context. There are several next actions that I can imagine besides a general discussion of this technology: * PTP has not been tested at scale yet. I wrote an MPI based benchmark program which continuously measures clock offsets between all nodes and would be happy to assist anyone who wants to try out PTPd on a larger cluster. * The author of PTPd is working towards a first stable 1.0 release. Testing the release candidate might help to get 1.0 out and clear the way for including future patches. It might also provide more insights about which kind of systems it works or doesn't work on. * If someone has the time, there are some gaps in PTPd which could be filled: it has no init scripts yet; there is a more recent IEEE 1588 specification that might address some of the issues outlined above; etc. * Ideally PTPd should come pre-packaged by Linux distributions or at least be added to HPC installations. Regarding the last point it is worth noting that there are patents on some of the technology. This probably has to be sorted out before redistributing binaries will be considered by Linux distributions. The patents can be found when looking for IEC 61588, which is the same as IEEE 1588: * http://www.iec.ch/tctools/patent_decl.htm * http://www.iec.ch/tctools/patents/agilent.htm Please note that I am not speaking for Intel or any of the involved companies in this matter and in particular when it comes to patents I cannot provide any advice. That's all for now (and probably enough stuff, too ... although perhaps you prefer detailed emails over bullet items on a PowerPoint presentation). So what do you think? -- Best Regards, Patrick Ohly The content of this message is my personal opinion only and although I am an employee of Intel, the statements I make here in no way represent Intel's position on the issue, nor am I authorized to speak on behalf of Intel on this matter. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf