Towards a resilient NTP configuration in NTP4

NTP 4 introduces some interesting new things that few people seem to know about, are sparsely documented, and are difficult to setup correctly, however they can help with synchronization in the event of total external network failure (even if you don’t have a reference time source).

Now, some reference time sources aren’t expensive (others are), but sometimes you care more about node-to-node synchronization than you do about absolute time accuracy. One example might be a large computational cluster where, if the network is disconnected from the Internet for a while, or if the primary time source is down, you don’t want the individual nodes to drift apart.

A combination of the new manycast and orphan modes can be a boon for this kind of network and create a self-organizing NTP server tree as a fallback state.

Here’s a sample NTP configuration:

# Large cluster config
manycastserver 227.221.9.75
manycastclient 227.221.9.75 key 687
server 10.255.4.150 key 687 iburst prefer
# set stratum to 6 for orphan mode
# don’t accept stuff from anything below stratum 3 for determining orphan
# throw out more than 7 time sources when in orphan mode
tos floor 3 orphan 7 minclock 7 minsane 1 cohort 1

driftfile /var/lib/ntp/drift
logfile /var/log/ntpd.log
keys /etc/ntp.keys
trustedkey 687
requestkey 687
controlkey 687

restrict default notrust
restrict 227.221.9.75 nomodify
restrict 127.0.0.1

First, we setup a multicast address on 227.221.9.75 and also setup a security key on this address to authenticate all servers to each other. (I could have also used the new NTP public key authentication stuff, but didn’t feel like hassling with key generation and distribution). Every machine is both a client and server on this multicast address. Machines broadcast requests and machines answer. If you have a lot of machines, you’ll see many many entries.. at first.. (most will stale out after a little bit)

The server line sets up your typical NTP server arrangement. In this case, I have an NTP server that is a stratum above a server with a GPS reference clock and also synchronizes with some of the NTP pool servers on the public Internet. It uses the same key for authentication, and acts as an NTP server for a network of 1054 client machines. The iburst keyword tells all servers to exchange an initial volley of packets to speed up synchronization (speedups from several to many minutes to a few seconds) and the prefer keyword says that if this server is available, prefer it above all others.

Now we get to the interesting line. tos is short for “terms of service” and adjusts some internal NTP parameters. The TOS parameters are documented at http://www.ee.udel.edu/~mills/ntp/html/manyopt.html.

floor 3 disables NTP server replies from stratums below 3. We want this to be a self organizing network only when our server at stratum 3 is unavailable, so if any other server somehow started sending out to our manycast address with a lower stratum, we want to ignore it.

orphan 7 When no outside source of synchronization is available, go into orphan mode and set my stratum to 7. All servers at stratum 7 then hold an election process. The winner of the election becomes the new server and all other servers synchronize to it so that they all keep the same relative time. When the main (non-manycast) server is available again, orphan mode is disabled and clients synchronize with it again.

minclock 7 continue eliminating servers via the selection algorithm until no more than 7 remain.

minsane 1I don’t think I really need this as it is the default, and I don’t remember why I put that in there, but it doesn’t hurt anything. I think I had it at 4 at one point and ran into some issues when in orphan mode.

One final tweak. Our postinstall script has a perl command that sets the orphan level to a random number between 5 and 7, so when we enter orphan mode we actually have 3 separate self-organizing stratums. The lowest stratum elects a leader and the higher stratums automatically pick the best time source(s) from the lower stratum servers using the usual NTP clock selection algorithms. Why? To speed up the election and synchronization process by breaking down the stratums into more manageable chunk sizes.

So, what does the running ntp config on a machine look like?


remote refid st t when poll reach delay offset jitter
==============================================================================
desrad1.nyc.des 10.255.4.150 4 u 59 1024 177 0.108 -0.170 0.345
desrad6.nyc.des 10.255.4.150 4 u 59 1024 177 0.119 -0.167 0.358
desrad3.nyc.des 10.255.4.150 4 u 59 512 77 0.095 0.190 0.096
drdsa07.nyc.des 10.255.4.150 4 u 59 1024 77 0.123 -0.110 0.351
desrad11.nyc.de 10.255.4.150 4 u 59 1024 37 0.087 -0.067 0.028
drdsa01.nyc.des 10.255.4.150 4 u 59 1024 37 0.120 0.034 0.013
drdsa05.nyc.des 10.255.4.150 4 u 56 1024 37 0.113 -0.064 0.054
desrad7.nyc.des 10.255.4.150 4 u 29 1024 37 0.099 -0.086 0.007
drdsa08.nyc.des 10.255.4.150 4 u 59 512 17 0.110 -0.583 0.003
desrad8.nyc.des 10.255.4.150 4 u 59 512 17 0.078 0.227 0.020
drdzf095.nyc.de 10.255.4.150 4 u 59 1024 7 0.083 0.072 0.006
desrad2.nyc.des 10.255.4.150 4 u 59 1024 7 0.100 -0.871 0.012
drdsa04.nyc.des 10.255.4.150 4 u 59 1024 7 0.103 -0.086 0.004
drdsa00.nyc.des 10.255.4.150 4 u 59 1024 7 0.092 -1.037 0.013
desrad10.nyc.de 10.255.4.150 4 u 59 1024 7 0.111 0.299 0.016
drdsa02.nyc.des 10.255.4.150 4 u 59 1024 7 0.080 -0.995 0.019
desrad4.nyc.des 10.255.4.150 4 u 17 64 3 0.090 0.084 0.009
desrad5.nyc.des 10.255.4.150 4 u 45 64 1 0.092 0.610 0.006
desrad9.nyc.des 10.255.4.150 4 u 1 64 1 0.112 0.112 0.005
227.221.9.75 .ACST. 16 u - 64 0 0.000 0.000 0.001
*ntpmastr.nyc.de 10.249.1.1 3 u 59 1024 377 0.093 -0.653 0.258

when ntpmast goes away, the servers take some number of minutes, but eventually elect a leader and settle on a usable hierarchy. They may drift in total from the absolute notion of NTP time, but they remain synchronized with each other, which is more important to us.