Robust NTP environments

Due to my background as a NetWare guy, time-synchronization is something I pay attention to. Early versions of NDS were touchy about that, since the time-stamp was used in the conflicting-edits resolution process. NetWare didn't use a full up NTP client for this, Novell built their own form of it based on NTP code and called it TimeSync. Unlike NTP, TimeSync did what it could to ensure the entire environment was within a second or two of a single time. Because of the lower time resolution, it synced a lot faster than NTP did, and this was considered a good thing since out-of-sync time was considered an outage.

With that in mind, it is no surprise that I like to have a solid time-sync process in place on my networks. One of the principles of Novell's TimeSync config was the concept of a time-group. A group of servers who coordinated time between themselves, and a bunch of clients who poll the members of the time-servers for correct time. Back before internet connections were as ubiquitous as air, this was a good way for an office network to maintain a consensus time. Later on, TimeSync gained the ability to talk over TCP/IP, and could use NTP sources for external time, and this allowed TimeSync to hook into the universial time coordinated (UTC) system.

You can create much the same kind of network with NTP as you could with TimeSync. It requires more than one time server, but your clients only have to directly speak with one of the time servers in the group. Yet the same type of robustness can be had.

The concept is founded in the "peer" association for NTP. The definition of this verb is rather dry:
For type s addresses (only), this command mobilizes a persistent symmetric-active mode association with the specified remote peer.
And doesn't tell you much. This is much clearer:
Symmetric active/passive mode is intended for configurations were a clique of low-stratum peers operate as mutual backups for each other. Each peer operates with one or more primary reference sources, such as a radio clock, or a set of secondary (stratum, 2) servers known to be reliable and authentic. Should one of the peers lose all reference sources or simply cease operation, the other peers will automatically reconfigure so that time and related values can flow from the surviving peers to all hosts in the subnet. In some contexts this would be described as a "push-pull" operation, in that the peer either pulls or pushes the time and related values depending on the particular configuration.
Unlike TimeSync, if all the peers lose their upstreams (the internet connection is down) then the entire infrastructure goes out of sync. This can be mitigated somewhat through judicious use of the 'maxpoll' parameter; set it high enough, and it can be hours (or days if you set it really high) before the peer even notices it can't talk to its upstream and will continue to report in-sync time to clients.

It is also a very good idea to use ACLs in your ntp.conf file to restrict what types of connections clients can mobilize. It is quite possible to be evil to NTP servers. You can turn on enough options to allow trouble-shooting, but not allow config changes.

It is a very good idea for your peers to be cryptographically associated with each other as well. There are at least two methods for this with NTP, v3's autokey, and v4's symmetric key. Autokey is a somewhat easier to set up preshared-key system, symmetric key is more secure, either is more preferable to nothing.

Here is a pair of /etc/ntp.conf files for a hypothetical set of WWU time-servers (items like drift-file and logging options have been omitted):
server 0.north-america.pool.ntp.org maxpoll 13
server 1.north-america.pool.ntp.org maxpoll 13
peer 140.160.247.31 key 1

enable auth monitor
keys /etc/ntp.keys
trustedkey 1
requestkey 1

restrict default ignore
restrict 140.160.0.0 mask 255.255.0.0 nomodify nopeer
restrict 140.160.247.31
server 2.north-america.pool.ntp.org maxpoll 13
server 3.north-america.pool.ntp.org maxpoll 13
peer 140.160.11.86 key 1

enable auth monitor
keys /etc/ntp.keys
trustedkey 1
requestkey 1

restrict default ignore
restrict 140.160.0.0 mask 255.255.0.0 nomodify nopeer
restrict 140.160.11.86

The 'maxpoll' values ensure that once time has been synchronized for long enough, the time between polls of the upstream NTP servers will be 137 minutes. Hopefully, any internet outages should be less then that. Setting max-poll to even higher values will allow longer times between polling intervals, and therefore longer internet outage tolerance. This can get QUITE long, I've seen some NTP servers that poll twice a week.

The key settings set up an Autokey-style crypto system. The "key 1" option on the peer line indicates that the designated connection should use crypto validation. The actual data passed isn't encrypted, the crypto is used for identity validation. This prevents spoofing of time, which can lead to wildly off time values.

The 'restrict' lines tell the NTPD to ignore off campus requests for time (it'll still listen, but return access-denied to all requests), allow on-campus users to get time and do time tracing but nothing else, and allow full access to the peer time server. In theory, inbound NTP traffic should be stopped at the border firewall but just in case it'll deny any that get through.

This is a two server setup, but three or more server could easilly be involved. For a network our size (large) and complexity (simple), two to three time-servers is probably all we need. The peered time-servers will all report in-sync so long as one still considers itself in-sync with an upstream time-server.

Because peers sync time amongst themselves, clients only have to talk to a single time-server to get valid time. Of course, that introduces a single-point-of-failure in the system if that time-host ever has to go down. Because of this, I strongly recommend configuring NTP clients to use at least two upstreams.

Enjoy high quality time!