Introduction
Ceph is a distributed system, and delivers the best experience when all server and client systems have closely synchronized clocks.
Notably, the Paxos consensus algorithm used by Ceph Monitors to pooo.establish and maintain quorum requires that Monitor clocks be synchronized to within 50 milliseconds of each other.
It is straightforward to design and implement an NTP architecture that easily achieves and maintains sub-millisecond accuracy, but we often see implementations that do not provide sufficient diversity and resilience. In this document we will briefly explore common pitfalls and strategies to mitigate them. Note that we do not attempt to provide a complete reference for configuring and running NTP services, but rather concentrate on architectural decisions and certain important configuration choices.
NTP Overview
The Network Time Protocol provides a mechanism for synchronizing the clocks of computers and other devices over LAN or WAN network connections.
There are three popular NTP implementations that we see on Linux systems:
Of these, system-timesyncd should be ruled out immediately, as it only syncs very intermittently and in every way is suited only to setting a desktop or laptop system’s clock to be vaguely accurate. Linux servers need stronger and ongoing synchronization, and systemd-timesyncd is fundamentally inadequate for IBM Storage Ceph systems.
The classic ntpd is sometimes referred to as simply NTP, but this is imprecise and can lead to confusion. NTP is the protocol, not the implementation thereof. It suffices for server timekeeping, but is saddled with considerable legacies. The classic ntpd is the default for RHEL 7.
The best NTP timekeeping daemon for Linux servers is chrony. Chrony is configured in much the same way as the classic ntpd and operates in a similar fashion, but is more efficient and will often converge system times more quickly. While out of scope for IBM Ceph Storage, RHEL 7 systems can easily be switched to Chrony from the legacy ntpd for a homogenous fleet.
NTP Sources
Your enterprise’s systems sync against reference time sources. These include appliances that receive super-accurate time signaling from GPS satellites as well as public servers available over the Internet. Other systems within your organization can act as servers as well, including in some cases network routers or switches. All are valuable, with caveats.
Resilience and Diversity are Crucial
In order to implement a quality, resilient NTP service for your IBM Storage Ceph deployment (and your enterprise in general) you must adhere to the below design principles:
-
More sources are better. The NTP protocol is lightweight from compute and network perspectives. There is no need to limit the number of configured sources out of concern for resource consumption. At any given time a system’s NTP daemon will select the single configured source that it considers the best available to which to synchronize. It is entirely possible that no configured source will be considered acceptable, which we must avoid. It is very acceptable to have as many as twenty sources configured.
-
Public NTP pools † are fine things, but their quality varies widely especially in certain geographical regions. They are valuable components of your NTP scheme, but are ideally not the only upstream sources in the mix. This time-series from a real-world enterprise Ceph cluster tells the tale:
Here we see periods of reasonably-precise synchronization of system clocks interspaced with times of severe divergence. The root cause of these wild fluctuations was inconsistent quality of the servers in a certain regional public NTP pool. The public NTP pools enact primitive load-balancing by periodically rotating the participating time source servers to which the advertised, abstracted DNS records point. At the time of writing, for example, us.pool.ntp.org rotates among nearly 600 backing servers, though only four are exposed at any given time.
Enterprise NTP daemons value stability, and during intervals when the public pools point-in-time selection of DNS record targets do not contain any quality sources, system times will skew wildly and rapidly as shown. Remember that Ceph Monitors want no worse than 50 millisecond synchronization among themselves: the above graph shows the time skew of each non-lead Monitor relative to the lead Monitor.
-
Modern NTP daemons implement adaptive backoff of the interval between probes of configured time sources. This helps reduce load and network traffic as a system’s clock stabilizes. The iburst attribute when configuring sources is useful for speeding initial synchronization by sending a small number of frequent time probes at startup, then falling back to less-frequent probes. This is advised for all time sources.
Resilient NTP Architecture
The below diagram shows a generalized, highly available and highly resilient datacenter NTP architecture. Not all components of this architecture are necessary, but the more you can implement, the better results you may have. We will briefly discuss each component.
-
Local Geo Pool
This refers to public NTP pool severs abstracted through rotating DNS. For example, a server in Boring, Oregon or Intercourse, Pennsylvania might configure the below
server0.us.pool.ntp.org iburst
server1.us.pool.ntp.org iburst
server2.us.pool.ntp.org iburst
server3.us.pool.ntp.org iburst
-
Hand-picked public servers
This might include known-quality specific source FQDNs or IP addresses or sources provided by your organization or an associate’s company. One might run chrony sources
and chrony sourcestats
when configuring public pools to select a specific server or two with consistently low Stratum, Freq Skew, Offset, and Std Dev values and high Reach. We do not list any examples here because the best choices will vary based on your location and situation. Note as well that this approach is acceptable for a very small number of distribution server but should not be applied directly to a large number of your internal systems.
That said, additional, static choices for diversity might be the servers run by NIST: https://tf.nist.gov/tf-cgi/servers.cgi
-
Distant Geo Pool
If your organization runs servers in Africa, Latin America, or APAC regions †† it may be especially valuable to add two entries for public servers in the US zone in addition to those in your local zone:
server0.asia.pool.ntp.org iburst
server1.asia.pool.ntp.org iburst
server2.asia.pool.ntp.org iburst
server3.asia.pool.ntp.org iburst
server0.us.pool.ntp.org iburst
server1.us.pool.ntp.org iburst
-
GPS Appliance
Old-school GPS appliances are dedicated hardware, often with a coax run to the data center’s roof where a specialized antenna receives signals from the constellation of visible GPS satellites. These can require expensive and lengthy site arrangements but cannot be beat for capacity and precision.
In recent years small appliances have become available for as little as USD 500. These generally can only serve a modest number of clients, but they can sit on a windowsill with line-of-sight to the sky and provide an inexpensive low-stratum and high-quality source for your distribution layer, which will share the temporal love with all your internal systems. In order to remain vendor-neutral and avoid stale advice we do not list specific appliances here but a web search engine will quickly find multiple options.
-
Internal Distribution Server
It is a bad netizenship to have more than a few servers directly query external, public time sources. Larger numbers of servers doing this would present inappropriate, abusive load to these sources that provide a valuable service free of charge. Implementing an internal distribution layer respects external resources that are provided out of the goodness of someone’s heart, keeps the network traffic off of your congested WAN, and presents much lower network RTT and jitter for internal clients.
-
Not pictured on the above diagram but quite valuable are the below three strategies, which reflect that with IBM Storage Ceph servers and clients, close synchronization is often more important than tight adherence to reference time, though staying very close to reference time has additional benefits.