A DNS query is the first step in the critical path of most internet services. Whether that is a web site, app, or even any internet-connected ‘smart’ product, most of these require DNS queries. It also has a unique problem. It is that nearly all top-level domain providers set the TTL (Time to Live) of a domain’s name servers to 48 hours (the only exception to this rule that I know of is .nl which sets them to just one hour!). This is the value that instructs resolvers how long to cache that information: you can see this in step 4 of the diagram below. This means that once you have delegated your domain to a DNS provider it will take up to 48 hours for all the resolvers across the internet to stop sending DNS queries to the previous set of name servers.

The Hidden Danger in DNS
If your DNS provider, and this is including self-hosted solutions, has an extended outage, then there is no way to simply switch to another provider because queries will still be getting sent to the non-responsive provider by some DNS resolvers for up to 48 hours after the change. To put it another way: there is no such thing as a backup or “hot standby” for your DNS provider because changing your DNS provider requires a migration period lasting 48 hours. To mitigate this risk of a single point of failure (SPOF), or even a single provider of failure, it is required to have multiple providers (of course, this could be some self-hosted solution plus a third-party provider, but self-hosting DNS leaves you open to attack that could overwhelm other services so we generally see companies moving away from this approach). This multiple provider approach is sometimes known as multi-DNS or redundant DNS.
At NS1 Connect we often talk to our customers about multi-DNS, whether that is the form of two, or more, providers or using our Dedicated DNS service to achieve something close to “dual provider” redundancy by delegating their domains to both our Managed and their single-tenant Dedicated network. We have also recently launched IBM Cloud Sync to allow customers to easily sync DNS configurations between NS1 Connect and Amazon Route 53 (more cloud providers will be added in future).
However, it is still more common to use a single DNS provider and, for those customers of NS1 Connect who take this approach, we have an excellent track record on reliability and back that up with a 100% uptime SLA on DNS resolution. In reality, multi-DNS is just not a "hot topic" and that is likely because it has been a long time since any major provider has had a significant outage with their public authoritative DNS service.
Multi-CDN Steering and the Risk of Moving the Single Point of Failure
Multi-CDN (Content Delivery Network), on the other hand, is a topic that is being raised by customers since Cloudflare had two significant outages in November and December of last year. They are certainly not the only CDN to have ever had an outage, but this is certainly the most high-profile case in recent times.
I spoke to a prospective customer recently about the options we have for supporting a multi-CDN strategy. These include simple weighted traffic steering (aka load balancing) using synthetic monitoring to test for global availability; but the optimal solution is to use the RUM-based (Real User Monitoring) traffic steering to optimally steer across the edge networks of the CDNs. This approach uses measurements collected from real users and makes decisions based on the performance and availability of each CDN using aggregated groups of users. As it makes decisions much closer to what is optimal, this is the approach we recommend.
One concern that the customer had was that even if they moved to a multi-CDN strategy and switched to NS1 Connect as their DNS provider to do so, then they have eliminated their original CDN as a SPOF, but now NS1 Connect is one for DNS: they have just moved the problem from one point to another in the critical path of their end users.
What they really need is a multi-CDN and a multi-DNS strategy. But implementing multi-CDN steering using RUM-based traffic steering requires NS1 Connect to be a sole provider as it relies proprietary technology. As we have already established that there’s no such thing as a backup DNS provider, what can they do?
Combining Multi-CDN and Multi-DNS for Optimal Resiliency
The solution is to have one domain you use solely for the purpose of the RUM-based traffic steering (see diagram below). For example, if you are customer.com, you could have customer-dns.com or customer-steering.com. In your domains, wherever you need a multi-CDN decision to be made, rather than having a traffic steering policy for that domain name, you would instead CNAME to your traffic steering domain and a policy in that. This means that all your other domains can be delegated to both NS1 Connect and another DNS provider (perhaps the DNS service from one of your CDN providers) where NS1 Connect is a secondary to the other provider using the DNS standard zone transfer mechanism.

Crucially, this CNAME to the traffic steering domain would have a relatively low TTL, say 60 seconds. This means, in the unlikely case of an outage with NS1 Connect, you could still make a change at your other DNS provider to steer traffic directly to one or other of your CDNs. You could choose to do this programmatically by scripting the changes via that provider’s API and this could lead to a disaster recovery plan that would take only several minutes to apply; not up to 48 hours! You could even consider entirely automating this using a third-party monitoring tool. Once NS1 Connect would recover, you can reintroduce the RUM-based traffic steering to ensure optimal steering across your CDNs again.
In summary, choosing an optimized multi-CDN strategy, does not mean you have to accept the risk of using a single DNS provider that you cannot quickly migrate away from in the event of a disaster.
About the Author
James McCarthy is the Technical Sales Manager for the IBM Network Management & Intelligence portfolio which includes NS1 Connect. James has worked in the DNS space for over 16 years, 9 of those on NS1 Connect. James has extensive experience with working with new and existing customers to building the most optimal DNS solutions.
#Technical
#TechnicalBlog
#BestPractices