IBM NS1 Connect

IBM NS1 Connect

Join this online group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

 View Only

Using Multiple DNS Authoritative Vendors Does Not Work Like You Thought

By Shane Kerr posted Thu February 22, 2024 12:12 PM

  

A few months ago we did some research at IBM, and the results were surprising. We put together a presentation for DNS-OARC recently which you can see here: 

https://www.youtube.com/watch?v=EPZZIMDWUqY 

In this article, we’ll cover this topic as follows: 

  • Provide a bit of background about the part of DNS we are testing 

  • Describe our test  

  • Review what the expectations were 

  • Analyze the actual results of the test 

  • Discuss the implications of the results

Background 

In the DNS, end users typically interact with recursive DNS servers (also called resolvers). These perform the necessary queries to convert DNS names (like ibm.example.com) into IP addresses or other values (like 203.0.113.65 or 2001:DB8::41:1771:0:31a2). Resolvers are typically run by ISPs, universities, companies, and the like. 

The information about which name goes with which IP address (or other information) is held on authoritative DNS servers. In the past many companies ran their own authoritative DNS servers, but now they are much more likely to use a managed DNS service like IBM’s. 

We had a customer who was considering using multiple DNS vendors for their authoritative DNS. This is a common setup, and the idea is that if one of your DNS vendors has a problem with their service the other DNS vendor will be used. They were a bit hesitant about how well this would work, so we offered to run an experiment to demonstrate this setup in practice. 

Test Setup 

For this test we needed a few things: 

  1. A domain to test 

  1. Accounts on two separate DNS vendors to publish the domain with 

  1. A number of end-user devices using varied resolvers 

For the domain, the only requirement is that it be visible on the public Internet. We used a vanity domain that one of our architects volunteered. The time-to-live (TTL) for all the DNS records in the domain was set at a duration long enough to not significantly impact the test, except for the actual record queried, which was set low enough so that it would expire between queries. 

For the account, we used a Route53 account that we have mostly for R&D purposes, plus an IBM Dedicated Network. A Dedicated network provides a routing prefix for a single customer, to avoid fate sharing with other customers. In our test we can simulate a vendor failure by withdrawing the BGP route used by the Dedicated network. 

For the end-user devices, we used the RIPE Atlas measurement network. This provides a large number of probes that can originate DNS queries using the resolver configured for the local network. The probes are worldwide, although most are in Europe since that is where the RIPE NCC – the organization that operates the RIPE Atlas network – is located. We chose 80% of our probes from the USA, and 20% from other countries around the world, all picked at random from the list of available RIPE Atlas probes. 

Performing the Test 

The failure that we want to simulate is a DNS provider losing their entire edge – meaning the network equipment and other servers used to deliver their service. This failure was chosen because it was a relatively simple condition to simulate, and is catastrophic if you have a single vendor. 

We created a RIPE Atlas measurement which would perform a DNS query using the local DNS resolver of each probe, repeating every minute: 

https://atlas.ripe.net/measurements/62455258/ 

After letting the RIPE Atlas measurement run for a brief time, we withdrew the route from the Dedicated network, simulating a complete vendor failure.  

After some time, we advertised the route again, simulating service restoration. 

Expected Outcome 

We rely on DNS resolvers to provide answers as quickly as possible. To do this, they cache data and also track the speed at which different authoritative DNS servers respond, so they can use the fastest ones. In theory, an authoritative DNS server which does not respond at all should get few queries, as resolvers shift their traffic to other authoritative DNS servers. 

After a previously-working edge of a DNS vendor fails, the idea is that DNS resolvers would send queries to the now-broken servers, and discover that they are not working, then move to working servers. 

The expectation is that there would be a period of instability and slowness, followed by a period where occasional checks against the now-broken servers would very slightly degrade service, but mostly things would be fine 

Actual Outcome 

Clients were still able to get replies throughout the test, without any noticeable increase in failures. Good! 

Replies took longer than during normal operation. Bad! 

To our surprise, when we took an edge down, we saw increased query times as expected, but these never stopped. Even after every resolver should have been able to shift traffic to working servers, they continued to use the downed edge, no matter how long the test ran. 

Analysis of Results 

RIPE Atlas provides information about each query and response in JSON. We wrote a Python program to track these based on the address of the DNS resolver used and plot this information using Matplotlib, to look for patterns. Here’s a diagram showing how to interpret our graphs:

We will show a couple of graphs arranged like this. 

Public DNS Resolvers 

For the first graph, let's look at the results from public DNS resolvers. Public DNS resolvers are available for anyone to query, and make up a significant minority of DNS query traffic.

All the public DNS resolvers saw service degradation, although the impact was heavier on some operators than others. In this view, you see horizontal stripes, which show where resolvers sent queries to non-responsive authoritative servers, and eventually timed out and chose another server. There were 8 IP addresses used by authoritative DNS servers for the test domain, and half were down during the outage, so an unlucky resolver might have to make up to 5 attempts before getting an answer. 

Why do we see these results? 

Large public DNS operators have clusters of resolvers operating at different locations around the world. These resolvers either do not share any cache at all, or only share information about answers and not about authoritative servers. A client may end up at a different resolver on every query, and the new resolver will not know anything about the non-working edge. 

There are a few interesting side effects of this. First, the more resolvers a given DNS operator has at a location, the more likely it is that a client will end up at a resolver that does not know about the outage. So, having more capacity and resilience at a site will work against getting answers quickly in this style of outage. Next, if there was a real-world outage with many clients, more resolvers for a given public DNS resolver would learn about it. So having lots of users for a service will result in better response time from public DNS resolvers. 

Resolvers Using Public IP Addresses 

Most of the RIPE Atlas probes used either a DNS resolver using a private IP address (RFC 1918 addresses, link-local IPv6 addresses, and the like). Since the same address may be used by separate networks, it is difficult to learn anything about the resolvers running on these addresses. 

Many of the RIPE Atlas probes used a public DNS resolver, covered above. 

The final category is probes using a non-public DNS resolver using a normal, globally-unique IP address. Several of the RIPE Atlas probes used this kind of resolver.

Each of the servers here only had a single RIPE Atlas probe querying it. We see different behaviors, with some DNS resolvers having retries, some seeing small impact, and even one with apparently no impact. A few of the DNS resolvers show what was originally expected: little impact, but with a pattern of occasional attempts to probe the offline servers to see if they are working again. 

The important thing to take from this graphic is that there are several different degradation modes. This implies that there is not a single implementation that performs poorly, but rather that there is likely to be several different DNS resolver implementations that all perform poorly – albeit differently – in this failure scenario. 

Analysis of Results 

First off, the good news is that everything worked. While it took longer to get replies (sometimes several seconds), in the end DNS resolvers were able to find a working authoritative DNS server and return the correct reply to the client. 

For domains that do not change often, this means that the domain operator can set the TTL value for the records in the zone to be relatively high, and once the DNS resolvers have looked up the answer once, they will be unaffected by any outage for a long time. 

However, domains that change often probably do so for network traffic steering or other highly-dynamic purposes, and need for clients to get fresh information. They cannot use a high TTL value. Taking a half second for a DNS lookup is bad for these domains, much less several seconds. 

Can the Speed of Resolution Be Improved? 

One hope when we first saw the poor performance was that a single large public DNS resolver operator was the source of slow answers, or that there was a single implementation that had an issue. Since we see poor performance in this scenario for many different DNS resolvers, and behavior is not identical, it is likely that the issue cannot be fixed easily. 

It is possible that over time improvements will be made on the DNS resolver side, but there is likely no quick or easy solution. 

On the side of the authoritative DNS there is probably little that can be done. 

Any solution involving alternate setups for the authoritative DNS ultimately relies on the behavior of DNS recursive resolvers, so will just move the point of failure around. 

One improvement that will probably work is by having many authoritative DNS vendors. For example, if a company had four vendors, each providing two addresses of eight total addresses, then an outage of one would have less impact on overall service for the domain. 

We might be able to improve the situation by using the routing layer, in a similar way to how DNS works around the problem of getting closer to clients by using anycast. However, for this to work across multiple vendors, they would have to cooperate. Few organizations would be willing to coordinate on this level, and it would add significant complexity. If they did then their combined setup would not be two independent vendors, which is an important reason for having multiple providers. 

If an organization has a specific application that they use, then it can be configured to use multiple domain names. For example, instead of just using appsrv.coolgame.example, the application could be configured to use both appsrv.coolgame.example and appsrv.coolgame.example.com. In this case there are two completely independent names, which have no single points of failure beyond the root zone, and which can be queried by the application using something like Happy Eyeballs or even always in parallel, ensuring high-speed DNS resolution at all times, even in the complete failure of one of their DNS vendors. This is only possible for applications that support such operation, or can be modified in this way, so is not useful for organizations that provide web-based services. 

Further Research 

Like most research, this experiment's result is more questions. 

This experiment was chosen because it was the simplest one which would help assuage our customer’s concerns. A simpler experiment could be designed using unicast IP addresses to several authoritative DNS servers where the servers themselves just blackholed traffic, instead of relying on anycast or BGP to induce failure. 

Experiments can be designed that cause failure further up the DNS hierarchy. Some DNS resolvers are parent-centric and some are child-centric, so this might reveal interesting behavior differences. 

It is much more likely that a partial failure occurs, rather than a total failure. For example, a vendor might have part of their anycast edge down, or be unable to propagate changes to some of their servers. 

Failures involving specific technologies are also not just possible, but likely. There can be IPv4 outages, or DNSSEC failures, zone truncations, and so on. 

In our experiment, we did not investigate specific implementations. Doing a deep dive into the source code of DNS resolvers might show ways to improve performance in scenarios like the one we have... although this might add complexity the software maintainers are unwilling to add to their code. 

More work on optimizing the number and layout of authoritative name servers and their addresses might yield better recommendations for domain holders. As indicated earlier there are drawbacks to having too many addresses per vendor, and benefits to having more independent vendors. 

IBM currently has no specific plans for further research on this issue, but are happy to cooperate with anyone interested in looking into this area. 

Conclusion 

Using separate authoritative DNS vendors provides redundancy, but if one fails then DNS service can be impacted as many DNS recursive resolvers will go to non-responsive authoritative servers and have to timeout and retry the queries.

1 comment
16 views

Permalink

Comments

Tue February 27, 2024 03:13 PM

The presentation at OARC was excellent.  I highly recommend you watch this if you have any interest in this topic.